Eigenstate: myrddin-dev mailing list

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] implement graphemewidth


Nice work!

On Thu, 26 Oct 2017 05:18:02 -0400
"S. Gilles" <sgilles@xxxxxxxxxxxx> wrote:

> ---
> Also rename it to cellwidth, because that's what I really want it
> to do and I'm not sure if what I implemented really deals with
> graphemes.

> The goal is the following:
> 
>         use std
>         const main = {
>                 std.put("|0123456789|\n")        /* |0123456789| */
>                 std.put("|{w=10}|\n", "foobar")  /* |    foobar| */
>                 std.put("|{w=10}|\n", "施氏食")  /* |    施氏食| */
>                 std.put("|{w=10}|\n", "человек") /* |   человек| */
>         }

Some other test cases should include accents.

> I wasn't particularly happy with any of the high-performance
> implementations of wcwidth() I surveyed (in particular, musl's is
> too clever for me to understand), so I ended up using the approach
> of http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c . That appears to
> be based on a pretty old version of Unicode, however, because it
> looks like a lot of the exceptions aren't necessary anymore.

> Doing a full binary search is probably wasteful, but if someone
> wants to process that much non-ASCII data, they are probably in a
> better position to contribute vectorized, SSE2-aware, triaxilating
> frequency algorithms than I am.

Right now, we're stuck on a pretty old version of unicode, so that's not
ideal, but acceptable. And we already do binary search for
isupper/islower/isdigit/etc.

I've got plans to update https://git.eigenstate.org/ori/mkchartab.git to
generate the tables, as well as moving to a "chunked array" structure for the
lookup which should be faster, but I'm very bottlenecked on time, so I don't
know when I'll get to it.

If you want to take a crack, there's a really nice description in chapter
13 of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.9112&rep=rep1&type=pdf

The book overall is excellent, too.

> As a disclaimer: I'm not a unicode guy, I just get emails from
> people with non-ASCII names with text like “H⁰(P•) ≅ ℤ”. I haven't
> even tried to verify that any of the more exotic scripts work as
> expected.
> ---
>  lib/std/bld.sub       |   1 +
>  lib/std/cellwidth.myr | 526 ++++++++++++++++++++++++++++++++++++++++++++++++++

Take a look at lib/std/utf.myr, where the other tables live. I think
this table can go there too.

>  lib/std/fmt.myr       |  11 +-
>  3 files changed, 529 insertions(+), 9 deletions(-)
>  create mode 100644 lib/std/cellwidth.myr
 
<snip>

-- 
Ori Bernstein <ori@xxxxxxxxxxxxxx>

Follow-Ups:
Re: [PATCH] implement graphemewidth"S. Gilles" <sgilles@xxxxxxxxxxxx>
References:
[PATCH] implement graphemewidth"S. Gilles" <sgilles@xxxxxxxxxxxx>