Eigenstate: myrddin-dev mailing list

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] implement graphemewidth


On 2017-10-26T11:15:54-0700, Ori Bernstein wrote:
> Nice work!
> 
> On Thu, 26 Oct 2017 05:18:02 -0400
> "S. Gilles" <sgilles@xxxxxxxxxxxx> wrote:
>  [snip]
> > The goal is the following:
> > 
> >         use std
> >         const main = {
> >                 std.put("|0123456789|\n")        /* |0123456789| */
> >                 std.put("|{w=10}|\n", "foobar")  /* |    foobar| */
> >                 std.put("|{w=10}|\n", "施氏食")  /* |    施氏食| */
> >                 std.put("|{w=10}|\n", "человек") /* |   человек| */
> >         }
> 
> Some other test cases should include accents.

Good point - the thing that made me notice this was an accent, so
they should definitely work. (More to the point, I guess this should
*have* tests. I'll put them in v2.)

> > I wasn't particularly happy with any of the high-performance
> > implementations of wcwidth() I surveyed (in particular, musl's is
> > too clever for me to understand), so I ended up using the approach
> > of http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c . That appears to
> > be based on a pretty old version of Unicode, however, because it
> > looks like a lot of the exceptions aren't necessary anymore.
> 
> > Doing a full binary search is probably wasteful, but if someone
> > wants to process that much non-ASCII data, they are probably in a
> > better position to contribute vectorized, SSE2-aware, triaxilating
> > frequency algorithms than I am.
> 
> Right now, we're stuck on a pretty old version of unicode, so that's not
> ideal, but acceptable. And we already do binary search for
> isupper/islower/isdigit/etc.

I generated my tables based on the latest UCD, so age shouldn't be
a problem. It's just that the URL I linked specifies a bunch of
extra tweaks to the interval generation that I think(?) are no
longer necessary, in case anyone was wondering why I didn't copy
the commands exactly.

> I've got plans to update https://git.eigenstate.org/ori/mkchartab.git to
> generate the tables, as well as moving to a "chunked array" structure for the
> lookup which should be faster, but I'm very bottlenecked on time, so I don't
> know when I'll get to it.

That sounds like a step up from "this one program I found on SO,
plus sed". I'll look at that and see if it's necessary for v2.

> If you want to take a crack, there's a really nice description in chapter
> 13 of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.9112&rep=rep1&type=pdf
> 
> The book overall is excellent, too.

Will do.

> [snip]
> 
> Take a look at lib/std/utf.myr, where the other tables live. I think
> this table can go there too.

Sounds good, though I think you mean lib/std/chartype.myr for the
tables. I'll put my tables in chartype.myr and my functions in
utf.myr for v2, if that's sensible.

-- 
S. Gilles

Follow-Ups:
[PATCH] Implement graphemewidth"S. Gilles" <sgilles@xxxxxxxxxxxx>
References:
[PATCH] implement graphemewidth"S. Gilles" <sgilles@xxxxxxxxxxxx>
Re: [PATCH] implement graphemewidthOri Bernstein <ori@xxxxxxxxxxxxxx>