[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] implement graphemewidth
- Subject: Re: [PATCH] implement graphemewidth
- From: "S. Gilles" <sgilles@xxxxxxxxxxxx>
- Reply-to: myrddin-dev@xxxxxxxxxxxxxx
- Date: Thu, 26 Oct 2017 14:54:21 -0400
- To: Ori Bernstein <ori@xxxxxxxxxxxxxx>
- Cc: myrddin-dev@xxxxxxxxxxxxxx
On 2017-10-26T11:15:54-0700, Ori Bernstein wrote:
> Nice work!
>
> On Thu, 26 Oct 2017 05:18:02 -0400
> "S. Gilles" <sgilles@xxxxxxxxxxxx> wrote:
> [snip]
> > The goal is the following:
> >
> > use std
> > const main = {
> > std.put("|0123456789|\n") /* |0123456789| */
> > std.put("|{w=10}|\n", "foobar") /* | foobar| */
> > std.put("|{w=10}|\n", "施氏食") /* | 施氏食| */
> > std.put("|{w=10}|\n", "человек") /* | человек| */
> > }
>
> Some other test cases should include accents.
Good point - the thing that made me notice this was an accent, so
they should definitely work. (More to the point, I guess this should
*have* tests. I'll put them in v2.)
> > I wasn't particularly happy with any of the high-performance
> > implementations of wcwidth() I surveyed (in particular, musl's is
> > too clever for me to understand), so I ended up using the approach
> > of http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c . That appears to
> > be based on a pretty old version of Unicode, however, because it
> > looks like a lot of the exceptions aren't necessary anymore.
>
> > Doing a full binary search is probably wasteful, but if someone
> > wants to process that much non-ASCII data, they are probably in a
> > better position to contribute vectorized, SSE2-aware, triaxilating
> > frequency algorithms than I am.
>
> Right now, we're stuck on a pretty old version of unicode, so that's not
> ideal, but acceptable. And we already do binary search for
> isupper/islower/isdigit/etc.
I generated my tables based on the latest UCD, so age shouldn't be
a problem. It's just that the URL I linked specifies a bunch of
extra tweaks to the interval generation that I think(?) are no
longer necessary, in case anyone was wondering why I didn't copy
the commands exactly.
> I've got plans to update https://git.eigenstate.org/ori/mkchartab.git to
> generate the tables, as well as moving to a "chunked array" structure for the
> lookup which should be faster, but I'm very bottlenecked on time, so I don't
> know when I'll get to it.
That sounds like a step up from "this one program I found on SO,
plus sed". I'll look at that and see if it's necessary for v2.
> If you want to take a crack, there's a really nice description in chapter
> 13 of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.9112&rep=rep1&type=pdf
>
> The book overall is excellent, too.
Will do.
> [snip]
>
> Take a look at lib/std/utf.myr, where the other tables live. I think
> this table can go there too.
Sounds good, though I think you mean lib/std/chartype.myr for the
tables. I'll put my tables in chartype.myr and my functions in
utf.myr for v2, if that's sensible.
--
S. Gilles