[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] implement graphemewidth
- Subject: Re: [PATCH] implement graphemewidth
- From: Ori Bernstein <ori@xxxxxxxxxxxxxx>
- Reply-to: myrddin-dev@xxxxxxxxxxxxxx
- Date: Thu, 26 Oct 2017 11:15:54 -0700
- To: myrddin-dev@xxxxxxxxxxxxxx
- Cc: "S. Gilles" <sgilles@xxxxxxxxxxxx>
Nice work!
On Thu, 26 Oct 2017 05:18:02 -0400
"S. Gilles" <sgilles@xxxxxxxxxxxx> wrote:
> ---
> Also rename it to cellwidth, because that's what I really want it
> to do and I'm not sure if what I implemented really deals with
> graphemes.
> The goal is the following:
>
> use std
> const main = {
> std.put("|0123456789|\n") /* |0123456789| */
> std.put("|{w=10}|\n", "foobar") /* | foobar| */
> std.put("|{w=10}|\n", "施氏食") /* | 施氏食| */
> std.put("|{w=10}|\n", "человек") /* | человек| */
> }
Some other test cases should include accents.
> I wasn't particularly happy with any of the high-performance
> implementations of wcwidth() I surveyed (in particular, musl's is
> too clever for me to understand), so I ended up using the approach
> of http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c . That appears to
> be based on a pretty old version of Unicode, however, because it
> looks like a lot of the exceptions aren't necessary anymore.
> Doing a full binary search is probably wasteful, but if someone
> wants to process that much non-ASCII data, they are probably in a
> better position to contribute vectorized, SSE2-aware, triaxilating
> frequency algorithms than I am.
Right now, we're stuck on a pretty old version of unicode, so that's not
ideal, but acceptable. And we already do binary search for
isupper/islower/isdigit/etc.
I've got plans to update https://git.eigenstate.org/ori/mkchartab.git to
generate the tables, as well as moving to a "chunked array" structure for the
lookup which should be faster, but I'm very bottlenecked on time, so I don't
know when I'll get to it.
If you want to take a crack, there's a really nice description in chapter
13 of http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.9112&rep=rep1&type=pdf
The book overall is excellent, too.
> As a disclaimer: I'm not a unicode guy, I just get emails from
> people with non-ASCII names with text like “H⁰(P•) ≅ ℤ”. I haven't
> even tried to verify that any of the more exotic scripts work as
> expected.
> ---
> lib/std/bld.sub | 1 +
> lib/std/cellwidth.myr | 526 ++++++++++++++++++++++++++++++++++++++++++++++++++
Take a look at lib/std/utf.myr, where the other tables live. I think
this table can go there too.
> lib/std/fmt.myr | 11 +-
> 3 files changed, 529 insertions(+), 9 deletions(-)
> create mode 100644 lib/std/cellwidth.myr
<snip>
--
Ori Bernstein <ori@xxxxxxxxxxxxxx>