unicode double-width character support #34

obfusk · 2021-02-16T18:57:30Z

Patch mentioned in #33 (but for master).

obfusk · 2021-02-17T02:09:28Z

I managed to get rid of the FIXMEs and it seems to work quite well 😄
It's still a bit rough though.

mattip · 2021-02-17T04:16:35Z

Does this work with python2/python3?

mattip · 2021-02-17T04:18:22Z

I guess so, tests are passing

obfusk · 2021-02-17T11:51:16Z

Does this work with python2/python3?

Yes. The tests originally failed with python2, but I fixed that.
I've now tested it myself with pypy, pypy3, and cpython3.9.

obfusk · 2021-02-17T12:03:52Z

It doesn't handle e.g. zero width joiners, though stand-alone emoji work fine.
But readline doesn't really handle those either, and neither do many terminals AFAIK.

blueyed · 2021-02-17T19:26:15Z

Thanks for this already!
I wonder how good test coverage in this regard really is, and think that it would be great to get e.g. the corner cases you've fixed covered.

Unfortunately codecov is acting up, at least with the latest commits, since it would be useful to revisit/see its patch status (i.e. how much / which code is covered in tests after all).

obfusk · 2021-02-17T21:14:49Z

I wonder how good test coverage in this regard really is, and think that it would be great to get e.g. the corner cases you've fixed covered.

The tests don't really seem to cover any of the curses stuff. Which is of course hard to test.
At least the tests helped me fix the python2 compatibility.

I think I may have fixed one or two existing bugs. But I may also have introduced some 😅

blueyed

(Manual codecov link: https://app.codecov.io/gh/pypy/pyrepl/compare/34/diff)

blueyed · 2021-02-25T14:53:28Z

pyrepl/reader.py

+def width(c):
+    return 2 if unicodedata.east_asian_width(c) in "FW" else 1
+def wlen(s):
+    return sum(map(width, s))


What do you think about using https://pypi.org/project/wcwidth/?

Looking in from the PyPy side, wcwidth seems like a small package with JIT-freindly structure (no complicated classes or deep function chains). It was dormant from 2016 to 2020, but seems to be active again. The license is MIT, which is fine for PyPy.

I agree that that bit of code could be improved, but what would be the advantage of using wcwidth over python's built-in unicode support? It seems to provide mostly the same functionality except with its own unicode tables that need to be kept up to date.

wcwidth can also return 0 or -1, which the current code doesn't know how to handle. And I'm not sure what the best way to handle those would be tbh.

For improved performance, we could use e.g. @functools.lru_cache.

(note that pyrepl already shows e.g. control characters as ^c etc.)

it is possible to find the Unicode version of a terminal by somewhat intrusive detection https://github.com/jquast/ucs-detect

According to the table in the README, lxterminal and kitty support the same unicode version (11.0.0).
But as my screenshot shows, they don't display emoji the same way.

(Slightly OT: I'd consider switching to kitty, but unfortunately it doesn't support Japanese input 😿)

I briefly tested using wcwidth: obfusk/pyrepl-wide@eba676a.

I don't notice any improvement atm.

I had to modify the original prompt length calculation (because that subtracts the length of ansi escape sequences).

_my_unctrl() currently turns a zero-width joiner into "\u200d" and I'm not sure returning it unmodified won't break something else.

@blueyed is there a suspicion that this PR is worse in some cases than the current code? If not maybe it can be merged as-is and other alternatives could be parts of future PRs?

Even with wcwidth's upcoming width() function (and setting aside the differences between terminals), handling zero-width joiners etc. properly is not easy, since width(s) == sum(width(c) for c in s) is no longer true for all strings. This would presumably require more careful and extensive modifications to the code than the current PR.

I think wcwidth is likely to be quite useful in improving unicode support even more in the future, but for now I think this PR is a good first step. Further improvements get a lot more complicated and should probably be separate, future PRs.

I'd also point out that readline doesn't really seem to handle emoji any better.

AFAIK this PR doesn't make anything worse. Unfortunately (but understandably since I don't really know a good way to test all the curses stuff), the test suite can't rule that out.

I do however think the PR is still a little rough and could use a code review and probably some updates to comments and documentation before merging.

Looking in from the PyPy side, wcwidth seems like a small package with JIT-freindly structure (no complicated classes or deep function chains). It was dormant from 2016 to 2020, but seems to be active again. The license is MIT, which is fine for PyPy.

~~I don't think wcwidth supports python2 though, which I presume is an issue for PyPy.~~ [edit: I was mistaken]

obfusk · 2021-02-25T15:48:02Z

(Manual codecov link: https://app.codecov.io/gh/pypy/pyrepl/compare/34/diff)

That link gives me a 403.

mattip · 2021-03-15T09:05:38Z

Any more thoughts here? I would like to get this in the upcoming PyPy release.

obfusk · 2021-03-19T14:33:05Z

Any more thoughts here? I would like to get this in the upcoming PyPy release.

To summarise:

This handles double-width characters as found in e.g. Japanese and Chinese reasonably well.
It seems to be an improvement over the current lack of unicode support and AFAIK doesn't make anything worse than before.
It doesn't handle emoji (but that's hard and neither does readline afaik).
It doesn't seem to handle e.g. Arabic (but I don't know much about that), but I don't think it makes it worse either.

I've tested it with Japanese and found no (further) bugs, but it would be nice to have a comprehensive test suite.
I'd like someone to review the code to make sure I didn't miss anything.
The PR is a bit rough and could probably use a bit of refactoring; some old comments should probably be updated.

Future improvements would probably benefit from switching to wcwidth, but this will require more extensive modifications as some invariants/assumptions (such as width(a) + width(b) == width(a+b)) will no longer hold.

obfusk force-pushed the unicode branch 2 times, most recently from 3634ce2 to a1dd08f Compare February 16, 2021 23:33

[WIP] unicode support

ad177c8

obfusk force-pushed the unicode branch from a1dd08f to ad177c8 Compare February 16, 2021 23:54

obfusk added 2 commits February 17, 2021 02:19

handle wrapped lines

3a8a8f2

handle last FIXME

5814deb

obfusk changed the title ~~[proof of concept] unicode support~~ unicode support Feb 17, 2021

obfusk marked this pull request as ready for review February 17, 2021 02:07

fix some edge cases

d5048d4

blueyed added the enhancement New feature or request label Feb 17, 2021

blueyed closed this Feb 25, 2021

blueyed reopened this Feb 25, 2021

This comment has been minimized.

Sign in to view

blueyed reviewed Feb 25, 2021

View reviewed changes

obfusk changed the title ~~unicode support~~ unicode double-width character support Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode double-width character support #34

unicode double-width character support #34

obfusk commented Feb 16, 2021 •

edited

Loading

obfusk commented Feb 17, 2021

mattip commented Feb 17, 2021

mattip commented Feb 17, 2021

obfusk commented Feb 17, 2021

obfusk commented Feb 17, 2021

blueyed commented Feb 17, 2021

obfusk commented Feb 17, 2021

This comment has been minimized.

blueyed left a comment

blueyed Feb 25, 2021

mattip Feb 25, 2021

obfusk Feb 25, 2021

obfusk Feb 25, 2021

obfusk Feb 25, 2021

obfusk Feb 25, 2021

obfusk Feb 25, 2021

mattip Feb 25, 2021

obfusk Feb 26, 2021

obfusk Feb 26, 2021 •

edited

Loading

obfusk commented Feb 25, 2021

mattip commented Mar 15, 2021

obfusk commented Mar 19, 2021

unicode double-width character support #34

Are you sure you want to change the base?

unicode double-width character support #34

Conversation

obfusk commented Feb 16, 2021 • edited Loading

obfusk commented Feb 17, 2021

mattip commented Feb 17, 2021

mattip commented Feb 17, 2021

obfusk commented Feb 17, 2021

obfusk commented Feb 17, 2021

blueyed commented Feb 17, 2021

obfusk commented Feb 17, 2021

This comment has been minimized.

blueyed left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

obfusk Feb 26, 2021 • edited Loading

Choose a reason for hiding this comment

obfusk commented Feb 25, 2021

mattip commented Mar 15, 2021

obfusk commented Mar 19, 2021

obfusk commented Feb 16, 2021 •

edited

Loading

obfusk Feb 26, 2021 •

edited

Loading