Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving word order for Arabic, Hebrew #582

Open
malthe opened this issue Jun 6, 2024 · 3 comments
Open

Preserving word order for Arabic, Hebrew #582

malthe opened this issue Jun 6, 2024 · 3 comments

Comments

@malthe
Copy link

malthe commented Jun 6, 2024

In Unicode text, consumers of RTL (right-to-left) language text such as Arabic or Hebrew, must identify the string direction, for example by observing the strong Unicode directional property of some glyphs such as Arabic letters.

That is, if for example a paragraph begins with an Arabic letter, we should align the whole paragraph right and render the glyphs right to left as we progress logically through the string.

In our testing, this does not seem to happen automatically in this library; bidi elements are not emitted.

While there's an ArabicScriptProcessor, often times we don't know the specific language of a given paragraph.

Shouldn't this be more or less an automatic process, working out of the box?

@plutext
Copy link
Owner

plutext commented Jun 8, 2024

If you are creating docx files, you need to set w:pPr/w:bidi and w:rPr/w:rtl appropriately, as well as w:pPr/w:lang .

See:

A program exporting docx then needs to be sensitive to these attributes. For example, docx4j's PDF output via FO should do this correctly.

If these attributes are not present, then the procedure recommended in your Strings on the Web reference might be a good fallback. (I wonder what Word does?)

Or are you suggesting that docx4j is the consumer and as such the methods to add text to a run at https://github.com/plutext/docx4j/blob/VERSION_11_4_12/docx4j-openxml-objects/src/main/java/org/docx4j/wml/R.java#L201 should set appropriate attributes?

@malthe
Copy link
Author

malthe commented Jun 12, 2024

We're basically adding a paragraph of text to the main document's content, providing a regular string when creating a Text object:

Text t = factory.createText();
t.setValue(string);

Now, implicit and sometimes explicit, unicode can be bidirectional and it would be convenient if there was a way to create a content element from a string that automatically figured out if bidi elements were necessary.

But we don't know if the string is Arabic, Hebrew, or a mix of languages, which is why something like an Arabic script processor doesn't really make sense. Ideally, this interface should simply follow the best practices for interpreting unicode as a bidirectional language container.

@plutext
Copy link
Owner

plutext commented Jun 17, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants