Preserving word order for Arabic, Hebrew #582

malthe · 2024-06-06T12:07:37Z

In Unicode text, consumers of RTL (right-to-left) language text such as Arabic or Hebrew, must identify the string direction, for example by observing the strong Unicode directional property of some glyphs such as Arabic letters.

That is, if for example a paragraph begins with an Arabic letter, we should align the whole paragraph right and render the glyphs right to left as we progress logically through the string.

In our testing, this does not seem to happen automatically in this library; bidi elements are not emitted.

While there's an ArabicScriptProcessor, often times we don't know the specific language of a given paragraph.

Shouldn't this be more or less an automatic process, working out of the box?

plutext · 2024-06-08T01:33:52Z

If you are creating docx files, you need to set w:pPr/w:bidi and w:rPr/w:rtl appropriately, as well as w:pPr/w:lang .

See:

A program exporting docx then needs to be sensitive to these attributes. For example, docx4j's PDF output via FO should do this correctly.

If these attributes are not present, then the procedure recommended in your Strings on the Web reference might be a good fallback. (I wonder what Word does?)

Or are you suggesting that docx4j is the consumer and as such the methods to add text to a run at https://github.com/plutext/docx4j/blob/VERSION_11_4_12/docx4j-openxml-objects/src/main/java/org/docx4j/wml/R.java#L201 should set appropriate attributes?

malthe · 2024-06-12T07:03:54Z

We're basically adding a paragraph of text to the main document's content, providing a regular string when creating a Text object:

Text t = factory.createText();
t.setValue(string);

Now, implicit and sometimes explicit, unicode can be bidirectional and it would be convenient if there was a way to create a content element from a string that automatically figured out if bidi elements were necessary.

But we don't know if the string is Arabic, Hebrew, or a mix of languages, which is why something like an Arabic script processor doesn't really make sense. Ideally, this interface should simply follow the best practices for interpreting unicode as a bidirectional language container.

plutext · 2024-06-17T08:05:51Z

Does https://docs.oracle.com/javase/7/docs/api/java/text/Bidi.html help you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving word order for Arabic, Hebrew #582

Preserving word order for Arabic, Hebrew #582

malthe commented Jun 6, 2024

plutext commented Jun 8, 2024

malthe commented Jun 12, 2024 •

edited

Loading

plutext commented Jun 17, 2024

Preserving word order for Arabic, Hebrew #582

Preserving word order for Arabic, Hebrew #582

Comments

malthe commented Jun 6, 2024

plutext commented Jun 8, 2024

malthe commented Jun 12, 2024 • edited Loading

plutext commented Jun 17, 2024

malthe commented Jun 12, 2024 •

edited

Loading