Skip to content

Latest commit

 

History

History
141 lines (78 loc) · 9.52 KB

supported.md

File metadata and controls

141 lines (78 loc) · 9.52 KB

Supported Languages and Systems

The goal of Romanization.NET is to provide a simple, extensive way to romanize widely-used languages as accurately as possible.

Below is a list of all supported languages and systems, with explanations of caveats and limitations if necessary. Languages are ordered lexicographically ascending.

Chinese

The Hànyǔ Pīnyīn system is considered a Readings System, and supports all Hànzì characters in the Unihan database.

The reading types to use can be specified, but default to using all of them.

The order in which readings are returned is as follows:

  1. Hànyǔ Pīnyīn
  2. Hànyǔ Pínlǜ - Hànyǔ Pīnyīn as it appeared in Xiàndài Hànyǔ Pínlǜ Cídiǎn
  3. XHC - Hànyǔ Pīnyīn as it appeared in Xiàndài Hànyǔ Cídiǎn

Japanese

This system is a revised version of the romanization system first published by James Curtis Hepburn, and the one in most widespread use in Japan.

It only supports Kana (Hiragana and Katakana), not Kanji. See below for Kanji support.

This supports syllabic n (ん), long consonants (sokuon, or っ), and long vowels (chōonpu (ー) only).

Limitations

In the Modified Hepburn system, certain pairs of subsequent vowels in the romanized result are to be combined into single long vowels, often indicated with a macron (aa => ā, for example).

The issue is, according to the spec for the system, these combinations depend on whether the two vowels belong to different morphemes - this is not something known to the program. As a result, while some vowel combinations could be done (not all have this requirement), to remain consistent in output, no vowel combination is done.

Kanji are effectively Japan's Hànzì, and share many of the same considerations and even symbols.

While Kana are syllabaries (each character is one syllable, and therefore maps neatly to a distinct sound), Kanji are their own symbols that can be a variable number of syllables. To make things more complicated, each can have multiple readings - in both Kun'yomi and On'yomi.

This is why this system is considered a Readings System for the purposes of this library, which means you can get every known reading from the Unihan database for each character.

The two reading types supported are:

  1. Kun'yomi - often referred to as just Kun - the native reading
  2. On'yomi - often referred to as just On - the Sino-Japanese reading

Additional Notes

Because Kanji often appear alongside supplementary Kana, the system also has a small convenience function that romanizes both Kanji and Kana, using the system of your choice for Kana.

Korean

The Revised Romanization of Korean system is the most commonly used, and does not make use of accents or macrons.

The system has a few provisions for certain kinds of content, which change the romanization somewhat:

  • Certain special pairs of Jamo are not combined in given names
  • Whether or not aspiration is reflected in the romanization depends on whether or not the word is a noun
  • Sometimes it can be helpful to hyphenate syllables, which occassionally makes a difference in disambiguating words with the same romanization (ga-eul vs. gae-ul)

The library's implementation of this system supports all of these provisions as options that can be supplied to the function.

Hanja => Hangeul Readings

Hanja, like Kanji, came from China and share their symbols with Hànzì. As a result, this is also considered a Readings System as some Hanja have multiple possible readings.

As with the other Hànzì-related characters, the supported Hanja are all from the Unihan database.

Only one reading type is supported, which is the Hangeul equivalent pronunciation for each Hanja character.

Additional Notes

Because the goal of this package is, as the name suggests, romanization, the implementation also includes a function for first converting the Hanja to Hangeul, then romanizing the Hangeul using the system of your choice.

Russian

At the time of writing, Russian has no single international standard of romanization/transliteration. Instead different systems are used by different groups for different purposes. As a result, there are many systems all implemented with very similar transliterations.

Developed jointly by the Unites States Board on Geographic Names and the Permanent Committee on Geographical Names for British Official Use, it is designed to be easier for anglophones to pronounce.

Because of this, it's likely a solid choice for romanizing text specifically for English speakers (US/CA/UK audience).

GOST 7.79-2000(A) focuses on mapping one Cyrillic character to one Latin character, potentially with diacritics.

ISO 9:1995 is the current standard for Slavic transliteration from the ISO, and is based on ISO/R 9:1968.

The two systems are functionally identical and in this library are combined into one, under the name of GOST 7.79-2000 System A. This is to retain consistency with the other GOST systems included, as it may be strange to have GOST 7.79-2000 System B but have A under a different name.

In contrast to the above, GOST 7.79-2000(B) focuses on mapping one Cyrillic character to potentially several Latin characters (eg. щ -> shh), but without the use of diacritics.

GOST 16876-71(1) focuses on mapping one Cyrillic character to one Latin character, potentially with diacritics.

It was recommended by the United Nations Group of Experts on Geographical Names (UNGEGN) in 1987.

GOST 16876-71 was most recently updated in 1980, and was abandoned in favour of GOST 7.79-2000 in 2002 by the Russian Federation.

GOST 16876-71(2) is another table in GOST 16876-71, and focuses on mapping one Cyrillic character to potentially several Latin characters (eg. щ -> shh), but without the use of diacritics.

The Scholarly transliteration system for Russian actually covers many slavic languages, with Russian being one of them. It tries to preserve pronunciation of the original characters while remaining unambiguous about it's transformations.

Similar to the scholarly system, ISO/R 9 was created 1954 and updated in 1968. It also supports many Slavic languages, and was the ISO's earliest adoption of scholarly transliteration.

This system was initially established in 1904, and remains largely unchanged since 1941. It's primary purpose is in US, Canadian, and British libraries.

This system uses some diacritics and uses two-letter tie characters for some Cyrillic characters.

It is the main system of Oxford University Press, and was used by the British Library up until 1975.

The ALA-LC system is now used by the British Library instead.

Created by the International Civil Aviation Organization, a UN agency, the document is designed to make travel documents machine-readable.

It contains tables for transliteration to Latin characters from many alphabets, including Cyrillic. The system uses no diacritics whatsoever, only standard ASCII characters.

The system was put into effect by the Russian government in 2013 for all citizen passports.

This is the system generally used for romanization for road signs and the like.

This originally followed GOST 10807-78 (tables 17, 18), but now follows GOST R 52290-2004 (tables Г.4, Г.5).