Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heterophones not in the list? #23

Open
sancarn opened this issue Jun 26, 2021 · 3 comments
Open

Heterophones not in the list? #23

sancarn opened this issue Jun 26, 2021 · 3 comments

Comments

@sancarn
Copy link

sancarn commented Jun 26, 2021

Heterophones like Tear and Record are not present in the ipa dictionary for English. Is there a specific reason for this as i understand these wouldn't fit neatly into the data structure? Or is it just an oversight?

@dohliam
Copy link
Member

dohliam commented Jun 26, 2021

Thanks for raising this question, but I'm not sure I quite understand the issue. Heterophones are explicitly included in the data structure, as described in the Readme:

Where multiple possible pronunciations exist for a given entry, they should all be listed (separated by commas), even if they have different senses. For example, the word est has two different pronunciations in French (/ɛst/ and /ɛ/), depending on whether it is a noun or an (unrelated) verb, so the entry for est lists both of these pronunciations.

Furthermore, the entries for tear and record are included in the English dictionary as follows:

tear	/ˈtɛɹ/, /ˈtɪɹ/
record	/ˈɹɛkɝd/, /ɹəˈkɔɹd/, /ɹɪˈkɔɹd/

These seem to pretty unambiguously include the multiple possible pronunciations for each word.

If you are referring to the data in the en_UK list, this list derives from the ipacards project and is somewhat less complete than the cmudict-ipa based dictionary used by en_US. In that case, it is simply a matter of the en_UK dictionary needing further additions/contributions to make up for any missing words. Pull requests to update the dictionary are very welcome!

@sancarn
Copy link
Author

sancarn commented Jun 26, 2021

If you are referring to the data in the en_UK list, this list derives from the ipacards project and is somewhat less complete than the cmudict-ipa based dictionary used by en_US. In that case, it is simply a matter of the en_UK dictionary needing further additions/contributions to make up for any missing words. Pull requests to update the dictionary are very welcome!

Right yes! Sorry I was as I am British so naturally gravitated towards that list 😛

On the topic of Heterophones, as this would effect my pull request what would the opinion be about including different pronunciations from different accents? For instance grass is pronounced /ɡrɑːs/ in most areas of Britain but /ɡraːs/ in other dialects? Would we include them all?

@dohliam
Copy link
Member

dohliam commented Jun 28, 2021

@sancarn The more accents/dialects/speech varieties the better! The current approach is to separate these into different dictionaries so that the list for each language variant is, internally speaking, as phonemically consistent as possible. So, just, as I'm hoping someone will generate en_AU and en_IE (not to mention en_SG etc) dictionaries eventually, any and all regional variants from around the UK would also be very welcome as new dictionary lists. (This is all assuming that the distinction between two different pronunciations is indeed a matter of geographic region and not just a case of allophones within the same standard or community.)

In terms of expanding the en_UK dictionary, I took a look at the ipacards repo and if I recall correctly our version is taken from the pre-generated list there which contains about 65K entries.

Looking at the code that generated that list, it appears that heteronyms are intentionally stripped out based on a list of 972 heteronyms (which include both tear and record among others). So it should be possible to regenerate this list with the missing heteronyms by removing those three lines (or just run the script again using the heteronyms file as the main vocabulary source, which might be easier).

If you'd like to give this a try, please go ahead, and I would be happy to accept the resulting PR. I would also be glad to look into this myself but likely won't have a chance to do so until early August at the earliest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants