Multilingual Coref

AngledLuffa released this 12 Sep 19:40

· 2 commits to main since this release

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref #1406

new features

streamlit visualizer for semgrex/ssurgeon #1396
updates to the constituency parser ensemble #1387
accuracy improvements to the IN_ORDER oracle #1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399

new models

Spanish combined models #1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: #1413 3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
actually include the visualization: #1421 thank you @bollwyvl

Contributors

bollwyvl

Assets 2