Releases: stanfordnlp/stanza
Stanza v1.5.0
Ssurgeon interface
Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/
In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.
CoreNLP integration:
- Ssurgeon interface! New interface allows for editing of dependency graphs using Semgrex patterns and Ssurgeon rules. #1205 https://aclanthology.org/2023.tlt-1.7/
- English Morphology class (deterministic English lemmatizer) 6aed177
- English constituency -> dependency converter 0987794
Bugfixes:
- Bugfix for older versions of torch: 376d7ea
- Bugfix for training (integration with new scoring script) #1167 9c39636
- Demo was showing constituency parser along with dependency parsing, even with conparse off: cbc13b0
- Replace absurdly long characters with UNK (thank you @khughitt) #1137 #1140
- Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed. 435685f
- stanza-train NER training bugfix (wrong pretrain): 2757cb4
- Pass around device everywhere instead of calling cuda(). this should fix models occasionally being split over multiple devices. would also allow for use of MPS, but the current torch implementation for MPS is buggy #1209 #1159
- Fix error in preparing tokenizer datasets (thanks @dvzubarev): #1161
- Fix unnecessary slowness in preparing tokenizer datasets (again, thanks @dvzubarev): #1162
- Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks @dvzubarev): #1170
- When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): b118473
- Update use of emoji to match latest releases: #1195 ea345a8
Features:
- Mechanism for resplitting tokens into MWT #95 8fac17f
- CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) cfd44d1
detach().cpu()
speeds things up significantly in some cases ccfbc56- Potentially use a constituency model as a classifier - WIP research project #1190
- add an output format
"{:C}"
for document objects which prints out documents as CoNLL: #1169 - If a constituency tree is available, include it when outputting conll format for documents: #1171
- Same with sentiment: abb5819
- Additional language code coverage (thank you @juanro49) 5802b10 f06bf86 32f83fa 3450575
- Allow loading a pipeline for new languages (useful when developing a new suite of models) e7fcd26
- Script to count the work done by annotators on aws sagemaker private workforce: #1186
- Streaming interface which batch processes items in the stream: 2c9fe3d #550
- Can pass a defaultdict to MultilingualPipeline, useful for specifying the processors for each language at once: 70fd2fd #1199
- Transformer at bottom layer of POS - currently only available in English as the
en_combined_bert
model, others to come #1132
New models:
- Armenian NER model using an NER labeling of armtdp (thanks to @ShakeHakobyan): https://github.com/myavrum/ArmTDP-NER #1206 #1212
- Sindhi tokenization from ISRA #1117
- Sindhi NER from SiNER: 2a8ded4
- Erzya from UD 2.11 0344ac3
Conparser experiments:
- Transformer stack (initial implementation did not help) https://arxiv.org/abs/2010.10669 110031e
- TREE_LSTM constituent composition method (didn't beat MAX) 2f722c8
- Learned weighting between bert layers (this did help a little) 2d0c69e
- Silver trees: train 10 models, use those models to vote on good trees, use those trees to then train new models. helps smaller treebanks such as IT and VI, but no effect on EN #1148
- New in_order_compound transition scheme: no improvement f560b08
- Multistage training with madgrad or adamw: definite improvement. madgrad included as optional dependency 2706c4b f500936
- Report the scores of tags when retagging (does not affect the conparser training) 7663419
- FocalLoss on the transitions using optional dependency: didn't help https://arxiv.org/abs/1708.02002 90a8337
- LargeMarginSoftmax: didn't help https://github.com/tk1980/LargeMarginInSoftmax 5edd724
- Maxout layer: didn't help https://arxiv.org/abs/1302.4389 c708ce7
- Reverse parsing: not expected to help, potentially can be useful when building silver treebanks. May also be useful as a two step parser in the future. 4954845
Stanza v1.4.2
Stanza v1.4.2: Minor version bump to improve (python) dependencies
-
Pipeline cache in Multilingual is a single OrderedDict
#1115 (comment)
ba3f64d -
Don't require
pytest
for all installations unless needed for testing
#1120
8c1d9d8 -
hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities
#1120
6a90ad4 -
Reorder & normalize installations in setup.py
#1124
Stanza v1.4.1
Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage
Overview
We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.
New NER models
-
New Polish NER model based on NKJP from Karol Saputa and ryszardtuora
#1070
#1110 -
Make GermEval2014 the default German NER model, including an optional Bert version
#1018
#1022 -
Japanese conversion of GSD by Megagon
#1038 -
Marathi NER dataset from L3Cube. Includes a Sentiment model as well
#1043 -
Thai conversion of LST20
555fc03 -
Kazakh conversion of KazNERD
de6cd25
Other new models
-
Sentiment conversion of Tass2020 for Spanish
#1104 -
VIT constituency dataset for Italian
149f144
... and many subsequent updates -
For UD models with small train dataset & larger test dataset, flip the datasets
UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL
#1030
9618d60 -
Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF
47740c6
Model improvements
-
Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost
#1086 -
Pretrained charlm integrated into Sentiment. Improves English, others not so much
#1025 -
LSTM, 2d maxpool as optional items in the Sentiment
from the paperText Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
#1098 -
First learn with AdaDelta, then with another optimizer in conparse training. Very helpful
b1d10d3 -
Grad clipping in conparse training
365066a
Pipeline interface improvements
-
GPU memory savings: charlm reused between different processors in the same pipeline
#1028 -
Word vectors not saved in the NER models. Saves bandwidth & disk space
#1033 -
Functions to return tagsets for NER and conparse models
#1066
#1073
36b84db
2db43c8 -
displaCy integration with NER and dependency trees
2071413
Bugfixes
-
Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)
TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)
#1056 -
Starting a new corenlp client w/o server shouldn't wait for the server to be available
TY to Mariano Crosetti
#1059
#1061 -
Read raw glove word vectors (they have no header information)
#1074 -
Ensure that illegal languages are not chosen by the LangID model
#1076
#1077 -
Fix loading of previously unseen languages in Multilingual pipeline
#1101
e551ebe -
Fix that conparse would occasionally train to NaN early in the training
c4d7857
Improved training tools
-
W&B integration for all models: can be activated with --wandb flag in the training scripts
#1040 -
New webpages for building charlm, NER, and Sentiment
https://stanfordnlp.github.io/stanza/new_language_charlm.html
https://stanfordnlp.github.io/stanza/new_language_ner.html
https://stanfordnlp.github.io/stanza/new_language_sentiment.html -
Script to download Oscar 2019 data for charlm from HF (requires
datasets
module)
#1014 -
Unify sentiment training into a Python script, replacing the old shell script
#1021
#1023 -
Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese
#1024 -
Slightly faster charlm training
#1026 -
Data conversion of WikiNER generalized for retraining / add new WikiNER models
#1039 -
XPOS factory now determined at start of POS training. Makes addition of new languages easier
#1082 -
Checkpointing and continued training for charlm, conparse, sentiment
#1090
0e6de80
e5793c9 -
Option to write the results of a NER model to a file
#1108 -
Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools
6544ef3 -
Convert an AMT NER result to Stanza .json
cfa7e49 -
Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters
5a5e918
b32a98e and others
Stanza v1.4.0
Stanza v1.4.0: Transformer integration to NER and conparse
Overview
As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.
Pipeline interface improvements
-
Download resources.json and models into temp dirs first to avoid race conditions between multiple processors
#213
#1001 -
Download models for Pipelines automatically, without needing to call
stanza.download(...)
#486
#943 -
Add ability to turn off downloads
68455d8 -
Add a new interface where both processors and package can be set
#917
f370429 -
When using pretokenized tokens, get character offsets from text if available
#967
#975 -
If Bert or other transformers are used, cache the models rather than loading multiple times
#980 -
Allow for disabling processors on individual runs of a pipeline
#945
#947
Other general improvements
-
Upgrades to EN, IT, and Indonesian models
#1003
#1008
IT improvements with the help of @attardi and @msimi -
Fix improper tokenization of Chinese text with leading whitespace
#920
#924 -
Check if a CoreNLP model exists before downloading it (thank you @Internull)
#965 -
Convert the run_charlm script to python
#942 -
stanza-train examples now compatible with the python training scripts
#896
NER features
-
Swedish model (thank you @EmilStenstrom)
#912
#857 -
Persian model
#797 -
Danish model
3783cc4 -
Norwegian model (both NB and NN)
31fa23e -
Myanmar model (thank you UCSY)
#845 -
Fix inconsistencies in B/S/I/E tags
#928 (comment)
#961 -
Add an option for multiple NER models at the same time, merging the results together
#928
#955
Constituency parser
-
Dynamic oracle (improves accuracy a bit)
#866 -
bugfix of () not being escaped when output in a tree
eaf134c -
charlm integration by default
#799 -
Bert integration (not the default model) (thank you @vythaihn and @hungbui0411)
05a0b04
0bbe8d1 -
Preemptive bugfix for incompatible devices from @zhaochaocs
#989
#1002 -
New models:
DA, based on Arboretum
IT, based on the Turin treebank
JA, based on ALT
PT, based on Cintil
TR, based on Starlang
ZH, based on CTB7
Stanza 1.3.0: LangID and Constituency Parser
Overview
Stanza 1.3.0 introduces a language id model, a constituency parser, a dictionary in the tokenizer, and some additional features and bugfixes.
New features
-
Langid model and multilingual pipeline
Based on "A reproduction of Apple's bi-directional LSTM models for language identification in short strings." by Toftrup et al 2021
(154b0e8) -
Constituency parser
Based on "In-Order Transition-based Constituent Parsing" by Jiangming Liu and Yue Zhang. Currently anen_wsj
model available, with more to come.
(9031802) -
Evalb interface to CoreNLP
Useful for evaluating the parser - requires CoreNLP 4.3.0 or later -
Dictonary tokenizer feature
Noticeably improved performance for ZH, VI, TH
(#776)
Bugfixes / Reliability
Stanza v1.2.3: Two new NER models and some minor bugfixes
Overview
In anticipation of a larger release with some new features, we make a small update to fix some existing bugs and add two more NER models.
Bugfixes
-
Sentiment models would crash on no text (issue #769, fixed by 47889e3)
-
Java processes as a context were not properly closed (a39d2ff)
Interface improvements
-
Downloading tokenize now downloads mwt for languages which require it (issue #774, fixed by #777, from davidrft)
-
NER model can finetune and save to/from different filenames (0714a01)
-
NER model now displays a confusion matrix at the end of training (9bbd3f7)
NER models
Stanza v1.2.2
Overview
A regression in NER results occurred in 1.2.1 when fixing a bug in VI models based around spaces.
Bugfixes
-
Fix Sentiment not loading correctly on Windows because of pickling issue (#742) (thanks to @BramVanroy)
-
Fix NER bulk process not filling out data structures as expected (#721) (#722)
-
Fix NER space issue causing a performance regression (#739) (#732)
Interface improvements
- Add an NER run script (#738)
Stanza v1.2.1
Overview
All models other than NER and Sentiment were retrained with the new UD 2.8 release. All of the updates include the data augmentation fixes applied in 1.2.0, along with new augmentations tokenization issues and end-of-sentence issues. This release also features various enhancements, bug fixes, and performance improvements, along with 4 new NER models.
Model improvements
-
Add Bulgarian, Finnish, Hungarian, Vietnamese NER models
- The Bulgarian model is trained on BSNLP 2019 data.
- The Finnish model is trained on the Turku NER data.
- The Hungarian model is trained on a combination of the NYTK dataset and earlier business and criminal NER datasets.
- The Vietnamese model is trained on the VLSP 2018 data.
- Furthermore, the script for preparing the lang-uk NER data has been integrated (c1f0bee)
-
Use new word vectors for Armenian, including better coverage for the new Western Armenian dataset(d9e8301)
-
Add copy mechanism in the seq2seq model. This fixes some unusual Spanish multi-word token expansion errors and potentially improves lemmatization performance. (#692 #684)
-
Fix Spanish POS and depparse mishandling a leading
¿
missing (#699 #698) -
Fix tokenization breaking when a newline splits a Chinese token(#632 #531)
-
Fix tokenization of parentheses in Chinese(452d842)
-
Fix various issues with characters not present in UD training data such as ellipses characters or unicode apostrophe
(db05552 f01a142 85898c5) -
Fix a variety of issues with Vietnamese tokenization - remove language specific model improvement which got roughly 1% F1 but caused numerous hard-to-track issues (3ccb132)
-
Fix spaces in the Vietnamese words not being found in the embedding used for POS and depparse(1972122)
-
Include UD_English-GUMReddit in the GUM models(9e6367c)
-
Add Pronouns & PUD to the mixed English models (various data improvements made this more appealing)(f74bef7)
Interface enhancements
-
Add ability to pass a Document to the pipeline in pretokenized mode(f88cd8c #696)
-
Track comments when reading and writing conll files (#676 originally from @danielhers in #155)
-
Add a proxy parameter for downloads to pass through to the requests module (#638)
-
add sent_idx to tokens (ee6135c)
Bugfixes
-
Fix Windows encoding issues when reading conll documents from @yanirmr (b40379e #695)
-
Fix tokenization breaking when second batch is exactly eval_length(7263686 #634 #631)
Efficiency improvements
-
Bulk process for tokenization - greatly speeds up the use case of many small docs (5d2d39e)
-
Optimize MWT usage in pipeline & fix MWT bulk_process (#642 #643 #644)
CoreNLP integration
Stanza v1.2.0
Overview
All models other than NER and Sentiment were retrained with the new UD 2.7 release. Quite a few of them have data augmentation fixes for problems which arise in common use rather than when running an evaluation task. This release also features various enhancements, bug fixes, and performance improvements.
New features and enhancements
-
Models trained on combined datasets in English and Italian The default models for English are now a combination of EWT and GUM. The default models for Italian now combine ISDT, VIT, Twittiro, PosTWITA, and a custom dataset including MWT tokens.
-
NER Transfer Learning Allows users to fine-tune all or part of the parameters of trained NER models on a new dataset for transfer learning (#351, thanks to @gawy for the contribution)
-
Multi-document support The Stanza
Pipeline
now supports multi-Document
input! To process multiple documents without having to worry about document boundaries, simply pass a list of StanzaDocument
objects into thePipeline
. (#70 #577) -
Added API links from token to sentence It's easier to access Stanza data objects from related ones. To access the sentence object a token or a word, simply use
token.sent
orword.sent
. (#533 #554) -
New external tokenizer for Thai with PyThaiNLP Try it out with, for example,
stanza.Pipeline(lang='th', processors={'tokenize': 'pythainlp'}, package=None)
. (#567) -
Faster tokenization We have improved how the data pipeline works internally to reduce redundant data wrangling, and significantly sped up the tokenization of long texts. If you have a really long line of text, you could experience up to 10x speedup or more without changing anything. (#522)
-
Added a method for getting all the supported languages from the resources file Wondering what languages Stanza supports and want to determine it programmatically? Wonder no more! Try
stanza.resources.common.list_available_languages()
. (#511 fa52f85) -
Load mwt automagically if a model needs it Multi-word token expansion is one of the most common things to miss from your
Pipeline
instantiation, and remembering to include it is a pain -- until now. (#516 #515 and many others) -
Vietnamese sentiment model based on VSFC This is now part of the default language package for Vietnamese that you get from
stanza.download("vi")
. Enjoy! -
More informative errors for missing models Stanza now throws more helpful exceptions with informative exception messages when you are missing models (#437 #430 ... #324 #438 ... #529 9539665 ... #575 #578)
Bugfixes
-
Fixed NER documentation for German to correctly point to the GermEval 2014 model for download. (4ee9f12 #559)
-
External tokenization library integration respects
no_ssplit
so you can enjoy using them without messing up your preferred sentence segmentation just like Stanza tokenizers. (#523 #556) -
Telugu lemmatizer and tokenizer improvements Telugu models set to use identity lemmatizer by default, and the tokenizer is retrained to separate sentence final punctuation (#524 ba0aec3)
-
Spanish model would not tokenize foo,bar Now fixed (#528 123d502)
-
Arabic model would not tokenize
asdf .
Now fixed (#545 03b7cea) -
Various tokenization models would split URLs and/or emails Now URLs and emails are robustly handled with regexes. (#539 #588)
-
Various parser and pos models would deterministically label "punct" for the final word Resolved via data augmentation (#471 #488 #491)
-
Norwegian tokenizers retrained to separate final punct The fix is an upstream data fix (#305 UniversalDependencies/UD_Norwegian-Bokmaal#5)
-
Bugfix for conll eval Fix the error in data conversion from python object of Document to CoNLL format. (#484 #483, thanks @m0re4u )
-
Less randomness in sentiment results Fixes prediction fluctuation in sentiment prediction. (#458 274474c)
-
Bugfix which should make it easier to use in jupyter / colab This fixes the issue where jupyter notebooks (and by extension colab) don't like it when you use sys.stderr as the stderr of popen (#434 #431)
-
Misc fixes for training, concurrency, and edge cases in basic Pipeline usage
BREAKING CHANGES
- Renamed
stanza.models.tokenize
->stanza.models.tokenization
#452 This stops the tokenize directory shadowing a built in library
Stanza v1.1.1
Overview
This release features support for extending the capability of the Stanza pipeline with customized processors, a new sentiment analysis tool, improvements to the CoreNLPClient
functionality, new models for a few languages (including Thai, which is supported for the first time in Stanza), new biomedical and clinical English packages, alternative servers for downloading resource files, and various improvements and bugfixes.
New Features and Enhancements
-
New Sentiment Analysis Models for English, German, Chinese: The default Stanza pipelines for English, German and Chinese now include sentiment analysis models. The released models are based on a convolutional neural network architecture, and predict three-way sentiment labels (negative/neutral/positive). For more information and details on the datasets used to train these models and their performance, please visit the Stanza website.
-
New Biomedical and Clinical English Model Packages: Stanza now features syntactic analysis and named entity recognition functionality for English biomedical literature text and clinical notes. These newly introduced packages include: 2 individual biomedical syntactic analysis pipelines, 8 biomedical NER models, 1 clinical syntactic pipelines and 2 clinical NER models. For detailed information on how to download and use these pipelines, please visit Stanza's biomedical models page.
-
Support for Adding User Customized Processors via Python Decorators: Stanza now supports adding customized processors or processor variants (i.e., an alternative of existing processors) into existing pipelines. The name and implementation of the added customized processors or processor variants can be specified via
@register_processor
or@register_processor_variant
decorators. See Stanza website for more information and examples (see custom Processors and Processor variants). (PR #322) -
Support for Editable Properties For Data Objects: We have made it easier to extend the functionality of the Stanza neural pipeline by adding new annotations to Stanza's data objects (e.g.,
Document
,Sentence
,Token
, etc). Aside from the annotation they already support, additional annotation can be easily attached throughdata_object.add_property()
. See our documentation for more information and examples. (PR #323) -
Support for Automated CoreNLP Installation and CoreNLP Model Download: CoreNLP can now be easily downloaded in Stanza with
stanza.install_corenlp(dir='path/to/corenlp/installation')
; CoreNLP models can now be downloaded withstanza.download_corenlp_models(model='english', version='4.1.0', dir='path/to/corenlp/installation')
. For more details please see the Stanza website. (PR #363) -
Japanese Pipeline Supports SudachiPy as External Tokenizer: You can now use the SudachiPy library as tokenizer in a Stanza Japanese pipeline. Turn on this when building a pipeline with
nlp = stanza.Pipeline('ja', processors={'tokenize': 'sudachipy'}
. Note that this will require a separate installation of the SudachiPy library via pip. (PR #365) -
New Alternative Server for Stable Download of Resource Files: Users in certain areas of the world that do not have stable access to GitHub servers can now download models from alternative Stanford server by specifying a new
resources_url
argument. For example,stanza.download(lang='en', resources_url='stanford')
will now download the resource file and English pipeline from Stanford servers. (Issue #331, PR #356) -
CoreNLPClient
Supports New Multiprocessing-friendly Mechanism to Start the CoreNLP Server: TheCoreNLPClient
now supports a newEnum
values with better semantics for itsstart_server
argument for finer-grained control over how the server is launched, including a new option calledStartServer.TRY_START
that launches the CoreNLP Server if one isn't running already, but doesn't fail if one has already been launched. This option makes it easier forCoreNLPClient
to be used in a multiprocessing environment. Boolean values are still supported for backward compatibility, but we recommendStartServer.FORCE_START
andStartSerer.DONT_START
for better readability. (PR #302) -
New Semgrex Interface in CoreNLP Client for Dependency Parses of Arbitrary Languages: Stanford CoreNLP has a module which allows searches over dependency graphs using a regex-like language. Previously, this was only usable for languages which CoreNLP already supported dependency trees. This release expands it to dependency graphs for any language. (Issue #399, PR #392)
-
New Tokenizer for Thai Language: The available UD data for Thai is quite small. The authors of pythainlp helped provide us two tokenization datasets, Orchid and Inter-BEST. Future work will include POS, NER, and Sentiment. (Issue #148)
-
Support for Serialization of Document Objects: Now you can serialize and deserialize the entire document by running
serialized_string = doc.to_serialized()
anddoc = Document.from_serialized(serialized_string)
. The serialized string can be decoded into Python objects by runningobjs = pickle.loads(serialized_string)
. (Issue #361, PR #366) -
Improved Tokenization Speed: Previously, the tokenizer was the slowest member of the neural pipeline, several times slower than any of the other processors. This release brings it in line with the others. The speedup is from improving the text processing before the data is passed to the GPU. (Relevant commits: 546ed13, 8e2076c, 7f5be82, etc.)
-
User provided Ukrainian NER model: We now have a model built from the lang-uk NER dataset, provided by a user for redistribution.
Breaking Interface Changes
-
Token.id is Tuple and Word.id is Integer: The
id
attribute for a token will now return a tuple of integers to represent the indices of the token (or a singleton tuple in the case of a single-word token), and theid
for a word will now return an integer to represent the word index. Previously both attributes are encoded as strings and requires manual conversion for downstream processing. This change brings more convenient handling of these attributes. (Issue: #211, PR: #357) -
Changed Default Pipeline Packages for Several Languages for Improved Robustness: Languages that have changed default packages include: Polish (default is now
PDB
model, from previousLFG
, #220), Korean (default is nowGSD
, from previousKaist
, #276), Lithuanian (default is nowALKSNIS
, from previousHSE
, #415). -
CoreNLP 4.1.0 is required:
CoreNLPClient
requires CoreNLP 4.1.0 or a later version. The client expects recent modifications that were made to the CoreNLP server. -
Properties Cache removed from CoreNLP client: The properties_cache has been removed from
CoreNLPClient
and theCoreNLPClient's
annotate()
method no longer has aproperties_key
argument. Python dictionaries with custom request properties should be directly supplied toannotate()
via theproperties
argument.
Bugfixes and Other Improvements
-
Fixed Logging Behavior: This is mainly for fixing the issue that Stanza will override the global logging setting in Python and influence downstream logging behaviors. (Issue #278, PR #290)
-
Compatibility Fix for PyTorch v1.6.0: We've updated several processors to adapt to new API changes in PyTorch v1.6.0. (Issues #412 #417, PR #406)
-
Improved Batching for Long Sentences in Dependency Parser: This is mainly for fixing an issue where long sentences will cause an out of GPU memory issue in the dependency parser. (Issue #387)
-
Improved neural tokenizer robustness to whitespaces: the neural tokenizer is now more robust to the presence of multiple consecutive whitespace characters (PR #380)
-
Resolved properties issue when switching languages with requests to CoreNLP server: An issue with default properties has been resolved. Users can now switch between CoreNLP supported languages with and get expected properties for each language by default.