14 Mar 05:09

a85cce6

Stanza v1.5.0

Ssurgeon interface

Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/

In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.

CoreNLP integration:

Ssurgeon interface! New interface allows for editing of dependency graphs using Semgrex patterns and Ssurgeon rules. #1205 https://aclanthology.org/2023.tlt-1.7/
English Morphology class (deterministic English lemmatizer) 6aed177
English constituency -> dependency converter 0987794

Bugfixes:

Bugfix for older versions of torch: 376d7ea
Bugfix for training (integration with new scoring script) #1167 9c39636
Demo was showing constituency parser along with dependency parsing, even with conparse off: cbc13b0
Replace absurdly long characters with UNK (thank you @khughitt) #1137 #1140
Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed. 435685f
stanza-train NER training bugfix (wrong pretrain): 2757cb4
Pass around device everywhere instead of calling cuda(). this should fix models occasionally being split over multiple devices. would also allow for use of MPS, but the current torch implementation for MPS is buggy #1209 #1159
Fix error in preparing tokenizer datasets (thanks @dvzubarev): #1161
Fix unnecessary slowness in preparing tokenizer datasets (again, thanks @dvzubarev): #1162
Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks @dvzubarev): #1170
When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): b118473
Update use of emoji to match latest releases: #1195 ea345a8

Features:

Mechanism for resplitting tokens into MWT #95 8fac17f
CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) cfd44d1
detach().cpu() speeds things up significantly in some cases ccfbc56
Potentially use a constituency model as a classifier - WIP research project #1190
add an output format "{:C}" for document objects which prints out documents as CoNLL: #1169
If a constituency tree is available, include it when outputting conll format for documents: #1171
Same with sentiment: abb5819
Additional language code coverage (thank you @juanro49) 5802b10 f06bf86 32f83fa 3450575
Allow loading a pipeline for new languages (useful when developing a new suite of models) e7fcd26
Script to count the work done by annotators on aws sagemaker private workforce: #1186
Streaming interface which batch processes items in the stream: 2c9fe3d #550
Can pass a defaultdict to MultilingualPipeline, useful for specifying the processors for each language at once: 70fd2fd #1199
Transformer at bottom layer of POS - currently only available in English as the en_combined_bert model, others to come #1132

New models:

Armenian NER model using an NER labeling of armtdp (thanks to @ShakeHakobyan): https://github.com/myavrum/ArmTDP-NER #1206 #1212
Sindhi tokenization from ISRA #1117
Sindhi NER from SiNER: 2a8ded4
Erzya from UD 2.11 0344ac3

Conparser experiments:

Transformer stack (initial implementation did not help) https://arxiv.org/abs/2010.10669 110031e
TREE_LSTM constituent composition method (didn't beat MAX) 2f722c8
Learned weighting between bert layers (this did help a little) 2d0c69e
Silver trees: train 10 models, use those models to vote on good trees, use those trees to then train new models. helps smaller treebanks such as IT and VI, but no effect on EN #1148
New in_order_compound transition scheme: no improvement f560b08
Multistage training with madgrad or adamw: definite improvement. madgrad included as optional dependency 2706c4b f500936
Report the scores of tags when retagging (does not affect the conparser training) 7663419
FocalLoss on the transitions using optional dependency: didn't help https://arxiv.org/abs/1708.02002 90a8337
LargeMarginSoftmax: didn't help https://github.com/tk1980/LargeMarginInSoftmax 5edd724
Maxout layer: didn't help https://arxiv.org/abs/1302.4389 c708ce7
Reverse parsing: not expected to help, potentially can be useful when building silver treebanks. May also be useful as a two step parser in the future. 4954845

Contributors

khughitt, juanro49, and 2 other contributors

Assets 2

15 Sep 05:47

AngledLuffa

v1.4.2

b18e6e8

Stanza v1.4.2

Stanza v1.4.2: Minor version bump to improve (python) dependencies

Pipeline cache in Multilingual is a single OrderedDict
#1115 (comment)
ba3f64d
Don't require pytest for all installations unless needed for testing
#1120
8c1d9d8
hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities
#1120
6a90ad4
Reorder & normalize installations in setup.py
#1124

Assets 2

14 Sep 16:41

AngledLuffa

v1.4.1

f3a9a72

Stanza v1.4.1

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New NER models

New Polish NER model based on NKJP from Karol Saputa and ryszardtuora
#1070
#1110
Make GermEval2014 the default German NER model, including an optional Bert version
#1018
#1022
Japanese conversion of GSD by Megagon
#1038
Marathi NER dataset from L3Cube. Includes a Sentiment model as well
#1043
Thai conversion of LST20
555fc03
Kazakh conversion of KazNERD
de6cd25

Other new models

Sentiment conversion of Tass2020 for Spanish
#1104
VIT constituency dataset for Italian
149f144
... and many subsequent updates
Combined UD models for Hebrew
#1109
e4fcf00
For UD models with small train dataset & larger test dataset, flip the datasets
UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL
#1030
9618d60
Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF
47740c6

Model improvements

Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost
#1086
Pretrained charlm integrated into Sentiment. Improves English, others not so much
#1025
LSTM, 2d maxpool as optional items in the Sentiment
from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
#1098
First learn with AdaDelta, then with another optimizer in conparse training. Very helpful
b1d10d3
Grad clipping in conparse training
365066a

Pipeline interface improvements

GPU memory savings: charlm reused between different processors in the same pipeline
#1028
Word vectors not saved in the NER models. Saves bandwidth & disk space
#1033
Functions to return tagsets for NER and conparse models
#1066
#1073
36b84db
2db43c8
displaCy integration with NER and dependency trees
2071413

Bugfixes

Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)
TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)
#1056
Starting a new corenlp client w/o server shouldn't wait for the server to be available
TY to Mariano Crosetti
#1059
#1061
Read raw glove word vectors (they have no header information)
#1074
Ensure that illegal languages are not chosen by the LangID model
#1076
#1077
Fix cache in Multilingual pipeline
#1115
cdf18d8
Fix loading of previously unseen languages in Multilingual pipeline
#1101
e551ebe
Fix that conparse would occasionally train to NaN early in the training
c4d7857

Improved training tools

W&B integration for all models: can be activated with --wandb flag in the training scripts
#1040
New webpages for building charlm, NER, and Sentiment
https://stanfordnlp.github.io/stanza/new_language_charlm.html
https://stanfordnlp.github.io/stanza/new_language_ner.html
https://stanfordnlp.github.io/stanza/new_language_sentiment.html
Script to download Oscar 2019 data for charlm from HF (requires datasets module)
#1014
Unify sentiment training into a Python script, replacing the old shell script
#1021
#1023
Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese
#1024
Slightly faster charlm training
#1026
Data conversion of WikiNER generalized for retraining / add new WikiNER models
#1039
XPOS factory now determined at start of POS training. Makes addition of new languages easier
#1082
Checkpointing and continued training for charlm, conparse, sentiment
#1090
0e6de80
e5793c9
Option to write the results of a NER model to a file
#1108
Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools
6544ef3
Convert an AMT NER result to Stanza .json
cfa7e49
Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters
5a5e918
b32a98e and others

Assets 2

0 Join discussion

23 Apr 06:01

AngledLuffa

v1.4.0

011b6c4

Stanza v1.4.0

Stanza v1.4.0: Transformer integration to NER and conparse

Overview

As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.

Pipeline interface improvements

Download resources.json and models into temp dirs first to avoid race conditions between multiple processors
#213
#1001
Download models for Pipelines automatically, without needing to call stanza.download(...)
#486
#943
Add ability to turn off downloads
68455d8
Add a new interface where both processors and package can be set
#917
f370429
When using pretokenized tokens, get character offsets from text if available
#967
#975
If Bert or other transformers are used, cache the models rather than loading multiple times
#980
Allow for disabling processors on individual runs of a pipeline
#945
#947

Other general improvements

Add # text and # sent_id to conll output
#918
#983
#995
Add ner to the token conll output
#993
#996
Fix missing Slovak MWT model
#971
5aa19ec
Upgrades to EN, IT, and Indonesian models
#1003
#1008
IT improvements with the help of @attardi and @msimi
Fix improper tokenization of Chinese text with leading whitespace
#920
#924
Check if a CoreNLP model exists before downloading it (thank you @Internull)
#965
Convert the run_charlm script to python
#942
Typing and lint fixes (thank you @asears)
#833
#856
stanza-train examples now compatible with the python training scripts
#896

NER features

Bert integration (not by default, thank you @vythaihn)
#976
Swedish model (thank you @EmilStenstrom)
#912
#857
Persian model
#797
Danish model
3783cc4
Norwegian model (both NB and NN)
31fa23e
Use updated Ukrainian data (thank you @gawy)
#873
Myanmar model (thank you UCSY)
#845
Training improvements for finetuning models
#788
#791
Fix inconsistencies in B/S/I/E tags
#928 (comment)
#961
Add an option for multiple NER models at the same time, merging the results together
#928
#955

Constituency parser

Dynamic oracle (improves accuracy a bit)
#866
Missing tags now okay in the parser
#862
04dbf4f
bugfix of () not being escaped when output in a tree
eaf134c
charlm integration by default
#799
Bert integration (not the default model) (thank you @vythaihn and @hungbui0411)
05a0b04
0bbe8d1
Preemptive bugfix for incompatible devices from @zhaochaocs
#989
#1002
New models:
DA, based on Arboretum
IT, based on the Turin treebank
JA, based on ALT
PT, based on Cintil
TR, based on Starlang
ZH, based on CTB7

Contributors

EmilStenstrom, attardi, and 7 other contributors

Assets 2

0 Join discussion

06 Oct 06:28

AngledLuffa

v1.3.0

f91ca21

Stanza 1.3.0: LangID and Constituency Parser

Overview

Stanza 1.3.0 introduces a language id model, a constituency parser, a dictionary in the tokenizer, and some additional features and bugfixes.

New features

Langid model and multilingual pipeline
Based on "A reproduction of Apple's bi-directional LSTM models for language identification in short strings." by Toftrup et al 2021
(154b0e8)
Constituency parser
Based on "In-Order Transition-based Constituent Parsing" by Jiangming Liu and Yue Zhang. Currently an en_wsj model available, with more to come.
(9031802)
Evalb interface to CoreNLP
Useful for evaluating the parser - requires CoreNLP 4.3.0 or later
Dictonary tokenizer feature
Noticeably improved performance for ZH, VI, TH
(#776)

Bugfixes / Reliability

HuggingFace integration
No more git issues complaining about unavailable models! (Hopefully)
(f7af504)
Sentiment processor crashes on certain inputs
(issue #804, fixed by e232f67)

Assets 2

09 Aug 23:12

AngledLuffa

v1.2.3

c457a93

Stanza v1.2.3: Two new NER models and some minor bugfixes

Overview

In anticipation of a larger release with some new features, we make a small update to fix some existing bugs and add two more NER models.

Bugfixes

Sentiment models would crash on no text (issue #769, fixed by 47889e3)
Java processes as a context were not properly closed (a39d2ff)

Interface improvements

Downloading tokenize now downloads mwt for languages which require it (issue #774, fixed by #777, from davidrft)
NER model can finetune and save to/from different filenames (0714a01)
NER model now displays a confusion matrix at the end of training (9bbd3f7)

NER models

Afrikaans, trained in NCHLT (6f1f04b)
Italian, trained on a model from FBK (d9a361f)

Assets 2

15 Jul 18:49

AngledLuffa

v1.2.2

de44be8

Stanza v1.2.2

Overview

A regression in NER results occurred in 1.2.1 when fixing a bug in VI models based around spaces.

Bugfixes

Fix Sentiment not loading correctly on Windows because of pickling issue (#742) (thanks to @BramVanroy)
Fix NER bulk process not filling out data structures as expected (#721) (#722)
Fix NER space issue causing a performance regression (#739) (#732)

Interface improvements

Add an NER run script (#738)

Assets 2

17 Jun 17:12

AngledLuffa

v1.2.1

68aa426

Stanza v1.2.1

Overview

All models other than NER and Sentiment were retrained with the new UD 2.8 release. All of the updates include the data augmentation fixes applied in 1.2.0, along with new augmentations tokenization issues and end-of-sentence issues. This release also features various enhancements, bug fixes, and performance improvements, along with 4 new NER models.

Model improvements

Add Bulgarian, Finnish, Hungarian, Vietnamese NER models
- The Bulgarian model is trained on BSNLP 2019 data.
- The Finnish model is trained on the Turku NER data.
- The Hungarian model is trained on a combination of the NYTK dataset and earlier business and criminal NER datasets.
- The Vietnamese model is trained on the VLSP 2018 data.
- Furthermore, the script for preparing the lang-uk NER data has been integrated (c1f0bee)
Use new word vectors for Armenian, including better coverage for the new Western Armenian dataset(d9e8301)
Add copy mechanism in the seq2seq model. This fixes some unusual Spanish multi-word token expansion errors and potentially improves lemmatization performance. (#692 #684)
Fix Spanish POS and depparse mishandling a leading ¿ missing (#699 #698)
Fix tokenization breaking when a newline splits a Chinese token(#632 #531)
Fix tokenization of parentheses in Chinese(452d842)
Fix various issues with characters not present in UD training data such as ellipses characters or unicode apostrophe
(db05552 f01a142 85898c5)
Fix a variety of issues with Vietnamese tokenization - remove language specific model improvement which got roughly 1% F1 but caused numerous hard-to-track issues (3ccb132)
Fix spaces in the Vietnamese words not being found in the embedding used for POS and depparse(1972122)
Include UD_English-GUMReddit in the GUM models(9e6367c)
Add Pronouns & PUD to the mixed English models (various data improvements made this more appealing)(f74bef7)

Interface enhancements

Add ability to pass a Document to the pipeline in pretokenized mode(f88cd8c #696)
Track comments when reading and writing conll files (#676 originally from @danielhers in #155)
Add a proxy parameter for downloads to pass through to the requests module (#638)
add sent_idx to tokens (ee6135c)

Bugfixes

Fix Windows encoding issues when reading conll documents from @yanirmr (b40379e #695)
Fix tokenization breaking when second batch is exactly eval_length(7263686 #634 #631)

Efficiency improvements

Bulk process for tokenization - greatly speeds up the use case of many small docs (5d2d39e)
Optimize MWT usage in pipeline & fix MWT bulk_process (#642 #643 #644)

CoreNLP integration

Add a UD Enhancer tool which interfaces with CoreNLP's generic enhancer (#675)
Add an interface to CoreNLP tokensregex using stanza tokenization (#659)

Assets 2

29 Jan 20:05

AngledLuffa

v1.2.0

9aa915e

Stanza v1.2.0

Overview

All models other than NER and Sentiment were retrained with the new UD 2.7 release. Quite a few of them have data augmentation fixes for problems which arise in common use rather than when running an evaluation task. This release also features various enhancements, bug fixes, and performance improvements.

New features and enhancements

Models trained on combined datasets in English and Italian The default models for English are now a combination of EWT and GUM. The default models for Italian now combine ISDT, VIT, Twittiro, PosTWITA, and a custom dataset including MWT tokens.
NER Transfer Learning Allows users to fine-tune all or part of the parameters of trained NER models on a new dataset for transfer learning (#351, thanks to @gawy for the contribution)
Multi-document support The Stanza Pipeline now supports multi-Document input! To process multiple documents without having to worry about document boundaries, simply pass a list of Stanza Document objects into the Pipeline. (#70 #577)
Added API links from token to sentence It's easier to access Stanza data objects from related ones. To access the sentence object a token or a word, simply use token.sent or word.sent. (#533 #554)
New external tokenizer for Thai with PyThaiNLP Try it out with, for example, stanza.Pipeline(lang='th', processors={'tokenize': 'pythainlp'}, package=None). (#567)
Faster tokenization We have improved how the data pipeline works internally to reduce redundant data wrangling, and significantly sped up the tokenization of long texts. If you have a really long line of text, you could experience up to 10x speedup or more without changing anything. (#522)
Added a method for getting all the supported languages from the resources file Wondering what languages Stanza supports and want to determine it programmatically? Wonder no more! Try stanza.resources.common.list_available_languages(). (#511 fa52f85)
Load mwt automagically if a model needs it Multi-word token expansion is one of the most common things to miss from your Pipeline instantiation, and remembering to include it is a pain -- until now. (#516 #515 and many others)
Vietnamese sentiment model based on VSFC This is now part of the default language package for Vietnamese that you get from stanza.download("vi"). Enjoy!
More informative errors for missing models Stanza now throws more helpful exceptions with informative exception messages when you are missing models (#437 #430 ... #324 #438 ... #529 9539665 ... #575 #578)

Bugfixes

Fixed NER documentation for German to correctly point to the GermEval 2014 model for download. (4ee9f12 #559)
External tokenization library integration respects no_ssplit so you can enjoy using them without messing up your preferred sentence segmentation just like Stanza tokenizers. (#523 #556)
Telugu lemmatizer and tokenizer improvements Telugu models set to use identity lemmatizer by default, and the tokenizer is retrained to separate sentence final punctuation (#524 ba0aec3)
Spanish model would not tokenize foo,bar Now fixed (#528 123d502)
Arabic model would not tokenize asdf . Now fixed (#545 03b7cea)
Various tokenization models would split URLs and/or emails Now URLs and emails are robustly handled with regexes. (#539 #588)
Various parser and pos models would deterministically label "punct" for the final word Resolved via data augmentation (#471 #488 #491)
Norwegian tokenizers retrained to separate final punct The fix is an upstream data fix (#305 UniversalDependencies/UD_Norwegian-Bokmaal#5)
Bugfix for conll eval Fix the error in data conversion from python object of Document to CoNLL format. (#484 #483, thanks @m0re4u )
Less randomness in sentiment results Fixes prediction fluctuation in sentiment prediction. (#458 274474c)
Bugfix which should make it easier to use in jupyter / colab This fixes the issue where jupyter notebooks (and by extension colab) don't like it when you use sys.stderr as the stderr of popen (#434 #431)
Misc fixes for training, concurrency, and edge cases in basic Pipeline usage
- Fix for mwt training (#446)
- Fix for race condition in seq2seq models (#463 #462)
- Fix for race condition in CRF (#566 #561)
- Fix for empty text in pipeline (#475 #474)
- Fix for resources not freed when downloading (#502 #503)
- Fix for vietnamese pipeline not working (#531 #535)

BREAKING CHANGES

Renamed stanza.models.tokenize -> stanza.models.tokenization #452 This stops the tokenize directory shadowing a built in library

Assets 2

13 Aug 06:26

J38

v1.1.1

1b23d69

Stanza v1.1.1

Overview

This release features support for extending the capability of the Stanza pipeline with customized processors, a new sentiment analysis tool, improvements to the CoreNLPClient functionality, new models for a few languages (including Thai, which is supported for the first time in Stanza), new biomedical and clinical English packages, alternative servers for downloading resource files, and various improvements and bugfixes.

New Features and Enhancements

New Sentiment Analysis Models for English, German, Chinese: The default Stanza pipelines for English, German and Chinese now include sentiment analysis models. The released models are based on a convolutional neural network architecture, and predict three-way sentiment labels (negative/neutral/positive). For more information and details on the datasets used to train these models and their performance, please visit the Stanza website.
New Biomedical and Clinical English Model Packages: Stanza now features syntactic analysis and named entity recognition functionality for English biomedical literature text and clinical notes. These newly introduced packages include: 2 individual biomedical syntactic analysis pipelines, 8 biomedical NER models, 1 clinical syntactic pipelines and 2 clinical NER models. For detailed information on how to download and use these pipelines, please visit Stanza's biomedical models page.
Support for Adding User Customized Processors via Python Decorators: Stanza now supports adding customized processors or processor variants (i.e., an alternative of existing processors) into existing pipelines. The name and implementation of the added customized processors or processor variants can be specified via @register_processor or @register_processor_variant decorators. See Stanza website for more information and examples (see custom Processors and Processor variants). (PR #322)
Support for Editable Properties For Data Objects: We have made it easier to extend the functionality of the Stanza neural pipeline by adding new annotations to Stanza's data objects (e.g., Document, Sentence, Token, etc). Aside from the annotation they already support, additional annotation can be easily attached through data_object.add_property(). See our documentation for more information and examples. (PR #323)
Support for Automated CoreNLP Installation and CoreNLP Model Download: CoreNLP can now be easily downloaded in Stanza with stanza.install_corenlp(dir='path/to/corenlp/installation'); CoreNLP models can now be downloaded with stanza.download_corenlp_models(model='english', version='4.1.0', dir='path/to/corenlp/installation'). For more details please see the Stanza website. (PR #363)
Japanese Pipeline Supports SudachiPy as External Tokenizer: You can now use the SudachiPy library as tokenizer in a Stanza Japanese pipeline. Turn on this when building a pipeline with nlp = stanza.Pipeline('ja', processors={'tokenize': 'sudachipy'}. Note that this will require a separate installation of the SudachiPy library via pip. (PR #365)
New Alternative Server for Stable Download of Resource Files: Users in certain areas of the world that do not have stable access to GitHub servers can now download models from alternative Stanford server by specifying a new resources_url argument. For example, stanza.download(lang='en', resources_url='stanford') will now download the resource file and English pipeline from Stanford servers. (Issue #331, PR #356)
CoreNLPClient Supports New Multiprocessing-friendly Mechanism to Start the CoreNLP Server: The CoreNLPClient now supports a new Enum values with better semantics for its start_server argument for finer-grained control over how the server is launched, including a new option called StartServer.TRY_START that launches the CoreNLP Server if one isn't running already, but doesn't fail if one has already been launched. This option makes it easier for CoreNLPClient to be used in a multiprocessing environment. Boolean values are still supported for backward compatibility, but we recommend StartServer.FORCE_START and StartSerer.DONT_START for better readability. (PR #302)
New Semgrex Interface in CoreNLP Client for Dependency Parses of Arbitrary Languages: Stanford CoreNLP has a module which allows searches over dependency graphs using a regex-like language. Previously, this was only usable for languages which CoreNLP already supported dependency trees. This release expands it to dependency graphs for any language. (Issue #399, PR #392)
New Tokenizer for Thai Language: The available UD data for Thai is quite small. The authors of pythainlp helped provide us two tokenization datasets, Orchid and Inter-BEST. Future work will include POS, NER, and Sentiment. (Issue #148)
Support for Serialization of Document Objects: Now you can serialize and deserialize the entire document by running serialized_string = doc.to_serialized() and doc = Document.from_serialized(serialized_string). The serialized string can be decoded into Python objects by running objs = pickle.loads(serialized_string). (Issue #361, PR #366)
Improved Tokenization Speed: Previously, the tokenizer was the slowest member of the neural pipeline, several times slower than any of the other processors. This release brings it in line with the others. The speedup is from improving the text processing before the data is passed to the GPU. (Relevant commits: 546ed13, 8e2076c, 7f5be82, etc.)
User provided Ukrainian NER model: We now have a model built from the lang-uk NER dataset, provided by a user for redistribution.

Breaking Interface Changes

Token.id is Tuple and Word.id is Integer: The id attribute for a token will now return a tuple of integers to represent the indices of the token (or a singleton tuple in the case of a single-word token), and the id for a word will now return an integer to represent the word index. Previously both attributes are encoded as strings and requires manual conversion for downstream processing. This change brings more convenient handling of these attributes. (Issue: #211, PR: #357)
Changed Default Pipeline Packages for Several Languages for Improved Robustness: Languages that have changed default packages include: Polish (default is now PDB model, from previous LFG, #220), Korean (default is now GSD, from previous Kaist, #276), Lithuanian (default is now ALKSNIS, from previous HSE, #415).
CoreNLP 4.1.0 is required: CoreNLPClient requires CoreNLP 4.1.0 or a later version. The client expects recent modifications that were made to the CoreNLP server.
Properties Cache removed from CoreNLP client: The properties_cache has been removed from CoreNLPClient and the CoreNLPClient's annotate() method no longer has a properties_key argument. Python dictionaries with custom request properties should be directly supplied to annotate() via the properties argument.

Bugfixes and Other Improvements

Fixed Logging Behavior: This is mainly for fixing the issue that Stanza will override the global logging setting in Python and influence downstream logging behaviors. (Issue #278, PR #290)
Compatibility Fix for PyTorch v1.6.0: We've updated several processors to adapt to new API changes in PyTorch v1.6.0. (Issues #412 #417, PR #406)
Improved Batching for Long Sentences in Dependency Parser: This is mainly for fixing an issue where long sentences will cause an out of GPU memory issue in the dependency parser. (Issue #387)
Improved neural tokenizer robustness to whitespaces: the neural tokenizer is now more robust to the presence of multiple consecutive whitespace characters (PR #380)
Resolved properties issue when switching languages with requests to CoreNLP server: An issue with default properties has been resolved. Users can now switch between CoreNLP supported languages with and get expected properties for each language by default.

Assets 2

Releases: stanfordnlp/stanza

Stanza v1.5.0

Ssurgeon interface

CoreNLP integration:

Bugfixes:

Features:

New models:

Conparser experiments:

Contributors

Stanza v1.4.2

Stanza v1.4.2: Minor version bump to improve (python) dependencies

Stanza v1.4.1

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

New NER models

Other new models

Model improvements

Pipeline interface improvements

Bugfixes

Improved training tools

Stanza v1.4.0

Stanza v1.4.0: Transformer integration to NER and conparse

Overview

Pipeline interface improvements

Other general improvements

NER features

Constituency parser

Contributors

Stanza 1.3.0: LangID and Constituency Parser

Overview

New features

Bugfixes / Reliability

Stanza v1.2.3: Two new NER models and some minor bugfixes

Overview

Bugfixes

Interface improvements

NER models

Stanza v1.2.2

Overview

Bugfixes

Interface improvements

Stanza v1.2.1

Overview

Model improvements

Interface enhancements

Bugfixes

Efficiency improvements

CoreNLP integration

Stanza v1.2.0

Overview

New features and enhancements

Bugfixes

BREAKING CHANGES

Stanza v1.1.1

Overview

New Features and Enhancements

Breaking Interface Changes

Bugfixes and Other Improvements