Skip to content

Releases: stanfordnlp/stanza

Stanza v1.5.0

14 Mar 05:09
Compare
Choose a tag to compare

Ssurgeon interface

Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/

In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.

CoreNLP integration:

Bugfixes:

  • Bugfix for older versions of torch: 376d7ea
  • Bugfix for training (integration with new scoring script) #1167 9c39636
  • Demo was showing constituency parser along with dependency parsing, even with conparse off: cbc13b0
  • Replace absurdly long characters with UNK (thank you @khughitt) #1137 #1140
  • Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed. 435685f
  • stanza-train NER training bugfix (wrong pretrain): 2757cb4
  • Pass around device everywhere instead of calling cuda(). this should fix models occasionally being split over multiple devices. would also allow for use of MPS, but the current torch implementation for MPS is buggy #1209 #1159
  • Fix error in preparing tokenizer datasets (thanks @dvzubarev): #1161
  • Fix unnecessary slowness in preparing tokenizer datasets (again, thanks @dvzubarev): #1162
  • Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks @dvzubarev): #1170
  • When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): b118473
  • Update use of emoji to match latest releases: #1195 ea345a8

Features:

  • Mechanism for resplitting tokens into MWT #95 8fac17f
  • CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) cfd44d1
  • detach().cpu() speeds things up significantly in some cases ccfbc56
  • Potentially use a constituency model as a classifier - WIP research project #1190
  • add an output format "{:C}" for document objects which prints out documents as CoNLL: #1169
  • If a constituency tree is available, include it when outputting conll format for documents: #1171
  • Same with sentiment: abb5819
  • Additional language code coverage (thank you @juanro49) 5802b10 f06bf86 32f83fa 3450575
  • Allow loading a pipeline for new languages (useful when developing a new suite of models) e7fcd26
  • Script to count the work done by annotators on aws sagemaker private workforce: #1186
  • Streaming interface which batch processes items in the stream: 2c9fe3d #550
  • Can pass a defaultdict to MultilingualPipeline, useful for specifying the processors for each language at once: 70fd2fd #1199
  • Transformer at bottom layer of POS - currently only available in English as the en_combined_bert model, others to come #1132

New models:

Conparser experiments:

  • Transformer stack (initial implementation did not help) https://arxiv.org/abs/2010.10669 110031e
  • TREE_LSTM constituent composition method (didn't beat MAX) 2f722c8
  • Learned weighting between bert layers (this did help a little) 2d0c69e
  • Silver trees: train 10 models, use those models to vote on good trees, use those trees to then train new models. helps smaller treebanks such as IT and VI, but no effect on EN #1148
  • New in_order_compound transition scheme: no improvement f560b08
  • Multistage training with madgrad or adamw: definite improvement. madgrad included as optional dependency 2706c4b f500936
  • Report the scores of tags when retagging (does not affect the conparser training) 7663419
  • FocalLoss on the transitions using optional dependency: didn't help https://arxiv.org/abs/1708.02002 90a8337
  • LargeMarginSoftmax: didn't help https://github.com/tk1980/LargeMarginInSoftmax 5edd724
  • Maxout layer: didn't help https://arxiv.org/abs/1302.4389 c708ce7
  • Reverse parsing: not expected to help, potentially can be useful when building silver treebanks. May also be useful as a two step parser in the future. 4954845

Stanza v1.4.2

15 Sep 05:47
Compare
Choose a tag to compare

Stanza v1.4.2: Minor version bump to improve (python) dependencies

  • Pipeline cache in Multilingual is a single OrderedDict
    #1115 (comment)
    ba3f64d

  • Don't require pytest for all installations unless needed for testing
    #1120
    8c1d9d8

  • hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities
    #1120
    6a90ad4

  • Reorder & normalize installations in setup.py
    #1124

Stanza v1.4.1

14 Sep 16:41
Compare
Choose a tag to compare

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New NER models

  • New Polish NER model based on NKJP from Karol Saputa and ryszardtuora
    #1070
    #1110

  • Make GermEval2014 the default German NER model, including an optional Bert version
    #1018
    #1022

  • Japanese conversion of GSD by Megagon
    #1038

  • Marathi NER dataset from L3Cube. Includes a Sentiment model as well
    #1043

  • Thai conversion of LST20
    555fc03

  • Kazakh conversion of KazNERD
    de6cd25

Other new models

  • Sentiment conversion of Tass2020 for Spanish
    #1104

  • VIT constituency dataset for Italian
    149f144
    ... and many subsequent updates

  • Combined UD models for Hebrew
    #1109
    e4fcf00

  • For UD models with small train dataset & larger test dataset, flip the datasets
    UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL
    #1030
    9618d60

  • Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF
    47740c6

Model improvements

  • Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost
    #1086

  • Pretrained charlm integrated into Sentiment. Improves English, others not so much
    #1025

  • LSTM, 2d maxpool as optional items in the Sentiment
    from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling
    #1098

  • First learn with AdaDelta, then with another optimizer in conparse training. Very helpful
    b1d10d3

  • Grad clipping in conparse training
    365066a

Pipeline interface improvements

  • GPU memory savings: charlm reused between different processors in the same pipeline
    #1028

  • Word vectors not saved in the NER models. Saves bandwidth & disk space
    #1033

  • Functions to return tagsets for NER and conparse models
    #1066
    #1073
    36b84db
    2db43c8

  • displaCy integration with NER and dependency trees
    2071413

Bugfixes

  • Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)
    TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)
    #1056

  • Starting a new corenlp client w/o server shouldn't wait for the server to be available
    TY to Mariano Crosetti
    #1059
    #1061

  • Read raw glove word vectors (they have no header information)
    #1074

  • Ensure that illegal languages are not chosen by the LangID model
    #1076
    #1077

  • Fix cache in Multilingual pipeline
    #1115
    cdf18d8

  • Fix loading of previously unseen languages in Multilingual pipeline
    #1101
    e551ebe

  • Fix that conparse would occasionally train to NaN early in the training
    c4d7857

Improved training tools

Stanza v1.4.0

23 Apr 06:01
Compare
Choose a tag to compare

Stanza v1.4.0: Transformer integration to NER and conparse

Overview

As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.

Pipeline interface improvements

  • Download resources.json and models into temp dirs first to avoid race conditions between multiple processors
    #213
    #1001

  • Download models for Pipelines automatically, without needing to call stanza.download(...)
    #486
    #943

  • Add ability to turn off downloads
    68455d8

  • Add a new interface where both processors and package can be set
    #917
    f370429

  • When using pretokenized tokens, get character offsets from text if available
    #967
    #975

  • If Bert or other transformers are used, cache the models rather than loading multiple times
    #980

  • Allow for disabling processors on individual runs of a pipeline
    #945
    #947

Other general improvements

  • Add # text and # sent_id to conll output
    #918
    #983
    #995

  • Add ner to the token conll output
    #993
    #996

  • Fix missing Slovak MWT model
    #971
    5aa19ec

  • Upgrades to EN, IT, and Indonesian models
    #1003
    #1008
    IT improvements with the help of @attardi and @msimi

  • Fix improper tokenization of Chinese text with leading whitespace
    #920
    #924

  • Check if a CoreNLP model exists before downloading it (thank you @Internull)
    #965

  • Convert the run_charlm script to python
    #942

  • Typing and lint fixes (thank you @asears)
    #833
    #856

  • stanza-train examples now compatible with the python training scripts
    #896

NER features

Constituency parser

Stanza 1.3.0: LangID and Constituency Parser

06 Oct 06:28
f91ca21
Compare
Choose a tag to compare

Overview

Stanza 1.3.0 introduces a language id model, a constituency parser, a dictionary in the tokenizer, and some additional features and bugfixes.

New features

  • Langid model and multilingual pipeline
    Based on "A reproduction of Apple's bi-directional LSTM models for language identification in short strings." by Toftrup et al 2021
    (154b0e8)

  • Constituency parser
    Based on "In-Order Transition-based Constituent Parsing" by Jiangming Liu and Yue Zhang. Currently an en_wsj model available, with more to come.
    (9031802)

  • Evalb interface to CoreNLP
    Useful for evaluating the parser - requires CoreNLP 4.3.0 or later

  • Dictonary tokenizer feature
    Noticeably improved performance for ZH, VI, TH
    (#776)

Bugfixes / Reliability

  • HuggingFace integration
    No more git issues complaining about unavailable models! (Hopefully)
    (f7af504)

  • Sentiment processor crashes on certain inputs
    (issue #804, fixed by e232f67)

Stanza v1.2.3: Two new NER models and some minor bugfixes

09 Aug 23:12
Compare
Choose a tag to compare

Overview

In anticipation of a larger release with some new features, we make a small update to fix some existing bugs and add two more NER models.

Bugfixes

  • Sentiment models would crash on no text (issue #769, fixed by 47889e3)

  • Java processes as a context were not properly closed (a39d2ff)

Interface improvements

  • Downloading tokenize now downloads mwt for languages which require it (issue #774, fixed by #777, from davidrft)

  • NER model can finetune and save to/from different filenames (0714a01)

  • NER model now displays a confusion matrix at the end of training (9bbd3f7)

NER models

  • Afrikaans, trained in NCHLT (6f1f04b)

  • Italian, trained on a model from FBK (d9a361f)

Stanza v1.2.2

15 Jul 18:49
de44be8
Compare
Choose a tag to compare

Overview

A regression in NER results occurred in 1.2.1 when fixing a bug in VI models based around spaces.

Bugfixes

  • Fix Sentiment not loading correctly on Windows because of pickling issue (#742) (thanks to @BramVanroy)

  • Fix NER bulk process not filling out data structures as expected (#721) (#722)

  • Fix NER space issue causing a performance regression (#739) (#732)

Interface improvements

  • Add an NER run script (#738)

Stanza v1.2.1

17 Jun 17:12
68aa426
Compare
Choose a tag to compare

Overview

All models other than NER and Sentiment were retrained with the new UD 2.8 release. All of the updates include the data augmentation fixes applied in 1.2.0, along with new augmentations tokenization issues and end-of-sentence issues. This release also features various enhancements, bug fixes, and performance improvements, along with 4 new NER models.

Model improvements

  • Add Bulgarian, Finnish, Hungarian, Vietnamese NER models

    • The Bulgarian model is trained on BSNLP 2019 data.
    • The Finnish model is trained on the Turku NER data.
    • The Hungarian model is trained on a combination of the NYTK dataset and earlier business and criminal NER datasets.
    • The Vietnamese model is trained on the VLSP 2018 data.
    • Furthermore, the script for preparing the lang-uk NER data has been integrated (c1f0bee)
  • Use new word vectors for Armenian, including better coverage for the new Western Armenian dataset(d9e8301)

  • Add copy mechanism in the seq2seq model. This fixes some unusual Spanish multi-word token expansion errors and potentially improves lemmatization performance. (#692 #684)

  • Fix Spanish POS and depparse mishandling a leading ¿ missing (#699 #698)

  • Fix tokenization breaking when a newline splits a Chinese token(#632 #531)

  • Fix tokenization of parentheses in Chinese(452d842)

  • Fix various issues with characters not present in UD training data such as ellipses characters or unicode apostrophe
    (db05552 f01a142 85898c5)

  • Fix a variety of issues with Vietnamese tokenization - remove language specific model improvement which got roughly 1% F1 but caused numerous hard-to-track issues (3ccb132)

  • Fix spaces in the Vietnamese words not being found in the embedding used for POS and depparse(1972122)

  • Include UD_English-GUMReddit in the GUM models(9e6367c)

  • Add Pronouns & PUD to the mixed English models (various data improvements made this more appealing)(f74bef7)

Interface enhancements

  • Add ability to pass a Document to the pipeline in pretokenized mode(f88cd8c #696)

  • Track comments when reading and writing conll files (#676 originally from @danielhers in #155)

  • Add a proxy parameter for downloads to pass through to the requests module (#638)

  • add sent_idx to tokens (ee6135c)

Bugfixes

  • Fix Windows encoding issues when reading conll documents from @yanirmr (b40379e #695)

  • Fix tokenization breaking when second batch is exactly eval_length(7263686 #634 #631)

Efficiency improvements

  • Bulk process for tokenization - greatly speeds up the use case of many small docs (5d2d39e)

  • Optimize MWT usage in pipeline & fix MWT bulk_process (#642 #643 #644)

CoreNLP integration

  • Add a UD Enhancer tool which interfaces with CoreNLP's generic enhancer (#675)

  • Add an interface to CoreNLP tokensregex using stanza tokenization (#659)

Stanza v1.2.0

29 Jan 20:05
9aa915e
Compare
Choose a tag to compare

Overview

All models other than NER and Sentiment were retrained with the new UD 2.7 release. Quite a few of them have data augmentation fixes for problems which arise in common use rather than when running an evaluation task. This release also features various enhancements, bug fixes, and performance improvements.

New features and enhancements

  • Models trained on combined datasets in English and Italian The default models for English are now a combination of EWT and GUM. The default models for Italian now combine ISDT, VIT, Twittiro, PosTWITA, and a custom dataset including MWT tokens.

  • NER Transfer Learning Allows users to fine-tune all or part of the parameters of trained NER models on a new dataset for transfer learning (#351, thanks to @gawy for the contribution)

  • Multi-document support The Stanza Pipeline now supports multi-Document input! To process multiple documents without having to worry about document boundaries, simply pass a list of Stanza Document objects into the Pipeline. (#70 #577)

  • Added API links from token to sentence It's easier to access Stanza data objects from related ones. To access the sentence object a token or a word, simply use token.sent or word.sent. (#533 #554)

  • New external tokenizer for Thai with PyThaiNLP Try it out with, for example, stanza.Pipeline(lang='th', processors={'tokenize': 'pythainlp'}, package=None). (#567)

  • Faster tokenization We have improved how the data pipeline works internally to reduce redundant data wrangling, and significantly sped up the tokenization of long texts. If you have a really long line of text, you could experience up to 10x speedup or more without changing anything. (#522)

  • Added a method for getting all the supported languages from the resources file Wondering what languages Stanza supports and want to determine it programmatically? Wonder no more! Try stanza.resources.common.list_available_languages(). (#511 fa52f85)

  • Load mwt automagically if a model needs it Multi-word token expansion is one of the most common things to miss from your Pipeline instantiation, and remembering to include it is a pain -- until now. (#516 #515 and many others)

  • Vietnamese sentiment model based on VSFC This is now part of the default language package for Vietnamese that you get from stanza.download("vi"). Enjoy!

  • More informative errors for missing models Stanza now throws more helpful exceptions with informative exception messages when you are missing models (#437 #430 ... #324 #438 ... #529 9539665 ... #575 #578)

Bugfixes

  • Fixed NER documentation for German to correctly point to the GermEval 2014 model for download. (4ee9f12 #559)

  • External tokenization library integration respects no_ssplit so you can enjoy using them without messing up your preferred sentence segmentation just like Stanza tokenizers. (#523 #556)

  • Telugu lemmatizer and tokenizer improvements Telugu models set to use identity lemmatizer by default, and the tokenizer is retrained to separate sentence final punctuation (#524 ba0aec3)

  • Spanish model would not tokenize foo,bar Now fixed (#528 123d502)

  • Arabic model would not tokenize asdf . Now fixed (#545 03b7cea)

  • Various tokenization models would split URLs and/or emails Now URLs and emails are robustly handled with regexes. (#539 #588)

  • Various parser and pos models would deterministically label "punct" for the final word Resolved via data augmentation (#471 #488 #491)

  • Norwegian tokenizers retrained to separate final punct The fix is an upstream data fix (#305 UniversalDependencies/UD_Norwegian-Bokmaal#5)

  • Bugfix for conll eval Fix the error in data conversion from python object of Document to CoNLL format. (#484 #483, thanks @m0re4u )

  • Less randomness in sentiment results Fixes prediction fluctuation in sentiment prediction. (#458 274474c)

  • Bugfix which should make it easier to use in jupyter / colab This fixes the issue where jupyter notebooks (and by extension colab) don't like it when you use sys.stderr as the stderr of popen (#434 #431)

  • Misc fixes for training, concurrency, and edge cases in basic Pipeline usage

    • Fix for mwt training (#446)
    • Fix for race condition in seq2seq models (#463 #462)
    • Fix for race condition in CRF (#566 #561)
    • Fix for empty text in pipeline (#475 #474)
    • Fix for resources not freed when downloading (#502 #503)
    • Fix for vietnamese pipeline not working (#531 #535)

BREAKING CHANGES

  • Renamed stanza.models.tokenize -> stanza.models.tokenization #452 This stops the tokenize directory shadowing a built in library

Stanza v1.1.1

13 Aug 06:26
@J38 J38
Compare
Choose a tag to compare

Overview

This release features support for extending the capability of the Stanza pipeline with customized processors, a new sentiment analysis tool, improvements to the CoreNLPClient functionality, new models for a few languages (including Thai, which is supported for the first time in Stanza), new biomedical and clinical English packages, alternative servers for downloading resource files, and various improvements and bugfixes.

New Features and Enhancements

  • New Sentiment Analysis Models for English, German, Chinese: The default Stanza pipelines for English, German and Chinese now include sentiment analysis models. The released models are based on a convolutional neural network architecture, and predict three-way sentiment labels (negative/neutral/positive). For more information and details on the datasets used to train these models and their performance, please visit the Stanza website.

  • New Biomedical and Clinical English Model Packages: Stanza now features syntactic analysis and named entity recognition functionality for English biomedical literature text and clinical notes. These newly introduced packages include: 2 individual biomedical syntactic analysis pipelines, 8 biomedical NER models, 1 clinical syntactic pipelines and 2 clinical NER models. For detailed information on how to download and use these pipelines, please visit Stanza's biomedical models page.

  • Support for Adding User Customized Processors via Python Decorators: Stanza now supports adding customized processors or processor variants (i.e., an alternative of existing processors) into existing pipelines. The name and implementation of the added customized processors or processor variants can be specified via @register_processor or @register_processor_variant decorators. See Stanza website for more information and examples (see custom Processors and Processor variants). (PR #322)

  • Support for Editable Properties For Data Objects: We have made it easier to extend the functionality of the Stanza neural pipeline by adding new annotations to Stanza's data objects (e.g., Document, Sentence, Token, etc). Aside from the annotation they already support, additional annotation can be easily attached through data_object.add_property(). See our documentation for more information and examples. (PR #323)

  • Support for Automated CoreNLP Installation and CoreNLP Model Download: CoreNLP can now be easily downloaded in Stanza with stanza.install_corenlp(dir='path/to/corenlp/installation'); CoreNLP models can now be downloaded with stanza.download_corenlp_models(model='english', version='4.1.0', dir='path/to/corenlp/installation'). For more details please see the Stanza website. (PR #363)

  • Japanese Pipeline Supports SudachiPy as External Tokenizer: You can now use the SudachiPy library as tokenizer in a Stanza Japanese pipeline. Turn on this when building a pipeline with nlp = stanza.Pipeline('ja', processors={'tokenize': 'sudachipy'}. Note that this will require a separate installation of the SudachiPy library via pip. (PR #365)

  • New Alternative Server for Stable Download of Resource Files: Users in certain areas of the world that do not have stable access to GitHub servers can now download models from alternative Stanford server by specifying a new resources_url argument. For example, stanza.download(lang='en', resources_url='stanford') will now download the resource file and English pipeline from Stanford servers. (Issue #331, PR #356)

  • CoreNLPClient Supports New Multiprocessing-friendly Mechanism to Start the CoreNLP Server: The CoreNLPClient now supports a new Enum values with better semantics for its start_server argument for finer-grained control over how the server is launched, including a new option called StartServer.TRY_START that launches the CoreNLP Server if one isn't running already, but doesn't fail if one has already been launched. This option makes it easier for CoreNLPClient to be used in a multiprocessing environment. Boolean values are still supported for backward compatibility, but we recommend StartServer.FORCE_START and StartSerer.DONT_START for better readability. (PR #302)

  • New Semgrex Interface in CoreNLP Client for Dependency Parses of Arbitrary Languages: Stanford CoreNLP has a module which allows searches over dependency graphs using a regex-like language. Previously, this was only usable for languages which CoreNLP already supported dependency trees. This release expands it to dependency graphs for any language. (Issue #399, PR #392)

  • New Tokenizer for Thai Language: The available UD data for Thai is quite small. The authors of pythainlp helped provide us two tokenization datasets, Orchid and Inter-BEST. Future work will include POS, NER, and Sentiment. (Issue #148)

  • Support for Serialization of Document Objects: Now you can serialize and deserialize the entire document by running serialized_string = doc.to_serialized() and doc = Document.from_serialized(serialized_string). The serialized string can be decoded into Python objects by running objs = pickle.loads(serialized_string). (Issue #361, PR #366)

  • Improved Tokenization Speed: Previously, the tokenizer was the slowest member of the neural pipeline, several times slower than any of the other processors. This release brings it in line with the others. The speedup is from improving the text processing before the data is passed to the GPU. (Relevant commits: 546ed13, 8e2076c, 7f5be82, etc.)

  • User provided Ukrainian NER model: We now have a model built from the lang-uk NER dataset, provided by a user for redistribution.

Breaking Interface Changes

  • Token.id is Tuple and Word.id is Integer: The id attribute for a token will now return a tuple of integers to represent the indices of the token (or a singleton tuple in the case of a single-word token), and the id for a word will now return an integer to represent the word index. Previously both attributes are encoded as strings and requires manual conversion for downstream processing. This change brings more convenient handling of these attributes. (Issue: #211, PR: #357)

  • Changed Default Pipeline Packages for Several Languages for Improved Robustness: Languages that have changed default packages include: Polish (default is now PDB model, from previous LFG, #220), Korean (default is now GSD, from previous Kaist, #276), Lithuanian (default is now ALKSNIS, from previous HSE, #415).

  • CoreNLP 4.1.0 is required: CoreNLPClient requires CoreNLP 4.1.0 or a later version. The client expects recent modifications that were made to the CoreNLP server.

  • Properties Cache removed from CoreNLP client: The properties_cache has been removed from CoreNLPClient and the CoreNLPClient's annotate() method no longer has a properties_key argument. Python dictionaries with custom request properties should be directly supplied to annotate() via the properties argument.

Bugfixes and Other Improvements

  • Fixed Logging Behavior: This is mainly for fixing the issue that Stanza will override the global logging setting in Python and influence downstream logging behaviors. (Issue #278, PR #290)

  • Compatibility Fix for PyTorch v1.6.0: We've updated several processors to adapt to new API changes in PyTorch v1.6.0. (Issues #412 #417, PR #406)

  • Improved Batching for Long Sentences in Dependency Parser: This is mainly for fixing an issue where long sentences will cause an out of GPU memory issue in the dependency parser. (Issue #387)

  • Improved neural tokenizer robustness to whitespaces: the neural tokenizer is now more robust to the presence of multiple consecutive whitespace characters (PR #380)

  • Resolved properties issue when switching languages with requests to CoreNLP server: An issue with default properties has been resolved. Users can now switch between CoreNLP supported languages with and get expected properties for each language by default.