Skip to content

Stanza v1.5.0

Compare
Choose a tag to compare
@AngledLuffa AngledLuffa released this 14 Mar 05:09
· 1082 commits to main since this release

Ssurgeon interface

Headlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool. Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets. More information is in the GURT 2023 paper, https://aclanthology.org/2023.tlt-1.7/

In addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between "ineffective" and "small improvements" and are available for people to experiment with.

CoreNLP integration:

Bugfixes:

  • Bugfix for older versions of torch: 376d7ea
  • Bugfix for training (integration with new scoring script) #1167 9c39636
  • Demo was showing constituency parser along with dependency parsing, even with conparse off: cbc13b0
  • Replace absurdly long characters with UNK (thank you @khughitt) #1137 #1140
  • Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed. 435685f
  • stanza-train NER training bugfix (wrong pretrain): 2757cb4
  • Pass around device everywhere instead of calling cuda(). this should fix models occasionally being split over multiple devices. would also allow for use of MPS, but the current torch implementation for MPS is buggy #1209 #1159
  • Fix error in preparing tokenizer datasets (thanks @dvzubarev): #1161
  • Fix unnecessary slowness in preparing tokenizer datasets (again, thanks @dvzubarev): #1162
  • Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks @dvzubarev): #1170
  • When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): b118473
  • Update use of emoji to match latest releases: #1195 ea345a8

Features:

  • Mechanism for resplitting tokens into MWT #95 8fac17f
  • CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) cfd44d1
  • detach().cpu() speeds things up significantly in some cases ccfbc56
  • Potentially use a constituency model as a classifier - WIP research project #1190
  • add an output format "{:C}" for document objects which prints out documents as CoNLL: #1169
  • If a constituency tree is available, include it when outputting conll format for documents: #1171
  • Same with sentiment: abb5819
  • Additional language code coverage (thank you @juanro49) 5802b10 f06bf86 32f83fa 3450575
  • Allow loading a pipeline for new languages (useful when developing a new suite of models) e7fcd26
  • Script to count the work done by annotators on aws sagemaker private workforce: #1186
  • Streaming interface which batch processes items in the stream: 2c9fe3d #550
  • Can pass a defaultdict to MultilingualPipeline, useful for specifying the processors for each language at once: 70fd2fd #1199
  • Transformer at bottom layer of POS - currently only available in English as the en_combined_bert model, others to come #1132

New models:

Conparser experiments:

  • Transformer stack (initial implementation did not help) https://arxiv.org/abs/2010.10669 110031e
  • TREE_LSTM constituent composition method (didn't beat MAX) 2f722c8
  • Learned weighting between bert layers (this did help a little) 2d0c69e
  • Silver trees: train 10 models, use those models to vote on good trees, use those trees to then train new models. helps smaller treebanks such as IT and VI, but no effect on EN #1148
  • New in_order_compound transition scheme: no improvement f560b08
  • Multistage training with madgrad or adamw: definite improvement. madgrad included as optional dependency 2706c4b f500936
  • Report the scores of tags when retagging (does not affect the conparser training) 7663419
  • FocalLoss on the transitions using optional dependency: didn't help https://arxiv.org/abs/1708.02002 90a8337
  • LargeMarginSoftmax: didn't help https://github.com/tk1980/LargeMarginInSoftmax 5edd724
  • Maxout layer: didn't help https://arxiv.org/abs/1302.4389 c708ce7
  • Reverse parsing: not expected to help, potentially can be useful when building silver treebanks. May also be useful as a two step parser in the future. 4954845