Releases: himkt/konoha
Releases · himkt/konoha
Release v5.3.0
Release v5.2.1
Release v5.2.0
konoha v5.2.0 is now available.
This version introduces a change to make sentence tokenizer configurable (#159).
You can see the detail in README.
- sentence splitter
sentence = "私は猫だ。名前なんてものはない.だが,「かわいい。それで十分だろう」。"
tokenizer = SentenceTokenizer(period=".")
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。名前なんてものはない.', 'だが,「かわいい。それで十分だろう」。']
- bracket expression
sentence = "私は猫だ。名前なんてものはない。だが,『かわいい。それで十分だろう』。"
tokenizer = SentenceTokenizer(
patterns=SentenceTokenizer.PATTERNS + [re.compile(r"『.*?』")],
)
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが,『かわいい。それで十分だろう』。']
core feature
- Configurable sentence tokenizer (#159)
dependencies
- Remove
poetry-dynamic-versioning
(#158)
Release v5.1.0
Release v5.0.1
Release v5.0.0
We release konoha v5.0.0. This version includes several major interface changes.
🚨 Breaking changes
Remove an option with_postag
Before
tokenizer_with_postag = WordTokenizer(tokenizer="mecab", with_postag=True)
tokenizer_without_postag = WordTokenizer(tokenizer="mecab", with_postag=False)
After
with_postag
was simply removed.
Note that the option was also removed from API.
tokenizer = WordTokenizer(tokenizer="mecab")
Add /api/v1/batch_tokenize
and prohibit users to pass texts
to /api/v1/tokenize
Konoha 4.x.x allows users to pass texts
to /api/v1/tokenize
.
From 5.0.0, we provide the new endpoint /api/v1/batch_tokenize
for batch tokenization.
Before
curl localhost:8000/api/v1/tokenize \
-X POST \
-H "Content-Type: application/json" \
-d '{"tokenizer": "mecab", "texts": ["自然言語処理"]}'
After
curl localhost:8000/api/v1/batch_tokenize \
-X POST \
-H "Content-Type: application/json" \
-d '{"tokenizer": "mecab", "texts": ["自然言語処理"]}'
core feature
- Remove postag information from
__repr__
(#144) - Remove
with_postag
from WordTokenizer (#141) - Remove konoa.konoha_token (#140)
- Extract batch tokenization from WordTokenizer.tokenize (#137)
other
- Introduce rich (#143)
- Import libraries in initializers of tokenizer classes (#142)
- Update tests (#136)
api
Release v4.6.5
Thanks to the contiribution by @altescy, Konoha v6.4.5 supports UniDic for MeCab!
>>> from konoha import WordTokenizer
>>> # [important] you have to include `unidic` in file path
>>> tk = WordTokenizer(system_dictionary_path="mecab/dict/mecab-unidic")
>>> tk.tokenize("オレンジジュースを飲む")
[オレンジ, ジュース, を, 飲む]
#core feature
#documentation
#integration
- Upgrade dependency for AllenNLP (#131)
#other
#api
Release v4.6.4
[beta] New feature
WordTokenizer
now supports a new argument endpoint
.
You can use konoha without installing tokenizers on your computer.
Changes
api
- Update API module: change endpoint path and upgrade Ubuntu to 20.04 (#120)
- Feature/remote tokenization (#123)
other
- Packaging/poetry dynamic versioning (#118)
- Update max-line-length for linters and remove poetry-dynamic-versioning.substitution (#119)
- Update Dockerfile to reduce image size (#122)
documentation
- Update README to add description of breaking change (#121)
Release v4.6.3
core feature
- [Thanks @hppRC for opening the issue!] Run pre-compile regular expressions before tokenization (#117)
bug
- [Thanks @Akirtn for opening the issue!] [Support AllenNLP>=1.3.0 by renaming path to Token class (#116)
- Note that it also drops support AllenNLP <1.3.0.
documentation
other
- Apply isort (#112)
Release v4.6.2
bug fix
- Fixbug in traversing directory on Amazon S3 (#111)