sentence = "私は猫だ。名前なんてものはない．だが，「かわいい。それで十分だろう」。"

tokenizer = SentenceTokenizer(period="．")
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。名前なんてものはない．', 'だが，「かわいい。それで十分だろう」。']

bracket expression

sentence = "私は猫だ。名前なんてものはない。だが，『かわいい。それで十分だろう』。"

tokenizer = SentenceTokenizer(
    patterns=SentenceTokenizer.PATTERNS + [re.compile(r"『.*?』")],
)
print(tokenizer.tokenize(sentence))
# => ['私は猫だ。', '名前なんてものはない。', 'だが，『かわいい。それで十分だろう』。']

core feature

Configurable sentence tokenizer (#159)

dependencies

Remove poetry-dynamic-versioning (#158)

Assets 2

21 Nov 07:57

himkt

v5.1.0

3c749cb

Release v5.1.0

Other

Update notebook (#157)

Integration

Breaking/drop integration (#153)

Bug

Fix poetry/setuptools version in Dockerfiles (#151)

Dependencies

Make requests required (#156)
Bump fastapi from 0.54.2 to 0.65.2 (#155)

Assets 2

25 Jun 22:40

himkt

v5.0.1

3870227

Release v5.0.1

This PR contains several minor fixes.

#api

Use async methods to avoid resource competition (#147)

#other

Skip tests using AWS credential when it is not provided (#149)
Upgrade mypy (#146)

Assets 2

06 Jun 04:42

himkt

v5.0.0

4e00e0c

Release v5.0.0

We release konoha v5.0.0. This version includes several major interface changes.

🚨 Breaking changes

Remove an option `with_postag`

Before

tokenizer_with_postag = WordTokenizer(tokenizer="mecab", with_postag=True)
tokenizer_without_postag = WordTokenizer(tokenizer="mecab", with_postag=False)

After

with_postag was simply removed.
Note that the option was also removed from API.

tokenizer = WordTokenizer(tokenizer="mecab")

Add `/api/v1/batch_tokenize` and prohibit users to pass `texts` to `/api/v1/tokenize`

Konoha 4.x.x allows users to pass texts to /api/v1/tokenize.
From 5.0.0, we provide the new endpoint /api/v1/batch_tokenize for batch tokenization.

Before

curl localhost:8000/api/v1/tokenize \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"tokenizer": "mecab", "texts": ["自然言語処理"]}'

After

curl localhost:8000/api/v1/batch_tokenize \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"tokenizer": "mecab", "texts": ["自然言語処理"]}'

core feature

Remove postag information from __repr__ (#144)
Remove with_postag from WordTokenizer (#141)
Remove konoa.konoha_token (#140)
Extract batch tokenization from WordTokenizer.tokenize (#137)

other

Introduce rich (#143)
Import libraries in initializers of tokenizer classes (#142)
Update tests (#136)

api

Change way to receive endpoint (#139)
Add endpoint /v1/api/batch_tokenize to konoha API (#138)
Support all options available for WordTokenizer in API server (#135)

Assets 2

23 May 03:24

himkt

v4.6.5

d1c1787

Release v4.6.5

Thanks to the contiribution by @altescy, Konoha v6.4.5 supports UniDic for MeCab!

>>> from konoha import WordTokenizer
>>> # [important] you have to include `unidic` in file path
>>> tk = WordTokenizer(system_dictionary_path="mecab/dict/mecab-unidic")
>>> tk.tokenize("オレンジジュースを飲む")
[オレンジ, ジュース, を, 飲む]

#core feature

Support UniDic format for MeCabTokenizer (#132) thanks @altescy!

#documentation

Add colab badge (#130)
chore: Fix typo in README (#124)

#integration

Upgrade dependency for AllenNLP (#131)

#other

Refactoring: cleanup word_tokenizers (#129)
Cleanup Dockerfiles (#125)

#api

Use app.state for caching objects (#128)
Use --factory to launch app server with factory-style imports (#127)
Return detailed information of token in konoha API (#126)

Assets 2

07 Mar 08:25

himkt

v4.6.4

b4d8426

Release v4.6.4

[beta] New feature

WordTokenizer now supports a new argument endpoint.
You can use konoha without installing tokenizers on your computer.

Changes

api

Update API module: change endpoint path and upgrade Ubuntu to 20.04 (#120)
Feature/remote tokenization (#123)

other

Packaging/poetry dynamic versioning (#118)
Update max-line-length for linters and remove poetry-dynamic-versioning.substitution (#119)
Update Dockerfile to reduce image size (#122)

documentation

Update README to add description of breaking change (#121)

Assets 2

05 Mar 15:57

himkt

v4.6.3

c3285b8

Release v4.6.3

core feature

[Thanks @hppRC for opening the issue!] Run pre-compile regular expressions before tokenization (#117)

bug

[Thanks @Akirtn for opening the issue!] [Support AllenNLP>=1.3.0 by renaming path to Token class (#116)
- Note that it also drops support AllenNLP <1.3.0.

documentation

[Thanks @vegai for submitting the PR!] Add license metadata (#113)

other

Apply isort (#112)

Assets 2

24 Sep 15:59

himkt

v4.6.2

d130c93

Release v4.6.2

bug fix

Fixbug in traversing directory on Amazon S3 (#111)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dependencies

other

core feature

dependencies

Other

Integration

Bug

Dependencies

🚨 Breaking changes

Remove an option `with_postag`

Before

After

Add `/api/v1/batch_tokenize` and prohibit users to pass `texts` to `/api/v1/tokenize`

Before

After

core feature

other

api

[beta] New feature

Changes

api

other

documentation

core feature

bug

documentation

other

bug fix

Releases: himkt/konoha

Release v5.3.0

Release v5.2.1

dependencies

other

Release v5.2.0

core feature

dependencies

Release v5.1.0

Other

Integration

Bug

Dependencies

Release v5.0.1

Release v5.0.0

🚨 Breaking changes

Remove an option with_postag

Before

After

Add /api/v1/batch_tokenize and prohibit users to pass texts to /api/v1/tokenize

Before

After

core feature

other

api

Release v4.6.5

Release v4.6.4

[beta] New feature

Changes

api

other

documentation

Release v4.6.3

core feature

bug

documentation

other

Release v4.6.2

bug fix

Remove an option `with_postag`

Add `/api/v1/batch_tokenize` and prohibit users to pass `texts` to `/api/v1/tokenize`