get_aligned_BERT_emb

Get the aligned BERT embedding for sequence labeling tasks

Why this repo?

In the origin script extract_features.py in BERT, tokens may be splited into pieces as follows:

orig_tokens = ["John", "Johanson", "'s",  "house"]
bert_tokens = ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
orig_to_tok_map = [1, 2, 4, 6]

We investigate 3 align strategies (first, mean and max) to maintain an original-to-tokenized alignment. Take the "Johanson -> johan, ##son" as example:

first: take the representation of johan as the whole word Johanson
mean: take the reduce_mean value of representations of johan and ##son as the whole word Johanson
max: take the reduce_max value of representations of johan and ##son as the whole word Johanson

How to use this repo?

sh run.sh input_file outout_file BERT_BASE_DIR
# For example:
sh run.sh you_data you_data.bert path/to/bert/uncased_L-12_H-768_A-12

You can modify layers and align_strategies in the run.sh.

How to load the output embeddings?

After the above procedure, you are expected to get a output file of contextual embeddings (e.g., your_data_6_mean). Then you can load this file like conventional word embeddings. For example in a python script:

with open("your_data_6_mean", "r", encoding="utf-8") as bert_f"
    for line in bert_f:
        bert_vec = [[float(value) for value in token.split()] for token in line.strip().split("|||")]

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
extract_features.py		extract_features.py
get_aligned_bert_emb.py		get_aligned_bert_emb.py
modeling.py		modeling.py
run.sh		run.sh
tokenization.py		tokenization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

get_aligned_BERT_emb

Why this repo?

How to use this repo?

How to load the output embeddings?

About

Releases

Packages

Languages

License

Adaxry/get_aligned_BERT_emb

Folders and files

Latest commit

History

Repository files navigation

get_aligned_BERT_emb

Why this repo?

How to use this repo?

How to load the output embeddings?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages