Biodiversity domain language model. It is fine-tuned on 2 downstream tasks for Named Entity Recognition (NER) and Relation Extraction (RE) using various state-of-the-art datasets.
Quick Download
- BiodivBERT on Huggingface hub
- Pre-proccessed Datasets for Fine-tuning
- Pre-training corpora
BiodivBERT Pre-training involves data crawling and pre-traing task.
- Pre-training from scratch requires a large corpus of unlabeled text. We are interested in the biodiversity domain, thus, we crawl two well known publishers in the Life Sciences: Elsevier and Springer.
- We provide our data_crawling script in this repo
To crawl the data:
- Make sure to create
elsevier_config.json
andspringer_config.json
with your API key - Adjust the settings to the target API from the
config.py
, enable the correct API you want to crawl. - Launch
main.py
to will crawl either the abstracts or the full text based on the settings you select in theconfig.py
file. - To ensure you will process unique DOIs and to avoid duplicate text at the end, run
filter_DOI.py
and specify the correct target, e.g.,targets=['Springer']
- This will create a text file with unique DOIs.
- To crawl the full text, you should ensure that
OPENACCESS=True
andFULL=True
in theconfig.py
- The main will download the PDFs from Springer and the actual parsed text from Elsevier.
- To obtain the full text from Springer, GROBID service must be hosted and the client is up and runing, for more information on how to setup GROBID, please visit their page and we recommend to use docker to establish the service.
- We provide a wrapper for the client at
run_grobid.py
- We provide a wrapper for the client at
- A data clean step is mandatory here, to do so, please run
data_clean.py
it will clean directory by directory based on the configuration. - To create an train and evaluation sets, you can use
train_test_split.py
.
- Pre-training code is published under this repo \pre-training
- We support 2
config.py
files, one for abstracts and one for full text pre-training.- Before you start pre-training, make sure you change the
root=Your data folder
- You can adjust the hyperparams, e.g., we select
pre_device_batch_size=16
is the maximum that could fit in a single V100 GPU in our case.
- Before you start pre-training, make sure you change the
- We recommend using the
datasets
library by HuggingFace to support efficient data loading.
- We have fine-tuned BiodivBERT on two down stream tasks: Named Entity Recognition & Relation Extration using the state-of-the-art datasets from biodiversity domain.
- Datasets:
- COPIOUS
- QEMP
- BiodivNER
- Species-800
- LINNAEUS
- Code:
- We have fine-tuned BiodivBERT for /NER on a single TPU provided by ColabPro for few hours per dataset.
- Datasets:
- GAD
- EU-ADR
- BioRelEx
- BiodivRE
- Same as in NER, We have fine-tuned BiodivBERT for /RE on a single TPU provided by ColabPro for few hours per dataset.
- Masked Language Model
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
model = AutoModelForMaskedLM.from_pretrained("NoYo25/BiodivBERT")
- Token Classification - Named Entity Recognition
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
model = AutoModelForTokenClassification.from_pretrained("NoYo25/BiodivBERT")
- Sequence Classification - Relation Extraction
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("NoYo25/BiodivBERT")
model = AutoModelForSequenceClassification.from_pretrained("NoYo25/BiodivBERT")