Contains adapters for connecting EMBL content into GBIF.
This repository contains an EMBL API crawler that produces data that is later used for producing DwC-A files suitable for ingestion into GBIF.
The results are the four EMBL-EBI datasets.
- INSDC Sequences
- INSDC Environment Sample Sequences
- INSDC Host Organism Sequences
- The European Nucleotide Archive (ENA) taxonomy
Expected use of the EMBL API by the crawler is described in this working document
The adapter is configured to run once a week at a specific time.
See the properties startTime
and frequencyInDays
in the gbif-configuration project.
Basic steps of the adapter:
- Request data from ENA portal API, two requests for each dataset + one taxonomy request (optional)
- Store raw data into database
- Process and store processed data into database (Perform backend deduplication)
- Clean temporal files
We get data from https://www.ebi.ac.uk/ena/portal/api. See the API documentation provided by EBI.
Requests requestUrl1
(sequence) and requestUrl2
(wgs_set) can be seen in the gbif-configuration project.
Request with result=sequence
- a dataset for eDNA: environmental_sample=True, host="" (no host)
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND environmental_sample=true AND host!="*"
- include records with environmental_sample=true
- include records with coordinates and/or specimen_voucher
- exclude records dataclass="CON" see here
- exclude records with host
- a dataset for organism sequenced: environmental_sample=False, host="" (no host)
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND environmental_sample=false AND host!="*"
- include records with environmental_sample=false
- include records with coordinates and/or specimen_voucher
- exclude records dataclass="CON" see here
- exclude records with host
- a dataset with hosts
query=(specimen_voucher="*" OR country="*") AND dataclass!="CON" AND host="*" AND host!="human" AND host!="Homo sapiens" AND host!="Homo_sapiens"
- include records with coordinates and/or specimen_voucher
- include records with host
- exclude records dataclass="CON" see here
- exclude records with human host
Request with result=wgs_set
.
These requests are pretty much the same with some differences:
- sequence_md5 field not supported, use specimen_voucher twice to match number of fields
- do not use dataclass filter
Adapter requests taxonomy separately: download a zipped archive, unzip it and store it into database. Configuration is here.
The data is stored in the PostgreSQL database after execution. Each dataset has own table with raw and processed data.
The database is created only once in the target environment and tables are cleaned up before every run.
Database creation scripts for data and taxonomy.
See gbif-configuration here and here for connection properties.
We perform several deduplication steps.
Perform SQL (local copy here), get rid of some duplicates and join data with taxonomy; based on issue here
Get rid of records with both missing specimen_voucher
and collection_date
Keep only one record with same sample_accession
and scientific_name
and get rid of the rest
Adapter stores all processed data back into database (tables with postfix _processed
) which then used by IPT as SQL sources.
Test datasets (UAT):
- https://www.gbif-uat.org/dataset/ee8da4a4-268b-4e91-ab5a-69a04ff58e7a
- https://www.gbif-uat.org/dataset/768eeb1f-a208-4170-9335-2968d17c7bdc
- https://www.gbif-uat.org/dataset/10628730-87d4-42f5-b593-bd438185517f
and production ones (prod):
- https://www.gbif.org/dataset/583d91fe-bbc0-4b4a-afe1-801f88263016
- https://www.gbif.org/dataset/393b8c26-e4e0-4dd0-a218-93fc074ebf4e
- https://www.gbif.org/dataset/d8cd16ba-bb74-4420-821e-083f2bac17c2
Remember that all configuration files are in the private gbif-configuration project!
Configuration files in the directory src/main/resources
do not affect the adapter and can be used, for example, for testing (local run).
Use scripts start.sh and start-taxonomy.sh for local testing. Remember to provide valid logback and config files for the scripts (you may need to create databases before run).