Skip to content

related-sciences/ensembl-genes

Repository files navigation

Extract the Ensembl gene catalog to simple tables

This repository extracts a catalog of genes from the Ensembl database for multiple species including human, rat, and mouse. It is ideal for situations where you want to represent genes via stable Ensembl identifiers. Data is extracted by a series of SQL queries as well as additional transformations in Python using Pandas. Tables are exported to output branches based on the corresponding Ensembl version in TSV and Parquet format.

Motivation

NCBI publishes the Homo_sapiens.gene_info.gz dataset of human genes with one row per gene. It includes useful metadata like the gene symbol, synonyms, and chromosome. However, we weren't able to find a comparable dataset for Ensembl genes (please let us know if this exists). Therefore, we combined several SQL queries guided by Biostars answers — for example to retrieve symbols, alternative sequence allele groups, and chromosomes — and from Open Targets pipelines to extract simplified tabular datasets.

Note that the Ensembl core schema consists of many tables. There is a chance we have made mistakes and will appreciate any feedback or contributions. Please use GitHub Issues for contact.

Usage

Ensembl stores gene information in databases where each database corresponds to specific combination of species, release, and genome assembly. Each supported core database receives a corresponding output branch in this repository. For example, see the output/homo_sapiens_core_104_38 branch for datasets generated from Ensembl release 104 of the human genome using the GRCh38 assembly.

If you'd like to download all files for a specific gene catalog, you can use a command like the following (replacing homo_sapiens_core_104_38 with the desired database, see all current databases here):

# clone the relevant output branch to a local directory
git clone --branch=output/homo_sapiens_core_104_38 --depth=1 https://github.com/related-sciences/ensembl-genes.git
# optionally uninitialize git from the data directory
cd ensembl-genes && rm -rf .git

Maintainers can create exports for new Ensembl releases running the export workflow (which is a workflow_dispatch GitHub Action). In addition, CI checks for a new Ensembl release every week, as reported by Bioversions, and runs an export if none already exists for that each species-specific database.

Development

# Install the environment
poetry install --no-root

# Update the lock file
poetry update

# Export datasets to output (change 104 to desired release)
poetry run ensembl_genes datasets --species=human --release=104

# Export notebooks to output (change 104 to desired release)
poetry run ensembl_genes notebooks --species=human --release=104

# Run tests
pytest

# Set up the git pre-commit hooks.
# `git commit` will now trigger automatic checks including linting.
pre-commit install

# Run all pre-commit checks (CI will also run this).
pre-commit run --all

License

This repository is released under an Apache License 2.0 License (see LICENSE.md). Furthermore, output datasets are also released under CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.

Please familiarize yourself with the Ensembl data disclaimer:

Ensembl imposes no restrictions on access to, or use of, the data provided and the software used to analyse and present it. Ensembl data generated by members of the project are available without restriction. …

Some of the data and software included in the distribution may be subject to third-party constraints. Users of the data and software are solely responsible for establishing the nature of and complying with any such restrictions.

The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) provides this data and software in good faith, but make no warranty, express or implied, nor assume any legal liability or responsibility for any purpose for which they are used.

Readings

Here is a list of relevant works that help explain aspects of Ensembl genes:

  1. Accessing alternate sequences in human
    Bronwen Aken
    Ensembl Blog (2011-05-20)

  2. Patches and Haplotypes in the Human Genome
    Ensembl Training
    YouTube (2012-01-21)

  3. Ensembl insights: Annotating readthrough transcription in Ensembl
    Erin Haskell
    Ensembl Blog (2019-02-11)

  4. The Ensembl gene annotation system
    Bronwen L Aken, Sarah Ayling, Daniel Barrell, Laura Clarke, Valery Curwen, Susan Fairley, Julio Fernandez Banet, Konstantinos Billis, Carlos García Girón, Thibaut Hourlier, … Stephen MJ Searle
    Database (2016-06-23)
    DOI: 10.1093/database/baw093 · PMID: 27337980 · PMCID: PMC4919035

  5. Ensembl core software resources: storage and programmatic access for DNA sequence and genome annotation
    Magali Ruffier, Andreas Kähäri, Monika Komorowska, Stephen Keenan, Matthew Laird, Ian Longden, Glenn Proctor, Steve Searle, Daniel Staines, Kieron Taylor, … Paul Flicek
    Database (2017-01-01)
    DOI: 10.1093/database/bax020 · PMID: 28365736 · PMCID: PMC5467575