Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[How to use with custom data] NCBI human gene model #265

Open
damianosmel opened this issue Jun 3, 2022 · 1 comment
Open

[How to use with custom data] NCBI human gene model #265

damianosmel opened this issue Jun 3, 2022 · 1 comment

Comments

@damianosmel
Copy link

Dear pyensembl team,

I would like to use pyensembl for the genemodel gff for human species download from NCBI.

As this is not from ensembl, I have followed the your mentioned guide.
However, I could not in the end do "pyensembl install .."

My detailed steps were:

  1. convert the initial ncbi gff to gtf format
    agat_convert_sp_gff2gtf.pl --gff /path/to/GCF_000001405.39_GRCh38.p13_genomic.gff -o /path/to/ncbi_grch38.gtf > /path/to/gff2gff_conversion.log

  2. install this new genome model
    pyensembl install --reference-name GRCh38 --annotation-name NCBI --gtf "/path/to/ncbi_grch38.gtf"

However at the second step I get the attached log output and including the error:

raise ValueError(
ValueError: Column 'transcript_id' can't be primary key of table 'transcript' because it contains repeated values

pyensembl_install_ncbi.log

Grateful to your suggestion on how to resolve this error and so use the ncbi genome annotation :)

Many thanks!
Damianos

@damianosmel
Copy link
Author

Dear pyensembl team,

I have tried to re-install the ncbi genome model and the relevant fasta files.

to do so, I have done:

from pyensembl import Genome
ncbi_genome_model = Genome(reference_name='GRCh38',annotation_name='ncbi',gtf_path_or_url='GCF_000001405.39_GRCh38.p13_genomic.gtf.gz',transcript_fasta_paths_or_urls='GCF_000001405.39_GRCh38.p13_genomic.fna.gz',protein_fasta_paths_or_urls='GCF_000001405.39_GRCh38.p13_translated_cds.faa.gz',copy_local_files_to_cache=True,cache_directory_path='/home/path/to/.cache/pyensembl')
ncbi_genome_model.index()

Then when I would like to extract the coding sequence for a transcript using ncbi transcript id I get:

slc26a4_transcript = ncbi_genome_model.transcript_by_id("NM_000441.2")
print(slc26a4_transcript.coding_sequence)
# will output: None

But using ensembl annotation it works:

from pyensembl import EnsemblRelease
data = EnsemblRelease(106)
slc26a4_transcript_ensembl = data.transcript_by_id("ENST00000644269")
print(slc26a4_transcript_ensembl.coding_sequence)
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/damianos.melidis/.cache/pyensembl/GRCh38/ensembl106/Homo_sapiens.GRCh38.cdna.all.fa.gz.pickle
INFO:pyensembl.sequence_data:Loaded sequence dictionary from /home/damianos.melidis/.cache/pyensembl/GRCh38/ensembl106/Homo_sapiens.GRCh38.ncrna.fa.gz.pickle
# will output: ATGGCA...TGA (skipped the rest of the coding sequence for space)

The output of defining the genome based on ncbi annotation and indexing the annotation is attached.
index_ncbi_model_in_code.log

I feel I had some progress by adding the transcript and protein sequence data, but still I can't get the same amount of information as using the ensembl.

Grateful to your ideas on what I am doing wrong here..

Many thanks,
Damianos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant