Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Tabix #125

Open
romanzenka opened this issue Jan 22, 2019 · 9 comments
Open

Support Tabix #125

romanzenka opened this issue Jan 22, 2019 · 9 comments

Comments

@romanzenka
Copy link

This is similar to issue #107.

We would like an interface where one can specify chromosome + region, and obtains only the parts of the VCF that correspond to this query. This would most likely mean supporting tabix indexed vcf files. Reading the whole VCF is simply not going to happen.

The current skip and nrows parameters are not useful to us. To know those values, we would first have to index the vcf somehow - exactly the job for Tabix.

@knausb
Copy link
Owner

knausb commented Jan 22, 2019

Hello, I'd say this is relevant to #124 as well. Here is the documentation for tabix. I believe that bgzip -i creates the same index that tabix does. Could you validate that for me?

@romanzenka
Copy link
Author

romanzenka commented Jan 22, 2019

@knausb I do not think bgzip -i creates the same index. This is what source code of bgzip says:

fprintf(fp, "   -i, --index                compress and create BGZF index\n");
fprintf(fp, "   -I, --index-name FILE      name of BGZF index file [file.gz.gzi]\n");

The thing is, I do not see any input parameters for bgzip that would let you specify chromosome and position columns to base the index on. So I believe the BGZF index file must be something else - probably a map of all block offsets and their starts in the original file. In other words - this would let you seek if you know byte offsets to read from, but it won't let you make a query "chromosome 2, positions 10,000-20,000" like Tabix does. Would need to dig further to confirm.

@knausb
Copy link
Owner

knausb commented Jan 22, 2019

The index created by bgzip is *.gzi while tabix creates *.tbi and the tabix index is a larger file. So they are not the same.

@romanzenka
Copy link
Author

Yeah. Well, we need the tabix index - how to go from genomic coordinates to a chunk of data.

@knausb
Copy link
Owner

knausb commented Jan 23, 2019

Thanks for the clarification in the indexes. I think you're correct. I found Rhtslib which I do not know how to use. But it looks like it may be a solution.

@romanzenka
Copy link
Author

romanzenka commented Jan 23, 2019

Ah, I just contributed a pull request to samtools/htsjdk#1248 which is a Java version of htslib. I do not have experience with the C version, but if it is anything like the java one, that's definitely a way to go.

Except I looked at Rhtslib and it is literally nothing but a wrapper around htslib. It has no interface, I believe you have to write C code to link to it. So it's not super useful for a casual R developer.

@knausb
Copy link
Owner

knausb commented Jan 23, 2019

Thanks for the validation, that helps! I see Rhtslib at 1.7.0 while htslib is at 1.9.0. Does that matter? According to the documentation it sounds like they would be amenable to to incrementing Rhtslib.

"not super useful for a casual R developer" yeah, it looks like there's going to be a learning curve. But I think its the the way to go. File IO is a bottle neck and one of the things that inspired me to create vcfR. vcfR currently uses Rcpp to get that performance. And I want to believe all of these libraries will play nicely together. But it might take me a bit to figure it all out.

@romanzenka
Copy link
Author

romanzenka commented Jan 24, 2019

@knausb So my understanding is that you wrote vcfR using C code you made from scratch. I do not see anything wrong with your approach - htslib is however more of a building block for many kinds of genomic I/O beyond VCF. It might be wise to join forces with them, vcfR becoming a comfy wrapper around htslib, inheriting all the new fetures - that said, I do not know enough about their C code quality at the moment. It is quite possible your code is better, because it is more specialized and more oriented towards the R integration. That's something that would have to be benchmarked and figured out.

@RichardJActon
Copy link

This package might be useful for interfacing with tabix: https://github.com/zhanxw/seqminer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants