-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Short Homo Sapien Assembly from Genome in a Bottle Data #23
Comments
Hi, I'm highly interested in the answer to this issue as I have exactly the same problem with very small final assemblies. I would appreciated your help, Thanks, Maxime |
You might use kmergenie to estimate the „optimal“ k for the Illumina reads then try that instead of the default 49. |
Thanks! I will try it out. However, I still don't understand why the long read subsampling produces the same file despite using 2 different long read files, or different subsampling threshold |
After changing k=19 according to kmergenie, I got another error:
|
Are you sure kmergenie gave you a k=19? I was expecting something more like k~101 depending on the read length. Not sure about the new error. |
@NTNguyen13 thanks for reporting low-sized final assembly, as well as the error with k=19. |
NA12878_R1_15X_merge_kmer.dat.pdf @jelber2 hi, this is the histogram of kmer size
But I also tried using the lr25x.fasta of #1 scenario with short read k=19, it resulted in
Edit: I tried with k=49, it still gives non-zero exit status for minimap2 |
hi @haghshenas, I re-downloaded the Long read file, this time I use sra-toolkit to download the SRA files, then convert them to fastq to make sure all files are well-preserved. However, the same problem about subsampling long read still persist:
|
Did this issue ever get resolved? I am also having very short genome assemblies compared to the reference genome. |
Hi, I'm trying HASLR using data from GIAB: https://github.com/genome-in-a-bottle/giab_data_indexes/tree/master/NA12878
cat
into a single fastq.gz file.cat
into 2 paired-end fastq files.I used haslr with this command
However, the result
asm.final.fa
only have 576MB in size, and only cover around 10% of the GRCh38, reported by QUAST. I even tried to increasegenome_size
option to 4G, and--cov-l
from 25 to 30, but HASLR still generate exactly the samelr*x.fasta
andasm.final.fa
. I even tried using onlycat
on 2 original long read file, but thelr*x.fasta
is still the same.What have possibly go wrong in my case?
The text was updated successfully, but these errors were encountered: