Short Homo Sapien Assembly from Genome in a Bottle Data #23

NTNguyen13 · 2020-10-26T04:04:45Z

Hi, I'm trying HASLR using data from GIAB: https://github.com/genome-in-a-bottle/giab_data_indexes/tree/master/NA12878

All PacBio HIFI reads (~30X), cat into a single fastq.gz file.
Subsampled Illumina Short Reads from 300X to 55X, cat into 2 paired-end fastq files.

I used haslr with this command

haslr.py \
    -t 14 \
    -g 3g \
    -l $long_read \
    -x corrected \
    -s $short_read \
    -o ~/NA12878/Assembly_HASLR_55X_30X \

However, the result asm.final.fa only have 576MB in size, and only cover around 10% of the GRCh38, reported by QUAST. I even tried to increase genome_size option to 4G, and --cov-l from 25 to 30, but HASLR still generate exactly the same lr*x.fasta and asm.final.fa. I even tried using only cat on 2 original long read file, but the lr*x.fasta is still the same.

What have possibly go wrong in my case?

The text was updated successfully, but these errors were encountered:

CourcelleM · 2020-10-26T09:41:07Z

Hi,

I'm highly interested in the answer to this issue as I have exactly the same problem with very small final assemblies.

I would appreciated your help,

Thanks,

Maxime

jelber2 · 2020-10-26T11:30:06Z

You might use kmergenie to estimate the „optimal“ k for the Illumina reads then try that instead of the default 49.

NTNguyen13 · 2020-10-27T02:23:12Z

Thanks! I will try it out. However, I still don't understand why the long read subsampling produces the same file despite using 2 different long read files, or different subsampling threshold

NTNguyen13 · 2020-10-27T15:43:46Z

After changing k=19 according to kmergenie, I got another error:

ERROR: "haslr_assemble" returned non-zero exit status

jelber2 · 2020-10-27T15:58:51Z

Are you sure kmergenie gave you a k=19? I was expecting something more like k~101 depending on the read length. Not sure about the new error.

haghshenas · 2020-10-27T16:26:35Z

@NTNguyen13 thanks for reporting low-sized final assembly, as well as the error with k=19.
I'm going to follow the steps you have done and see if I can reproduce what you get. In general, if the average length of long reads is low I could expect to see more fragmented and lower-sized final assembly (this is because "shorter" long reads might not be able to connect distant unique short read contigs). However, I'm going to check if this is the case here.
With regards to k=19, my suspicion is that contigs generated by Minia are very short and therefore not useful for HASLR.
I'll try both cases and get back to you.

NTNguyen13 · 2020-10-28T04:19:41Z

NA12878_R1_15X_merge_kmer.dat.pdf

@jelber2 hi, this is the histogram of kmer size
@haghshenas thank you for your support. I suspect that the problem come from using cat on fastq.gz file of CCS. I tried 3 screnarios:

HASLR on heterozygous genomes #1 Using 1 original read: lr25x.fasta size is ~30GB
minimum coverage requirements? #2 Cat 2 original reads: lr25x.fasta size is ~5.2GB
Arranging files with paired end short-reads #3 Cat all original reads: lr25x.fasta size is ~5.2GB. File content is exactly the same with the minimum coverage requirements? #2 scenario.

But I also tried using the lr25x.fasta of #1 scenario with short read k=19, it resulted in

[28-Oct-2020 09:56:26] aligning long reads to short read assembly using minimap2... failed
ERROR: "minimap2" returned non-zero exit status

Edit: I tried with k=49, it still gives non-zero exit status for minimap2

NTNguyen13 · 2020-11-04T09:55:27Z

hi @haghshenas, I re-downloaded the Long read file, this time I use sra-toolkit to download the SRA files, then convert them to fastq to make sure all files are well-preserved.

However, the same problem about subsampling long read still persist:

1. Set --cov-l 10, file size ~26.1 GB
1. Set --cov-l 25, file size ~26.1 GB
1. Set --cov-l 30, file size ~26.1 GB

NTNguyen13 · 2020-11-09T03:22:34Z

for checking integrity of Long read fastq file, I aligned it to HG38 using minimap2, the coverage is around 30X, as expected.

iek · 2021-09-17T17:00:08Z

Did this issue ever get resolved? I am also having very short genome assemblies compared to the reference genome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short Homo Sapien Assembly from Genome in a Bottle Data #23

Short Homo Sapien Assembly from Genome in a Bottle Data #23

NTNguyen13 commented Oct 26, 2020 •

edited

Loading

CourcelleM commented Oct 26, 2020

jelber2 commented Oct 26, 2020

NTNguyen13 commented Oct 27, 2020

NTNguyen13 commented Oct 27, 2020

jelber2 commented Oct 27, 2020

haghshenas commented Oct 27, 2020

NTNguyen13 commented Oct 28, 2020 •

edited

Loading

NTNguyen13 commented Nov 4, 2020

NTNguyen13 commented Nov 9, 2020

iek commented Sep 17, 2021

Short Homo Sapien Assembly from Genome in a Bottle Data #23

Short Homo Sapien Assembly from Genome in a Bottle Data #23

Comments

NTNguyen13 commented Oct 26, 2020 • edited Loading

CourcelleM commented Oct 26, 2020

jelber2 commented Oct 26, 2020

NTNguyen13 commented Oct 27, 2020

NTNguyen13 commented Oct 27, 2020

jelber2 commented Oct 27, 2020

haghshenas commented Oct 27, 2020

NTNguyen13 commented Oct 28, 2020 • edited Loading

NTNguyen13 commented Nov 4, 2020

NTNguyen13 commented Nov 9, 2020

iek commented Sep 17, 2021

NTNguyen13 commented Oct 26, 2020 •

edited

Loading

NTNguyen13 commented Oct 28, 2020 •

edited

Loading