Skip to content

Latest commit

 

History

History
125 lines (88 loc) · 7.47 KB

0.04_PGGB_smoothxg.md

File metadata and controls

125 lines (88 loc) · 7.47 KB

Pangenome: PGGB smoothxg

Tags: #pangenome #pggb #smoothxg 🏠 home


[! info] Purpose The smoothxg polishing of the pangenome allows to increase the compression level of the pangenome by solving nodes with a length < k-mer size selected. It will create more nodes, but will reduce the overall size of the pangenome.

https://github.com/pangenome/pggb https://github.com/pangenome/smoothxg

At the time of the analysis, smoothxg was not intended to be run as a standalone tool. The key was to look how the PGGB automatic pipeline calls smoothxg.

round 1

cd smoothxg

for chr in {01..19}; do smoothxg -t 48 -T 48 -g ../seqwish/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.gfa -w 68017 -K -X 100 -I 0.85 -R 0 -j 0 -e 0 -l 4001 -P "1,4,6,2,26,1" -O 0.03 -Y 1700 -d 0 -D 0 -m all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.presmooth.maf -Q "Consensus_" -V -o all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.presmooth.gfa 2> all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.presmooth.log; done

round 2

for chr in {01..19}; do smoothxg -t 48 -T 48 -g all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.presmooth.gfa -w 76619 -K -X 100 -I 0.85 -R 0 -j 0 -e 0 -l 4507 -P "1,4,6,2,26,1" -O 0.03 -Y 1700 -d 0 -D 0 -m all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.maf -Q "Consensus_" -V -o all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.gfa 2> all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.log; done

[! summary] Summary The pangenomes per chromosome are ready, now the node ID space must be corrected for further use. Since the pangenome was made per chromosome, for each chromosome, the first node is node #1. The correction will make the node of chromosome 2 to continue after chromosome 1 instead of restarting at 1.

convert to vg

for chr in {01..19}; do vg view --gfa-in smoothxg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.gfa --vg --threads 48 > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.vg 2> logs/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.gfa2vg.err; done

correct IDs space

https://github.com/vgteam/vg/wiki/Index-Construction#with-haplotypes-or-with-many-paths

With 1 vg per chr, the id space is overlapping, step needed to correct for it.

cd vg

cp *vg backups/.

vg ids --join $(ls *seqwish_k49.smooth.vg)

convert back to gfa

for chr in {01..19}; do vg view --gfa --threads 48 vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.vg > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.gfa 2> logs/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.vg2gfa.err; done

statistics

summary stats

for chr in {01..19}; do vg stats -z --threads 48 vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.gfa > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.stats; done

extract fasta

# extract fasta
for chr in {01..19}; do awk 'BEGIN{FS="\t";OFS="\t"} $1=="S"' vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.gfa | awk 'BEGIN{FS="\t";OFS="\t"}  {gsub("S", ">s", $1); print $1$2"\n"$3}' > fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.fasta; done

# extract len
for chr in {01..19}; do seqkit fx2tab --length --name fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.fasta > fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.fasta.len; done

extract paths

# extract paths with strand
for chr in {01..19}; do awk '$1=="P"' vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.gfa | cut -f 2-3 > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path; done

# extract names
for chr in {01..19}; do cut -f 1 vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.id; done

# from the different chr graph, extract paths and concatenate per haplotype
for chr in {01..19}; do tot_len=$(wc -l vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.id | cut -f 1 -d ' '); for line_nb in `seq 1 ${tot_len}`; do name=$(sed -n "${line_nb}p" vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.id); echo "sed -n "${line_nb}p" vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path | cut -f 2 | tr ',' '\n' | sed 's:+:\\t+:g' | sed 's:-:\\t-:g' | sed 's:^:s:g'> vg/${name}.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path"; done; done > extract_smooth_join_path.sh

parallel -j 24 :::: extract_smooth_join_path.sh

# concatenate per genome
for chr in {01..19}; do for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); echo "cat vg/${name}.hap[1-2].chr${chr}.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path | cut -f 1 | sort -u > path/${name}.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path"; done; done > smooth_join_path_concat_perGenome.sh

parallel -j 24 :::: smooth_join_path_concat_perGenome.sh

class paths

# concatenate all genomes paths
for chr in {01..19}; do for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); cat vg/${name}.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path; done > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path; done

# get freq per node
for chr in {01..19}; do cat vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path | sort | uniq -c > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.freq; done

# check that the length of the node names
cat path/*smooth.join.freq | sed 's:^.*s:s:g' | wc -l
# is the same than the length of the unique node names
cat path/*smooth.join.freq | sed 's:^.*s:s:g' | sort -u | wc -l

# class
# core
for chr in {01..19}; do awk '$1==9' vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.freq > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.core.freq; done

# disp
for chr in {01..19}; do awk '$1 < 9 && $1 > 1' vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.freq > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.disp.freq; done

# priv
for chr in {01..19}; do awk '$1==1' vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.freq > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.priv.freq; done

stats per class checkpoint

# get fasta len per class
for chr in {01..19}; do for i in $(ls vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.*.freq); do class=$(basename $i .freq | sed "s:allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.::g"); grep -wFf <(sed 's:^.*s:s:g' $i) fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.fasta.len > fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.${class}.fasta.len; done; done

Using the different stats, the number of nodes and their total lengths can be checked.

[! check] Success The correction of the ID space worked, node # starts from chromosome 1 and ends at the end of chromosome 19. Length and number match the statistics obtained prior to correction. The pangenome is ready.

seqwish <- Previous Step | home | Next Step -> modeling