Tags: #pangenome #pggb #smoothxg 🏠 home
[! info] Purpose The smoothxg polishing of the pangenome allows to increase the compression level of the pangenome by solving nodes with a length < k-mer size selected. It will create more nodes, but will reduce the overall size of the pangenome.
https://github.com/pangenome/pggb https://github.com/pangenome/smoothxg
At the time of the analysis, smoothxg was not intended to be run as a standalone tool. The key was to look how the PGGB automatic pipeline calls smoothxg.
cd smoothxg
for chr in {01..19}; do smoothxg -t 48 -T 48 -g ../seqwish/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.gfa -w 68017 -K -X 100 -I 0.85 -R 0 -j 0 -e 0 -l 4001 -P "1,4,6,2,26,1" -O 0.03 -Y 1700 -d 0 -D 0 -m all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.presmooth.maf -Q "Consensus_" -V -o all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.presmooth.gfa 2> all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.presmooth.log; done
for chr in {01..19}; do smoothxg -t 48 -T 48 -g all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.presmooth.gfa -w 76619 -K -X 100 -I 0.85 -R 0 -j 0 -e 0 -l 4507 -P "1,4,6,2,26,1" -O 0.03 -Y 1700 -d 0 -D 0 -m all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.maf -Q "Consensus_" -V -o all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.gfa 2> all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.log; done
[! summary] Summary The pangenomes per chromosome are ready, now the node ID space must be corrected for further use. Since the pangenome was made per chromosome, for each chromosome, the first node is node #1. The correction will make the node of chromosome 2 to continue after chromosome 1 instead of restarting at 1.
for chr in {01..19}; do vg view --gfa-in smoothxg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.gfa --vg --threads 48 > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.vg 2> logs/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.gfa2vg.err; done
https://github.com/vgteam/vg/wiki/Index-Construction#with-haplotypes-or-with-many-paths
With 1 vg per chr, the id space is overlapping, step needed to correct for it.
cd vg
cp *vg backups/.
vg ids --join $(ls *seqwish_k49.smooth.vg)
for chr in {01..19}; do vg view --gfa --threads 48 vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.vg > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.gfa 2> logs/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.vg2gfa.err; done
for chr in {01..19}; do vg stats -z --threads 48 vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.gfa > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.stats; done
# extract fasta
for chr in {01..19}; do awk 'BEGIN{FS="\t";OFS="\t"} $1=="S"' vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.gfa | awk 'BEGIN{FS="\t";OFS="\t"} {gsub("S", ">s", $1); print $1$2"\n"$3}' > fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.fasta; done
# extract len
for chr in {01..19}; do seqkit fx2tab --length --name fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.fasta > fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.fasta.len; done
# extract paths with strand
for chr in {01..19}; do awk '$1=="P"' vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.gfa | cut -f 2-3 > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path; done
# extract names
for chr in {01..19}; do cut -f 1 vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path > vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.id; done
# from the different chr graph, extract paths and concatenate per haplotype
for chr in {01..19}; do tot_len=$(wc -l vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.id | cut -f 1 -d ' '); for line_nb in `seq 1 ${tot_len}`; do name=$(sed -n "${line_nb}p" vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.id); echo "sed -n "${line_nb}p" vg/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path | cut -f 2 | tr ',' '\n' | sed 's:+:\\t+:g' | sed 's:-:\\t-:g' | sed 's:^:s:g'> vg/${name}.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path"; done; done > extract_smooth_join_path.sh
parallel -j 24 :::: extract_smooth_join_path.sh
# concatenate per genome
for chr in {01..19}; do for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); echo "cat vg/${name}.hap[1-2].chr${chr}.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path | cut -f 1 | sort -u > path/${name}.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path"; done; done > smooth_join_path_concat_perGenome.sh
parallel -j 24 :::: smooth_join_path_concat_perGenome.sh
# concatenate all genomes paths
for chr in {01..19}; do for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); cat vg/${name}.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path; done > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path; done
# get freq per node
for chr in {01..19}; do cat vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.path | sort | uniq -c > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.freq; done
# check that the length of the node names
cat path/*smooth.join.freq | sed 's:^.*s:s:g' | wc -l
# is the same than the length of the unique node names
cat path/*smooth.join.freq | sed 's:^.*s:s:g' | sort -u | wc -l
# class
# core
for chr in {01..19}; do awk '$1==9' vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.freq > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.core.freq; done
# disp
for chr in {01..19}; do awk '$1 < 9 && $1 > 1' vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.freq > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.disp.freq; done
# priv
for chr in {01..19}; do awk '$1==1' vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.freq > vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.priv.freq; done
# get fasta len per class
for chr in {01..19}; do for i in $(ls vg/allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.*.freq); do class=$(basename $i .freq | sed "s:allGenome.thru.all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.::g"); grep -wFf <(sed 's:^.*s:s:g' $i) fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.fasta.len > fasta/all.on.all.wfmash_s10000p85n1.chr${chr}.seqwish_k49.smooth.join.${class}.fasta.len; done; done
Using the different stats, the number of nodes and their total lengths can be checked.
[! check] Success The correction of the ID space worked, node # starts from chromosome 1 and ends at the end of chromosome 19. Length and number match the statistics obtained prior to correction. The pangenome is ready.