Tags: #pangenome #pggb 🏠 home
[! info] Purpose Model the pangenome at sequence- and gene-level.
Prepare a list of files containing all the possible combinations of the genomes from n = 1 to n = total number of genomes.
PANGENOME.prep_genomes_combinations_4server.R
For every combination of genomes from, 1 genome to n genomes, the total number of nodes and the total length of nodes is determined for each class; core (core), dispensable (disp), and private (priv).
# concatenate the path file for the combination
for comb in $(ls comb/genome_nb${i}*txt); do comb_name=$(basename ${comb} .txt); cat ${comb} | while read line; do cat vg/${line}.thru.all.on.all.wfmash_s10000p85n1.chr*.seqwish_k49.smooth.join.path; done > comb/${comb_name}.path; done
# get the freq for the combination
for path in $(ls comb/genome_nb${i}*path); do comb_name=$(basename ${path} .path); sort ${path} | uniq -c > comb/${comb_name}.freq; done
# reformat the freq
for freq in $(ls comb/genome_nb${i}*freq); do comb_name=$(basename ${freq} .freq); echo "paste <(sed 's:s.*\$::g' ${freq} | sed 's: ::g') <(sed 's:^.*s:s:g' ${freq} | sed 's: ::g') > comb/${comb_name}_freq.txt"; done > reformat_freq.sh
parallel -j 9 :::: reformat_freq.sh
# class
for freq in $(ls comb/genome_nb${i}*_freq.txt); do comb_name=$(basename ${freq} _freq.txt); echo "grep -wFf <(awk '\$1==1' ${freq} | cut -f 2) fasta/all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.fasta.len > comb/${comb_name}.priv.len; grep -wFf <(awk -v "i=$i" '\$1>1 && \$1<i' ${freq} | cut -f 2) fasta/all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.fasta.len > comb/${comb_name}.disp.len; grep -wFf <(awk -v "i=$i" '\$1==i' ${freq} | cut -f 2) fasta/all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.fasta.len > comb/${comb_name}.core.len"; done > class_freq.sh
parallel -j 9 :::: class_freq.sh
# stats
for freq in $(ls comb/genome_nb${i}*.len); do comb_name=$(basename ${freq} .len); echo "paste <(wc -l ${freq} | cut -f 1 -d ' ') <(awk '{s+=\$2}END{print s}' ${freq}) > comb/${comb_name}.stats"; done > get_stats.sh
parallel -j 9 :::: get_stats.sh
# merge
for stats in $(ls comb/*stats); do name=$(basename ${stats} .stats); genome_nb=$(echo ${name} | sed 's:genome_nb::g' | sed 's:\..*$::g'); comb=$(echo ${name} | sed 's:^.*comb::g' | sed 's:\..*$::g'); type=$(echo ${name} | sed 's:^.*\.::g'); n=$(cat ${stats} | cut -f 1); len=$(cat ${stats} | cut -f 2); echo -e "${genome_nb}\t${comb}\t${type}\t${n}\t${len}"; done > comb/all_stats.txt
PANGENOME.seq_pangenome_modeling.R
To intersect the pangenome nodes with any annotations, we need first to get a bed files describing the node composition of each genome.
PANGENOME.smoothxg_chr_path2bed_4server.R
for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); echo "cat vg/${name}.hap*.chr*thru.all.on.all.wfmash_s10000p85n1.chr*.seqwish_k49.smooth.join.bed > vg/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.bed"; done > merge_bed.smooth.join.sh
parallel -j 24 :::: merge_bed.smooth.join.sh
For each genome, an awk '$3=="gene"'
was performed on the gene annotation gff3.
[! attention] Do not run intersect in parallel
# intersect
for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); bedtools intersect -wo -a assembly/${name}.gene.gff3 -b vg/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.bed > intersect/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.intersect_gene.txt; done
# simplify
for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); echo "awk 'BEGIN{FS=\"\t\";OFS=\"\t\"} {gsub(/;Name.*\$/, \"\", \$9); gsub(/;Description.*\$/, \"\", \$9); gsub(/ID=/, \"\", \$9); print \$9,\$13,\$16}' intersect/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.intersect_gene.txt > intersect/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.intersect_gene_ez.txt"; done > simplify_intersect_smooth_join.sh
parallel -j 9 :::: simplify_intersect_smooth_join.sh
The modeling of the gene pangenome is more complex than for the pangenome graph because it is hard to identify the genomes from which come a dispensable gene. To make it simpler, the process is based on a binary matrix that class the combinations of all the genes from all genomes as core or private genes. From this matrix, in any combination with more than two genomes, dispensable genes can be identified and counted.
PANGENOME.gene_intersect_smoothxg_path_modeling_4server.R
smoothxg <- Previous Step | home | Next Step -> Graph-inferred gene pangenome