Skip to content

Latest commit

 

History

History
88 lines (54 loc) · 5.06 KB

0.05_modeling.md

File metadata and controls

88 lines (54 loc) · 5.06 KB

Pangenome: modeling

Tags: #pangenome #pggb 🏠 home


[! info] Purpose Model the pangenome at sequence- and gene-level.

create all possible combinations of genomes

Prepare a list of files containing all the possible combinations of the genomes from n = 1 to n = total number of genomes.

PANGENOME.prep_genomes_combinations_4server.R

extract class stats per combination

For every combination of genomes from, 1 genome to n genomes, the total number of nodes and the total length of nodes is determined for each class; core (core), dispensable (disp), and private (priv).

# concatenate the path file for the combination
for comb in $(ls comb/genome_nb${i}*txt); do comb_name=$(basename ${comb} .txt); cat ${comb} | while read line; do cat vg/${line}.thru.all.on.all.wfmash_s10000p85n1.chr*.seqwish_k49.smooth.join.path; done > comb/${comb_name}.path; done

# get the freq for the combination
for path in $(ls comb/genome_nb${i}*path); do comb_name=$(basename ${path} .path); sort ${path} | uniq -c > comb/${comb_name}.freq; done

# reformat the freq
for freq in $(ls comb/genome_nb${i}*freq); do comb_name=$(basename ${freq} .freq); echo "paste <(sed 's:s.*\$::g' ${freq} | sed 's: ::g') <(sed 's:^.*s:s:g' ${freq} | sed 's: ::g') > comb/${comb_name}_freq.txt"; done > reformat_freq.sh

parallel -j 9 :::: reformat_freq.sh

# class
for freq in $(ls comb/genome_nb${i}*_freq.txt); do comb_name=$(basename ${freq} _freq.txt); echo "grep -wFf <(awk '\$1==1' ${freq} | cut -f 2) fasta/all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.fasta.len > comb/${comb_name}.priv.len; grep -wFf <(awk -v "i=$i" '\$1>1 && \$1<i' ${freq} | cut -f 2) fasta/all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.fasta.len > comb/${comb_name}.disp.len; grep -wFf <(awk -v "i=$i" '\$1==i' ${freq} | cut -f 2) fasta/all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.fasta.len > comb/${comb_name}.core.len"; done > class_freq.sh

parallel -j 9 :::: class_freq.sh

# stats
for freq in $(ls comb/genome_nb${i}*.len); do comb_name=$(basename ${freq} .len); echo "paste <(wc -l ${freq} | cut -f 1 -d ' ') <(awk '{s+=\$2}END{print s}' ${freq}) > comb/${comb_name}.stats"; done > get_stats.sh

parallel -j 9 :::: get_stats.sh

# merge
for stats in $(ls comb/*stats); do name=$(basename ${stats} .stats); genome_nb=$(echo ${name} | sed 's:genome_nb::g' | sed 's:\..*$::g'); comb=$(echo ${name} | sed 's:^.*comb::g' | sed 's:\..*$::g'); type=$(echo ${name} | sed 's:^.*\.::g'); n=$(cat ${stats} | cut -f 1); len=$(cat ${stats} | cut -f 2); echo -e "${genome_nb}\t${comb}\t${type}\t${n}\t${len}"; done > comb/all_stats.txt

sequence modeling

PANGENOME.seq_pangenome_modeling.R

gene modeling

create beds

To intersect the pangenome nodes with any annotations, we need first to get a bed files describing the node composition of each genome.

PANGENOME.smoothxg_chr_path2bed_4server.R

merge chromosome bed per genome

for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); echo "cat vg/${name}.hap*.chr*thru.all.on.all.wfmash_s10000p85n1.chr*.seqwish_k49.smooth.join.bed > vg/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.bed"; done > merge_bed.smooth.join.sh

parallel -j 24 :::: merge_bed.smooth.join.sh

intersect nodes with genes per genome

For each genome, an awk '$3=="gene"' was performed on the gene annotation gff3.

[! attention] Do not run intersect in parallel

# intersect
for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); bedtools intersect -wo -a assembly/${name}.gene.gff3 -b vg/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.bed > intersect/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.intersect_gene.txt; done

# simplify
for genome in $(cat genome_loc.txt); do name=$(basename ${genome}); echo "awk 'BEGIN{FS=\"\t\";OFS=\"\t\"} {gsub(/;Name.*\$/, \"\", \$9); gsub(/;Description.*\$/, \"\", \$9); gsub(/ID=/, \"\", \$9); print \$9,\$13,\$16}' intersect/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.intersect_gene.txt > intersect/${name}.thru.all.on.all.wfmash_s10000p85n1.seqwish_k49.smooth.join.intersect_gene_ez.txt"; done > simplify_intersect_smooth_join.sh

parallel -j 9 :::: simplify_intersect_smooth_join.sh

modeling

The modeling of the gene pangenome is more complex than for the pangenome graph because it is hard to identify the genomes from which come a dispensable gene. To make it simpler, the process is based on a binary matrix that class the combinations of all the genes from all genomes as core or private genes. From this matrix, in any combination with more than two genomes, dispensable genes can be identified and counted.

PANGENOME.gene_intersect_smoothxg_path_modeling_4server.R

smoothxg <- Previous Step | home | Next Step -> Graph-inferred gene pangenome