-
Notifications
You must be signed in to change notification settings - Fork 4
Day 4
In this tutorial, we will replicate some current analyses using the new tools we have developed. By the end of this tutorial, you will be able to:
- Map ancient metagenomic reads to a database of reference genomes to understand the sample's composition.
- Identify specific genomes in the ancient metagenome.
- Authenticate genomes based on damage patterns.
- Perform functional profiling of an ancient metagenomic sample.
First, let's create the necessary folder structure in our working directory (wdir
):
cd ~/course/wdir
mkdir -p day4
cd day4
ln -s ~/course/data/day4/ data
Now, activate the environment for today's session:
conda activate day4
Before moving forward, we need to install a missing package:
mamba install bbmap
pip install tabview
If everything went as expected, we will have our system ready for today.
Our approach for analysing ancient microbial data is a bit different from what is the norm in the field. After QC'ing our sequences we follow the following steps:
extension -> dereplication -> mapping -> filtering -> damage estimation
One of the first steps we do is to extend the reads by doing very gentle assemblies, but before proceeding let's get some statistics for our fastQ files:
seqkit stats -j 5 -T data/fastq/PRI-TJPGK-CATN-160-162.fq.gz | csvtk -t pretty
ℹ️ seqkit, csvtk and taxonkit are very useful tools. Check them at https://bioinf.shenwei.me/
And extend the reads using bbmap. We will need to do two different steps, first calculate how much memory we need so the extension is deterministic and then do the extension using the estimates:
MEM=$(loglog.sh seed=1234 k=17 in=data/fastq/PRI-TJPGK-CATN-160-162.fq.gz ignorebadquality 2> >(grep Cardinality) \
| awk -vP=0.5 -vB=16 -vH=3 '{{print int( (((B*H)/8)*$2)/P )}}')
tadpole.sh -Xmx10G \
k=17 \
in=data/fastq/PRI-TJPGK-CATN-160-162.fq.gz \
out=PRI-TJPGK-CATN-160-162.extended.fq.gz \
mode=extend \
ibb=f \
prefilter=0 \
el=100 er=100 \
threads=5 \
overwrite=true \
trimends=9 \
ecc=f ecco=f \
filtermem="${MEM}" \
conservative=t \
ignorebadquality
Get the stats of PRI-TJPGK-CATN-160-162.extended.fq.gz
Once the reads have been extended, we will dereplicate them:
seqkit rmdup -j 5 -s -o PRI-TJPGK-CATN-160-162.extended.derep.fq.gz PRI-TJPGK-CATN-160-162.extended.fq.gz
Get the stats of PRI-TJPGK-CATN-160-162.extended.derep.fq.gz
Before mapping, we will get the original reads using the extended-dereplicated as a guide. Although we could use this set of reads for mapping, at the moment we prefer not to (any ideas why?):
filterbyname.sh in=data/fastq/PRI-TJPGK-CATN-160-162.fq.gz out=PRI-TJPGK-CATN-160-162.mapping.fastq.gz names=PRI-TJPGK-CATN-160-162.extended.derep.fq.gz threads=5 overwrite=t include=t
So finally we have the reads we can use for mapping.Althoug, we use Bowtie2, we follow a slightly different strategy for mapping, First let's activate the conda environment that we need:
conda activate mapping
And let's do the mapping:
bowtie2 -p 5 -k 100 -D 10 -R 2 \
-N 0 -D 5 -R 1 -L 22 -i S,0,2.50 \
--np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1" \
-x data/db/aegenomics.db \
-q PRI-TJPGK-CATN-160-162.mapping.fastq.gz --no-unal \
| samtools view -F 4 -b \
| samtools sort -@ 32 -m 8G -o PRI-TJPGK-CATN-160-162.sorted.bam
Expand for an explanation of the parameters
There's a lot going on in this last command. First, with -k we ask for 100 alignments per read. This is because many of those reads are going to be mapping equally well to multiple references. We can adjust the % identity in --score-min "L,0,-0.1", where the -0.1 == 90, 0.05 == 95% We will do a fast search by using -N 1 -D 5 -R 1 -L 22 -i S,0,2.50. Those are the preset parameters for a fast search in Bowtie2, but with the difference that we are using -N 1 to allow a mismatch in the seed to increase sensitivity and we will use.The part -i S,0,2.5 sets the interval function f to f(x) = 1 + 2.5 * sqrt(x), where x is the read length. It means how many seed substrings will generate.
This is just an example for the tutorial, in real life you might want to use:
recommended, good compromise between sensitivity, specificity and speed: -D 15 -R 2 -N 1 -L 22 -i S,1,1.15 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1"
faster: -D 10 -R 2 -N 1 -L 22 -i S,0,2.50 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1"
more sensitive, but ~10X slower: -D 15 -R 2 -N 1 -L 20 -i S,1,1.15 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1"
super sensitive, but ~20X slower (Reduce -L if you want to be more sensitive): -D 15 -R 3 -N 1 -L 20 -i S,1,0.5 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1"
Now we have the reads mapped and ready for the next step!
Now is time to remove some noise. As we set -k 100
for the mapping step, we will have lots of unspecific mappings. Also we will have references with uneven coverage or that the reads are clumped in a few regions.
First we will remove duplicates from the BAM files:
And we will use sambamba to remove duplicates. Sambamba has similar heuristics than Picard, but is way faster:
sambamba markdup -r -t 5 -p PRI-TJPGK-CATN-160-162.sorted.bam PRI-TJPGK-CATN-160-162.sorted.rmdup.bam
And then, we will filter our the noise:
filterBAM --chunk-size 25 \
--bam PRI-TJPGK-CATN-160-162.sorted.rmdup.bam \
-N \
-r data/misc/aegenomics.db.len.map \
-A 92 \
-a 94 \
-n 100 \
-b 0.75 \
-B 0.01 \
-t 5 \
--sort-memory 8G \
--include-low-detection \
--stats PRI-TJPGK-CATN-160-162.stats.tsv.gz \
--stats-filtered PRI-TJPGK-CATN-160-162.stats-filtered.tsv.gz \
--bam-filtered PRI-TJPGK-CATN-160-162.filtered.bam
Expand for an explanation of the parameters
There's a lot going on in this last command. We use -N to sort the filtered BAM files by name, so it can be used by metaDMG. The `-r` specifies where we can find the non-concatenated length of the references for the abundance estimation. With `-A` we will only keep those reads with at least 92% ANI and `-a` will keep those references with at least 94% of average ANI. The `-n` will keep only those references with at least 100 reads mapping.
Let's deactivate the mapping
environment:
conda deactivate
Let's have a look at the results of the filtering:
zcat PRI-TJPGK-CATN-160-162.stats-filtered.tsv.gz \
| csvtk cut -t -T -f "reference,n_reads,read_ani_mean,read_ani_std,coverage_mean,breadth,exp_breadth,breadth_exp_ratio,norm_entropy,norm_gini,cov_evenness,tax_abund_tad" \
| csvtk grep -r -t -v -f reference -p _plas -p _mito \
| csvtk sort -t -T -k "n_reads:Nr" \
| tabview -
We will explore four different examples to understand the effect of each filter. We will get the references GCA_002781685.1
, IMGVR_UViG_3300027782_000260
, GCA_014380485.1
and one artificial difficult case.
printf "GCA_002781685.1\nIMGVR_UViG_3300027782_000260\nGCA_014380485.1" > ref-list.txt
getRPercId --bam PRI-TJPGK-CATN-160-162.sorted.rmdup.bam --reference-list ref-list.txt --threads 5 --sort-memory 8G
And let's explore the coverage patterns:
/home/antonio/opt/conda/envs/day2/bin/bamcov -w 0 -m GCA_002781685.1.bam
/home/antonio/opt/conda/envs/day2/bin/bamcov -w 0 -m GCA_014380485.1.bam
/home/antonio/opt/conda/envs/day2/bin/bamcov -w 0 -m IMGVR_UViG_3300027782_000260.bam
/home/antonio/opt/conda/envs/day2/bin/bamcov -w 0 -m data/misc/example-4.bam
No that we have our set of refined references, is time to authenticate them, this means to look for damage patterns. We recently developed a new method that can estimate damage over thousands of taxa (LCA mode), references (local mode) or provide a global estimate (global mode).
We will use the Local mode to have an overview of the potential references that might show damage in the sample. First, we need to create the config file:
metaDMG config --config-file PRI-TJPGK-CATN-160-162.local.yaml \
--metaDMG-cpp /usr/local/bin/metaDMG-cpp \
--parallel-samples 1 \
--cores-per-sample 5 \
--output-dir PRI-TJPGK-CATN-160-162.local \
--max-position 35 \
--min-similarity-score 0.92 \
--damage-mode local \
PRI-TJPGK-CATN-160-162.filtered.bam
Let's compute:
metaDMG compute PRI-TJPGK-CATN-160-162.local.yaml
And we will export the results to a CSV:
metaDMG convert --add-fit-predictions \
--output PRI-TJPGK-CATN-160-162.csv.gz \
--results PRI-TJPGK-CATN-160-162.local/results
metaDMG dashboard --results PRI-TJPGK-CATN-160-162.local/results/
Optionally, we can also run the LCA mode:
conda activate metaDMG
metaDMG config --config-file PRI-TJPGK-CATN-160-162.lca.yaml \
--custom-database \
--names data/taxonomy/names.dmp \
--nodes data/taxonomy/nodes.dmp \
--acc2tax data/taxonomy/acc2taxid.map.gz \
--metaDMG-cpp /usr/local/bin/metaDMG-cpp \
--parallel-samples 1 \
--cores-per-sample 5 \
--output-dir PRI-TJPGK-CATN-160-162.lca \
--max-position 35 \
--lca-rank '' \
--min-similarity-score 0.92 \
--damage-mode lca \
--weight-type 1 \
PRI-TJPGK-CATN-160-162.filtered.bam
And compute the results:
metaDMG compute PRI-TJPGK-CATN-160-162.lca.yaml
Let's launch the interactive dashboard and have a look at the results:
metaDMG dashboard --results PRI-TJPGK-CATN-160-162.lca/results/
For the functional profiling we will use MMseqs2 with parameters tunned for ancient DNA in combination with the KEGG genes database.
conda activate day4
mmseqs easy-search PRI-TJPGK-CATN-160-162.extended.derep.fq.gz ~/course/data/day4/mmseqs2/kegg_genes-i0.9-c0.8 PRI-TJPGK-CATN-160-162.search.tsv tmp/ --comp-bias-corr 0 --mask 0 -e 1e-20 --exact-kmer-matching 1 --sub-mat PAM30.out -s 3 -k 6 --spaced-kmer-pattern 11011101 --min-length 15 --format-mode 2 -c 0.8 --cov-mode 2 --min-seq-id 0.9
This will take a while, use the pre-computed results in ~/course/data/day4/results/PRI-TJPGK-CATN-160-162.search.tsv
And we will filter the results using xFilter:
cp ~/course/data/day4/results/PRI-TJPGK-CATN-160-162.search.tsv .
xFilter --input PRI-TJPGK-CATN-160-162.search.tsv \
--prefix PRI-TJPGK-CATN-160-162 \
--n-iters 25 \
--evalue 1e-20 \
--scale 0.9 \
--bitscore 60 \
--filter depth_evenness \
--depth-evenness 1.0 \
--threads 5 \
--anvio \
--mapping-file data/mmseqs2/ko_genes.list
The results from the filtering can be used with anvi'o to estimate the presence and abundance of metabolic modules in KEGG.
In case you want to test it, let's create an environment and install anvi'o:
mamba env create data/envs/anvio.yml
conda activate anvio
pip install https://github.com/merenlab/anvio/releases/download/v8/anvio-8.tar.gz
Let's setup the KEGG data:
anvi-setup-kegg-data
and run the metabolism estimations:
anvi-estimate-metabolism \
--add-coverage \
--enzymes-txt PRI-TJPGK-CATN-160-162_group-abundances-anvio.tsv.gz \
--output-modes modules,module_paths,module_steps,hits \
-O PRI-TJPGK-CATN-160-162 \
--include-kos-not-in-kofam
More information here
We will open the R scripts in the folder ~/course/data/day4/src
in VS Code and load the r
extension.