Michael G. Campana and Ellie E. Armstrong, 2019-2024
Smithsonian Institution
Stanford University
Here we provide a brief tutorial for running RatesTools. This tutorial assumes that you have already installed the RatesTools pipeline and its dependencies (See here for details). The dataset (available here) can be used to test your RatesTools installation and configuration. Please note that results will vary slightly from the provided final example output files due to the random number generators used by the software dependencies. The wolf sequencing reads are subset from SRA accessions SRR1518530-SRR1518532 from BioProject PRJNA255370 [1] and SRR9095635-SRR9095636,SRR9095638 from BioProject PRJNA543877 [2]. The reference genome chromosomes derive from the domestic dog genome assembly canFam3.1 [3].
- Create a base directory for the run, e.g.
mkdir test
. - Decompress the reference genome (
dog_test.fna.gz
) (gunzip dog_test.fna.gz
). Place the reference genome and the list of required chromosomes (dog_test_chr.csv
) within the base directory. - Decompress the
RawData.tar.gz
archive (tar xvfz RawData.tar.gz
). Place the directoryRawData
and the FASTQ sequence files within it into the base directory.
-
We provide two example configurations for running the test dataset: one using the Genome Analysis Toolkit (GATK) [4] v. 3.8-1 (
wolf_gatk3.config
) and the other for GATK v. >= 4.2.3.0 (wolf_gatk4.config
). A custom configuration can be made by executing theconfigure.sh
script and completing the prompts (See here). Please note that, in the following steps, we describe how to configure the run manually using a text editor. Theconfigure.sh
script automates the setting of these values except for the final configuration of the executor profile, as this is highly variable across systems. -
Set the run parameters. These The parameters are located in the
params
directive. Take careful notice of the value enclosure of the strings (either' '
or" "
). Values enclosed in' '
do not perform variable substitution, while those using" "
do. Note the special variable $launchDir indicates the launch directory of the run.
2a.refseq
specifies the path to the reference sequence (in FASTA format).
2b.libraries
specifies a CSV giving sample information, library names, and their corresponding read files. See (README.md#specifying-library-information) for details.
2c.readDir
specifies the path to the sequence FASTQ files. Each library is expected to be represented by a file containing forward reads and another containing the corresponding reverse reads.
2d.bwa_alg
specifies the BWA [5] index algorithm. Setting this value to an empty string ("") instructs BWA to infer the indexing algorithm.
2e.markDuplicates
specifies whether duplicate marking should use Picard [7] (value "picard"), SAMtools [8] or Sambamba [9] (value "sambamba").
2f.picard_conda
determines whether to install Picard using Conda. Set the value to true to install using Conda.
2g.picard
specifies the path to the Picard jar file. This can be omitted ifpicard_conda
is set to true.
2h.picard_java
specifies additional Java parameters for Picard operations. Enter the command-line Java options as a string.
2i.gatk_conda
determines whether to install GATK using Conda. Set the value to true to install using Conda.
2j.gatk
specifies the path to the GATK jar file. This can be omitted ifgatk_conda
is set to true.
2k.gatk_build
specifies whether the GATK jar is major version 3 or 4.
2l.gatk_java
specifies additional Java parameters for GATK operations. Enter the command-line Java options as a string.
2m.filter_bams
determines whether to filter BAM alignments (e.g. for secondary alignments and duplicates) before genotyping. Set the value to true to filter.
2n.region_filter
determines whether to remove low-quality regions from the called sites. Set to true to remove regions (recommended). 2o.phase
determines whether to phase genotypes using WhatsHap trio-phasing.
2p.whatshap_opts
specifies additional options (as a string) to pass to WhatsHap phase.
2q.gm_tmpdir
specifies the path to a temporary directory for GenMap [10].
2r.gm_opts
specifies mapping parameters (as a string) for GenMap. Use cpus in the process configuration to set number of concurrent threads.
2s.rm_species
specifies the species library for RepeatMasker [11].
2t.rm_mask_opts
specifies parameters (as a string) for RepeatMasker. Use cpus in the process configuration to set number of concurrent threads.
2u.rm_model_opts
specifies parameters (as a string) for RepeatModeler [12]. Use cpus in the process configuration to set number of concurrent threads and omit the-lib
parameter.
2v.indelpad
specifies the number of bases up- and downstream of an indel to remove.
2w.prefix
specifies the file prefix for output files.
2x.outdir
specifies the name of the output directory.
2y.dam
specifies the name of the dam. This must match the name derivied from file globbing of the reads.
2z.sire
specifies the name of the sire. This must match the name derivied from file globbing of the reads.
2aa.vcftools_site_filters
specifies the site-specific filters using VCFtools [13] as a string. Setting this value to "NULL" bypasses this filter. See the VCFtools documentation for details. We recommend restricting to biallelic sites (either using VCFtools or GATK).
2ab.gatk_site_filters
specifies the site-specific filters using GATK as a string. Setting this value 'NULL' bypasses this filter. See the GATK documentation for details. Be sure to use GATK 3 syntax for GATK 3 runs and GATK 4 syntax for GATK 4 runs. We recommend restricting to biallelic sites (either using VCFtools or GATK).
2ac.chr_file
specifies the path to the list of chromosomes to retain. Set the value to "NULL" to ignore this filter.
2ad.min_contig_length
specifies the minimum length (in bp) of contigs to retain before applying site and region filters. 2ae.min_filt_contig_length
specifies the minimum length (in bp) of contigs to retain after applying site and region filters. 2af.dnm_opts
specifies the options for calc_denovo_mutation_rate.rb as a string. See the documentation for details.
2ag.dnm_clump
species the number of bases in a window to consider a DNM candidate clump. 0 does not removed clumps.
2ah.email
specifies an email address to send alerts regarding pipeline completion, termination and errors. Set to "NULL" to turn off email alerts. -
Update the module list. In the
modules
directive, there is a list of software. Enter the name of any modulefiles needed for each program. If no modules are needed, leave the value as an empty string. -
Configure the executor profiles for your system. If you are running locally, the standard local profile provided should be sufficient (but may need some adaptation depending on your hardware). The number of threads for parallelizable software (BWA, SAMtools, GATK, GenMap, RepeatMasker, RepeatModeler) can be controlled using the $task.cpus variable for each process. Executor profiles are passed to the pipeline using the config file specified at runtime using the
-c
option. If you are running on a cluster or a cloud service, consult the Nextflow documentation and your system administrators to optimize the parameter profile. -
Place the final configuration file into the run base directory.
- Execute the analysis using the command:
nextflow run campanam/RatesTools -r <version> -c <config_file.config>
. You can use the-bg
option to send the Nextflow process to the background (useful for cluster systems) and the-profile <profile_name>
option to specify your custom parameter profile. - The pipeline will generate the following files in the output directory:
2a.01_FinalBams
: The final BAM alignments used in the analysis.
2b.02_gVCFs
: Individual gVCFs for each sample.
2c.03_CombinedVCF
: Combined joint-genotyped all-sites VCF for all samples in the analysis.
2d.04_RepeatMasking
: Output of RepeatMasker and RepeatModeler.
2e.05_ExcludedRegions
: A BED file of all genomic regions containining indels, repetitive sequence and low-mappability sequence. These regions are excluded from the analysis.
2f.06_FilterChrVCFs
: All-sites VCF including only the chromosomes listed in the chromosome file specified inparams.chr_list
.
2g.07_TrioPhasedVCFs
: All-sites VCF after trio-phasing (optional output).
2h.08_SplitTrioVCFs
: VCFs of all parent-offspring trios after splitting the all-sample joint-genotyping VCF.
2i.09_gVCFs_GP_DQ
: Summary statistics and graphs of DP and GQ for all individuals.
2j.10_SplitVCFs
: Trio VCFs split by chromosome for parallelization.
2k.11_VCFtoolsSiteFilteredVCFs
: Trio chromosome VCFs filtered at the site level by VCFtools.
2l.12_GATKSiteFilteredVCFs
: Trio chromosome VCFs filtered at the site level by GATK.
2m.13_RegionFilteredVCFs
: Trio chromosome VCFs after removal of unreliable regions.
2n.14_SplitCalcDNMLogs
: De novo mutations (DNMs) and DNM rates calculated per chromosome for each trio.
2o.15_SummarizeDNMLogs
: Summarized DNMs and DNM rates for all chromosomes for each trio.
2p.16_SummaryStats
: Summary statistics of the total number of sites retained after each filtration step and the number of each mutation class. It also identifies and removes clumped candidate mutations and overlapping candidate mutations between siblings. The script then recalculates mutation rate estimates and confidence intervals assuming these sites are erroneous. - We provide example output of the
SummarizeDNMLogs
andSummaryStats
files. A successful run should produce complete versions of these files with similar results.
- Fan, Z., Silva, P., Gronau, I., Wang, S., Serres Armero, A., Schweizer, R.M., Ramirez, O., Pollinger, J., Galaverni, M., Ortega Del-Vecchyo, D., Du, L., Zhang, W., Zhang, Z., Xing, J., Vilà, C., Marques-Bonet, T., Godinho, R., Yue, B., Wayne, R.K. (2016) Worldwide patterns of genomic variation and admixture in gray wolves. Genome Res, 26, 163-173. DOI:10.1101/gr.197517.115.
- Koch, E.M., Schweizer, R.M., Schweizer, T.M., Stahler, D.R., Smith, D.W., Wayne, R.K., Novembre, J. (2019). De novo mutation rate estimation in wolves of known pedigree. Mol Biol Evol, 36, 2536-2547, doi: 10.1093/molbev/msz159.
- Hoeppner, M.P., Lunquist, A., Pirun, M., Meadows, J.R.S., Zamani, N., Johnson, J., Sundström, G., Cook, A., FitzGerald, M.G., Swofford, R., Mauceli, E., Moghadam, B.T., Greka, A., Alföldi, J., Abouelleil, A., Aftuck, L., Bessette, D., Berlin, A., Brown, A., Gearin, G., Lui, A., Macdonald, J.P., Priest, M., Shea, T., Turner-Maier, J., Zimmer, A., Lander, E.S., di Palma, F., Lindblad-Toh, K., Grabherr, M.G. (2014) An improved canine genome and a comprehensive catalogue of coding genes and non-coding transcripts. PLOS One, 9, e91172. DOI: 10.1371/journal.pone.0091172.
- McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res, 20, 1297-1303. DOI: 10.1101/gr.107524.110.
- Li, H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv, 1303.3997v2.
- Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-2079. DOI: 10.1093/bioinformatics/btp352.
- Broad Institute (2020). Picard v. 2.23.8 (https://broadinstitute.github.io/picard/).
- Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., Li, H. (2021) Twelve years of SAMtools and BCFtools. GigaScience, 10, giab008. DOI: 10.1093/gigascience/giab008.
- Tarasov, A., Vilella, A.J., Cuppen, E., Nijman, I.J., Prins, P. (2015) Sambamba: fast processing of NGS alignment formats. Bioinformatics, 31, 2032–2034. DOI: 10.1093/bioinformatics/btv098.
- Pockrandt, C., Alzamel, M., Iliopoulos, C.S., Reinert, K. (2020) GenMap: ultra-fast computation of genome mappability. Bioinformatics, 36, 3687–3692, doi: 10.1093/bioinformatics/btaa222.
- Smit, A.F.A., Hubley, R., Green, P. (2013-2015) RepeatMasker Open-4.0. (http://www.repeatmasker.org).
- Flynn, J.M., Hubley, R., Goubert, C., Rosen, J. Clark,. A.G., Feschotte, C., Smit, A.F. (2020) RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A, 117, 9451-9457. DOI: 10.1073/pnas.1921046117.
- Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. DOI: 10.1093/bioinformatics/btr330.