A pipeline for high-quality plasmid assemblies.
Optionally annotate genes. Collects quality info on both incoming and outgoing datasets.
flowchart TD
short_reads --> fastp(fastp)
fastp -- trimmed_short_reads --> plassembler(plassembler)
long_reads --> filtlong(filtlong)
filtlong -- filtered_long_reads --> plassembler
plassembler --> assembly[assembly]
assembly --> prokka(prokka*)
assembly --> bakta(bakta*)
assembly --> quast(quast)
assembly --> bandage(bandage)
prokka --> prokka_annotated_assembly
bakta --> bakta_annotated_assembly
quast --> assembly_qc
bandage --> assembly_diagram
*Optional processes
- Read trimming & QC: fastp and filtlong
- Genome Assembly: plassembler (long reads, or hybrid)
- Gene Annotation: prokka or bakta
- Assembly QC: quast, bandage
By default, plassembler will be used, and no gene annotation will be run:
nextflow run BCCDC-PHL/plasmid-assembly \
--fastq_input <short-read fastq input directory> \
--fastq_input_long <long-read fastq input directory> \
--outdir <output directory>
Prokka and/or bakta can be used with the --prokka
and --bakta
flags:
nextflow run BCCDC-PHL/plasmid-assembly \
--fastq_input <short-read fastq input directory> \
--fastq_input_long <long-read fastq input directory> \
--prokka \
--bakta \
--outdir <output directory>
The pipeline also supports a 'samplesheet input' mode. Pass a samplesheet.csv
file with the headers ID
, R1
, R2
,LONG
:
nextflow run BCCDC-PHL/dragonflye-nf \
--samplesheet_input <samplesheet.csv> \
--outdir <output directory>
Eg:
ID,R1,R2,LONG
sample-01,/path/to/sample-01_R1.fastq.gz,/path/to/sample-01_R2.fastq.gz,/path/to/sample-01_RL.fastq.gz
sample-02,/path/to/sample-02_R1.fastq.gz,/path/to/sample-02_R2.fastq.gz,/path/to/sample-02_RL.fastq.gz
sample-03,/path/to/sample-03_R1.fastq.gz,/path/to/sample-03_R2.fastq.gz,/path/to/sample-03_RL.fastq.gz
(Note: this section is currently incomplete. Will be updated as output files are finalized)
An output directory will be created for each sample under the directory provided with the --outdir
flag. The directory will be named by sample ID, inferred from
the fastq files (all characters before the first underscore in the fastq filenames), or the ID
field of the samplesheet, if one is used.
If we have sample-01_R{1,2}.fastq.gz
, in our --fastq_input
directory, the output directory will be:
sample-01
├── sample-01_20211125165316_provenance.yml
├── sample-01_fastp.csv
├── sample-01_fastp.json
├── sample-01_plassembler_hybrid.fa
For each pipeline invocation, each sample will produce a provenance.yml
file with the following contents:
- pipeline_name: BCCDC-PHL/plasmid-assembly
pipeline_version: 0.1.0
- timestamp_analysis_start: 2022-08-16T13:22:11.553143
- input_filename: sample-01_R1.fastq.gz
sha256: 4ac3055ac5f03114a005aff033e7018ea98486cbebdae669880e3f0511ed21bb
file_type: fastq-input
- input_filename: sample-01_R2.fastq.gz
sha256: 8db388f56a51920752319c67b5308c7e99f2a566ca83311037a425f8d6bb1ecc
file_type: fastq-input
- process_name: fastp
tools:
- tool_name: fastp
tool_version: 0.23.1
- process_name: plassembler
tools:
- tool_name: plassembler
tool_version: 1.4.1
- process_name: prokka
tools:
- tool_name: prokka
tool_version: 1.14.5
parameters:
- parameter: --compliant
value: null
- process_name: quast
tools:
- tool_name: quast
tool_version: 5.0.2
parameters:
- parameter: --space-efficient
value: null
- parameter: --fast
value: null
The filename of the provenance file includes a timestamp with format YYYYMMDDHHMMSS
to ensure that re-analysis of the same sample will create a unique provenance.yml
file.