GitHub - Chriswinefield/RepARK: Repetitive motif detection by Assembly of Repetitive K-mers

Chriswinefield / RepARK Public

forked from PhKoch/RepARK

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Repetitive motif detection by Assembly of Repetitive K-mers

www.ncbi.nlm.nih.gov/pubmed/24634442

0 stars 5 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README		README
RepARK.pl		RepARK.pl
pe400.fq.gz		pe400.fq.gz

Repository files navigation

## Description ##
RepARK.pl is a wrapper script for constructing a repeat library from sequencing reads using Jellyfish and either Velvet or CLC.


## System Requirements ##
- Jellyfish (tested with versions 1.1.6, 2.1.1, 2.1.4)
	http://www.cbcb.umd.edu/software/jellyfish
- Velvet (tested with version 1.2.08)
	http://www.ebi.ac.uk/~zerbino/velvet
- CLC Assembler (optional, tested with version 4.4.0)
	http://www.clcbio.com/

These programs should be in the PATH, or the appropriate variables (e.g. $jellyfish_path) in RepARK.pl should be modified accordingly.


## Input Data ##
Input data can be one or more fasta or fastq files. They can be given as arguments with the option -l (multiple times) or included in the RepARK.pl
The reads can be in gzipped format ending with ".gz". Note that it is currently not supported to mix compressd and uncompressed libraries.

## Example Demo Data ##
In order to test the pipeline run ./RepARK.pl without arguments.
The command line parameters are optional. If either are omitted, RepARK.pl will use the defaults which are adapted to the demo data.
Check if the assembly results in 133 sequences of a size between 27,902-27,917bp. Due to velvet creates slightly different assemblies with the same data, we provided this size range and 10 libraries based on the demo data with defauld parameters that can be downloaded from [1]. If you discover greater disagreements please contact the authors (see Support).


## Output ##
The repeat consensuses can be found in the file repeat_lib.fasta within the working directory.


## Usage ##

RepARK.pl [ options ]

Options: [default_values]
  -o  output dir [RepARK_working]
  -l  library file - can be used multiple times [pe400.fq.gz]
  -a  assembler  (options: clc,velvet) [velvet]
  -s  jellyfish hash size [100000000]
  -k  jellyfish kmer size [31]
  -p  number of threads used by jellyfish [1]
  -t  set manually a threshold, if automatic prediction is not working (this step will then be skipped)
  -d  debug mode gives some more information
  -n --nojf  skip Jellyfish computation (jf_RepARK.kmers|histo must exist in the working dir)
  -h --help  this help


## Notes ##
For 40x human data, we set -s to 54100000000 which reserves 370Gb of RAM for jellyfish (v 1.1.6) but there is no merging of the intermediate files needed.

In order to discriminate between the error peak and the unique-kmer peak, sufficient read coverage depth is required (>10x).

## Support ##
In case of problems, please run the script in debugging mode (-d) and send the output to [email protected].



[1] ftp://genome.fli-leibniz.de/pub/repeat-assemblies/RepARKdemoruns.tar.gz


Change Log
---------
1.3.0	- clc_assembler support
	- gzipped input file support
	
1.2.2	- Checking for k <= velvet's hash size

1.2.1	- Included support for jellyfish v2.x

1.2	- Initial release