Skip to content

tauqeer9/bioinformatics-one-liners-1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

One line scripts for bioinformatics

Extracting fasta records

Use samtools faidx

First index the fasta file

samtools faidx mysequences.fasta

Then retrieve sequences by ID

samtools faidx mysequences.fasta id1 id2 id3

If you have a file with lots of ID's or you have some other command (eg awk) that produces IDs with one on each line you can use xargs to pass these IDs to samtools faidx. For example if all the IDs are in a file called ids.txt;

cat ids.txt | xargs samtools faidx mysequences.fasta

Reformatting fasta records

A great tool for this is bioawk .

For example to add a fasta compatible prefix like this

>comp12345_c0_seq1
to
>lcl|comp12345_c0_seq1

can be done with the following bioawk command

bioawk -c fastx '{printf ">lcl|%s\n%s\n", $name, $seq}' original.fasta > reformatted.fasta

Counting the number of sequences in a fasta file

cat mysequences.fasta | grep -c ">"

Untaring multiple files in a folder

for file in *.gz; do tar -xvfz $file;done

Annotating entries in a fasta file ( not quite one line)

Assuming we have an indexed blast database built as follows

makeblastdb -in uniprot_sprot.fasta -input_type 'fasta' -dbtype 'prot' -parse_seqids

Find homologs with blastp

blastp -query contigs.fasta -db uniprot_sprot.fasta -outfmt '6 qseqid sacc evalue stitle' -max_target_seqs 1 > contigs.blastp

Join blastp results to the fasta file. This just prints in tabular format

join <(bioawk -c fastx '{print $name,$seq}' contigs.fasta) <(awk -F '\t' '{print $1,$2,$3,$4}' contigs.blastp)

And finally reformat output to fasta with annotations

join <(bioawk -c fastx '{print $name,$seq}' contigs.fasta) <(awk -F '\t' '{print $1,$2,$3,$4}' contigs.blastp) | awk -F '\t' {printf(">%s %s\n%s\n",$1,$4,$2)}'

About

One line scripts for bioinformatics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published