Use samtools faidx
First index the fasta file
samtools faidx mysequences.fasta
Then retrieve sequences by ID
samtools faidx mysequences.fasta id1 id2 id3
If you have a file with lots of ID's or you have some other command (eg awk) that produces IDs with one on each line you can use xargs
to pass these IDs to samtools faidx
. For example if all the IDs are in a file called ids.txt
;
cat ids.txt | xargs samtools faidx mysequences.fasta
A great tool for this is bioawk .
For example to add a fasta compatible prefix like this
>comp12345_c0_seq1
to
>lcl|comp12345_c0_seq1
can be done with the following bioawk command
bioawk -c fastx '{printf ">lcl|%s\n%s\n", $name, $seq}' original.fasta > reformatted.fasta
cat mysequences.fasta | grep -c ">"
for file in *.gz; do tar -xvfz $file;done
Assuming we have an indexed blast database built as follows
makeblastdb -in uniprot_sprot.fasta -input_type 'fasta' -dbtype 'prot' -parse_seqids
Find homologs with blastp
blastp -query contigs.fasta -db uniprot_sprot.fasta -outfmt '6 qseqid sacc evalue stitle' -max_target_seqs 1 > contigs.blastp
Join blastp results to the fasta file. This just prints in tabular format
join <(bioawk -c fastx '{print $name,$seq}' contigs.fasta) <(awk -F '\t' '{print $1,$2,$3,$4}' contigs.blastp)
And finally reformat output to fasta with annotations
join <(bioawk -c fastx '{print $name,$seq}' contigs.fasta) <(awk -F '\t' '{print $1,$2,$3,$4}' contigs.blastp) | awk -F '\t' {printf(">%s %s\n%s\n",$1,$4,$2)}'