You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Noticed on a real dataset where my Python code did not get the same median contig length, reduced to a minimal test case:
$ cat /tmp/median.fasta
>one
A
>two
AA
>three
AAA
>four
AAAA
There are an even number of sequences, 1, 2, 3, 4, meaning the median should be the mean of the middle two values, which is 2.5 in this trivial case.
As can be seen below, fasta_summary.pl picks the larger of the middle two values instead (which in the context of sequences will ensure the answer is an integer), so in practise the error will often be larger:
$ perl fasta_summary.pl -i /tmp/median.fasta -o /tmp && more /tmp/stats.txt
Directory '/tmp' exists, so the existing fasta_summary.pl output files will be overwritten
Statistics for read lengths:
Min read length: 1
Max read length: 4
Mean read length: 2.50
Standard deviation of read length: 1.12
Median read length: 3
N50 read length: 3
Statistics for numbers of reads:
Number of reads: 4
Number of reads >=1kb: 0
Number of reads in N50: 2
Statistics for bases in the reads:
Number of bases in all reads: 10
Number of bases in reads >=1kb: 0
GC Content of reads: 0.00 %
Simple Dinucleotide repeats:
Number of reads with over 70% dinucleotode repeats: 0.00 % (0 reads)
AT: 0.00 % (0 reads)
CG: 0.00 % (0 reads)
AC: 0.00 % (0 reads)
TG: 0.00 % (0 reads)
AG: 0.00 % (0 reads)
TC: 0.00 % (0 reads)
Simple mononucleotide repeats:
Number of reads with over 50% mononucleotode repeats: 50.00 % (2 reads)
AA: 50.00 % (2 reads)
TT: 0.00 % (0 reads)
CC: 0.00 % (0 reads)
GG: 0.00 % (0 reads)
CC original author @sujaikumar - given the Perl script has been used widely it may be simpler just to document this behaviour? Is there an official repository for this script?
The text was updated successfully, but these errors were encountered:
Noticed on a real dataset where my Python code did not get the same median contig length, reduced to a minimal test case:
There are an even number of sequences, 1, 2, 3, 4, meaning the median should be the mean of the middle two values, which is 2.5 in this trivial case.
As can be seen below,
fasta_summary.pl
picks the larger of the middle two values instead (which in the context of sequences will ensure the answer is an integer), so in practise the error will often be larger:CC original author @sujaikumar - given the Perl script has been used widely it may be simpler just to document this behaviour? Is there an official repository for this script?
The text was updated successfully, but these errors were encountered: