Skip to content
Irwin Jungreis edited this page Jun 18, 2021 · 8 revisions

The "Omega Test" is a simplified operational mode of PhyloCSF that requires fewer pre-computed parameters about the phylogeny under analysis. In particular, while PhyloCSF requires both a phylogenetic tree and "empirical codon models" with thousands of parameters estimated from known coding and non-coding regions aligned in the species of interest, the "Omega Test" requires only a phylogenetic tree. On the other hand, the test is less accurate and slower than the standard PhyloCSF mode.

We used the Omega Test to distinguish coding and non-coding genes in zebrafish RNA-seq data (Pauli et al. 2011). We've also written a brief (dense) description of the mathematical details of the method.

Activating the Omega Test

To use the Omega Test with one of the existing phylogenies supported by PhyloCSF, simply add --strategy omega to the command line. For example, the tal-AA alignment in the 12 fly genomes bundled with PhyloCSF:

$ ./PhyloCSF 12flies PhyloCSF_Examples/tal-AA.fa
PhyloCSF_Examples/tal-AA.fa	score(decibans)	297.6235
$ ./PhyloCSF 12flies PhyloCSF_Examples/tal-AA.fa --strategy omega
PhyloCSF_Examples/tal-AA.fa	score(decibans)	132.5892

The score is still reported as a likelihood ratio in units of decibans. In this example, the (known) coding region is given a highly positive score in both modes, but it is lower in magnitude with the Omega Test. This is typical: the Omega Test observes fewer informative features of the alignment than the full PhyloCSF method, and as a result usually reports less overall evidence one way or the other.

Providing a new phylogeny

To use the Omega Test with a phylogeny not supported by PhyloCSF, you have to provide a phylogenetic tree (with branch lengths) relating the species under analysis, and place it in the PhyloCSF_Parameters directory in Newick format. As an example, this is the content of 12flies.nh from the PhyloCSF distribution:

((((((dmel:0.061361,(dsim:0.054894,dsec:0.031243):0.031837):0.063495,(dyak:0.111338,dere:0.100461):0.039892):0.357431,dana:0.581114):0.243592,(dpse:0.033045,dper:0.036095):0.495254):0.224541,dwil:0.801425):0.249420,((dvir:0.301255,dmoj:0.453117):0.141069,dgri:0.434875):0.249455);

When you tell PhyloCSF to use the 12flies phylogeny on the command line, it looks for 12flies.nh in the PhyloCSF_Parameters directory. (If not using --strategy omega, it also looks for files containing the ECM parameters.) The name of your Newick tree file thus determines how you tell PhyloCSF to use that phylogeny. The names of the species in the Newick tree must be consistent with the alignments you intend to provide. All other details of formatting the alignments are identical to the normal PhyloCSF mode (see the main page).

The topology and/or branch lengths in the phylogenetic tree can be estimated using standard packages (e.g. PAML, PHAST, HyPhy, or MrBayes), using a modest sample of aligned codon sites. They should technically be in units of "codon substitutions per codon site", but in practice it is mainly necessary that the branch lengths be accurate in proportion (relative to each other).

Clone this wiki locally