Merge branch 'newpsf' into lazy_par

schneebergerlab · Sep 11, 2024 · e05698e · e05698e
2 parents 371a979 + 9bd3b6a
commit e05698e
Show file tree

Hide file tree

Showing 3 changed files with 47 additions and 24 deletions.
diff --git a/docs/format.md b/docs/format.md
@@ -1,33 +1,39 @@
-# The Population syntenty File Format (PFF)
+# The Population Syntenty File Format (PSF)
+Updated to v0.3 file format 2024-09-11 Leon Rauschning
+Updated to v0.2 file format 2024-05-22 Leon Rauschning
+Created 2022-05-11 Leon Rauschning
 
 ## Premise
-Finding synteny in multiple genomes at once presents novel challenges, as not all syntenic sequences may be present on a linear reference.
-To usefully represent this information, a new file format is needed; at the same time, as much compatibility as possible should be maintained with tools built for pairwise genome comparisons.
-For this reason, PFF retains some of the basic structure of the BED format.
-The information presented in PFF is richer than what msyd exports to VCF, in particular it is able to contain synteny not present in the linear reference.
-For compatibility with BED format, a position on a linear reference is reported for all annotations, even those not present on the reference; in these cases, they are placed at the first position they can be at given the constraint of the multisyntenic regions surrounding them.
+
+Finding synteny in multiple genomes at once presents a new way of working with the genomes of a population. To represent this novel kind of information, a new file format is needed; at the same time, as much compatibility as possible should be maintained with tools built for pairwise genome comparisons.
+For this reason, in designing the Population Synteny File format we have chosen to retain some of the basic structure of the BED format and stick to a simple yet flexible tab-separated format.
+
+This rough specification uses some terms specific to msyd; these are explained in glossary.md.
+
+The information presented in PSF is richer than what msyd exports to VCF, in that it represents base-level alignments of multisyntenic regions and canrepresent synteny not found on a single linear reference.
+To provide full compatibility with BED format, optionally even these annotations can be assigned a coordinate on a linear reference if specified; the region is then canonically placed at the earliest space it can be given the constraint of neighbouring regions.
+
 
 ## Spec
-- A PFF file consists of a body and a header, and is tab-separated
-- The header starts with '\#ANN' and contains the names of all organisms in the callset, separated by tabs
-- The first three columns contain the genomic coordinates on a linear reference; if a sequence does not have a position on the reference, it is annotated as a single base somewhere between the two nearest regions that constrain its position on the reference.
-- The fourth column contains the ID of each syntenic region, labeling it as either coresyntenic (syntenic across all samples) or merasyntenic (syntenic across only a subset of sequences). IDs present a way to consistently order all merasyntenic regions, even when they do not have a position on a linear reference; if possible, order of genomic coordinates on a reference are preserved.
-- The following columns contain the position of each syntenic region in each sample, as well as the sample this region was aligned against to make the call and optionally the CIGAR string of the alignment, separated by `,`.
-- The position is given as a range, consisting of an optional sample name specifier, a chromosome number and haplotype character separated by : from each other and the start and end positions separated by -
-- x represents an unknown haplotype; X represents a locus in all haplotypes
+- A PSF file consists of a body and a header, and is tab-separated. A . represents a missing field, if it is allowed.
+- The header starts with '\#' and contains the names of all organisms in the callset, separated by tabs
+- The first three columns contain the genomic coordinates on a linear reference; if a sequence does not have a position on the reference, these may be annotated as missing values or as a single base at the earliest position allowed by the contraints of other regions present on the reference. The latter option is provided to maintain compatibility with BED and facilitate tabix indexing.
+- The fourth column contains the ID of the multisyntenic annotation, labeling it as either coresyntenic (syntenic across all samples), merasyntenic (syntenic across only a subset of sequences) or private (not syntenic in any other sample). IDs present a way to consistently order all merasyntenic regions, even when they do not have a position on a linear reference.
+- The fifth column contains the name of the sample chosen as representative for this multisyntenic region; it must be either present in the header annotation, or be 'ref' (the linear reference genome, by convention). The representative sample is the reference to which the alignments of the other samples are reported.
+- The sixth to eighth columns store the position of the multisyntenic region on the representative sample.
+- The following columns contain the position of each syntenic region in each sample and the CIGAR string of the alignment, separated by `,`. The CIGAR string may be elided for smaller file sizes and easier downstream processing, though further multisynteny identification may not work.
+- The position is given as a range, consisting of an optional sample name specifier, a chromosome identifier and haplotype character separated by : from each other and the start and end positions separated by -. Both start and end positions are inclusive.
 - A range is inverted if the start position is before the end position. In this case, any alignments referring to this range have this sequence reversed
 - A range can contain whitespace characters that are not TAB at arbitrary positions; they are removed during parsing
-- Merasyntenic regions, in particular those not present in the reference may be collapsed into one record by separating the annotations in each column with a `;`
 
 ## Example
 
-ANN|	Ref|	Qry1|	Qry2
--|	-|	-|	-
-SYN|	5:X:10-100,90=|	5:X:10-100,90=|	5:a:10-100,45=5X30=10D;5:b:10-100,90=
-TRANS|	8:X:20-30,10=;9:a:440-450,4=2:X:4=|	8:X:20-30,10=|	8:X:20-30,8=2D
-SNP|	12:X:400-400,1=|	12:X:400-400,1:X:|	12:a:400-400,1=;12:b:400-4001:X:
-INV|	2:X:100-200,100=|	2:X:200-100,100=|	2:X:100-200
-DEL|	4:X:10-20,10=|	|	4:a:10-20,10=
-INS|	|	10:b:30-50,20I|	10:X:30-50,20I
-DUP|	5:X:20-30,10=|	5:X:20-30,10=;5:a:30-40,10I|	5:X:20-30,10=
+CIGAR strings are shortened using [...]
 
+\#CHR|START|END|ANN|REF|CHR|START|END|c24|eri|ler|sha
+---
+Chr1|124|530|MERASYN1|ref|.|.|.|.|Chr1:513-919,53=1X11=1X29=1X33=1X16=1X5=1X34=1X190=1X5=1X22=|.|.
+Chr1|531|1084|MERASYN2|ref|.|.|.|.|Chr1:920-1458,44=1X27=1X94=1X12=1X43=1X136=1X133=15D26=1X4=1X2=1X3=1X5=|.|Chr1:1581-2126,44=1X21=1X5=1X22=1X71=1X12=1X43=1X136=1X173=8D1=1X8=1X
+Chr1|1090|17001|CORESYN1|ref|.|.|.|Chr1:13-15966,1464=1X[...]|Chr1:1464-17369,1=2X[...]|Chr1:6-15923,5235=1I2344=[...]|Chr1:2132-18108,3=5I50=1X1=4D[...]
+Chr1|17002|18729|MERASYN3|ref|.|.|.|Chr1:15967-17694,1X8=1X14=1X4=1X17=1X256=1X95=1X36=1X125=1I265=1X42=1X519=1X3=1X48=1X104=1D178=|.|Chr1:15924-17651,1055=1X672=|.
+Chr1|18730|68988|CORESYN2|ref|.|.|.|Chr1:17695-67953,1793=10D3=1X[...]|Chr1:17378-67673,26=1X33=1X129=2I[...]|Chr1:17652-67896,229=1D1550=4D11696=[...]|Chr1:18117-67579,26=1X33=1X129=2I[...]
diff --git a/docs/glossary.md b/docs/glossary.md
@@ -0,0 +1,17 @@
+# Glossary
+Created 2024-09-11, Leon Rauschning
+
+In the msyd documentation and codebase, we use a number of terms specific to msyd, or in a different meaning than in general usage or other genomics contexts.
+Through the development of msyd, these terms have evolved somewhat – owing as much to the shifting nature of methods for multi-genomics analyses as to the idiosyncracies small groups like to develop.
+This glossary represents an attempt to capture the meaning of the most important terms and the intuition behind them as msyd is nearing release, in the hopes that it will be useful for working with msyd and its output.
+
+- Synteny: Two genomic regions are considered to be in synteny (syntenic to each other) when they are found in the same large-scale genomic context and align to each other. It can be thought of as the absence of structural variation. Synteny forms a backbone along the chromosomes of a species, and enables biological processes like meiotic paring and recombination. Contrary to other tools, SyRI and by extension msyd do not consider genome annotations when finding synteny.
+- Structural Variation (SV): Structural variants are large-scale differences in the genomes of a population. Various processes may cause structural variation, including transposon activity, double-strand breaks and non-homologous recombination. SVs may contain small variants like SNPs or indels, or other SVs.
+- Multisynteny: A multisyntenic region is a region in the genome that is syntenic between any two individual genomes in a subset of a population. It can be either mera- or coresyntenic. Msyd stores its position in any of the genomes contained in it. One individual genome of this subset is chosen by msyd as the representative of that multisyntenic region, and alignments of this representative to other organisms are (optionally) reported as CIGAR strings. If a multisyntenic region is present in a linear reference given to msyd, it will always be chosen as the representative sample.
+- Degree of multisynteny: The number of organisms a multisyntenic region is shared by. It is somewhat analogous to the concept of allele frequency. The degree of a multisyntenic region is always between two and the number of individuals in the population.
+- Coresynteny: A multisyntenic region encompassing the entire population. Its degree will always be the number of individuals in the population. Because coresyntenic regions are structurally conserved and occur in every genome, they have a single defined ordering shared by all individuals. This can be thought to represent the syntenic backbone of a population.
+- Merasynteny: A merasyntenic region is multisyntenic between a true subset of the sampled population – so its degree is always less than the number of individuals in the population. Merasynteny can be thought of as corresponding to alleles of a structural variant. In contrast to coresyntenic regions, merasyntenic regions do not have a defined position in all genomes of the population, though they are implicitly constrained by the surrounding coresyntenic regions. If prompted, msyd places these regions at the first position allowed by this constraint in the coordinate space of a linear reference.
+- Private: Structurally private regions are regions that are not syntenic to any other region in any genome in the population. This does not necessarily mean that they do not exist elsewhere; they may simply be in a different structural context, or too diverged or repetitive to align. In practice, private regions are highly enriched in the centromeres and rDNA arrays.
+- msyd/multisynteny detector (pronounced /msi:d/, M-SEE-D): A software for efficiently finding multisynteny in a population of genomes. msyd uses different algorithms to achieve this, most notable the Synteny Intersection Algorithm and Iterative Realignment. The native file format of msyd is the Population Synteny File Format (PSF), though msyd can also export multisynteny calls to VCF. msyd also provides subcommand that can filter PSF files, compute statistics on them or use multisynteny for ordering genomes for plotsr plotting. msyd can also merge small variant calls of different samples in structurally conserved regions.
+- Population Synteny File (Format): The native file format used by msyd to store called multisynteny. It loosely retains some BED-like features. PSF can contain CIGAR strings to allow further processing and reading by msyd, or be stored without CIGAR strings for better readability and smaller file sizes.
+
diff --git a/msyd/pyxfiles/io.pyx b/msyd/pyxfiles/io.pyx
@@ -640,7 +640,7 @@ cpdef save_df_to_psf(df, buf, save_cigars=True, force_ref_pos=False):
         int coreend = 0
         str corechr = ''
 
-    buf.write("#CHR\tSTART\tEND\tANN\tREF\tCHR\tSTART\tEND\t")
+    buf.write("#CHR\tSTART\tEND\tANN\tREP\tRCHR\tRSTART\tREND\t")
     buf.write("\t".join(orgs))
     buf.write("\n")