PROFUNGIS post processing uploads ZOTUs sample fasta files into a reference ZOTU table
The following scripts provide ways to post-process ZOTU sequence data generated by PROFUNGIS pipeline
Script1: name: generate_zotu_ref1.py desc: generate_zotu_ref1.py allows to generate a reference file from a FASTA format and appends attributes which can be used as a template to upload to the Entity ZOTU of MDDB.
Script2: name: update_ref_map.py desc: update_ref_map.py allows to update the ZOTU reference file with new ZOTUs and keeps track of existing ZOTUs detected in reference
Please note that PROFUNGIS needs to be launched before using the input post-processing input files. PROFUNGIS output fasta files are the inputs for updating MDDB tables refseq ZOTU and contains (ZOTU - sample mapping)
generate_zotu_ref1.py
The script requests two arguments from the user: a fasta file and requests if primary keys would like to be generated ('Y'/'N')
If second argument == 'N', then a csv file is generated in which it includes the labels 'seq_id' taken from the original FASTA sequence label, followed by a sequence attribute 'sequence' which contains the sequence belonging to the ITS2 marker in this case
If second argument = 'Y', then a csv file is generated in which it includes a 'refseq_pk' which is generated by using a specific suffix used to refer to the db table belonging to the ZOTU reference set.
A relationship table is also created in which the following mapping is stored: SRR name | Zotu label | Primary key assigned to the Zotu label
update_ref_map.py
This script can be run sequentially to a reference ZOTU file generated (for example by launching generate_zotu_ref1.py), as it updates and traces existing ZOTU sequences (ZOTU detected). This version of the script requests one FASTA argument from the user (string) Update: option str <fasta.fa filename> or list of fasta <fasta_list.txt filename>
In input a ZOTU reference file is provided, this is usually generated as a dump of the existing MDDB existing ZOTU reference table. Also the previous tracker is initially provided, such to update and keep track with how many new ZOTU sequences will be generated as reference.
The script also generates a mapping RefZOTU - NewZOTU in the following format:
SRR id | Zotu label | Primary key of ref ZOTU
and assigns new primary keys to the new ZOTUs not detected as a reference ZOTU, else provides the mapping of ZOTUs which have been already detected
to a reference ZOTU, thus the primary key of the reference ZOTU is given
This mapping is useful such that it provides two types of information:
- traces and provides which ZOTUs are new,
- traces which ZOTUs are shared among different samples
####generate_zotu_ref1.py
generate_zotu_ref1.py <srr_filename.fa>
where <srr_filename.fa> is in FASTA format -- PROFUNGIS generates ZOTU fasta files with name belonging to the SRA SRR.
(ex: SRR1502226_zotus_final.fa))
####update_ref_map.py
update_ref_map.py <srr_filename.fa> <RefZOTU.csv> <mapping.csv>
where <srr_filename.fa> -- is the next (new) processed ZOTU file (FASTA format) generated by PROFUNGIS
<RefZOTU.csv> -- ZOTU reference file (unique ZOTUs)
<mapping.csv> -- the mapping table which keeps track of which ZOTU sequence belongs to which sequence sample
HINT: for testing you can use the output files generated by generate_zotu_ref1.py
(ex: generate_zotu_ref1.py <newfasta.fa> <refseq_table_pk.csv> <mapping_table_pk_zotu_srr.csv>)
generate_zotu_ref1.py
- csv
- pandas
- re
- Bio
update_ref_map.py
- csv
- pandas
- re
- Bio
###OUTPUTS
-
generate_zotu_ref1.py -> ZOTU reference file with extended annotation
outputs:
mapping_table_pk_zotu_srr.csv -> traces the mapping of the original Fasta Label to assigned PK otu_seq_mapping_to_update.csv -> a simple reference ZOTU table generated from the given FASTA record_track.csv -> tracker of how many reference ZOTUs have been generated refseq_table_pk.csv -> the ZOTU list with extended annotation used as reference
-
update_ref_map.py -> update ZOTU reference table and mappings
outputs:
mapping_table_pk_zotu_srr.csv -> updates the mapping table of the original Fasta Label to assigned PK or to new PK generated if not found otu_seq_mapping_to_update.csv -> provides the table format of the new ZOTUs coming in for update record_track.csv -> updates the tracker of how many reference ZOTUs have been generated from the new FASTA refseq_table_pk.csv -> the updated ZOTU list with new PK generated if new ZOTU was detected