How is a cluster representative sequence determined #881

nooryoussef · 2024-08-26T16:01:41Z

I used MMseqs2 to cluster sequences and am trying to understand how the representative sequence for each cluster is determined. Here’s a summary of the code I ran:

# Step 1: Convert the fasta file to a database (only needs to be done once for each fasta file)
mmseqs createdb fasta_file.fa queryDB

# Step 2: Cluster the database
mmseqs cluster queryDB queryDB_clus tmp --min-seq-id 0.75 -c 0.8 --cov-mode 1

# Step 3: Create a TSV file of the clusters
mmseqs createtsv queryDB queryDB queryDB_clus queryDB_clus.tsv

To investigate how the representative sequence is selected, I ran the following additional steps:

# Step 4: Compute distances between all sequences within a cluster
mmseqs alignall queryDB queryDB_clu queryDB_alnall

# Step 5: Create a TSV file of the alignment results
mmseqs createtsv queryDB queryDB queryDB_alnall queryDB_alnall.tsv

After analyzing the average identity and average score between sequences within each cluster, I found that neither of these metrics is highest for the sequence chosen by MMseqs as the representative. How is the representative sequence determined in this case?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is a cluster representative sequence determined #881

How is a cluster representative sequence determined #881

nooryoussef commented Aug 26, 2024

How is a cluster representative sequence determined #881

How is a cluster representative sequence determined #881

Comments

nooryoussef commented Aug 26, 2024