Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is a cluster representative sequence determined #881

Open
nooryoussef opened this issue Aug 26, 2024 · 0 comments
Open

How is a cluster representative sequence determined #881

nooryoussef opened this issue Aug 26, 2024 · 0 comments

Comments

@nooryoussef
Copy link

I used MMseqs2 to cluster sequences and am trying to understand how the representative sequence for each cluster is determined. Here’s a summary of the code I ran:

# Step 1: Convert the fasta file to a database (only needs to be done once for each fasta file)
mmseqs createdb fasta_file.fa queryDB

# Step 2: Cluster the database
mmseqs cluster queryDB queryDB_clus tmp --min-seq-id 0.75 -c 0.8 --cov-mode 1

# Step 3: Create a TSV file of the clusters
mmseqs createtsv queryDB queryDB queryDB_clus queryDB_clus.tsv

To investigate how the representative sequence is selected, I ran the following additional steps:

# Step 4: Compute distances between all sequences within a cluster
mmseqs alignall queryDB queryDB_clu queryDB_alnall

# Step 5: Create a TSV file of the alignment results
mmseqs createtsv queryDB queryDB queryDB_alnall queryDB_alnall.tsv

After analyzing the average identity and average score between sequences within each cluster, I found that neither of these metrics is highest for the sequence chosen by MMseqs as the representative. How is the representative sequence determined in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant