You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used MMseqs2 to cluster sequences and am trying to understand how the representative sequence for each cluster is determined. Here’s a summary of the code I ran:
# Step 1: Convert the fasta file to a database (only needs to be done once for each fasta file)
mmseqs createdb fasta_file.fa queryDB
# Step 2: Cluster the database
mmseqs cluster queryDB queryDB_clus tmp --min-seq-id 0.75 -c 0.8 --cov-mode 1
# Step 3: Create a TSV file of the clusters
mmseqs createtsv queryDB queryDB queryDB_clus queryDB_clus.tsv
To investigate how the representative sequence is selected, I ran the following additional steps:
# Step 4: Compute distances between all sequences within a cluster
mmseqs alignall queryDB queryDB_clu queryDB_alnall
# Step 5: Create a TSV file of the alignment results
mmseqs createtsv queryDB queryDB queryDB_alnall queryDB_alnall.tsv
After analyzing the average identity and average score between sequences within each cluster, I found that neither of these metrics is highest for the sequence chosen by MMseqs as the representative. How is the representative sequence determined in this case?
The text was updated successfully, but these errors were encountered:
I used MMseqs2 to cluster sequences and am trying to understand how the representative sequence for each cluster is determined. Here’s a summary of the code I ran:
To investigate how the representative sequence is selected, I ran the following additional steps:
After analyzing the average identity and average score between sequences within each cluster, I found that neither of these metrics is highest for the sequence chosen by MMseqs as the representative. How is the representative sequence determined in this case?
The text was updated successfully, but these errors were encountered: