Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Mash accurately classify subspecies? #172

Open
rpalcab opened this issue Mar 30, 2022 · 1 comment
Open

Can Mash accurately classify subspecies? #172

rpalcab opened this issue Mar 30, 2022 · 1 comment

Comments

@rpalcab
Copy link

rpalcab commented Mar 30, 2022

Hello,

I'm currently working on Mycobacterium caprae and Mycobacterium bovis. These subspecies of the M. tuberculosis complex are phylogenetically very similar, so the task of identifying them is not always trivial.

In one of my analysis, I expected all the samples to be M. caprae, but when looking at the Mash screen results I find that many of them could be assigned to both subspecies, since they got the same shared-hashes score and p-value, or just a difference of 1 in the shared-hashes score.

#Sample A

0.99957	991/1000	77	0	GCF_001941665.1_ASM194166v1_genomic.fna.gz	NZ_CP016401.1 Mycobacterium caprae strain Allgaeu genome
0.99957	991/1000	77	0	GCF_001483905.1_ASM148390v1_genomic.fna.gz	NZ_CP013741.1 Mycobacterium bovis strain BCG-1 (Russia), complete genome
0.99957	991/1000	77	0	GCF_001274555.1_ASM127455v1_genomic.fna.gz	NZ_CP009243.1 Mycobacterium bovis BCG strain Russia 368, complete genome

#Sample B

0.999377	987/1000	193	0	GCF_000195835.1_ASM19583v1_genomic.fna.gz	NC_002945.3 Mycobacterium bovis AF2122/97 chromosome, complete genome
0.999329	986/1000	193	0	GCF_001941665.1_ASM194166v1_genomic.fna.gz	NZ_CP016401.1 Mycobacterium caprae strain Allgaeu genome
0.999329	986/1000	193	0	GCF_001580385.1_ASM158038v1_genomic.fna.gz	NZ_CP014566.1 Mycobacterium bovis BCG str. Tokyo 172 substrain TRCS, complete genome

This makes me wonder whether Mash screen is able to identify in a subspecies level. Also, is a difference of 1 in the shared-hashes score robust enough to determine the taxonomy of an organism?

Thanks in advance

@sheikki
Copy link

sheikki commented Nov 7, 2023

In my experience, k-mer size of 17 (-k 17) and sketch size of 50000 (-s 50000) is enough for differentiating Salmonella serovars. The default sketch size of just 1000 certainly doesn't provide enough resolution for subspecies etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants