-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gene Name attributes that start with gensp. #187
Comments
Here are my opinions about those cases:
Notes: *[2]
For these, I'd like a second opinion from @adf-ncgr and Joann if appropriate. |
Thanks for opinions, @StevenCannon-USDA , I'll put these into a to-do checkbox list here and I'll start updating them after giving @adf-ncgr and @joannmudge a chance to object. As for the Zhou, Silverstein, at al. genomes, I agree that the Name attribute should be just the final piece (
|
OK, I'm in agreement with most of these, but I think the original Names for tripr were actually like "Tp57577_TGAC_v2_gene10066" not just "gene10066" so should we use that instead, like Phytozome and Ensembl seem to do? I'm still a little unclear about what the principal is here (originalism or aesthetics), though we once tried to pin it down here: legumeinfo/datastore-specifications#44 I would personally vote to keep medtr.HMxxx.g1 (or at least HMxxx.g1) which is seemingly no more problematic than having GlymaLee as part of a name, it just happens to also be identical with part of our full yuck system. But if we think g1 is better for any given medicago accession, I think that implies that strict Name originalism is the principle here, no matter how bad we think the names are, meaning we should be stuck with Tp57577_TGAC_v2_gene10066. But whatever we decide, let's take it as an opportunity to resolve the open questions in legumeinfo/datastore-specifications#44 |
I think we're close to convergence, and are down to the point of splitting hairs - which I guess is unavoidable. And a key clause:
I would say that "exceptionally cumbersome strings ... if those prefixes do not contribute to the uniqueness of the names within the annotation file" is a fair description of Tp57577_TGAC_v2_gene10066. I mean: the Trifolium team has encoded Genus (T), species (p), accession (57577 I think), sequencing center (TGAC), and assembly version v2. I think this is a worthy case for an exception (shortening it to "gene10066"). But I won't fight anyone over it. If Sam is implementing, I say: go ahead and do what you think is right, and we'll be prepared to be delighted. |
Thanks @StevenCannon-USDA, sounds like that clause is indeed the final refuge for the hair-splitters! I am in favor of shortening where there is substantial overlap with what full yuck is accomplishing. I think this would mean that we'd allow: |
@adf-ncgr - yeah, I think I'd leave ... When you're running an Airbnb and some guests insist on bringing all their own furniture. |
This issue (see the title of this issue) is about the gensp prefixes, which it appears we all agree should be dropped. A protocol for how we populate the Name attribute otherwise is certainly a Good Thing. I don't see any argument for keeping the gensp prefix here, so I'll yank those from the appropriate places, and we can move the discussion of Names in general back to legumeinfo/datastore-specifications#44 . I'll keep this issue open just so I can hit my checkboxes. |
And yes, in the few cases where Name is full-yuck, I'll de-yuckify it down to the non-yuck portion. (example: glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100). |
We talked about this already, but I'd like to take action and update the Datastore where appropriate.
There are a number of
gene_models_main
GFFs that have the Name attribute starting with gensp. This seems non-conformant to me, in the sense that Name is meant to be what a gene is called in the source material and the gensp prefixes tend to be an LIS thing.Here's the list with an example of a GFF line for each case. I'd like @StevenCannon-USDA to confirm that the gene Name attributes should, in fact, contain the gensp prefix in these cases or, when not, to update the GFFs (outsourcing that to me is fine). Also, @adf-ncgr may have some arcane reasons for including the gensp prefix in certain cases. (Name uniqueness does not qualify as a reason, in my opinion, but he may have some JBrowse-related or other reasons for doing so.)
cajca.ICPL87119.gnm1.ann1.Y27M.gene_models_main.gff3.gz:
cicar.CDCFrontier.gnm1.ann1.nRhs.gene_models_main.gff3.gz
cicar.ICC4958.gnm2.ann1.LCVX.gene_models_main.gff3.gz
Glycine/max/annotations/Lee.gnm1.ann1.6NZV/glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz
Glycine/max/annotations/Wm82_ISU01.gnm2.ann1.FGFB/glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz
Glycine/soja/annotations/PI483463.gnm1.ann1.3Q3Q/glyso.PI483463.gnm1.ann1.3Q3Q.gene_models_main.gff3.gz
Glycine/soja/annotations/W05.gnm1.ann1.T47J/glyso.W05.gnm1.ann1.T47J.gene_models_main.gff3.gz
Lupinus/albus/annotations/Amiga.gnm1.ann1.3GKS/lupal.Amiga.gnm1.ann1.3GKS.gene_models_main.gff3.gz
Lupinus/angustifolius/annotations/Tanjil.gnm1.ann1.nnV9/lupan.Tanjil.gnm1.ann1.nnV9.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM056.gnm1.ann1.CHP6/medtr.HM056.gnm1.ann1.CHP6.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM058.gnm1.ann1.LXPZ/medtr.HM058.gnm1.ann1.LXPZ.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM060.gnm1.ann1.H41P/medtr.HM060.gnm1.ann1.H41P.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM095.gnm1.ann1.55W4/medtr.HM095.gnm1.ann1.55W4.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM125.gnm1.ann1.KY5W/medtr.HM125.gnm1.ann1.KY5W.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM129.gnm1.ann1.7FTD/medtr.HM129.gnm1.ann1.7FTD.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM185.gnm1.ann1.GB3D/medtr.HM185.gnm1.ann1.GB3D.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM324.gnm1.ann1.SQH2/medtr.HM324.gnm1.ann1.SQH2.gene_models_main.gff3.gz
Trifolium/pratense/annotations/MilvusB.gnm2.ann1.DFgp/tripr.MilvusB.gnm2.ann1.DFgp.gene_models_main.gff3.gz
Vigna/unguiculata/annotations/IT97K-499-35.gnm1.ann2.FD7K/vigun.IT97K-499-35.gnm1.ann2.FD7K.gene_models_main.gff3.gz
The text was updated successfully, but these errors were encountered: