Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a "non-data" full release pipeline (ontology metadata curation only) #382

Open
kltm opened this issue Jul 23, 2024 · 7 comments
Open
Assignees

Comments

@kltm
Copy link
Member

kltm commented Jul 23, 2024

In order to support the joint pipeline with GOA, we want to create a high-frequency high-success pipeline that produces all of the data products that GOA needs to complete their parts of the pipeline.

We want to produce:

  • Ontology
  • User, groups, dbxrefs, and other metadata
  • Curation tool resources
    • PAINT annotations
    • Noctua-derived annotations (standard and GO-CAM)
    • "Automated upstreams" (MGI)

In such a way as to enable easy pickup and signalling for GOA.

@kltm
Copy link
Member Author

kltm commented Jul 23, 2024

Current discussion is looking at:

  • daily
  • dated/released
  • signalled to GOA

@kltm kltm moved this to In Progress in GOC / GOA Joint Pipeline Jul 23, 2024
@kltm
Copy link
Member Author

kltm commented Jul 23, 2024

Tagging @pgaudet

@kltm kltm changed the title Create a "non-data" full release pipeline (ontology and metadata only) Create a "non-data" full release pipeline (ontology metadata curation only) Jul 23, 2024
@kltm
Copy link
Member Author

kltm commented Jul 23, 2024

Considering doing a full "raw data" release, including Zenodo and a CF endpoint. This may actually be easiest as it mirrors what we already do--basically the first stage of snapshot, plus the second stage's "publish" step. I'll want to check size and whether Zenodo can digest; we want this fully automated and on smooth rails. Maybe skip Zenodo, as we still will have the full release there.

@kltm
Copy link
Member Author

kltm commented Aug 8, 2024

Basing around "raw-data"

kltm added a commit to geneontology/go-site that referenced this issue Aug 8, 2024
kltm added a commit to geneontology/go-site that referenced this issue Aug 11, 2024
…files are now the same (empty) and it seems to be tripping up for some reason; for geneontology/pipeline#382
kltm added a commit to geneontology/go-site that referenced this issue Aug 11, 2024
kltm added a commit to geneontology/go-site that referenced this issue Aug 11, 2024
@kltm
Copy link
Member Author

kltm commented Aug 14, 2024

I'm doing some exploring of a partial run. Looking at what I have, I expect that all raw upstreams and first-order products (excluding blazegraph and solr), to run about 10G. This puts us well under typical limits for Zenodo and our usual publications (which clock in at nearly 50G). If working weekly, this would allow us to use a monthly buffer (or/with S3 lifecycle) or Zenodo as transport without incurring too much overhead or cost.

kltm added a commit that referenced this issue Aug 14, 2024
…iate targets and flush for raw-data.geneontology.org; for #382
kltm added a commit that referenced this issue Aug 15, 2024
kltm added a commit that referenced this issue Aug 16, 2024
@kltm
Copy link
Member Author

kltm commented Aug 22, 2024

From talking to @pgaudet, I think I'll move raw-data.geneontology.org a little closer to where we want it to be by removing "annotations/" and "blazegraph/".

kltm added a commit that referenced this issue Aug 22, 2024
@kltm
Copy link
Member Author

kltm commented Aug 22, 2024

TBD: after talking to Alex, the best way to package and communicate our data for remote processing and re-ingest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

1 participant