Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo GDAC parquet on s3 #67

Open
4 of 5 tasks
tcarval opened this issue Jul 24, 2024 · 1 comment
Open
4 of 5 tasks

Argo GDAC parquet on s3 #67

tcarval opened this issue Jul 24, 2024 · 1 comment
Assignees

Comments

@tcarval
Copy link
Contributor

tcarval commented Jul 24, 2024

To do list

  • EasyOneArgo : generate a parquet file containing all Argo profiles (core, bgc, deep, ...)
  • EasyOneArgoLight : generate a parquet file containing all Argo profiles (core, bgc, deep, ...) reduced on 40 levels
  • Push EasyOneArgo & EasyOneArgoLight on Argo GDAC S3
  • Document "Argo GDAC parquet on s3"
  • Publish a demo notebook

Argo EasyOneArgo parquet on s3

@tcarval tcarval converted this from a draft issue Jul 24, 2024
@tcarval
Copy link
Contributor Author

tcarval commented Sep 20, 2024

Here is a documentation of EasyOneArgo and EasyOneArgoLight products. If accepted, this documentation should be in a product section of OneArgo GitHub.

One of the conclusions from the "Argo and Copernicus Marine link" task of the European project Euro-Argo Rise highlighted the difficulty for modelers in using Argo data due to its complexity. This led to the specificatoin of a simplified product called EasyArgo.
Argo vertical profiles are continuously aggregated and distributed via Copernicus Marine as NetCDF “PF” files (Profiling Floats).

Users feedbacks:
• It is difficult to manage real-time, adjusted real-time, and delayed mode variables simultaneously. For example, finding the best salinity requires dealing with seven different variables or attributes (PSAL, PSAL:DATA_MODE, PSAL_QC, PSAL_ADJUSTED, PSAL_ADJUSTED:DATA_MODE, PSAL_ADJUSTED_QC, PSAL_ADJUSTED_DM).
• Biogeochemical (BGC) profiles are scattered between Core+BGC and BGC-only profiles, with heterogeneous vertical sampling, which creates duplicate profiles.

EasyArgo product specifications:
• Only profiles with "good" quality control (QC = 1, 5, 8) are included, representing good value, value changed, estimated values.
• Biogeochemical (BGC) profiles are the BGC synthetic profiles, while the remaining profiles are from Core-Argo profiles.
• Adjusted variables are reported as essential ocean variables (EOV) along with their data mode (real-time, adjusted in real time or delayed mode data)

For example, salinity is reported using only two variables (PSAL, PSAL_DM) instead of seven, simplifying the analysis.

The EasyArgo vertical profiles are available in a cloud-optimized parquet format, ready for analysis.

  • The EasyArgo.parquet file contains all Argo core and biogeochemical observations, totaling 6.5 billions of data points in a single 45 Gb file.
  • The EasyArgoLight.parquet file is a subset of 35 levels extracted from the main file, totaling 19 million data points in 500Mb file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants