ESGF_Publishing_Services

(Next Generation) ESGF Publishing Services

Introduction

ESGF is working on a "next generation" publishing framework which will be simpler, more general, and more powerful.

The main advantages of the next framework will be:

The capability to publish resources of any kind, not just files (mainly in NetCDF format) listed in THREDDS catalogs
The capability to publish resources by invoking either "push" or "pull" services
The capability to validate records upon ingestion
The publishing services will be available to clients via HTTP RESTful endpoints

Note that this development concerns the ESGF services on the server side - no work is currently being undertaken to change publication on the client side. The current Hessian-based publishing services will still be supported, for the time being, so current publication clients will still work, although they are encouraged to migrate to invoking the new services. It is also expected that new clients will be developed to take advantge of the new functionality.

The new services will be secured in exactly the same way as the old services. Specifically, publication is controlled by local policies on the Data Node, which match classes of resources to groups and roles authorized for publishing. As part of the request, the publishing client needs to transmit to the server an X509 certificate containing the identity of the publication agent. The server uses the agent identity to invoke the local authorization service.

API

The publishing services API consists of matching RESTful endpoints for publishing and unpublishing metadata, in both pull and push mode. Additionally, a service is provided to delete records by identier.

Behavior common to all services:

All services must be invoked by clients via HTTPS POST requests
The client request must include an X509 certificate for authentication and authorization
Currently, all invocations are synchronous - i.e. the server, upon receiving a request, starts processing and only returns a response to the client when the operation is completed (succesfully or not). Note that this is the same behavior as the old services. Asynchronous invocations are planned for a future release.
The response returned to the client contains:
- The standard HTTP status code indicating the result of the publishing operation. In particular, the following codes may be returned:
  - 200 OK: the publishing operation was succesfull
  - 400 Bad Request: request parameters are missing or have incorrect values
  - 401 Unauthorized: publishing operation failed because agent lacked the proper permission
  - 500 Internal Server Error: publishing operation failed because of an unspecified error arised on the server side
- A body encoded as XML containing a short confirmation message in case of success, or an error message in case of failure

"Pull" Operations

In "pull" mode, the client requests the server to harvest metadata from a repository. Records are generated on the server side, validated, and sent to the metadata store for ingestion.

Pull Publishing Service

* URL: " https://<hostname>/esg-search/ws/harvest " 

* HTTP POST data: encoded as form (key, value) pairs 
  * "uri": location identifier of remote metadata repository or catalog 
  * "metadataRepositoryType": type of metadata repository, chosen from controlled vocabulary 
  * "schema": optional URI of additional schema for record validation 
* Examples: 
  *                curl --insecure --key ~/.esg/credentials.pem --cert ~/.esg/credentials.pem --verbose 
                --data "uri=http://esg-datanode.jpl.nasa.gov/thredds/esgcet/1/obs4MIPs.NASA-JPL.AIRS.mon.v1.xml&metadataRepositoryType=THREDDS&schema=cmip5" 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/harvest
        
                       wget --no-check-certificate --certificate ~/.esg/credentials.pem --private-key ~/.esg/credentials.pem --verbose 
                --post-data="uri=http://esg-datanode.jpl.nasa.gov/thredds/esgcet/1/obs4MIPs.NASA-JPL.AIRS.mon.v1.xml&metadataRepositoryType=THREDDS&schema=cmip5" 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/harvest

* Note: authorization is based on the resource "uri"

Pull UnPublishing Service

* URL: " https://<hostname>/esg-search/ws/unharvest " 

* HTTP POST data: encoded as form (key, value) pairs 
  * "uri": location identifier of remote metadata repository or catalog 
  * "metadataRepositoryType": type of metadata repository, chosen from controlled vocabulary 
* Examples: 
  *                curl --insecure --key ~/.esg/credentials.pem --cert ~/.esg/credentials.pem --verbose 
                --data "uri=http://esg-datanode.jpl.nasa.gov/thredds/esgcet/1/obs4MIPs.NASA-JPL.AIRS.mon.v1.xml&metadataRepositoryType=THREDDS" 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/unharvest 
        
                       wget --no-check-certificate --certificate ~/.esg/credentials.pem --private-key ~/.esg/credentials.pem --verbose 
                --post-data="uri=http://esg-datanode.jpl.nasa.gov/thredds/esgcet/1/obs4MIPs.NASA-JPL.AIRS.mon.v1.xml&metadataRepositoryType=THREDDS" 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/unharvest

* Note: authorization is based on the resource "uri"

"Push" Operations

In "push" mode, the client sends already generated metadata records to the server. The server validates the records and send them to the metadata store for ingestion.

Push Publishing Service

* URL: " https://<hostname>/esg-search/ws/publish " 

* HTTP POST data: metadata record encoded as Solr/XML (with optional "<doc schema=...>" attribute for additional project-specific validation). 

* Examples: 
  *                curl --insecure --key ~/.esg/credentials.pem --cert ~/.esg/credentials.pem --verbose 
                -X POST -d @cmip5_dataset.xml --header "Content-Type:application/xml" 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/publish 
        
                       wget --no-check-certificate --certificate ~/.esg/credentials.pem --private-key ~/.esg/credentials.pem --verbose 
                --post-file=cmip5_dataset.xml 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/publish  

* Note: authorization is based on the record identifier 
* Note: example XML records are contained in the "esgf-search" module Git repository, and attached to this web page.

Push UnPublishing Service

* URL: " https://<hostname>/esg-search/ws/unpublish " 

* HTTP POST data: metadata record encoded as XML 
* Examples: 
  *                curl --insecure --key ~/.esg/credentials.pem --cert ~/.esg/credentials.pem --verbose 
                -X POST -d @dataset.xml --header "Content-Type:application/xml" 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/unpublish
        
                       wget --no-check-certificate --certificate ~/.esg/credentials.pem --private-key ~/.esg/credentials.pem --verbose 
                --post-file=dataset.xml 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/unpublish  

* Note: authorization is based on the record identifier

Delete Operations

A generic "delete" service is provided to remove records by identifier from the metadata store.

Delete UnPublishing Service

* URL: " https://<hostname>/esg-search/ws/delete " 

* HTTP POST data: encoded as form (key, value) pairs 
  * "id": identifier of record to be deleted (key and value pairs may be repeated any number of times to delete more than one record at a time) 
* Examples: 
  *                curl --insecure --key ~/.esg/credentials.pem --cert ~/.esg/credentials.pem --verbose 
                --data "id=obs4MIPs.NASA-JPL.AIRS.mon.v1|esg-datanode.jpl.nasa.gov" 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/delete 
        
                       wget --no-check-certificate --certificate ~/.esg/credentials.pem --private-key ~/.esg/credentials.pem --verbose 
                --post-data="id=obs4MIPs.NASA-JPL.AIRS.mon.v1|esg-datanode.jpl.nasa.gov" -O response.xml 
                https://test-datanode.jpl.nasa.gov/esg-search/ws/delete  

* Note: authorization is based on the record identifier

Validation

The (new) ESGF Publishing Services enforce record validation: before being ingested into the metadata index, all incoming records are validated for basic compliance requirements, and optionally for project-specific compliance.

Record validation is based on meta-schemas - XML document instances that encode the rules to be applied by the validation engine. Currently, meta- schemas are distributed as part of the "esgf-search" module in the ESGF Git repository, in the future they might be made available and read from some URL location. Meta-schemas are modularized to encode distinct sets of requirements. Specifically at this time the following schemas are available:

esgf.xml : contains core requirements that ALL XML records must comply with (each record must have an "id", a "type", a "title", etc.). This schema is ALWAYS enforced.
geo.xml : contains requirements for Earth Sciences data (for geospatial and temporal location). This schema is always enforced, but all of its elements are optional so it only applies to Earth Sciences datasets.
cmip5.xml : contains requirements specific to CMIP5 model data (and similar datasets such as obs4MIPs and ana4MIPs). This schema is enforced ONLY if the publisher agent explicitly requests it (in "pull" operations), or flags the records as such (in "push" operations).

ESGF meta-schemas are XML documents (conforming to a single XSD schema) that allow for encoding of a complex validation semantics, specifically:

The required and optional metadata elements and their cardinality.
The metadata element type, including advanced or custom types such as "uuid". The default type if not specified is "string".
The record types to which they apply: "Dataset", "File", "Aggregation". By default, they apply to all record types.
An optional minimum value and maximum value for numeric fields.
An optional controlled vocabulary for string fields.

Meta-schemas are parsed by the ESGF validation engine, that enforces the corresponding rules on the incoming XML records (after converting them to Java objects). The validation engine may also apply higher logic to specific fields: for example, "url" fields are inspected for being of the form "url|mimeType|serverName". Note that currently the ESGF validation engine adopts a "lenient" approach: if a metadata field is found in the incoming XML record, but not constrained by any meta-schema requirement, it is still ingested as a multi-valued field of type "string" (assuming it is not defined otherwise by the Solr engine own schema, in which case the field will pass ESGF validation but may be rejected by Solr).

As mentioned, "core" validation is enforced for all publishing operations - both "push" and "pull". Additional validation based on some other schema (such as "schema=cmip5") must be requested by the client:

pull publishing : the additional HTTP POST parameter "schema=...." must be specified.
push publishing : the Solr/XML document that is the payload of the request must be flagged as conforming to the desired schema via the "doc@schema" XML attribute (which is ignored by the Solr engine upon ingestion). Note that no validation is applied to un-publishing or delete operations (as the only field that matters is the record "id").

Appendix: XSD versus ESGF meta schemas

Historically, the following considerations led to the decision of using custom meta-schemas instead of standard XSD documents for validating ESGF records:

Advantages of XSD documents:
- Validation can be performed by standard libraries available in all languages.
- XML records conforming to XSD schemas can be validated by non-ESGF software and engines.
Disadvantages of XSD schemas:
- XSDs are written in a very complex and verbose syntax, that require humans to gain specific training to understand and modify
- At this time, XSDs require optional metadata fields to follow a specific order in the document (this will change in the near future, but library support will take some time).
Advantages of ESGF meta-schemas
- They are much simpler than XSD schemas...
- ...yet they allow to encode a richer semantics such as record type dependency, custom metadata types, etc.
- They can be enforced on XML documents conforming to Solr/XML, as opposed to documents conforming to some other XSD schema. Solr/XML documents are very simple, uniform across all projects, and can be directly ingested into Solr without the need for an additional conversion.
- The validating engine can apply any additional custom logic (for example, for "url" fields).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly