University of Colorado
The ESGF Publishing Services enforce record validation: before being ingested into the metadata index, all incoming records are validated for basic compliance requirements, and optionally for project-specific compliance.
Record validation is based on meta-schemas - XML document instances that encode the rules to be applied by the validation engine. Currently, meta- schemas are distributed as part of the "esgf-search" module in the ESGF GitHub repository, in the future they might be made available and read from some URL location. Meta-schemas are modularized to encode distinct sets of requirements. Specifically at this time the following schemas are available:
esgf.xml : contains core requirements that ALL XML records must comply with (each record must have an "id", a "type", a "title", etc.). This schema is always enforced.
geo.xml : contains requirements for Earth Sciences data. This schema is always enforced, but all of its elements are optional so it only applies to Earth Sciences datasets.
cmip5.xml : contains requirements specific to CMIP5 model data (and similar datasets such as obs4MIPs and ana4MIPs). This schema is enforced only if the publisher agent explicitly requests it (in "pull" operations), or flags the records as such (in "push" operations).
Solr uses its own schema definition document (schema.xml inside each core conf/ directory) to validate the incoming metadata records. Unfortunately, this document lacks all the information that is needed for ESGF validation, and therefore ESGF metadata must be defined in two different places.
Luckily, the definition of ESGF metadata fields within the Solr schema.xml document may be greatly shortened by defining some naming conventions:
ESGF meta-schemas are XML documents (conforming to a single XSD schema) that allow for encoding of a complex validation semantics, specifically:
Meta-schemas are parsed by the ESGF validation engine, that enforces the corresponding rules on the incoming XML records (after converting them to Java objects). The validation engine may also apply higher logic to specific fields: for example, "url" fields are inspected for being of the form "url|mimeType|serverName". Note that currently the ESGF validation engine adopts a "lenient" approach: if a metadata field is found in the incoming XML record, but not constrained by any meta-schema requirement, it is still ingested as a multi-valued field of type "string" (assuming it is not defined otherwise by the Solr engine own schema, in which case the field will pass ESGF validation but may be rejected by Solr).
As mentioned, "core" validation is enforced for all publishing operations - both "push" and "pull". Additional validation based on some other schema (such as "schema=cmip5") must be requested by the client:
Historically, the following considerations led to the decision of using custom meta-schemas instead of standard XSD documents for validating ESGF records: