%%{init: {
"look": "handDrawn",
"flowchart": { "htmlLabels": false }
}}%%
flowchart TD
%% Consumer at the top
subgraph Consumer["Consumer"]
USER["Researchers and Machines"]
end
%% Repository
subgraph Repository["Repository"]
DATA["Data Files"]
META["Local Metadata<br>(JSON, XML, RDF)"]
API["Expose Metadata<br>via OAI-PMH / REST API"]
META -->|"offered through"| API
end
%% Metadata Service
subgraph MetadataService["Metadata Service"]
HARV["Harvester<br>Collect and Normalise"]
INDEX["Indexer<br>Search Index / Knowledge Graph"]
PORTAL["Metadata Portal<br>Discovery Interface"]
HARV -->|"provide harvested metadata"| INDEX
INDEX -->|"serve indexed metadata"| PORTAL
end
%% --- Anchor the layout ---
%% These edges ensure Consumer stays ABOVE the two subgraphs
USER --> |"search / explore"| PORTAL
USER --> |"access and download"| DATA
%% Metadata flow (Repository → Metadata Service)
API -->|"offer metadata"| HARV
%% Highlight nodes (light green fill, dark green text)
style USER fill:#d8f3dc,stroke:#1b4332,color:#1b4332
style DATA fill:#d8f3dc,stroke:#1b4332,color:#1b4332
style PORTAL fill:#d8f3dc,stroke:#1b4332,color:#1b4332
%% Link styles
linkStyle 0 labelPosition:0.5, align:top;
linkStyle 1 labelPosition:0.5, align:top;
linkStyle 2 labelPosition:0.5, align:top;
linkStyle 3 labelPosition:0.5, align:top;
linkStyle 4 labelPosition:0.5, align:top;
Metadata Infrastructure
Portals, Harvesters, Indexers, and Serialisation
Metadata becomes truly powerful only when it can be shared, discovered, aggregated, and reused across systems. For this to work, research infrastructures rely on a set of services that collect, expose, and standardise metadata at scale. These services form the metadata infrastructure that underpins FAIR data ecosystems.
This chapter introduces the main components of that infrastructure: metadata portals, metadata harvesters, indexing services, and the serialisation formats that allow metadata to move between systems.
Why Local Metadata Alone Is Not Enough
Creating a metadata record inside a repository is essential, but metadata that stays inside a single system remains effectively invisible to the wider research ecosystem. A dataset may be perfectly documented, but if its metadata is not exposed in standardised, machine‑readable ways, it cannot be:
- discovered by search engines
- harvested by aggregators
- indexed by disciplinary or cross‑disciplinary portals
- linked to related datasets, publications, or software
- integrated into knowledge graphs
- found by researchers outside the hosting institution
In other words, metadata is only as useful as the infrastructure that distributes it.
Several factors limit the findability of “local‑only” metadata:
- Siloing — each repository stores metadata in its own database, often with its own schema
- Lack of standardisation — metadata may not follow common schemas or vocabularies
- No machine‑readable exposure — metadata may be visible on a webpage but not accessible via APIs or harvesting protocols
- No cross‑repository search — users must know the repository exists before they can find the data
- No integration with PID services — identifiers may not be resolvable or linked to global discovery systems
This is why FAIR emphasises that metadata must be findable and accessible, not merely present.
To achieve this, repositories must expose metadata through standard protocols (e.g., OAI‑PMH, ResourceSync, REST APIs) so that external services can collect and index it.
The Figure below shows the architecture to make metadata actionable for researchers and machines. The nodes in light green denote the components that directly interact with users and machines and which you might have already used.
We will now go through the single aspects of the metadata infrastructure.
Metadata Portals
Metadata portals are platforms that provide human‑ and machine‑readable access to metadata records. They act as central entry points for discovering datasets, publications, software, and other research outputs.
Portals typically:
- expose metadata through search interfaces
- provide landing pages for datasets
- offer APIs for programmatic access
- integrate controlled vocabularies and ontologies
- link metadata to identifiers such as DOIs, ORCID iDs, and ROR IDs (see Chapter Persistent Identifiers)
Examples
- DataCite Commons — aggregates metadata for DOIs assigned to datasets, software, and publications
- OpenAIRE Explore — integrates metadata from repositories across Europe
- FAIRsharing — registry of metadata standards, databases, and policies
- Discipline‑specific portals (e.g., PANGAEA, GBIF, ICPSR)
Portals are the public face of metadata: they make research outputs discoverable and provide the context needed for reuse.
Metadata Harvesters
Metadata harvesters solve the visibility problem by collecting metadata from many repositories and bringing it together into a unified index. They ensure that metadata created locally becomes part of a broader, interoperable ecosystem.
Harvesters typically:
- poll repositories at regular intervals
- retrieve metadata records using standard protocols
- normalise or enrich metadata
- detect updates, deletions, and new versions
- feed metadata into search indexes or knowledge graphs
Common harvesting protocols
OAI‑PMH (Open Archives Initiative Protocol for Metadata Harvesting)
Widely used in repositories; supports incremental harvesting and multiple metadata formats.ResourceSync
A modern protocol for synchronising large collections of metadata and content.REST APIs
Many repositories expose metadata through JSON‑based APIs.
Examples of harvesters
- OpenAIRE Harvester — collects metadata from thousands of repositories
- DataCite OAI‑PMH service — exposes DOI metadata for harvesting
- Europeana Harvester — aggregates cultural heritage metadata
Harvesters are the connective tissue of metadata infrastructure: they ensure that metadata does not remain siloed.
What Is a Metadata Indexer?
A metadata indexer is a service that takes harvested metadata and organises it into a searchable, structured index.
Indexers:
- parse metadata into fields
- normalise identifiers (e.g., DOI, ORCID, ROR)
- enrich metadata with external information
- build search indexes (e.g., Elasticsearch, Solr)
- support faceted search, filtering, and ranking
- may generate knowledge graphs or entity‑resolution layers
In short:
Harvesters collect metadata; indexers make it findable.
Examples
- OpenAIRE Graph — a large‑scale scholarly knowledge graph integrating publications, datasets, software, and projects
- DataCite Search Index — supports DOI discovery and metadata enrichment
- Google Dataset Search — indexes structured metadata exposed on the web
- Semantic indexes built on RDF triple stores (e.g., Blazegraph, GraphDB)
Indexers turn metadata into actionable information, enabling discovery, analytics, and automated linking.
Comparison of Metadata Portals, Harvesters, and Indexers
| Component | Purpose | Operates On | Key Functions | Examples |
|---|---|---|---|---|
| Metadata Portal | Human‑ and machine‑friendly discovery | Indexed metadata | Search, landing pages, APIs, PID linking | DataCite Commons, OpenAIRE Explore |
| Metadata Harvester | Collect metadata from multiple sources | OAI‑PMH, APIs, ResourceSync | Harvest, normalise, deduplicate, enrich | OpenAIRE Harvester, Europeana Harvester |
| Metadata Indexer | Make metadata searchable | Harvested metadata | Parse, extract fields, build index/graph | OpenAIRE Graph, DataCite Index |
Metadata Serialisation Formats
To move metadata between systems, it must be expressed in serialisation formats that machines can read. These formats encode metadata according to schemas, ontologies, and controlled vocabularies.
Serialisation formats fall into two broad categories:
1. Key–Value and Document‑Oriented Formats
These formats are widely used in repositories, APIs, and metadata templates.
- JSON — common for REST APIs; flexible and human‑readable
- XML — used in OAI‑PMH, DataCite XML, Dublin Core XML
- YAML — popular in research workflows and documentation
- CSV / TSV — often used for simple metadata templates or variable‑level metadata
These formats are easy to generate and consume, but relationships between entities are implicit.
2. Semantic Web and Linked Data Formats
These formats express metadata as graphs, enabling interoperability and machine reasoning.
- RDF/XML
- Turtle (TTL)
- JSON‑LD
- N‑Triples / N‑Quads
These formats encode metadata as subject–predicate–object triples, using URIs to identify entities. They are used in:
- knowledge graphs
- ontology‑driven metadata
- schema.org markup for dataset discovery
- provenance tracking (e.g., PROV‑O)
Why serialisation matters
Serialisation determines:
- how metadata is exchanged between systems
- how easily it can be validated
- whether it supports Linked Data
- how well it integrates with harvesters and indexers
In practice, repositories often support multiple serialisations to serve different use cases.
Below we give an example of the same metadata expressed in different formats.
Example: The Same Metadata in JSON, XML, and Turtle
Below we describe a fictive dataset using three common serialisation formats.
The metadata fields are:
- title: Soil Moisture Dataset 2020–2023
- creator: Anna Vermeer
- license: CC BY 4.0
- issued: 2024
- subject: Soil Moisture
JSON (Key–Value Metadata)
{
"title": "Soil Moisture Dataset 2020–2023",
"creator": "Anna Vermeer",
"license": "CC BY 4.0",
"issued": "2024",
"subject": "Soil Moisture"
}Characteristics:
- Simple, human‑readable
- No explicit semantics
- Relationships are implicit
XML (Structured, Schema‑Driven Metadata)
<resource>
<title>Soil Moisture Dataset 2020–2023</title>
<creator>Anna Vermeer</creator>
<license>CC BY 4.0</license>
<issued>2024</issued>
<subject>Soil Moisture</subject>
</resource>Characteristics:
- Hierarchical structure
- Often validated with schemas (e.g., XSD)
- Still no explicit semantics unless combined with namespaces
Turtle (RDF / Linked Data)
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:dataset1
dc:title "Soil Moisture Dataset 2020–2023" ;
dc:creator "Anna Vermeer" ;
dc:license <https://creativecommons.org/licenses/by/4.0/> ;
dc:issued "2024"^^xsd:gYear ;
dc:subject "Soil Moisture" .Characteristics:
- Fully semantic
- Uses URIs for identifiers
- Machine‑interpretable relationships
- Suitable for knowledge graphs and FAIR Linked Data
| Format | Human‑Readable | Machine‑Readable | Supports Semantics | Best For |
|---|---|---|---|---|
| JSON | ✔ | ✔ | ✖ | APIs, lightweight metadata |
| XML | ✔ | ✔ | ◑ (with namespaces) | OAI‑PMH, schema‑driven metadata |
| Turtle (RDF) | ◑ | ✔✔ | ✔✔ | Linked Data, knowledge graphs, FAIR metadata |
When to Choose Which Serialisation Format?
Different serialisation formats serve different purposes, and the right choice depends on how the metadata will be used.
JSON is ideal when metadata needs to be exchanged quickly and flexibly between systems. It is the natural choice for REST APIs, lightweight services, and modern repositories because it is easy to generate, easy to parse, and widely supported by programming languages.
XML is preferred when metadata must follow a strict schema and be validated against a formal structure. Many harvesting protocols (such as OAI‑PMH) and long‑standing metadata standards (like Dublin Core XML or DataCite XML) rely on XML because of its predictability and strong validation capabilities.
YAML works well for human‑authored metadata, such as workflow descriptions, lab templates, or configuration files. It is readable and concise, but less suitable for large‑scale automated processing.
CSV/TSV is the simplest option for tabular or variable‑level metadata. It is easy to create and edit, especially for researchers working in spreadsheets, but it cannot express hierarchical or semantic relationships.
RDF serialisations such as Turtle, JSON‑LD, and RDF/XML are the best choice when metadata needs to be interoperable, machine‑interpretable, and semantically rich. They allow metadata to be integrated into knowledge graphs, linked to ontologies, and connected across systems using globally unique identifiers. JSON‑LD is particularly useful for embedding semantic metadata on the web, while Turtle is preferred for human‑readable Linked Data.
In practice, many repositories support multiple serialisations: a simple JSON or XML representation for APIs and harvesting, and a semantic RDF representation for Linked Data and FAIR interoperability. The choice depends on whether the priority is ease of use, strict validation, or semantic richness.
| Format | Type | Strengths | Limitations | Common Uses |
|---|---|---|---|---|
| JSON | Document / key–value | Simple, flexible, API‑friendly | Limited semantics | REST APIs, lightweight metadata |
| XML | Document | Schema validation, structured | Verbose, rigid | OAI‑PMH, schema‑based standards |
| YAML | Human‑friendly key–value | Very readable, good for configs | Indentation‑sensitive, not ideal for automation | Workflow files, templates |
| CSV/TSV | Tabular | Simple, spreadsheet‑friendly | No hierarchy or semantics | Variable‑level metadata, tables |
| RDF/XML | Semantic graph | Expressive, ontology‑driven | Verbose, harder to read | Linked Data, semantic repos |
| Turtle (TTL) | Semantic graph | Compact, readable | Requires RDF knowledge | Knowledge graphs, ontologies |
| JSON‑LD | Semantic graph | Web‑friendly, schema.org | Needs context definitions | Dataset discovery, web embedding |
| N‑Triples | Semantic graph | Simple, line‑based | Not human‑friendly | Large RDF dumps, streaming |
Summary
A typical metadata flow looks like this:
- Repository stores metadata (JSON, XML, or RDF)
- PID service assigns identifiers and links metadata to PIDs
- Metadata portal exposes metadata to users and APIs
- Harvester collects metadata from many repositories
- Indexer builds a searchable, enriched metadata graph
- Discovery services make metadata findable across domains
Together, these components form the metadata infrastructure that enables FAIR, interoperable, and reusable research data.
A Note on Linked Data, Triple Stores, and Combining Knowledge Graphs
Linked Data changes how metadata is shared and connected across research infrastructures. Instead of exchanging whole records through harvesting protocols, repositories can publish their metadata directly as RDF using persistent identifiers (DOIs, ORCID iDs, ROR IDs) and shared vocabularies. This allows metadata to be expressed as a graph, where entities are nodes and their relationships are edges.
Repositories can expose RDF in several ways:
- Static RDF files (e.g., Turtle or JSON‑LD generated on demand or published as dumps)
- Content negotiation (e.g., returning RDF when a client requests
text/turtleorapplication/ld+json)
- API endpoints that output RDF serialisations
- SPARQL or Linked Data endpoints when the repository uses a graph‑based backend
Only the last option requires a triple store; the others simply generate RDF from the repository’s internal metadata model.
A triple store is a specialised database designed to store and query RDF triples—statements in the form subject–predicate–object. It supports SPARQL queries, reasoning, and graph traversal, and can automatically merge graphs from different sources when they use the same identifiers. If two repositories describe the same dataset, person, or organisation using the same URI, a triple store treats them as the same node. No custom mapping or harmonisation layer is required; the semantics are embedded directly in the data model.
This makes combining knowledge graphs remarkably straightforward. Multiple RDF sources can be loaded into the same triple store, and shared identifiers cause the graphs to interlink automatically. The infrastructure does not need to normalise formats, deduplicate records, or align schemas—the graph model handles this naturally.
Because of this, Linked Data reduces the need for traditional metadata harvesters. A harvester normally collects XML or JSON from many repositories, normalises it, deduplicates it, and prepares it for indexing. In a Linked Data environment, much of this work disappears: repositories publish RDF, and the triple store ingests it directly. The infrastructure no longer needs to “fix” metadata during harvesting.
However, this convenience for infrastructures shifts responsibility upstream. For Linked Data to work well, researchers and data stewards must provide clean identifiers, consistent vocabularies, and well‑structured RDF. Instead of the infrastructure harmonising metadata after the fact, the quality and interoperability of the combined graph depend on the metadata producers themselves.
In short: triple stores make integration easier for infrastructures, but require more discipline and modelling effort from researchers. When done well, the result is a metadata ecosystem that is inherently connected, machine‑interpretable, and ready for reuse across systems.
%%{init: {
"look": "handDrawn",
"flowchart": { "htmlLabels": false }
}}%%
flowchart TD
%% Consumer at the top
subgraph Consumer["Consumer"]
USER["Researchers and Machines"]
end
%% Repository
subgraph Repository["Repository"]
DATA["Data Files"]
RDF["RDF Metadata<br>(Turtle, JSON-LD, RDF/XML)"]
ENDPOINT["Expose RDF<br>via Content Negotiation / API"]
RDF -->|"published as RDF"| ENDPOINT
end
%% Linked Data Service
subgraph LinkedDataService["Linked Data Service"]
TRIPLE["Triple Store<br>(GraphDB, Blazegraph, Fuseki)"]
SPARQL["SPARQL Endpoint<br>Query Interface"]
TRIPLE -->|"query graph"| SPARQL
end
%% Metadata flow (Repository → Triple Store)
ENDPOINT -->|"load RDF graphs"| TRIPLE
%% Consumer interactions
USER -->|"query / explore"| SPARQL
USER -->|"access and download"| DATA
%% Highlight nodes (light green fill, dark green text)
style USER fill:#d8f3dc,stroke:#1b4332,color:#1b4332
style DATA fill:#d8f3dc,stroke:#1b4332,color:#1b4332
style SPARQL fill:#d8f3dc,stroke:#1b4332,color:#1b4332
%% Link styles
linkStyle 0 labelPosition:0.5, align:top;
linkStyle 1 labelPosition:0.5, align:top;
linkStyle 2 labelPosition:0.5, align:top;
linkStyle 3 labelPosition:0.5, align:top;