Metadata Infrastructure

Portals, Harvesters, Indexers, and Serialisation

Metadata becomes truly powerful only when it can be shared, discovered, aggregated, and reused across systems. For this to work, research infrastructures rely on a set of services that collect, expose, and standardise metadata at scale. These services form the metadata infrastructure that underpins FAIR data ecosystems.

This chapter introduces the main components of that infrastructure: metadata portals, metadata harvesters, indexing services, and the serialisation formats that allow metadata to move between systems.

Why Local Metadata Alone Is Not Enough

Creating a metadata record inside a repository is essential, but metadata that stays inside a single system remains effectively invisible to the wider research ecosystem. A dataset may be perfectly documented, but if its metadata is not exposed in standardised, machine‑readable ways, it cannot be:

discovered by search engines
harvested by aggregators
indexed by disciplinary or cross‑disciplinary portals
linked to related datasets, publications, or software
integrated into knowledge graphs
found by researchers outside the hosting institution

In other words, metadata is only as useful as the infrastructure that distributes it.

Several factors limit the findability of “local‑only” metadata:

Siloing — each repository stores metadata in its own database, often with its own schema
Lack of standardisation — metadata may not follow common schemas or vocabularies
No machine‑readable exposure — metadata may be visible on a webpage but not accessible via APIs or harvesting protocols
No cross‑repository search — users must know the repository exists before they can find the data
No integration with PID services — identifiers may not be resolvable or linked to global discovery systems

This is why FAIR emphasises that metadata must be findable and accessible, not merely present.

To achieve this, repositories must expose metadata through standard protocols (e.g., OAI‑PMH, ResourceSync, REST APIs) so that external services can collect and index it.

The Figure below shows the architecture to make metadata actionable for researchers and machines. The nodes in light green denote the components that directly interact with users and machines and which you might have already used.

%%{init: {
  "look": "handDrawn",
  "flowchart": { "htmlLabels": false }
}}%%

flowchart TD

    %% Consumer at the top
    subgraph Consumer["Consumer"]
        USER["Researchers and Machines"]
    end

    %% Repository
    subgraph Repository["Repository"]
        DATA["Data Files"]
        META["Local Metadata<br>(JSON, XML, RDF)"]
        API["Expose Metadata<br>via OAI-PMH / REST API"]
        META -->|"offered through"| API
    end

    %% Metadata Service
    subgraph MetadataService["Metadata Service"]
        HARV["Harvester<br>Collect and Normalise"]
        INDEX["Indexer<br>Search Index / Knowledge Graph"]
        PORTAL["Metadata Portal<br>Discovery Interface"]
        HARV -->|"provide harvested metadata"| INDEX
        INDEX -->|"serve indexed metadata"| PORTAL
    end

    %% --- Anchor the layout ---
    %% These edges ensure Consumer stays ABOVE the two subgraphs
    USER --> |"search / explore"| PORTAL
    USER --> |"access and download"| DATA

    %% Metadata flow (Repository → Metadata Service)
    API -->|"offer metadata"| HARV


    %% Highlight nodes (light green fill, dark green text)
    style USER fill:#d8f3dc,stroke:#1b4332,color:#1b4332
    style DATA fill:#d8f3dc,stroke:#1b4332,color:#1b4332
    style PORTAL fill:#d8f3dc,stroke:#1b4332,color:#1b4332

    %% Link styles
    linkStyle 0 labelPosition:0.5, align:top;
    linkStyle 1 labelPosition:0.5, align:top;
    linkStyle 2 labelPosition:0.5, align:top;
    linkStyle 3 labelPosition:0.5, align:top;
    linkStyle 4 labelPosition:0.5, align:top;

Figure 10.1: From Local Metadata to Discovery and Data Access

We will now go through the single aspects of the metadata infrastructure.

Metadata Portals

Metadata portals are platforms that provide human‑ and machine‑readable access to metadata records. They act as central entry points for discovering datasets, publications, software, and other research outputs.

Portals typically:

expose metadata through search interfaces
provide landing pages for datasets
offer APIs for programmatic access
integrate controlled vocabularies and ontologies
link metadata to identifiers such as DOIs, ORCID iDs, and ROR IDs (see Chapter Persistent Identifiers)

Examples

DataCite Commons — aggregates metadata for DOIs assigned to datasets, software, and publications
OpenAIRE Explore — integrates metadata from repositories across Europe
FAIRsharing — registry of metadata standards, databases, and policies
Discipline‑specific portals (e.g., PANGAEA, GBIF, ICPSR)

Portals are the public face of metadata: they make research outputs discoverable and provide the context needed for reuse.

Metadata Harvesters

Metadata harvesters solve the visibility problem by collecting metadata from many repositories and bringing it together into a unified index. They ensure that metadata created locally becomes part of a broader, interoperable ecosystem.

Harvesters typically:

poll repositories at regular intervals
retrieve metadata records using standard protocols
normalise or enrich metadata
detect updates, deletions, and new versions
feed metadata into search indexes or knowledge graphs

Common harvesting protocols

OAI‑PMH (Open Archives Initiative Protocol for Metadata Harvesting)
Widely used in repositories; supports incremental harvesting and multiple metadata formats.
ResourceSync
A modern protocol for synchronising large collections of metadata and content.
REST APIs
Many repositories expose metadata through JSON‑based APIs.

Examples of harvesters

OpenAIRE Harvester — collects metadata from thousands of repositories
DataCite OAI‑PMH service — exposes DOI metadata for harvesting
Europeana Harvester — aggregates cultural heritage metadata

Harvesters are the connective tissue of metadata infrastructure: they ensure that metadata does not remain siloed.

What Is a Metadata Indexer?

A metadata indexer is a service that takes harvested metadata and organises it into a searchable, structured index.

Indexers:

parse metadata into fields
normalise identifiers (e.g., DOI, ORCID, ROR)
enrich metadata with external information
build search indexes (e.g., Elasticsearch, Solr)
support faceted search, filtering, and ranking
may generate knowledge graphs or entity‑resolution layers

In short:
Harvesters collect metadata; indexers make it findable.

Examples

OpenAIRE Graph — a large‑scale scholarly knowledge graph integrating publications, datasets, software, and projects
DataCite Search Index — supports DOI discovery and metadata enrichment
Google Dataset Search — indexes structured metadata exposed on the web
Semantic indexes built on RDF triple stores (e.g., Blazegraph, GraphDB)

Indexers turn metadata into actionable information, enabling discovery, analytics, and automated linking.

Comparison of Metadata Portals, Harvesters, and Indexers

Component	Purpose	Operates On	Key Functions	Examples
Metadata Portal	Human‑ and machine‑friendly discovery	Indexed metadata	Search, landing pages, APIs, PID linking	DataCite Commons, OpenAIRE Explore
Metadata Harvester	Collect metadata from multiple sources	OAI‑PMH, APIs, ResourceSync	Harvest, normalise, deduplicate, enrich	OpenAIRE Harvester, Europeana Harvester
Metadata Indexer	Make metadata searchable	Harvested metadata	Parse, extract fields, build index/graph	OpenAIRE Graph, DataCite Index

Metadata Serialisation Formats

To move metadata between systems, it must be expressed in serialisation formats that machines can read. These formats encode metadata according to schemas, ontologies, and controlled vocabularies.

Serialisation formats fall into two broad categories:

1. Key–Value and Document‑Oriented Formats

These formats are widely used in repositories, APIs, and metadata templates.

JSON — common for REST APIs; flexible and human‑readable
XML — used in OAI‑PMH, DataCite XML, Dublin Core XML
YAML — popular in research workflows and documentation
CSV / TSV — often used for simple metadata templates or variable‑level metadata

These formats are easy to generate and consume, but relationships between entities are implicit.

2. Semantic Web and Linked Data Formats

These formats express metadata as graphs, enabling interoperability and machine reasoning.

RDF/XML
Turtle (TTL)
JSON‑LD
N‑Triples / N‑Quads

These formats encode metadata as subject–predicate–object triples, using URIs to identify entities. They are used in:

knowledge graphs
ontology‑driven metadata
schema.org markup for dataset discovery
provenance tracking (e.g., PROV‑O)

Why serialisation matters

Serialisation determines:

how metadata is exchanged between systems
how easily it can be validated
whether it supports Linked Data
how well it integrates with harvesters and indexers

In practice, repositories often support multiple serialisations to serve different use cases.

Below we give an example of the same metadata expressed in different formats.

Example: The Same Metadata in JSON, XML, and Turtle

Below we describe a fictive dataset using three common serialisation formats.
The metadata fields are:

title: Soil Moisture Dataset 2020–2023
creator: Anna Vermeer
license: CC BY 4.0
issued: 2024
subject: Soil Moisture

JSON (Key–Value Metadata)

{
  "title": "Soil Moisture Dataset 2020–2023",
  "creator": "Anna Vermeer",
  "license": "CC BY 4.0",
  "issued": "2024",
  "subject": "Soil Moisture"
}

Characteristics:
- Simple, human‑readable
- No explicit semantics
- Relationships are implicit

XML (Structured, Schema‑Driven Metadata)

<resource>
  <title>Soil Moisture Dataset 2020–2023</title>
  <creator>Anna Vermeer</creator>
  <license>CC BY 4.0</license>
  <issued>2024</issued>
  <subject>Soil Moisture</subject>
</resource>

Characteristics:
- Hierarchical structure
- Often validated with schemas (e.g., XSD)
- Still no explicit semantics unless combined with namespaces

Turtle (RDF / Linked Data)

@prefix dc:     <http://purl.org/dc/elements/1.1/> .
@prefix ex:     <http://example.org/> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

ex:dataset1
    dc:title "Soil Moisture Dataset 2020–2023" ;
    dc:creator "Anna Vermeer" ;
    dc:license <https://creativecommons.org/licenses/by/4.0/> ;
    dc:issued "2024"^^xsd:gYear ;
    dc:subject "Soil Moisture" .

Characteristics:
- Fully semantic
- Uses URIs for identifiers
- Machine‑interpretable relationships
- Suitable for knowledge graphs and FAIR Linked Data

Format	Human‑Readable	Machine‑Readable	Supports Semantics	Best For
JSON	✔	✔	✖	APIs, lightweight metadata
XML	✔	✔	◑ (with namespaces)	OAI‑PMH, schema‑driven metadata
Turtle (RDF)	◑	✔✔	✔✔	Linked Data, knowledge graphs, FAIR metadata

When to Choose Which Serialisation Format?

Different serialisation formats serve different purposes, and the right choice depends on how the metadata will be used.

JSON is ideal when metadata needs to be exchanged quickly and flexibly between systems. It is the natural choice for REST APIs, lightweight services, and modern repositories because it is easy to generate, easy to parse, and widely supported by programming languages.

XML is preferred when metadata must follow a strict schema and be validated against a formal structure. Many harvesting protocols (such as OAI‑PMH) and long‑standing metadata standards (like Dublin Core XML or DataCite XML) rely on XML because of its predictability and strong validation capabilities.

YAML works well for human‑authored metadata, such as workflow descriptions, lab templates, or configuration files. It is readable and concise, but less suitable for large‑scale automated processing.

CSV/TSV is the simplest option for tabular or variable‑level metadata. It is easy to create and edit, especially for researchers working in spreadsheets, but it cannot express hierarchical or semantic relationships.

RDF serialisations such as Turtle, JSON‑LD, and RDF/XML are the best choice when metadata needs to be interoperable, machine‑interpretable, and semantically rich. They allow metadata to be integrated into knowledge graphs, linked to ontologies, and connected across systems using globally unique identifiers. JSON‑LD is particularly useful for embedding semantic metadata on the web, while Turtle is preferred for human‑readable Linked Data.

In practice, many repositories support multiple serialisations: a simple JSON or XML representation for APIs and harvesting, and a semantic RDF representation for Linked Data and FAIR interoperability. The choice depends on whether the priority is ease of use, strict validation, or semantic richness.

Format	Type	Strengths	Limitations	Common Uses
JSON	Document / key–value	Simple, flexible, API‑friendly	Limited semantics	REST APIs, lightweight metadata
XML	Document	Schema validation, structured	Verbose, rigid	OAI‑PMH, schema‑based standards
YAML	Human‑friendly key–value	Very readable, good for configs	Indentation‑sensitive, not ideal for automation	Workflow files, templates
CSV/TSV	Tabular	Simple, spreadsheet‑friendly	No hierarchy or semantics	Variable‑level metadata, tables
RDF/XML	Semantic graph	Expressive, ontology‑driven	Verbose, harder to read	Linked Data, semantic repos
Turtle (TTL)	Semantic graph	Compact, readable	Requires RDF knowledge	Knowledge graphs, ontologies
JSON‑LD	Semantic graph	Web‑friendly, schema.org	Needs context definitions	Dataset discovery, web embedding
N‑Triples	Semantic graph	Simple, line‑based	Not human‑friendly	Large RDF dumps, streaming

Summary

A typical metadata flow looks like this:

Repository stores metadata (JSON, XML, or RDF)
PID service assigns identifiers and links metadata to PIDs
Metadata portal exposes metadata to users and APIs
Harvester collects metadata from many repositories
Indexer builds a searchable, enriched metadata graph
Discovery services make metadata findable across domains

Together, these components form the metadata infrastructure that enables FAIR, interoperable, and reusable research data.

A Note on Linked Data, Triple Stores, and Combining Knowledge Graphs

Linked Data changes how metadata is shared and connected across research infrastructures. Instead of exchanging whole records through harvesting protocols, repositories can publish their metadata directly as RDF using persistent identifiers (DOIs, ORCID iDs, ROR IDs) and shared vocabularies. This allows metadata to be expressed as a graph, where entities are nodes and their relationships are edges.

Repositories can expose RDF in several ways:

Static RDF files (e.g., Turtle or JSON‑LD generated on demand or published as dumps)
Content negotiation (e.g., returning RDF when a client requests text/turtle or application/ld+json)
API endpoints that output RDF serialisations
SPARQL or Linked Data endpoints when the repository uses a graph‑based backend

Only the last option requires a triple store; the others simply generate RDF from the repository’s internal metadata model.

A triple store is a specialised database designed to store and query RDF triples—statements in the form subject–predicate–object. It supports SPARQL queries, reasoning, and graph traversal, and can automatically merge graphs from different sources when they use the same identifiers. If two repositories describe the same dataset, person, or organisation using the same URI, a triple store treats them as the same node. No custom mapping or harmonisation layer is required; the semantics are embedded directly in the data model.

This makes combining knowledge graphs remarkably straightforward. Multiple RDF sources can be loaded into the same triple store, and shared identifiers cause the graphs to interlink automatically. The infrastructure does not need to normalise formats, deduplicate records, or align schemas—the graph model handles this naturally.

Because of this, Linked Data reduces the need for traditional metadata harvesters. A harvester normally collects XML or JSON from many repositories, normalises it, deduplicates it, and prepares it for indexing. In a Linked Data environment, much of this work disappears: repositories publish RDF, and the triple store ingests it directly. The infrastructure no longer needs to “fix” metadata during harvesting.

However, this convenience for infrastructures shifts responsibility upstream. For Linked Data to work well, researchers and data stewards must provide clean identifiers, consistent vocabularies, and well‑structured RDF. Instead of the infrastructure harmonising metadata after the fact, the quality and interoperability of the combined graph depend on the metadata producers themselves.

In short: triple stores make integration easier for infrastructures, but require more discipline and modelling effort from researchers. When done well, the result is a metadata ecosystem that is inherently connected, machine‑interpretable, and ready for reuse across systems.

%%{init: {
  "look": "handDrawn",
  "flowchart": { "htmlLabels": false }
}}%%

flowchart TD

    %% Consumer at the top
    subgraph Consumer["Consumer"]
        USER["Researchers and Machines"]
    end

    %% Repository
    subgraph Repository["Repository"]
        DATA["Data Files"]
        RDF["RDF Metadata<br>(Turtle, JSON-LD, RDF/XML)"]
        ENDPOINT["Expose RDF<br>via Content Negotiation / API"]
        RDF -->|"published as RDF"| ENDPOINT
    end

    %% Linked Data Service
    subgraph LinkedDataService["Linked Data Service"]
        TRIPLE["Triple Store<br>(GraphDB, Blazegraph, Fuseki)"]
        SPARQL["SPARQL Endpoint<br>Query Interface"]
        TRIPLE -->|"query graph"| SPARQL
    end

    %% Metadata flow (Repository → Triple Store)
    ENDPOINT -->|"load RDF graphs"| TRIPLE

    %% Consumer interactions
    USER -->|"query / explore"| SPARQL
    USER -->|"access and download"| DATA

    %% Highlight nodes (light green fill, dark green text)
    style USER fill:#d8f3dc,stroke:#1b4332,color:#1b4332
    style DATA fill:#d8f3dc,stroke:#1b4332,color:#1b4332
    style SPARQL fill:#d8f3dc,stroke:#1b4332,color:#1b4332

    %% Link styles
    linkStyle 0 labelPosition:0.5, align:top;
    linkStyle 1 labelPosition:0.5, align:top;
    linkStyle 2 labelPosition:0.5, align:top;
    linkStyle 3 labelPosition:0.5, align:top;

Figure 10.2: Linked Data Architecture Without Harvesters