Metadata

What Is Metadata in Research Data Management?

In the context of Research Data Management (RDM), metadata is structured information that describes, explains, and contextualises a digital object—such as a dataset, document, image, or software file. It acts as a layer of documentation that makes a digital object findable, understandable, and reusable, both now and in the future.

Metadata typically captures:

Content information — what the object contains
Provenance — how, when, and by whom it was created
Technical characteristics — file formats, structures, and software requirements
Administrative information — rights, licences, and access conditions
Relationships — links to other versions, datasets, or publications

In short, metadata provides the context needed to interpret and reuse data beyond its original purpose.

A fictive example for a published dataset

Field	Example Value
Title	Soil Moisture Dataset 2020–2023
Creator	Anna Vermeer
Affiliation	Utrecht University
Publication Year	2024
Description	Measurements of soil moisture across multiple sites in the Netherlands over three years.
Keywords / Subject	soil moisture, hydrology, climate
License	CC BY 4.0
Provenance	Collected using sensor network; cleaned and validated by research team; processed using R scripts version 4.2
Dataset DOI	10.5281/zenodo.1234567
Domain Detail (DDI)	Variable-level metadata: `soil_moisture`, `temperature`, `precipitation`; measurement units, missing values, coding scheme documented.
Domain Detail (ISA)	Assay-level metadata for soil sensors: sensor type, calibration protocol, data logger model, measurement frequency.

A fictive example for accompanying software

Field	Example Value
Title	Soil Moisture Data Cleaning Script
Creator / Author	Anna Vermeer
Affiliation	Utrecht University
Version	1.0.2
Programming Language	R
Description	Script to clean and standardize soil moisture measurements from multiple sensor networks, including missing value imputation and unit conversions.
License	MIT License
Repository / DOI	https://github.com/soil-data/cleaning-script or DOI: 10.5281/zenodo.7654321
Dependencies	R packages: tidyverse 1.4.3, lubridate 1.9.0
Provenance	Developed as part of Soil Moisture Project 2020–2023; tested on dataset version 3; last updated March 2024
Domain Detail (CodeMeta)	`softwareRequirements`: R 4.2; `programmingLanguage`: R; `codeRepository`: GitHub; `maintainer`: Anna Vermeer
Domain Detail (ISA)	If linked to experimental workflows: script applied to assay-level data (sensor readings), preserving measurement protocols and calibration details.

Ontologies, Controlled Vocabularies, Metadata Schemas, and Templates

Understanding the Building Blocks of Structured Research Metadata

In Research Data Management, terms such as ontologies, controlled vocabularies, metadata schemas, and metadata templates are often used together. While closely related, they serve distinct roles. Together, they ensure that research outputs are described in a consistent, machine-readable, and interoperable way.

Controlled Vocabularies

Controlled vocabularies are curated lists of approved terms used to describe data consistently.

They:

Provide standardised terms for describing concepts
Reduce ambiguity (e.g., “soil moisture” vs. “soil humidity”)
Improve searchability and interoperability

Examples

AGROVOC (agriculture)
MeSH (biomedical terms)
GCMD keywords (Earth science)

When a metadata field requires a subject keyword, a controlled vocabulary ensures that everyone uses the same term for the same concept.

Ontologies

Ontologies extend controlled vocabularies by not only defining terms, but also specifying the relationships between them.

They:

Provide a formal, machine-readable model of concepts
Define hierarchies (e.g., “soil moisture” is a type of “hydrological variable”)
Enable reasoning and automated linking between datasets

Examples

Gene Ontology (GO)
ENVO (Environment Ontology)
PROV-O (Provenance Ontology)

Ontologies allow machines to understand that terms like “precipitation” and “rainfall” are related, enabling more intelligent search and data integration.

Metadata Schemas

A metadata schema defines which fields are used to describe a digital object and how those fields should be structured.

They:

Specify required and optional fields
Define field types (e.g., text, date, identifier)
Ensure consistency across repositories and disciplines

Examples

Dublin Core (general-purpose)
DataCite (research outputs)
DDI (social sciences)
CodeMeta (software)

A metadata schema tells you what information to provide, such as:

Title
Creator
Description
Keywords
License
Persistent identifier

Importantly, it does not define which terms to use—that is the role of controlled vocabularies.

Metadata Templates

A metadata template is a practical, user-friendly implementation of a metadata schema.

They:

Provide forms or structured documents for data entry
Translate schema fields into prompts
Include guidance or examples
Often embed controlled vocabularies (e.g., dropdowns or autocomplete)

Examples

A Zenodo upload form
A Dataverse dataset form
A lab-specific metadata spreadsheet
A README template aligned with DataCite

Templates are what researchers actually interact with. They operationalise schemas and often help enforce consistency.

These components work together as complementary layers:

Concept	Purpose	Relationship
Controlled vocabularies	Standardised terms	Provide the values used in metadata
Ontologies	Concepts + relationships	Add semantic meaning and structure
Metadata schemas	Field definitions	Specify what metadata to capture
Metadata templates	Practical tools	Implement schemas for users

A simple analogy

Controlled vocabularies → the dictionary
Ontologies → the dictionary plus relationships
Metadata schemas → the blueprint
Metadata templates → the form you fill in

A real-world example

When uploading a dataset to Zenodo:

A schema (e.g., DataCite) defines the required fields.
The platform presents these fields through a template (web form).
Controlled vocabularies may guide keyword selection.
Ontologies may link your dataset to related concepts behind the scenes.

Linked Data and Key–Value Metadata

Metadata can range from simple structures to fully semantic representations. Two key approaches are key–value metadata and Linked Data.

Key–Value Metadata

Key–value metadata is the simplest and most widely used format. It consists of pairs:

Key → the field name
Value → the content

Examples

title: Soil Moisture Dataset 2020–2023
creator: Anna Vermeer
license: CC BY 4.0

Characteristics

Easy to create and understand
Common in spreadsheets, JSON, YAML, and repository forms
Human-readable, but limited in machine interpretation

How it fits

Schemas define the keys
Controlled vocabularies constrain the values
Templates present both to the user

Key–value metadata forms the foundation of most metadata practices.

Linked Data

Linked Data is a more advanced, semantic approach that represents metadata as interconnected statements.

Core idea

Information is expressed as subject–predicate–object triples:

Dataset — hasCreator → Anna Vermeer
Dataset — hasSubject → Soil Moisture
Soil Moisture — is a → Hydrological Variable

Each element is identified by a URI, ensuring global uniqueness.

Characteristics

Highly interoperable
Machine-interpretable
Enables automated reasoning
Connects datasets across systems

How it fits

Ontologies define relationships
Controlled vocabularies provide globally unique identifiers for the concepts and terms used
Schemas can be expressed in machine-readable relationships (triples) using globally unique identifiers
Templates may generate Linked Data automatically

Linked Data turns metadata into a network of meaning, rather than a collection of fields.

A Simple Comparison

Describing a book:

Key–value metadata
- title: The Hobbit
- author: J.R.R. Tolkien
Controlled vocabulary
- subject: Fantasy Fiction
Ontology
- Fantasy Fiction → is a → Fiction Genre
Schema → defines the fields
Template → the form
Linked Data → a connected network (knowledge graph) of relationships

Key–value pairs versus Linked Data

Metadata often begins its life as a simple list of key–value pairs. This is the format most researchers encounter in spreadsheets, repository submission forms, or lightweight JSON/YAML files.

Field (Key)	Value
Title	Soil Moisture Dataset 2020–2023
Creator	Anna Vermeer
Affiliation	Utrecht University
Publication Year	2024
Subject	Soil moisture
License	CC BY 4.0

In this representation:

Each row stands alone as an independent field–value pair
Relationships (e.g., that Anna Vermeer is affiliated with Utrecht University) are implicit
Values are plain text, without global identifiers or machine‑interpretable meaning

Such metadata is typically serialised as CSV or TSV files: simple, familiar, and easy to edit, but limited in how much structure or semantics they can express.

In the diagram below, the same metadata is expressed as a small knowledge graph. Instead of isolated fields, we now have entities (dataset, person, organisation, concept) connected by explicit relationships.

%%{init: {
  "look": "handDrawn",
  "flowchart": { "htmlLabels": false }
}}%%
flowchart TD

    D[Dataset]

    D -->|dc:title| T["Soil Moisture Dataset 2020–2023"]
    D -->|dc:issued| Y["2024"]
    D -->|dc:license| L["CC BY 4.0 (URI)"]
    D -->|dc:creator| P[Person: Anna Vermeer]
    D -->|dc:subject| S[Concept: Soil Moisture]

    P -->|schema:affiliation| O[Organisation: Utrecht University]

Figure 6.1: Knowledge Graph

Compared to the table:

The table presents metadata as a flat list of fields
The graph presents metadata as a network of connected entities
The person, organisation, and subject become first‑class nodes
Relationships such as creator and affiliation are explicit and machine‑interpretable

This is the essential shift: from fields with values to entities with relationships.

To serialise such a graph, we use a Linked Data format such as Turtle:

@prefix dc:     <http://purl.org/dc/elements/1.1/> .
@prefix dct:    <http://purl.org/dc/terms/> .
@prefix schema: <http://schema.org/> .
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .
@prefix skos:   <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

# Dataset node (using schema:Dataset)
<https://data.example.org/dataset/soil-moisture-2020-2023>
    a schema:Dataset ;
    dc:title "Soil Moisture Dataset 2020–2023" ;
    dct:issued "2024"^^xsd:gYear ;
    dct:license <https://creativecommons.org/licenses/by/4.0/> ;
    dct:creator <https://data.example.org/person/anna-vermeer> ;
    dct:subject <https://vocab.example.org/concept/soil-moisture> .

# Person node (FOAF)
<https://data.example.org/person/anna-vermeer>
    a foaf:Person ;
    foaf:name "Anna Vermeer" ;
    foaf:member <https://data.example.org/org/utrecht-university> .

# Organisation node (FOAF)
<https://data.example.org/org/utrecht-university>
    a foaf:Organization ;
    foaf:name "Utrecht University" .

# Concept node (SKOS)
<https://vocab.example.org/concept/soil-moisture>
    a skos:Concept ;
    skos:prefLabel "Soil Moisture"@en ;
    skos:definition "The amount of water contained in soil, typically expressed as a percentage."@en .

In this Turtle serialisation, the RDF graph not only describes the dataset, the creator, and the organisation, but also provides a stable identifier for the subject concept (“Soil Moisture”). This makes the metadata interoperable: other datasets can refer to the same concept, and machines can follow the URI to retrieve its meaning.

At the top of the file, we declare the ontologies and vocabularies we use. For example:

@prefix dc:     <http://purl.org/dc/elements/1.1/> .

This tells us that dc:title is shorthand for the full URI
http://purl.org/dc/elements/1.1/title,
which is the authoritative definition of the Dublin Core title property.

By giving each entity, i.e. dataset, person, organisation and concept its own URI, the graph becomes a reusable and extensible structure. Machines can infer relationships, merge graphs, or enrich them with external knowledge. While it is technically possible to use URIs as keys in key–value metadata, such formats lack the semantics and inference capabilities that RDF provides.

We explore metadata serialisation and metadata infrastructure in more detail in the chapter Metadata Infrastructure.

Bringing It All Together

Key–value metadata is the practical starting point
Linked Data is the semantic, interoperable extension
Controlled vocabularies and ontologies provide meaning and standardisation of terms
Schemas and templates provide structure

Together, they form the foundation of FAIR, reusable research metadata.

Understanding and Combining Metadata Standards

Metadata standards vary in scope—from general to highly specialised. Understanding how they complement each other helps in selecting the right approach.

Dublin Core

A general-purpose, lightweight metadata schema.

It provides:

15 core elements (e.g., Title, Creator, Subject, Date)
Broad applicability across domains

Why it’s useful:

Simple and widely supported
Suitable for discovery and basic description

Role: Provides baseline descriptive metadata.

PROV-O

A provenance ontology describing how data was created.

It provides

Entities, activities, and agents
Relationships such as wasGeneratedBy and used

Why it’s useful:

Supports reproducibility
Captures workflows and processes

Role: Adds semantic provenance and process context.

Discipline-Specific Schema - Data Documentation Initiative

Data Documentation Initiative (DDI) organises metadata into three main levels:

Level	What it describes	Examples
Study level	The overall research project	Title, investigators, methodology, sampling
Dataset level	The data files	File format, number of variables, version
Variable level	Individual variables	Question text, response categories, coding

The variable level is what makes DDI especially powerful.

Study level

Title: European Social Survey 2022
Method: Survey
Sample: Random sample of EU residents

Dataset level

File: ess2022.csv
Cases: 30,000 respondents
Variables: 250

Variable level

Variable: trust_gov
Question: “How much do you trust the national government?”
Values:
- 0 = No trust
- 10 = Complete trust

DDI enables researchers to understand:

what the data contains (variables),
how it was collected (methodology), and
how to reuse it correctly.

It provides:

Detailed methodological and variable-level metadata
Support for complex datasets

Why it’s useful:

Captures domain knowledge essential for reuse

Role: Provides deep, field-specific context.

Discipline-Specific Schema - Investigation–Study–Assay (ISA)

ISA is a discipline-specific metadata framework used in the life sciences to describe experimental workflows, especially in genomics, proteomics, and other omics research.

ISA organises metadata into three hierarchical levels:

Level	What it describes	Examples
Investigation	The overall research context	Project title, researchers, objectives
Study	A specific experiment or dataset	Study design, subjects, sample characteristics
Assay	Analytical measurements and technologies	Sequencing, mass spectrometry, protocols

The assay level captures how data was actually generated.

Investigation level

Title: Gut Microbiome and Diet Study
Objective: Analyse microbiome changes under different diets

Study level

Subjects: 100 participants
Design: Controlled dietary intervention
Samples: Stool samples collected weekly

Assay level

Technology: DNA sequencing
Platform: Illumina
Output: Microbial abundance profiles

ISA enables researchers to understand:

what was studied (samples and subjects),
how experiments were conducted, and
how measurements were generated.

It provides:

Detailed methodological and variable-level metadata
Support for complex datasets

Why it’s useful:

Captures domain knowledge essential for reuse

Role: Provides deep, field-specific context.

How They Work Together

Layer	Purpose	Example
General description	What is it?	Dublin Core
Provenance	How was it created?	PROV-O
Domain detail	What does it mean in context?	DDI (social sciences), ISA (life sciences)

Together, these layers create rich, FAIR, and interoperable metadata.

References

https://www.w3.org/wiki/LinkedData
https://www.w3.org/DesignIssues/LinkedData
https://www.rd-alliance.org/group_output/rda-tdwg-attribution-metadata-working-group-final-recommendations/
https://ddialliance.org/
https://isa-tools.org/format/specification.html