Metadata

What Is Metadata in Research Data Management?

In the context of Research Data Management (RDM), metadata is structured information that describes, explains, and contextualises a digital object—such as a dataset, document, image, or software file. It acts as a layer of documentation that makes a digital object findable, understandable, and reusable, both now and in the future.

Metadata typically captures:

  • Content information — what the object contains
  • Provenance — how, when, and by whom it was created
  • Technical characteristics — file formats, structures, and software requirements
  • Administrative information — rights, licences, and access conditions
  • Relationships — links to other versions, datasets, or publications

In short, metadata provides the context needed to interpret and reuse data beyond its original purpose.

A fictive example for a published dataset

Field Example Value
Title Soil Moisture Dataset 2020–2023
Creator Anna Vermeer
Affiliation Utrecht University
Publication Year 2024
Description Measurements of soil moisture across multiple sites in the Netherlands over three years.
Keywords / Subject soil moisture, hydrology, climate
License CC BY 4.0
Provenance Collected using sensor network; cleaned and validated by research team; processed using R scripts version 4.2
Dataset DOI 10.5281/zenodo.1234567
Domain Detail (DDI) Variable-level metadata: soil_moisture, temperature, precipitation; measurement units, missing values, coding scheme documented.
Domain Detail (ISA) Assay-level metadata for soil sensors: sensor type, calibration protocol, data logger model, measurement frequency.

A fictive example for accompanying software

Field Example Value
Title Soil Moisture Data Cleaning Script
Creator / Author Anna Vermeer
Affiliation Utrecht University
Version 1.0.2
Programming Language R
Description Script to clean and standardize soil moisture measurements from multiple sensor networks, including missing value imputation and unit conversions.
License MIT License
Repository / DOI https://github.com/soil-data/cleaning-script or DOI: 10.5281/zenodo.7654321
Dependencies R packages: tidyverse 1.4.3, lubridate 1.9.0
Provenance Developed as part of Soil Moisture Project 2020–2023; tested on dataset version 3; last updated March 2024
Domain Detail (CodeMeta) softwareRequirements: R 4.2; programmingLanguage: R; codeRepository: GitHub; maintainer: Anna Vermeer
Domain Detail (ISA) If linked to experimental workflows: script applied to assay-level data (sensor readings), preserving measurement protocols and calibration details.

Ontologies, Controlled Vocabularies, Metadata Schemas, and Templates

Understanding the Building Blocks of Structured Research Metadata

In Research Data Management, terms such as ontologies, controlled vocabularies, metadata schemas, and metadata templates are often used together. While closely related, they serve distinct roles. Together, they ensure that research outputs are described in a consistent, machine-readable, and interoperable way.

Controlled Vocabularies

Controlled vocabularies are curated lists of approved terms used to describe data consistently.

They:

  • Provide standardised terms for describing concepts
  • Reduce ambiguity (e.g., “soil moisture” vs. “soil humidity”)
  • Improve searchability and interoperability

Examples

When a metadata field requires a subject keyword, a controlled vocabulary ensures that everyone uses the same term for the same concept.

Ontologies

Ontologies extend controlled vocabularies by not only defining terms, but also specifying the relationships between them.

They:

  • Provide a formal, machine-readable model of concepts
  • Define hierarchies (e.g., “soil moisture” is a type of “hydrological variable”)
  • Enable reasoning and automated linking between datasets

Examples

Ontologies allow machines to understand that terms like “precipitation” and “rainfall” are related, enabling more intelligent search and data integration.

Metadata Schemas

A metadata schema defines which fields are used to describe a digital object and how those fields should be structured.

They:

  • Specify required and optional fields
  • Define field types (e.g., text, date, identifier)
  • Ensure consistency across repositories and disciplines

Examples

A metadata schema tells you what information to provide, such as:

  • Title
  • Creator
  • Description
  • Keywords
  • License
  • Persistent identifier

Importantly, it does not define which terms to use—that is the role of controlled vocabularies.

Metadata Templates

A metadata template is a practical, user-friendly implementation of a metadata schema.

They:

  • Provide forms or structured documents for data entry
  • Translate schema fields into prompts
  • Include guidance or examples
  • Often embed controlled vocabularies (e.g., dropdowns or autocomplete)

Examples

  • A Zenodo upload form
  • A Dataverse dataset form
  • A lab-specific metadata spreadsheet
  • A README template aligned with DataCite

Templates are what researchers actually interact with. They operationalise schemas and often help enforce consistency.

These components work together as complementary layers:

Concept Purpose Relationship
Controlled vocabularies Standardised terms Provide the values used in metadata
Ontologies Concepts + relationships Add semantic meaning and structure
Metadata schemas Field definitions Specify what metadata to capture
Metadata templates Practical tools Implement schemas for users
A simple analogy
  • Controlled vocabularies → the dictionary
  • Ontologies → the dictionary plus relationships
  • Metadata schemas → the blueprint
  • Metadata templates → the form you fill in
A real-world example

When uploading a dataset to Zenodo:

  1. A schema (e.g., DataCite) defines the required fields.
  2. The platform presents these fields through a template (web form).
  3. Controlled vocabularies may guide keyword selection.
  4. Ontologies may link your dataset to related concepts behind the scenes.

Linked Data and Key–Value Metadata

Metadata can range from simple structures to fully semantic representations. Two key approaches are key–value metadata and Linked Data.

Key–Value Metadata

Key–value metadata is the simplest and most widely used format. It consists of pairs:

  • Key → the field name
  • Value → the content
Examples
  • title: Soil Moisture Dataset 2020–2023
  • creator: Anna Vermeer
  • license: CC BY 4.0
Characteristics
  • Easy to create and understand
  • Common in spreadsheets, JSON, YAML, and repository forms
  • Human-readable, but limited in machine interpretation

How it fits

  • Schemas define the keys
  • Controlled vocabularies constrain the values
  • Templates present both to the user

Key–value metadata forms the foundation of most metadata practices.

Linked Data

Linked Data is a more advanced, semantic approach that represents metadata as interconnected statements.

Core idea

Information is expressed as subject–predicate–object triples:

  • Dataset — hasCreator → Anna Vermeer
  • Dataset — hasSubject → Soil Moisture
  • Soil Moisture — is a → Hydrological Variable

Each element is identified by a URI, ensuring global uniqueness.

Characteristics
  • Highly interoperable
  • Machine-interpretable
  • Enables automated reasoning
  • Connects datasets across systems

How it fits

  • Ontologies define relationships
  • Controlled vocabularies provide globally unique identifiers for the concepts and terms used
  • Schemas can be expressed in machine-readable relationships (triples) using globally unique identifiers
  • Templates may generate Linked Data automatically

Linked Data turns metadata into a network of meaning, rather than a collection of fields.

A Simple Comparison

Describing a book:

  • Key–value metadata

    • title: The Hobbit
    • author: J.R.R. Tolkien
  • Controlled vocabulary

    • subject: Fantasy Fiction
  • Ontology

    • Fantasy Fiction → is a → Fiction Genre
  • Schema → defines the fields

  • Template → the form

  • Linked Data → a connected network (knowledge graph) of relationships

Key–value pairs versus Linked Data

Metadata often begins its life as a simple list of key–value pairs. This is the format most researchers encounter in spreadsheets, repository submission forms, or lightweight JSON/YAML files.

Field (Key) Value
Title Soil Moisture Dataset 2020–2023
Creator Anna Vermeer
Affiliation Utrecht University
Publication Year 2024
Subject Soil moisture
License CC BY 4.0

In this representation:

  • Each row stands alone as an independent field–value pair
  • Relationships (e.g., that Anna Vermeer is affiliated with Utrecht University) are implicit
  • Values are plain text, without global identifiers or machine‑interpretable meaning

Such metadata is typically serialised as CSV or TSV files: simple, familiar, and easy to edit, but limited in how much structure or semantics they can express.

In the diagram below, the same metadata is expressed as a small knowledge graph. Instead of isolated fields, we now have entities (dataset, person, organisation, concept) connected by explicit relationships.

%%{init: {
  "look": "handDrawn",
  "flowchart": { "htmlLabels": false }
}}%%
flowchart TD

    D[Dataset]

    D -->|dc:title| T["Soil Moisture Dataset 2020–2023"]
    D -->|dc:issued| Y["2024"]
    D -->|dc:license| L["CC BY 4.0 (URI)"]
    D -->|dc:creator| P[Person: Anna Vermeer]
    D -->|dc:subject| S[Concept: Soil Moisture]

    P -->|schema:affiliation| O[Organisation: Utrecht University]
Figure 6.1: Knowledge Graph

Compared to the table:

  • The table presents metadata as a flat list of fields
  • The graph presents metadata as a network of connected entities
  • The person, organisation, and subject become first‑class nodes
  • Relationships such as creator and affiliation are explicit and machine‑interpretable

This is the essential shift: from fields with values to entities with relationships.

To serialise such a graph, we use a Linked Data format such as Turtle:

@prefix dc:     <http://purl.org/dc/elements/1.1/> .
@prefix dct:    <http://purl.org/dc/terms/> .
@prefix schema: <http://schema.org/> .
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .
@prefix skos:   <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

# Dataset node (using schema:Dataset)
<https://data.example.org/dataset/soil-moisture-2020-2023>
    a schema:Dataset ;
    dc:title "Soil Moisture Dataset 2020–2023" ;
    dct:issued "2024"^^xsd:gYear ;
    dct:license <https://creativecommons.org/licenses/by/4.0/> ;
    dct:creator <https://data.example.org/person/anna-vermeer> ;
    dct:subject <https://vocab.example.org/concept/soil-moisture> .

# Person node (FOAF)
<https://data.example.org/person/anna-vermeer>
    a foaf:Person ;
    foaf:name "Anna Vermeer" ;
    foaf:member <https://data.example.org/org/utrecht-university> .

# Organisation node (FOAF)
<https://data.example.org/org/utrecht-university>
    a foaf:Organization ;
    foaf:name "Utrecht University" .

# Concept node (SKOS)
<https://vocab.example.org/concept/soil-moisture>
    a skos:Concept ;
    skos:prefLabel "Soil Moisture"@en ;
    skos:definition "The amount of water contained in soil, typically expressed as a percentage."@en .

In this Turtle serialisation, the RDF graph not only describes the dataset, the creator, and the organisation, but also provides a stable identifier for the subject concept (“Soil Moisture”). This makes the metadata interoperable: other datasets can refer to the same concept, and machines can follow the URI to retrieve its meaning.

At the top of the file, we declare the ontologies and vocabularies we use. For example:

@prefix dc:     <http://purl.org/dc/elements/1.1/> .

This tells us that dc:title is shorthand for the full URI
http://purl.org/dc/elements/1.1/title,
which is the authoritative definition of the Dublin Core title property.

By giving each entity, i.e. dataset, person, organisation and concept its own URI, the graph becomes a reusable and extensible structure. Machines can infer relationships, merge graphs, or enrich them with external knowledge. While it is technically possible to use URIs as keys in key–value metadata, such formats lack the semantics and inference capabilities that RDF provides.

We explore metadata serialisation and metadata infrastructure in more detail in the chapter Metadata Infrastructure.

Bringing It All Together

  • Key–value metadata is the practical starting point
  • Linked Data is the semantic, interoperable extension
  • Controlled vocabularies and ontologies provide meaning and standardisation of terms
  • Schemas and templates provide structure

Together, they form the foundation of FAIR, reusable research metadata.

Understanding and Combining Metadata Standards

Metadata standards vary in scope—from general to highly specialised. Understanding how they complement each other helps in selecting the right approach.

Dublin Core

A general-purpose, lightweight metadata schema.

It provides:

  • 15 core elements (e.g., Title, Creator, Subject, Date)
  • Broad applicability across domains

Why it’s useful:

  • Simple and widely supported
  • Suitable for discovery and basic description

Role: Provides baseline descriptive metadata.

PROV-O

A provenance ontology describing how data was created.

It provides

  • Entities, activities, and agents
  • Relationships such as wasGeneratedBy and used

Why it’s useful:

  • Supports reproducibility
  • Captures workflows and processes

Role: Adds semantic provenance and process context.

Discipline-Specific Schema - Data Documentation Initiative

Data Documentation Initiative (DDI) organises metadata into three main levels:

Level What it describes Examples
Study level The overall research project Title, investigators, methodology, sampling
Dataset level The data files File format, number of variables, version
Variable level Individual variables Question text, response categories, coding

The variable level is what makes DDI especially powerful.

Study level

  • Title: European Social Survey 2022
  • Method: Survey
  • Sample: Random sample of EU residents

Dataset level

  • File: ess2022.csv
  • Cases: 30,000 respondents
  • Variables: 250

Variable level

  • Variable: trust_gov

  • Question: “How much do you trust the national government?”

  • Values:

    • 0 = No trust
    • 10 = Complete trust

DDI enables researchers to understand:

  • what the data contains (variables),
  • how it was collected (methodology), and
  • how to reuse it correctly.

It provides:

  • Detailed methodological and variable-level metadata
  • Support for complex datasets

Why it’s useful:

  • Captures domain knowledge essential for reuse

Role: Provides deep, field-specific context.

Discipline-Specific Schema - Investigation–Study–Assay (ISA)

ISA is a discipline-specific metadata framework used in the life sciences to describe experimental workflows, especially in genomics, proteomics, and other omics research.

ISA organises metadata into three hierarchical levels:

Level What it describes Examples
Investigation The overall research context Project title, researchers, objectives
Study A specific experiment or dataset Study design, subjects, sample characteristics
Assay Analytical measurements and technologies Sequencing, mass spectrometry, protocols

The assay level captures how data was actually generated.

Investigation level

  • Title: Gut Microbiome and Diet Study
  • Objective: Analyse microbiome changes under different diets

Study level

  • Subjects: 100 participants
  • Design: Controlled dietary intervention
  • Samples: Stool samples collected weekly

Assay level

  • Technology: DNA sequencing
  • Platform: Illumina
  • Output: Microbial abundance profiles

ISA enables researchers to understand:

  • what was studied (samples and subjects),
  • how experiments were conducted, and
  • how measurements were generated.

It provides:

  • Detailed methodological and variable-level metadata
  • Support for complex datasets

Why it’s useful:

  • Captures domain knowledge essential for reuse

Role: Provides deep, field-specific context.

How They Work Together

Layer Purpose Example
General description What is it? Dublin Core
Provenance How was it created? PROV-O
Domain detail What does it mean in context? DDI (social sciences), ISA (life sciences)

Together, these layers create rich, FAIR, and interoperable metadata.

References

  • https://www.w3.org/wiki/LinkedData
  • https://www.w3.org/DesignIssues/LinkedData
  • https://www.rd-alliance.org/group_output/rda-tdwg-attribution-metadata-working-group-final-recommendations/
  • https://ddialliance.org/
  • https://isa-tools.org/format/specification.html