%%{init: {
"look": "handDrawn",
"flowchart": { "htmlLabels": false }
}}%%
flowchart TD
D[Dataset]
D -->|dc:title| T["Soil Moisture Dataset 2020–2023"]
D -->|dc:issued| Y["2024"]
D -->|dc:license| L["CC BY 4.0 (URI)"]
D -->|dc:creator| P[Person: Anna Vermeer]
D -->|dc:subject| S[Concept: Soil Moisture]
P -->|schema:affiliation| O[Organisation: Utrecht University]
Metadata
What Is Metadata in Research Data Management?
In the context of Research Data Management (RDM), metadata is structured information that describes, explains, and contextualises a digital object—such as a dataset, document, image, or software file. It acts as a layer of documentation that makes a digital object findable, understandable, and reusable, both now and in the future.
Metadata typically captures:
- Content information — what the object contains
- Provenance — how, when, and by whom it was created
- Technical characteristics — file formats, structures, and software requirements
- Administrative information — rights, licences, and access conditions
- Relationships — links to other versions, datasets, or publications
In short, metadata provides the context needed to interpret and reuse data beyond its original purpose.
A fictive example for a published dataset
| Field | Example Value |
|---|---|
| Title | Soil Moisture Dataset 2020–2023 |
| Creator | Anna Vermeer |
| Affiliation | Utrecht University |
| Publication Year | 2024 |
| Description | Measurements of soil moisture across multiple sites in the Netherlands over three years. |
| Keywords / Subject | soil moisture, hydrology, climate |
| License | CC BY 4.0 |
| Provenance | Collected using sensor network; cleaned and validated by research team; processed using R scripts version 4.2 |
| Dataset DOI | 10.5281/zenodo.1234567 |
| Domain Detail (DDI) | Variable-level metadata: soil_moisture, temperature, precipitation; measurement units, missing values, coding scheme documented. |
| Domain Detail (ISA) | Assay-level metadata for soil sensors: sensor type, calibration protocol, data logger model, measurement frequency. |
A fictive example for accompanying software
| Field | Example Value |
|---|---|
| Title | Soil Moisture Data Cleaning Script |
| Creator / Author | Anna Vermeer |
| Affiliation | Utrecht University |
| Version | 1.0.2 |
| Programming Language | R |
| Description | Script to clean and standardize soil moisture measurements from multiple sensor networks, including missing value imputation and unit conversions. |
| License | MIT License |
| Repository / DOI | https://github.com/soil-data/cleaning-script or DOI: 10.5281/zenodo.7654321 |
| Dependencies | R packages: tidyverse 1.4.3, lubridate 1.9.0 |
| Provenance | Developed as part of Soil Moisture Project 2020–2023; tested on dataset version 3; last updated March 2024 |
| Domain Detail (CodeMeta) | softwareRequirements: R 4.2; programmingLanguage: R; codeRepository: GitHub; maintainer: Anna Vermeer |
| Domain Detail (ISA) | If linked to experimental workflows: script applied to assay-level data (sensor readings), preserving measurement protocols and calibration details. |
Ontologies, Controlled Vocabularies, Metadata Schemas, and Templates
Understanding the Building Blocks of Structured Research Metadata
In Research Data Management, terms such as ontologies, controlled vocabularies, metadata schemas, and metadata templates are often used together. While closely related, they serve distinct roles. Together, they ensure that research outputs are described in a consistent, machine-readable, and interoperable way.
Controlled Vocabularies
Controlled vocabularies are curated lists of approved terms used to describe data consistently.
They:
- Provide standardised terms for describing concepts
- Reduce ambiguity (e.g., “soil moisture” vs. “soil humidity”)
- Improve searchability and interoperability
Examples
- AGROVOC (agriculture)
- MeSH (biomedical terms)
- GCMD keywords (Earth science)
When a metadata field requires a subject keyword, a controlled vocabulary ensures that everyone uses the same term for the same concept.
Ontologies
Ontologies extend controlled vocabularies by not only defining terms, but also specifying the relationships between them.
They:
- Provide a formal, machine-readable model of concepts
- Define hierarchies (e.g., “soil moisture” is a type of “hydrological variable”)
- Enable reasoning and automated linking between datasets
Examples
- Gene Ontology (GO)
- ENVO (Environment Ontology)
- PROV-O (Provenance Ontology)
Ontologies allow machines to understand that terms like “precipitation” and “rainfall” are related, enabling more intelligent search and data integration.
Metadata Schemas
A metadata schema defines which fields are used to describe a digital object and how those fields should be structured.
They:
- Specify required and optional fields
- Define field types (e.g., text, date, identifier)
- Ensure consistency across repositories and disciplines
Examples
- Dublin Core (general-purpose)
- DataCite (research outputs)
- DDI (social sciences)
- CodeMeta (software)
A metadata schema tells you what information to provide, such as:
- Title
- Creator
- Description
- Keywords
- License
- Persistent identifier
Importantly, it does not define which terms to use—that is the role of controlled vocabularies.
Metadata Templates
A metadata template is a practical, user-friendly implementation of a metadata schema.
They:
- Provide forms or structured documents for data entry
- Translate schema fields into prompts
- Include guidance or examples
- Often embed controlled vocabularies (e.g., dropdowns or autocomplete)
Examples
- A Zenodo upload form
- A Dataverse dataset form
- A lab-specific metadata spreadsheet
- A README template aligned with DataCite
Templates are what researchers actually interact with. They operationalise schemas and often help enforce consistency.
These components work together as complementary layers:
| Concept | Purpose | Relationship |
|---|---|---|
| Controlled vocabularies | Standardised terms | Provide the values used in metadata |
| Ontologies | Concepts + relationships | Add semantic meaning and structure |
| Metadata schemas | Field definitions | Specify what metadata to capture |
| Metadata templates | Practical tools | Implement schemas for users |
A simple analogy
- Controlled vocabularies → the dictionary
- Ontologies → the dictionary plus relationships
- Metadata schemas → the blueprint
- Metadata templates → the form you fill in
A real-world example
When uploading a dataset to Zenodo:
- A schema (e.g., DataCite) defines the required fields.
- The platform presents these fields through a template (web form).
- Controlled vocabularies may guide keyword selection.
- Ontologies may link your dataset to related concepts behind the scenes.
Linked Data and Key–Value Metadata
Metadata can range from simple structures to fully semantic representations. Two key approaches are key–value metadata and Linked Data.
Key–Value Metadata
Key–value metadata is the simplest and most widely used format. It consists of pairs:
- Key → the field name
- Value → the content
Examples
title: Soil Moisture Dataset 2020–2023creator: Anna Vermeerlicense: CC BY 4.0
Characteristics
- Easy to create and understand
- Common in spreadsheets, JSON, YAML, and repository forms
- Human-readable, but limited in machine interpretation
How it fits
- Schemas define the keys
- Controlled vocabularies constrain the values
- Templates present both to the user
Key–value metadata forms the foundation of most metadata practices.
Linked Data
Linked Data is a more advanced, semantic approach that represents metadata as interconnected statements.
Core idea
Information is expressed as subject–predicate–object triples:
- Dataset — hasCreator → Anna Vermeer
- Dataset — hasSubject → Soil Moisture
- Soil Moisture — is a → Hydrological Variable
Each element is identified by a URI, ensuring global uniqueness.
Characteristics
- Highly interoperable
- Machine-interpretable
- Enables automated reasoning
- Connects datasets across systems
How it fits
- Ontologies define relationships
- Controlled vocabularies provide globally unique identifiers for the concepts and terms used
- Schemas can be expressed in machine-readable relationships (triples) using globally unique identifiers
- Templates may generate Linked Data automatically
Linked Data turns metadata into a network of meaning, rather than a collection of fields.
A Simple Comparison
Describing a book:
Key–value metadata
title: The Hobbitauthor: J.R.R. Tolkien
Controlled vocabulary
subject: Fantasy Fiction
Ontology
- Fantasy Fiction → is a → Fiction Genre
Schema → defines the fields
Template → the form
Linked Data → a connected network (knowledge graph) of relationships
Key–value pairs versus Linked Data
Metadata often begins its life as a simple list of key–value pairs. This is the format most researchers encounter in spreadsheets, repository submission forms, or lightweight JSON/YAML files.
| Field (Key) | Value |
|---|---|
| Title | Soil Moisture Dataset 2020–2023 |
| Creator | Anna Vermeer |
| Affiliation | Utrecht University |
| Publication Year | 2024 |
| Subject | Soil moisture |
| License | CC BY 4.0 |
In this representation:
- Each row stands alone as an independent field–value pair
- Relationships (e.g., that Anna Vermeer is affiliated with Utrecht University) are implicit
- Values are plain text, without global identifiers or machine‑interpretable meaning
Such metadata is typically serialised as CSV or TSV files: simple, familiar, and easy to edit, but limited in how much structure or semantics they can express.
In the diagram below, the same metadata is expressed as a small knowledge graph. Instead of isolated fields, we now have entities (dataset, person, organisation, concept) connected by explicit relationships.
Compared to the table:
- The table presents metadata as a flat list of fields
- The graph presents metadata as a network of connected entities
- The person, organisation, and subject become first‑class nodes
- Relationships such as
creatorandaffiliationare explicit and machine‑interpretable
This is the essential shift: from fields with values to entities with relationships.
To serialise such a graph, we use a Linked Data format such as Turtle:
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix schema: <http://schema.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
# Dataset node (using schema:Dataset)
<https://data.example.org/dataset/soil-moisture-2020-2023>
a schema:Dataset ;
dc:title "Soil Moisture Dataset 2020–2023" ;
dct:issued "2024"^^xsd:gYear ;
dct:license <https://creativecommons.org/licenses/by/4.0/> ;
dct:creator <https://data.example.org/person/anna-vermeer> ;
dct:subject <https://vocab.example.org/concept/soil-moisture> .
# Person node (FOAF)
<https://data.example.org/person/anna-vermeer>
a foaf:Person ;
foaf:name "Anna Vermeer" ;
foaf:member <https://data.example.org/org/utrecht-university> .
# Organisation node (FOAF)
<https://data.example.org/org/utrecht-university>
a foaf:Organization ;
foaf:name "Utrecht University" .
# Concept node (SKOS)
<https://vocab.example.org/concept/soil-moisture>
a skos:Concept ;
skos:prefLabel "Soil Moisture"@en ;
skos:definition "The amount of water contained in soil, typically expressed as a percentage."@en .
In this Turtle serialisation, the RDF graph not only describes the dataset, the creator, and the organisation, but also provides a stable identifier for the subject concept (“Soil Moisture”). This makes the metadata interoperable: other datasets can refer to the same concept, and machines can follow the URI to retrieve its meaning.
At the top of the file, we declare the ontologies and vocabularies we use. For example:
@prefix dc: <http://purl.org/dc/elements/1.1/> .
This tells us that dc:title is shorthand for the full URI
http://purl.org/dc/elements/1.1/title,
which is the authoritative definition of the Dublin Core title property.
By giving each entity, i.e. dataset, person, organisation and concept its own URI, the graph becomes a reusable and extensible structure. Machines can infer relationships, merge graphs, or enrich them with external knowledge. While it is technically possible to use URIs as keys in key–value metadata, such formats lack the semantics and inference capabilities that RDF provides.
We explore metadata serialisation and metadata infrastructure in more detail in the chapter Metadata Infrastructure.
Bringing It All Together
- Key–value metadata is the practical starting point
- Linked Data is the semantic, interoperable extension
- Controlled vocabularies and ontologies provide meaning and standardisation of terms
- Schemas and templates provide structure
Together, they form the foundation of FAIR, reusable research metadata.
Understanding and Combining Metadata Standards
Metadata standards vary in scope—from general to highly specialised. Understanding how they complement each other helps in selecting the right approach.
Dublin Core
A general-purpose, lightweight metadata schema.
It provides:
- 15 core elements (e.g., Title, Creator, Subject, Date)
- Broad applicability across domains
Why it’s useful:
- Simple and widely supported
- Suitable for discovery and basic description
Role: Provides baseline descriptive metadata.
PROV-O
A provenance ontology describing how data was created.
It provides
- Entities, activities, and agents
- Relationships such as
wasGeneratedByandused
Why it’s useful:
- Supports reproducibility
- Captures workflows and processes
Role: Adds semantic provenance and process context.
Discipline-Specific Schema - Data Documentation Initiative
Data Documentation Initiative (DDI) organises metadata into three main levels:
| Level | What it describes | Examples |
|---|---|---|
| Study level | The overall research project | Title, investigators, methodology, sampling |
| Dataset level | The data files | File format, number of variables, version |
| Variable level | Individual variables | Question text, response categories, coding |
The variable level is what makes DDI especially powerful.
Study level
- Title: European Social Survey 2022
- Method: Survey
- Sample: Random sample of EU residents
Dataset level
- File:
ess2022.csv - Cases: 30,000 respondents
- Variables: 250
Variable level
Variable:
trust_govQuestion: “How much do you trust the national government?”
Values:
- 0 = No trust
- 10 = Complete trust
DDI enables researchers to understand:
- what the data contains (variables),
- how it was collected (methodology), and
- how to reuse it correctly.
It provides:
- Detailed methodological and variable-level metadata
- Support for complex datasets
Why it’s useful:
- Captures domain knowledge essential for reuse
Role: Provides deep, field-specific context.
Discipline-Specific Schema - Investigation–Study–Assay (ISA)
ISA is a discipline-specific metadata framework used in the life sciences to describe experimental workflows, especially in genomics, proteomics, and other omics research.
ISA organises metadata into three hierarchical levels:
| Level | What it describes | Examples |
|---|---|---|
| Investigation | The overall research context | Project title, researchers, objectives |
| Study | A specific experiment or dataset | Study design, subjects, sample characteristics |
| Assay | Analytical measurements and technologies | Sequencing, mass spectrometry, protocols |
The assay level captures how data was actually generated.
Investigation level
- Title: Gut Microbiome and Diet Study
- Objective: Analyse microbiome changes under different diets
Study level
- Subjects: 100 participants
- Design: Controlled dietary intervention
- Samples: Stool samples collected weekly
Assay level
- Technology: DNA sequencing
- Platform: Illumina
- Output: Microbial abundance profiles
ISA enables researchers to understand:
- what was studied (samples and subjects),
- how experiments were conducted, and
- how measurements were generated.
It provides:
- Detailed methodological and variable-level metadata
- Support for complex datasets
Why it’s useful:
- Captures domain knowledge essential for reuse
Role: Provides deep, field-specific context.
How They Work Together
| Layer | Purpose | Example |
|---|---|---|
| General description | What is it? | Dublin Core |
| Provenance | How was it created? | PROV-O |
| Domain detail | What does it mean in context? | DDI (social sciences), ISA (life sciences) |
Together, these layers create rich, FAIR, and interoperable metadata.
References
- https://www.w3.org/wiki/LinkedData
- https://www.w3.org/DesignIssues/LinkedData
- https://www.rd-alliance.org/group_output/rda-tdwg-attribution-metadata-working-group-final-recommendations/
- https://ddialliance.org/
- https://isa-tools.org/format/specification.html