The Chicago School Library: Research Data Management: Describing Data

About Metadata for Data

Metadata is often defined as "data about data," a characterization that fails to capture why it is important and what it does. Paul Miller provides a richer description that illustrates metadata's many values and purposes:

“In essence, metadata is the extra baggage associated with any resource that enables a real or potential user to find that resource; to decide whether or not it is of value to them; to discover where, when and by whom it was created, as well as for what purpose; to know what tools will be needed to manipulate the resource; to determine whether or not they will actually be allowed access to the resource itself and how much this will cost them. Metadata is, in short, a means by which largely meaningless data may be transformed into information, interpretable and reusable by those other than the creator of the data resource.”¹

Metadata is structured information about an object, like a dataset, and has value to both the original creator and other users. Complete metadata allows researchers to locate data they created and recall the circumstances and context under which they created and analyzed the data. It allows researchers outside of the original research team to:

Find the data
Know who created the data or contributed to the creation of the data (i.e. a funder)
Understand how the data was created and manipulated
Know when the data was created
Determine tools needed to view, manipulate, and use the data
Understand rights and use conditions surrounding the data
Connect to related information objects

1. Miller, Paul (2004). Metadata: What it means for memory institutions. In Metadata applications and management, ed. G.E. Gorman and Daniel G. Dorner. Lanham, MD: Scarecrow Press, p. 4.

Metadata Standards for Data

Metadata are structured information that provides context for information objects of all kinds, including research data, and in doing so enables discovery, use, exchange, and preservation of those objects. Metadata for data typically includes information about the researchers involved with the data creation, a name or title of the data set, dates associated with the creation of the data, a brief description or abstract, and terms and conditions associated with the data set.

There are a variety of metadata standards for describing data sets based on discipline, international standards, and many other characteristics of the data. Academic disciplines have supported initiatives to formalize metadata specifications within their community. The type of resource being represented and the desired uses of the represented resource will influence the metadata standards. Some examples of widely adopted metadata standards include the following:

General

Dublin Core: A general-purpose metadata standard for describing a variety of resources.
DataCite: A domain-agnostic list of core metadata properties chosen for the accurate and consistent identification of data for citation and retrieval purposes.
Digital Curation Centre - List of repository metadata standards, including tools and use cases.

Sciences

Digital Curation Centre - List of Biology metadata standards, including tools and use cases
Digital Curation Centre - List of Earth Sciences metadata standards, including tools and use cases
Digital Curation Centre - List of Physical Sciences metadata standards, including tools and use cases

Social Sciences

DDI: An international standard for describing data from the social, behavioral, and economic sciences. Expressed in XML, the DDI metadata specification supports the entire research data life cycle.
Digital Curation Centre - List of Social Sciences metadata standards, including tools and use cases

Humanities

CDWA: Categories for the Description of Works of Art serve as a foundational framework for the description of cultural heritage materials.
VRA Core: Visual Resource Association Core Categories, a data standard for the description of works of visual culture as well as the images that document them.
TEI: Text Encoding Initiative, a standard for the digital encoding of literary and linguistic texts.

Data Documentation

For data to be interpretable and useful to others, researchers should document their research workflow, decisions that they make during their research process, and their manipulation of the data. The UK Data Archive outlines a set of best practices for data documentation, which is captured here:

Good data documentation includes information on:

the context of data collection: project history, aim, objectives and hypotheses
data collection methods: sampling, data collection process, instruments used, hardware and software used, scale and resolution, temporal and geographic coverage and secondary data sources used
dataset structure of data files, study cases, relationships between files
data validation, checking, proofing, cleaning and quality assurance procedures carried out
changes made to data over time since their original creation and identification of different versions of data files
information on access and use conditions or data confidentiality

At data-level, datasets should also be documented with:

names, labels and descriptions for variables, records and their values
explanation of codes and classification schemes used
codes of, and reasons for, missing values
derived data created after collection, with code, algorithm or command file used to create them
weighting and grossing variables created
data listing with descriptions for cases, individuals or items studied

Variable-level descriptions may be embedded within a dataset itself as metadata. Other documentation may be contained in user guides, reports, publications, working papers and laboratory books (see Managing and Sharing Data UK Data Archive).

Readme Files

In the context of research data, a readme file is a plain text file (.txt) that helps others understand your data and interconnections among data files. By titling the file "readme," the date creator signals to other users that this file should be looked at first. For researchers depositing data in D-Scholarship@Pitt, the information in the readme file may mirror and augment information included in the metadata form and, if the deposit includes multiple files, may explain the file naming structure, relationship among the files, and abbreviations used.

Cornell University's Research Data Management Service Group has made a useful readme file template available for download. At a minimum, the Cornell group recommends completing the following sections in the readme file template:

General information

Data set title
Name and contact information for investigators
Date (or date range) of data collection
Geographic location of data collection

Data and file overview

A short description of each file
Date that the file was created

Methodological information

Description of methods for data collection
Description of methods for data processing

Data specific-information

Variable list, with full names and definitions of column headings if tabular data
Units of measurement
Definitions for codes or symbols used to record missing information (see Cornell University, Guide to writing "readme" style metadata)

Data Dictionaries

A data dictionary describes all the data stored in a data set or used by a database, including their types, attributes, structure, relationships, and usage in the database or software program. A good data dictionary can be a valuable part of the metadata describing a data set, enabling a user to get a clear understanding of the content and organization of the data and how it could be modified, if necessary. In the context of a database or software package, the data dictionary may be an essential piece of software that programmers and the database management system require to access and use the data properly. The user view of a data dictionary is usually presented as a table or spreadsheet. Dictionaries may also be incorporated into XML files or other mark-up languages. A data dictionary does not contain the data, but only describes it.

A data dictionary typically contains a list of all files in the database, names for each file, the type of data included, a list of all field names and variable names, a description of the information contained in each field, and the various attributes of each field. These may include type (text, date, numeric, etc.), standard formats, units, field length, description, unique identifiers, default values, whether a value is required or not, and more, depending on the specific data.

For some examples of data dictionaries, check the following sites:

Data Dictionary Examples – Ag Data Commons – National Agricultural Library - USDA
Sample Dataset 2014 - Statistical Consulting at University Libraries, Kent State University. Click on the link to “Data definitions (*pdf)” in the Sample Data Files section.
Fleet DNA Data Dictionary – National Renewable Energy Laboratory (NREL).
Protein Data Bank Exchange Data Dictionary (PDBx/mmCIF V4.0) – Worldwide Protein Data Bank. There are separate tabs for Category Groups, Data Categories, and Data Items.