An important but sometimes neglected step in generating research data is writing documentation or metadata to accompany it. Metadata provides context and meaning to your research data or research materials and enable you and others to make sense of and reuse your data in the future.
It is best practice to prepare metadata while you are conducting research. You will also be asked to provide metadata when you deposit your research data in a repository for preservation.
On this page, you’ll find guidance on how to describe your research data or research materials, and how to structure this metadata.
What to include
At a minimum, your documentation or metadata should clearly tell the story of how you gathered and used your data or research materials and for what purpose.
You may also need to provide some broader context to explain the motivation for the design decisions you have taken and the significance of what you found.
More specifically, you may need to include some of the following elements:
- details of the equipment used, such as the make and model of the instrument, the settings used, information on how it was calibrated
- the text of survey instruments used, including questionnaires and interview templates
- details of who collected the data and when
- citations for any third-party data you have used
- key features of the methodology, such as the sampling technique, whether the experiment was blinded, how sample groups were subdivided
- legal and ethical agreements relating to the data, such as consent forms, data licences, approval documents or COSHH forms
- details of the file formats and standard data structures used to record data and supporting information. For example, explain which measurement each column heading represents, the units of measurement used, and any abbreviations, coding or controlled vocabulary used
- a glossary of column names and abbreviations used, explaining for example which measurement resulted in the given column and what units were used
- the codebook used to analyse and encode content
- the workflow used to process and manipulate data, including steps such as applying a statistical test or removing outliers
- details of the software used to generate or process the data, including version number and platform
You may be recording some of this information in a lab notebook or research journal. If so, you may find it convenient to record the corresponding page numbers alongside the data files until you have an opportunity to transfer the information into a documentation file.
Where to record the information
Depending on the context there are several places where the documentation can be placed:
- Within the data file: Some file formats can record information in addition to the main data content. For example, the Observations and Measurements XML standard provides a way of recording sampling strategies and observation procedures as well as measurement values
- In a separate metadata file: Some disciplines have developed special file formats or data structures for recording supporting information. For example, the Agricultural Metadata Element Set (AgMES) provides a way of describing an agricultural dataset using the subject-predicate-object structure of the Resource Description Framework (RDF)
- In a readme file: Any information that cannot be recorded in a structured way (i.e. as the values of fields in a data or metadata file) can be recorded as free text within a readme file. A readme file is a plain text file that is named 'readme' to encourage users to read it before looking at the remainder of the content. It can contain documentation directly or instruct the reader where to look to find more information. The file should be structured into sections as an aid to the reader
- In a published journal article: Some of the information needed to understand data would normally be provided in a journal article reporting the research. In order to prevent duplication of effort, it is possible to refer to an article to provide more information about a dataset, but before doing so you should be sure that (a) the article provides sufficient detail, and (b) the article will be available on open access
Structured metadata
Metadata are most useful when they have been structured, that is, arranged as properties and values. The metadata you provide for reuse will depend on the field of your research:
- Social scientists often package their data and metadata together using DDI (Data Documentation Initiative), MODS (Metadata Object Description Schema) or, if the data are strongly statistical in nature, SDMX (Statistical Data and Metadata eXchange)
- In the humanities, the metadata standards are TEI (Text Encoding Initiative) and VRA (Visual Resources Association)
- Many types of biological and biomedical investigation have a corresponding Minimum Information standard, setting out what information would be needed to interpret the data unambiguously and reproduce the experiment
- Ecological and natural sciences use EML (Ecological Metadata Language), DIF (Directory Interchange Format), and Darwin Core
- For life sciences, consult the BioSharing Standards Registry of metadata standards
- Geospatial data are usually packaged in a format that complies with the standard ISO 19115. There are many profiles of this standard aimed at different communities; UK researchers are encouraged to use UK GEMINI, which is in turn compliant with the European INSPIRE Directive
- Dublin Core is a general metadata schema
- Some subject-specific data archives ask for data to be submitted in a particular format. For example, the NCBI Gene Expression Omnibus specifies a metadata set to be submitted along with data, and has developed the spreadsheet-based GEOarchive format for capturing it
If you decide or are required to offer your data to a subject-specific data centre, you should contact them in the early stages of your project to discuss their metadata requirements. This can save a lot of additional work later on as some metadata can only be collected accurately at the point of data creation.
For more information about the data and metadata standards available for your subject area, see this list of Disciplinary Metadata from the Digital Curation Centre.
Vocabularies
As an aid to clarity, some subject areas have agreed on a common set of terminology to use when describing data. If metadata standards list the properties that need to be known, vocabularies help with providing useful values.
- The NERC Vocabulary Server provides access to many different vocabularies in use in geoscience and oceanography.
- The Open Knowledge Foundation runs the Linked Open Vocabularies service, which provides schemas for over 750 different vocabularies from Air Traffic Data to XML Schema for use in Resource Description Framework (RDF) applications.
More information
You can find further guidance on how to document your data from the UK Data Service, and read this guide to writing ‘readme’ style metadata from Cornell University.
You can also find detailed guidance on documentation and metadata from MIT Libraries.
Contact us
For further support and guidance, contact the Research Data Management Officer at [email protected].