As a data creator, you have certain rights over the work and an opportunity to license your data appropriately to facilitate sharing and re-use. The application of copyright and licensing depends on several factors - whether your data set contains quantitative data, qualitative data, or sensitive information. Copyright and licensing options vary depending on the type of data and its sensitivity.
Best practices include:
The information presented here is a brief overview of a very complicated topic. Please get in touch with the Research Data Management team for help with any of the rights and permissions considerations below.
By quantitative data, we mean data that are numerical values or measurements of facts about the universe. Because facts are not subject to copyright, most quantitative data are not copyrightable in the United States and copyright laws usually do not apply or are not enforceable.
However, the arrangement, selection, and coordination of the data set as a whole may be subject to copyright. This depends on the creativity involved with arranging and displaying the data.
Many researchers believe in the importance of sharing data openly to facilitate the greatest possible reuse of the data. For example, Dryad and the Panton Principles for Open Data strongly recommend that data be contributed to the public domain. When a data set is dedicated to the Public Domain, then the creator declares that others may use the data set in its current form (and, therefore, the potential copyright of the arrangement, selection, and coordination of the data set are dedicated to the Public Domain). Below are two examples of licenses that a data set creator can apply to a quantitative data set to dedicate it to the Public Domain.
By qualitative data, we mean data that contain observations, texts, conversations, artistic or creative works, which are usually collected in the humanities and social science fields. Some examples of qualitative data include text corpora, interviews, photographs, and social media output. Because these are often creative expressions made by individuals that are fixed in a tangible form, many of these data sets are subject to copyright, and permission may need to be obtained for their use. For those compiling qualitative data sets, privacy, ethics, and licenses are of key concern.
For those collecting interviews or other recordings and documentation made by research subjects, clear guidelines for the usage and ownership of these materials should be set out in a Consent Form and cleared with the IRB. This is also the case when research work is conducted via the Internet.
Considerations and Recommendations Concerning Internet Research and Human Subjects Research Regulations from the US Department of Health and Human Services
Guidelines for Ethical Conduct in Participant Observation (University of Toronto) - contains advice on what to consider when writing a consent form and protocol.
Communicating Qualitative Research Study Designs to Research Ethics Review Boards (2011) by Carolyn Ells - a discussion of ethical issues in collecting data for qualitative research studies and how to construct a protocol that reflects these considerations.
Researchers must identify whether the data are in the public domain, subject to licensing terms, or may qualify as Fair Use. Because these data sets often include substantial transformative use, a Fair Use argument may be particularly powerful for qualitative data sets.
Copyright and Intellectual Property Toolkit: Public Domain - contains information on how to determine if an item is in the public domain.
Copyright and Intellectual Property Toolkit: Fair Use - contains information about the doctrine of Fair Use and tools for making a Fair Use argument.
Understanding Fair Use: Transformative Use - read more about the "Fifth Factor" of Fair Use, Transformation.
When obtaining data from the Internet via scraping tools, the restrictions in Terms of Service and Developer Policies apply, especially from social media websites.
Fair Use in the Age of Social Media - an article from Forbes magazine covering the basics of Fair Use in social media contexts.
Challenges of Using Twitter as a Data Source - from the London School of Economics, covers some of the issues with using and sharing qualitative social media data sets, including licensing issues. See also Twitter's Developer Policy, which applies to those creating data sets by scraping Twitter.
For data sets that contain sensitive research, e.g. human subject research, access control may be an option. Mixed levels of access control may be put in place for some data, combining controlled access to confidential data with standard access to non-confidential data.
Before data collected during research with human subjects is published, researchers should ensure the removal of any personally identifiable information (or PII). A documented plan for anonymizing the data will serve to mitigate the risk to participants, encourage consistency in practices among the research team throughout the project, and help future users to understand what decisions were made during the anonymization.
Some approaches for anonymization include:
Beyond the Public Domain licensing options above, there are some other licensing options that can apply to data sets. Creative Commons licenses allow creators to specify the rights for reuse - typically with attribution to the creator, but potentially also including bans on commercial use and derivatives. It is not recommended to prohibit derivative works on a data set, as this will compromise the usability of that data set.
How to License Research Data by the Digital Curation Centre (UK)
Copyright and Intellectual Property Toolkit: Creative Commons, Copyleft, and Other Licenses
Open Data Commons - licenses specifically created for data reuse, including a Public Domain dedication as well as an Attribution Required license.
Licenses can work in tandem with access control, Fair Use, and ethical considerations detailed above. For complex situations, contact us for guidance.
Copyright law protects the original creative expressions that are fixed in either physical or digital form. The US Copyright Law provides examples of creative works that are protected -- including literary works, musical works, and motion pictures -- and works that are not eligible for copyright protection -- including ideas, processes, and concepts. Factual information has been interpreted as being outside of the protection of the copyright law, which has implications for data. Peter Hirtle of Cornell University Library cautions, "Not all data is in the public domain. A project might, for example, be built around copyrighted photographs; the photographs are part of the project’s 'data.' But in many cases, the data in a data management system as well as the metadata describing that data will be factual, and hence not protected by copyright." For for more information, see Cornell's "Introduction to Intellectual Property Rights in Data Management."
Even if datasets are not protected under copyright, researchers who are not the creators may be uncertain whether they are indeed allowed to use it for their own work. Licenses that clearly outline the terms of use can help to alleviate this uncertainty and to promote the use the data. Creative Commons licenses and Open Data Commons licenses are two noteworthy instruments for specifying the terms of use for datasets.