Research Guides: Data Literacy for Researchers: Finding Data

Narrowing the Search for Data

To determine what data you need, you must must first define the area of interest for your research question. Here are some questions to consider:

What is the subject or topic of your research?
How could your research question be measured? (e.g. by recorded observations, a survey instrument, demographic counts)
What are the geographic constraints or units?
What are the time constraints (a range of years; monthly, quarterly or annually)?
Do you need cross-sectional data (observations of many different subjects at a given time) or longitudinal data (tracks the same type of information on the same subjects at multiple points in time)?
Do you need quantitative data (measurements, counts, rankings) or qualitative data (texts, surveys, opinions, words), or both?

Questions adapted from Partlo, Kristin. 2009. "The Pedagogical Data Reference Interview." IASSIST Quarterly 33, (4): 6-10. Available at: https://iassistdata.org/sites/default/files/iqvol334_341partlo.pdf. Accessed via Staff and Faculty Work. Library. Carleton Digital Commons https://digitalcommons.carleton.edu/libr_staff_faculty/5

Conduct a literature review to identify existing data sets that are used by studies relevant to your research question, as well as gaps in these datasets.
Following a literature review, consider how your research question addresses existing gaps and whether generating new data sets will help you answer that question.

Finding Datasets in Repositories

Data repositories contain published datasets that are typically associated with publications or ongoing research projects. Data repositories are used to store and preserve data so that researchers can access and analyze it.

There are two types of repositories we will discuss: scholarly and public.

Scholarly: Scholarly data repositories are managed by organizations or scientific societies and often have stricter guidelines on the format and level of detail in submissions. They generally are well-maintained, containing data sets from well-controlled studies, and that include detailed descriptions and metadata. Access to these repositories may be restricted.
Public: While some data repositories are accessible only to select user groups, many are publicly accessible to anyone with an internet connection. Public data repositories can have a lot of interesting and useful data, but make sure to carefully evaluate the source and scope of the data, as well as the inclusion and preservation practices of the repository itself.

When searching for data in scholarly repositories, be sure to check to see if it is associated with a research publication. Make sure there is enough information in the record to be sure you can reuse the data correctly.

Here are some scholarly repositories:

Sage Data (powered by Data Planet)
A repository of publicly, privately, and commercially sourced statistical time-series data, with integrated analysis and mapping tools.

And here are a list of public repositories:

When collecting data, there are two main considerations.

First, research data sometimes has to be purchased and/or used under strict terms of agreement, or following specific privacy protocols. When purchasing data sets, or downloading protected data, be sure the data is stored in a safe and secure environment. It's important to respect copyright permissions and understand what constitutes fair use.

Second, carefully looking into the context and content of the data can help you understand any potential biases or limitations to prevent misuse. Data that is published is often associated with specific experimental strategies. While the strategies and limitations are often discussed when data are published with research papers, this information is not always available with the data set. When selecting a data set for use:

Understand the methodological limitations to the data collection
Look for a data dictionary, or some sort of readme file that contains clear information about the variables and observations found in the data set
Try to find information on the context in which the data was collected, as this can inform limitations on how data should be used

Copyright and Fair Use Tools (Stanford)
Fair Use Infographic
The Association of Research Libraries (ARL)
Ask the Copyright Genie if the work is covered by copyright
A Guide to Works You Can Use Freely
University of Montana library guide
Copyright Basics
University of Cincinnati library research guide

If your research question cannot be answered with existing datasets, it may be necessary to create your own. Creating data can be done through practices such as observation, surveying, simulation, and experimentation, as well as through methods that extract data from existing bodies of information such as web-scraping or text & data mining (TDM).

Data collection looks different for different disciplines. Here we include some generalized resources to assist with creating datasets:

When collecting original data to answer research questions, there are a few key things to think about in order to be sure the findings are accurate and can be used to draw conclusions relating to your research question.

Take detailed notes about the methods you are using to collect data. If data collection does not go as planned, make sure you make note of which aspects of the methods were changed.
If working with human-subjects data, or protected data, be sure to check with your local Institutional Review Board (IRB) office to see what kinds of protections need to be put in place for storing and reporting your data.
It is important to collect samples that properly represent your subject of study. If you expect to see certain results, compare your experimental sample with a sample with known results to see if the result is aligned with your expectations.
- A sample with known results is called a control sample. A positive control is a sample where you expect to see the effect you think you will observe as a result of your experiment. A negative control is a sample where you know the result of your experiment will fail.
If you plan to collect data, be sure to outline how you will organize and analyze the data beforehand. If the work is to be published, have a plan to share the data. See more about data management below.

Getting Started with IRB
The UCLA Institutional Review Board provides guidelines, training, and approval for working with protected data
Explaining Experimental Controls
Video and transcript explaining how control samples are used in scientific experimentation.

Data Management

Data Management Plans
Ethics

Data management plans (DMPs) are formal plans that describe the data you expect to acquire or generate through your research, along with how you plan to manage it, analyze it, and share it. Here are some resources focused on data management planning:

Data Management Plans (DMPs) are crucial to reproducible research practice because they provide a framework for how research data are managed and stored. This prevents data rot, the loss and/or corruption of data stored on individual computer hard drives. It also makes it easier to find data after research projects are completed and improves transparency by allowing collaborators or like-minded researchers to download and verify data analysis. Often DMPs enable researcher to think about the necessary privacy and compliance considerations, especially if the data handles human subjects research, or some other type of protected data. For these reasons, DMPs are often required as a part of research funding proposals.