Data Literacy for Researchers

An introductory guide to finding, analyzing, and communicating data in the research process.

Related Research Guides

Narrowing the Search for Data

To determine what data you need, you must must first define the area of interest for your research question. Once you are looking for data, here are some questions to consider:

  • What is the subject or topic of your research?
  • How could your research question be measured? (e.g. by recorded observations, a survey instrument, demographic counts)
  • What are the geographic constraints or units?
  • What are the time constraints (a range of years; monthly, quarterly or annually)?
  • Do you need cross-sectional data (observations of many different subjects at a given time) or longitudinal data (tracks the same type of information on the same subjects at multiple points in time)?
  • Do you need quantitative data (measurements, counts, rankings) or qualitative data (texts, surveys, opinions, words), or both?

Questions adapted from Partlo, Kristin. 2009. "The Pedagogical Data Reference Interview." IASSIST Quarterly 33, (4): 6-10. Available at: https://iassistdata.org/sites/default/files/iqvol334_341partlo.pdf. Accessed via Staff and Faculty Work. Library. Carleton Digital Commons https://digitalcommons.carleton.edu/libr_staff_faculty/5

Next:

  • Conduct a literature review to identify existing data sets that are used by studies relevant to your research question, as well as gaps in these datasets.
  • Following a literature review, consider how your research question addresses existing gaps and whether generating new data sets will help you answer that questions.

Finding Datasets in Repositories

Data repositories contain published datasets that are typically associated with publications or ongoing research projects. Data repositories are used to store, and preserve, and retrieve data so that other researchers can access and analyze it. 

Here are some key general-purpose data repositories which hold a wide variety of data types. When searching for data in general repositories, be sure to check to see if it is associated with a research publication. Make sure there is enough information in the record to be sure you can reuse the data correctly.

Some repositories are designed for specific disciplines or data-types. Discipline-specific repositories are often managed by organizations or scientific societies and require strict guidelines on the format and level of detail in submissions. Data found in disciplinary repositories might not be associated with a research publication, but generally contain data from well controlled studies and are described with detail.

There are too many discipline specific repositories to list. Below we include some tools for finding more specialized repositories.

While some data repositories are accessible only to select user groups, many are publicly accessible to anyone with an internet connection. When using public data repositories, make sure to carefully evaluate the source and scope of the data, as well as the inclusion and preservation practices of the repository itself.

When collecting data, there are two main considerations.

First, research data sometimes has to be purchased and/or used under strict terms of agreement, or following specific privacy protocols. When purchasing data sets, or downloading protected data, be sure the data is stored in a safe and secure environment. It's important to respect copyright permissions and understand what constitutes fair use.

Second, carefully looking into the context and content of the data can help you understand any potential biases or limitations to prevent misuse. Data that is published is often associated with specific experimental strategies. While the strategies and limitations are often discussed when data are published with research papers, this information is not always available with the data set. When selecting a data set for use:

  • Understand the methodological limitations to the data collection
  • Look for a data dictionary, or some sort of readme file that contains clear information about the variables and observations found in the data set
  • Try to find information on the context in which the data was collected, as this can inform limitations on how data should be used

Creating Datasets

If your research question cannot be answered with existing datasets, it may be necessary to create your own. Creating data can be done through practices such as observation, surveying, simulation, and experimentation, as well as through methods that extract data from existing bodies of information such as web-scraping or text & data mining (TDM).

Data collection looks different for different disciplines. Here we include some generalized resources to assist with creating datasets:

When collecting original data to answer research questions, there are a few key things to think about in order to be sure the findings are accurate and can be used to draw conclusions relating to your research question.

  • Take detailed notes about the methods your are using to collect data, if data collection does not go as planned, make sure you make note of what aspects of the methods were changed.

  • If working with human-subjects data, or protected data, be sure to check with your local Institutional Review Board (IRB) office to see what kinds of protections need to be put in place for storing and reporting your data.

  • It is important to collect samples that properly represent your subject of study. If you expect to see certain results, compare your experimental sample with a sample with known results to see if the result is aligned with your expectations.

    • A sample with known results is called a control sample. A positive control is a sample where you expect to see the effect you think you will observe as a result of your experiment. A negative control is a sample where you know the result of your experiment will fail. 

  • If you plan to collect data, be sure you have a plan of how you plan to organize and analyze the data before you collect it. If the work is to be published, have a plan to share the data. See more about data management below.

Data Management

Data can be challenging to organize since it comes in many different forms and each form does not exactly take up physical space that you can see. Instead, we must organize the information into files and folders on computers or servers. It is important to recognize the types of data you will be working with during the research process, and how much virtual storage space you will need when working with the data. Since important research findings are often shared, creating a plan to share the data on a data-archive (also called a data repository) will make it easier for people to view and review your work. 

Data management plans (DMPs) are formal plans that describe the data you expect to acquire or generate through your research, along with how you plan to manage it, analyze it, and share it. Below we have included some resources focused on data management planning:

Data Management Plans (DMPs) are crucial to reproducible research practice because they provide a framework for how research data are managed and stored. This prevents data rot, the loss and/or corruption of data stored on individual computer hard drives. It also makes it easier to find data after research projects are completed and improves transparency by allowing collaborators or like-minded researchers to download and verify data analysis. Often DMPs enable researcher to think about the necessary privacy and compliance considerations, especially if the data handles human subjects research, or some other type of protected data. For these reasons, DMPs are often required as a part of research funding proposals.