Skip to Main Content

Data Visualization

Data Visualization Overview

Data Visualization is one of the most critical aspects of research. It allows researchers to see patterns and trends in data that are not easily observable when looking at raw information. Over the last two decades, technology has dramatically advanced the scale at which data can be visualized. In addition, more public data is available than ever before, allowing for both researchers and citizen scientists to create excellent visual stories with their data. 

There are two different approaches to data visualization:

  • Exploratory: Your purpose for visualization is the exploration and understanding of trends in your data, either for the purpose of data analysis or the preparation of it.

  • Explanatory: Your purpose for visualization is to clearly and effectively communicate something about your data to a wider audience.

With a focus on explanatory data visualization, this guide is designed to help users with finding datasets and resources, outline general approaches to data visualization for the purpose of effective communication, and provide a list of tools useful for creating beautiful visualizations.

Why Visualize Data?

A good example illustrating the significance of data visualization is Anscombe's Quartet, where four different datasets have the same simple statistics but look very different visually.

Source: MultiThreaded

These datasets have the same average and variance for their x and y variables, the same correlation between them, and the same linear regression line (with an accuracy of at least two decimal points). Thus they would be basically indistinguishable from one another when compared through a table of these simple descriptive summary statistics.

However, this is actually really misleading because the datasets have very different distributions from one another, which becomes immediately apparent when they are graphed. Visualizations are powerful because they allow us to quickly and intuitively understand data and patterns that they hold.

Source: Anscombe, Francis J. (1973) Graphs in statistical analysis. American Statistician, 27, 17–21

Finding Datasets in Repositories

Data repositories contain published datasets that are typically associated with publications or ongoing research projects. Data repositories are used to store and preserve data so that researchers can access and analyze it. 

There are two types of repositories we will discuss: scholarly and public.

  • Scholarly: Scholarly data repositories are managed by organizations or scientific societies and often have stricter guidelines on the format and level of detail in submissions. They generally are well-maintained, containing data sets from well-controlled studies, and that include detailed descriptions and metadata. Access to these repositories may be restricted.

  • Public: While some data repositories are accessible only to select user groups, many are publicly accessible to anyone with an internet connection. Public data repositories can have a lot of interesting and useful data, but make sure to carefully evaluate the source and scope of the data, as well as the inclusion and preservation practices of the repository itself.

When searching for data in scholarly repositories, be sure to check to see if it is associated with a research publication. Make sure there is enough information in the record to be sure you can reuse the data correctly.

Here are some scholarly repositories:

 

And here are a list of public repositories:

When collecting data, there are two main considerations.

First, research data sometimes has to be purchased and/or used under strict terms of agreement, or following specific privacy protocols. When purchasing data sets, or downloading protected data, be sure the data is stored in a safe and secure environment. It's important to respect copyright permissions and understand what constitutes fair use.

Second, carefully looking into the context and content of the data can help you understand any potential biases or limitations to prevent misuse. Data that is published is often associated with specific experimental strategies. While the strategies and limitations are often discussed when data are published with research papers, this information is not always available with the data set. When selecting a data set for use:

  • Understand the methodological limitations to the data collection
  • Look for a data dictionary, or some sort of readme file that contains clear information about the variables and observations found in the data set
  • Try to find information on the context in which the data was collected, as this can inform limitations on how data should be used