Data Literacy for Researchers

An introductory guide to finding, analyzing, and communicating data in the research process.

Related Research Guides

Preparing Your Data for Analysis and Visualization

“Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.” (Source: Tableau) 

Data cleaning (sometimes referred to as "preprocessing") prepares data for the most effective analysis and visualization. It typically includes the following key considerations:

  • Remove duplicated data, which likely originated from combining data from different places
  • Remove data that is irrelevant to the problem you are trying to analyze
  • By removing extraneous information, you can make it easier to analyze and visualize the most relevant data
  • Correct for typos along with variations in spelling and spacing
  • Ensure units are the same for a given variable
  • Computers can't tell the difference between "California" and "CA" standardizing these types of variations will make your analysis more straightforward
  • Identify outliers and the reasons behind them
  • Removing certain outliers is sometimes appropriate for data visualization, but not always for statistical analysis
  • Outliers can be exceptions or inconsistencies found in representative samples
  • One does not simply remove the outliers. Be aware of why you might find them in your sample, and be prepared to justify why you remove any data from data collection.
  • Many computer programs do not understand when a value is missing, you must decide how to handle missing values
  • Carefully weigh your options, which include filling in the missing values based on existing data, filtering analyses or completely dropping data that has missing values

Data cleaning resources:

Dropping or altering values without proper consideration harms the integrity of your data. There are no steadfast specific rules that can be applied in every situation such as always dropping data with missing values or outliers, but in general your data cleaning decisions should be justifiable, consistent, transparent, and determined before any hypotheses testing.

Resources:

Statistical Analysis

Statistics is a branch of mathematics that is used to summarize data and compare differences seen between samples and groups. Statistical tests can be used to check if patterns you might observe are real, or simply due to random variation naturally seen in the world.

In the research process we can collect data from a group representing what we want to study, but we might not be able to measure everyone in the entire population. Summary statistics such as the mean, median, standard deviation, and variance can provide insight in the center, and spread of the data collected for the sample. These summarizing characteristics can be used to make predictions of patterns, compare differences between groups or extrapolate information about the larger population.

In the media, average values (like mean and median) are often described as statistics, however the field of statistics goes far beyond these simple numbers. Check out the other tabs in this section for more information.

Statistical tests are used to compare values between groups. The result of a statistical test is often a "p-value" which represents the probability that the observed differences between groups occurred by random chance. Simply, if you multiply the p-value of a statistical test by 100, you get the percent chance that the test occurred by random chance. 

Statistical tests often make assumptions which need to be met in order to validate any results, and low p-values are generally used to determine if something is “statistically significant.” Essentially, a low percent change that the result was random, helps researchers understand how significant a difference might be. The function of these tests can be categorized as follows:

  • Comparing quantitative measurements between two populations
    • Examples: t-test, z-test, proportion tests, regression, permutation testing
  • Comparing quantitative measurements between more than two qualitative categories
    • Examples: ANOVA, goodness of fit, chi-squared
  • Comparing quantitative measurements between a large number of quantitative observations that fall under different qualitative categories.
    • Example: Principal component analysis

Resources:

Data analysis strategies and statistics used on research data should be clearly justified and carefully described. Competition for funding and desire for career advancement can lead to hasty analysis, data falsification, and inadvertently lying through statistics. This can happen through a variety of ways:

  • Improper sampling: as discussed with Creating Datasets Ethics, samples that do not accurately represent the population being studied will lead to faulty conclusions.
  • P-hacking: the repeated use of statistical tests in order to selectively report one that results in a p-value that is statistically significant or otherwise manipulating data to achieve a statistically significant p-value
  • Incomplete reporting: not including information like sample size, study assumptions, or excluded variables can make it difficult to verify statistics. 
  • Poor representation: poorly representing things like variability in data can make the data seem more significant than it really is
  • Exaggeration: this can include making big assumptions as to how the study sample relates to the broader population or claiming causation when there is only a correlation

Resources:

Visualization

Data visualization is the process of creating summaries of datasets in the form of images or graphics. There are two central approaches to data visualizations.

  • Exploratory data visualization is used to understand patterns or trends in data. These may depict averages, most common values, variability, and range of data.
  • Explanatory data visualization is used to communicate data to an audience. This can be used to compare values between populations, or depict trends or relationships between things

A central principle of data visualization is to increase the density or volume of data you that you can view at once. Since data is most commonly organized in the form of tables or spreadsheets, an excellent data visualization will help a reader view the contents of the tables in an efficient way, without overloading them with information. 

Data visualization resources:

As graphics are typically expected to provide a quick and intuitive understanding, manipulation or a lack of proper consideration can lead to false impressions.

As a part of accessibility consideration, any colors you use in your visualization should be easily distinguishable from each other by non-colorblind and colorblind people alike, regardless of what type of colorblindness they may have.

Resources:

Tools and Resources

Here are some commonly used tools for data cleaning, statistical analysis, and visualization.

UCLA offers various free and discounted licenses for some software products, so make sure to check the list before paying for a program.

  • Stackoverflow
    • A community where people can ask, answer, and search for questions related to programming
  • Software Carpentries
    • Lessons to build software skills, part of the larger community The Carpentries which aims to teach foundational computational and data science skills to researchers
  • Github
    • Cloud-based service website based on Git software that allows develops to store and manage their code, especially helpful for version control during collaboration. The Software Carpentries has a lesson on Git and Github where you can learn more