Skip to Main Content

Data Literacy for Researchers

An introductory guide to finding, analyzing, and communicating data in the research process.

Data Cleaning: Preparing Your Data for Analysis and Visualization

Data cleaning (sometimes referred to as "preprocessing") prepares data for the most effective analysis and visualization. Here are some key considerations:

Removing duplicate and irrelevant observations
  • Remove duplicated data, which is a common consequence of human error in data collection and attempts to combine data from different places
  • Remove data that is irrelevant to the problem you are trying to analyze
  • By removing extraneous information, you can make it easier to analyze and visualize the most relevant data
Standardizing the data and fixing structural errors
  • Correct for typos along with variations in spelling and spacing
  • Ensure units are the same for a given variable
  • Computer programs are not like humans and will consider "California" and "CA" to be completely separate and unrelated. Standardizing these types of variations will make your analysis more straightforward
Dealing with outliers
  • Identify outliers and the reasons behind them
  • Removing certain outliers is sometimes appropriate for data visualization, but not always for statistical analysis
  • Outliers can be exceptions or inconsistencies found in representative samples
  • Instead of always defaulting to removing outliers in every case, be aware of why you might find them in your sample, and be prepared to justify why you remove any data from data collection
Handling Missing Values
  • As many methods for data analysis such as machine learning are not inherently equipped to handle missing values, you must carefully evaluate what method or methods of dealing with missing values are suitable given your data and its context
  • Options include imputing values, or filling in the missing values based on existing data, and dropping features or instances that have too many missing values

 

Statistical Analysis

Statistics is a branch of mathematics that is used to summarize data and compare differences seen between samples and groups. Statistical tests can be used to check if patterns you might observe are real, or simply due to random variation naturally seen in the world.

In the research process we can collect data from a group representing what we want to study, but we might not be able to measure everyone in the entire population. Summary statistics such as the mean, median, standard deviation, and variance can provide insight in the center, and spread of the data collected for the sample. These summarizing characteristics can be used to make predictions of patterns, compare differences between groups or extrapolate information about the larger population.

In popular usage, average values (like mean and median) are often described as statistics, however the field of statistics goes far beyond these simple numbers. Check out the other tabs in this section for more information.

Statistical tests are used to compare values between groups. The result of a statistical test is often a "p-value" which represents the probability that the observed differences between groups occurred by random chance. Simply, if you multiply the p-value of a statistical test by 100, you get the percent chance that the test occurred by random chance. 

Statistical tests often make assumptions which need to be met in order to validate any results, and low p-values are generally used to determine if something is “statistically significant.” Essentially, a low percent change that the result was random, helps researchers understand how significant a difference might be. The function of these tests can be categorized as follows:

  • Comparing quantitative measurements between two populations
    • Examples: t-test, z-test, proportion tests, regression, permutation testing
  • Comparing quantitative measurements between more than two qualitative categories
    • Examples: ANOVA, goodness of fit, chi-squared
  • Comparing quantitative measurements between a large number of quantitative observations that fall under different qualitative categories.
    • Example: Principal component analysis

Resources:

Data analysis strategies and statistics used on research data should be clearly justified and carefully described. Competition for funding and desire for career advancement can lead to hasty analysis, data falsification, and inadvertently lying through statistics. This can happen through a variety of ways:

  • Improper sampling: as discussed with Creating Datasets Ethics, samples that do not accurately represent the population being studied will lead to faulty conclusions.
  • P-hacking: the repeated use of statistical tests in order to selectively report one that results in a p-value that is statistically significant or otherwise manipulating data to achieve a statistically significant p-value
  • Incomplete reporting: not including information like sample size, study assumptions, or excluded variables can make it difficult to verify statistics. 
  • Poor representation: poorly representing things like variability in data can make the data seem more significant than it really is
  • Exaggeration: this can include making big assumptions as to how the study sample relates to the broader population or claiming causation when there is only a correlation

Resources:

Visualization

Data visualization is the process of creating summaries of datasets in the form of images or graphics. There are two central approaches to data visualizations.

  • Exploratory data visualization is used to understand patterns or trends in data. These may depict averages, most common values, variability, and range of data.
  • Explanatory data visualization is used to communicate data to an audience. This can be used to compare values between populations, or depict trends or relationships between things

A central principle of data visualization is to increase the density or volume of data you that you can view at once. Since data is most commonly organized in the form of tables or spreadsheets, an excellent data visualization will help a reader view the contents of the tables in an efficient way, without overloading them with information. 

Data visualization resources:

As graphics are typically expected to provide a quick and intuitive understanding, manipulation or a lack of proper consideration can lead to false impressions.

As a part of accessibility consideration, any colors you use in your visualization should be easily distinguishable from each other by non-colorblind and colorblind people alike, regardless of what type of colorblindness they may have.

Resources:

Data Analysis Tools and Resources

Here are some commonly used tools for data cleaning, statistical analysis and visualization.

UCLA offers various free and discounted licenses for some software products, so make sure to check the list before paying for a program.

Python is a programming language that enables data analysis.

R is a programming language that enables data analysis.

MATLAB is a proprietary programming language and numeric computing environment.

Open Refine is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

Tableau Software helps people see and understand data. Tableau allows anyone to perform sophisticated education analytics and share their findings with online dashboards.

Stata is a proprietary, general-purpose statistical software package for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including biomedicine, economics, epidemiology and sociology.

SPSS (Statistical Package for the Social Sciences) is a software package used for the analysis of statistical data. Although the name of SPSS reflects its original use in the field of social sciences, its use has since expanded into other data markets.

ArcGIS is geospatial software to view, edit, manage and analyze geographic data. It enables users to visualize spacial data and create maps.

Stackoverflow

  • A community where people can ask, answer, and search for questions related to programming

Software Carpentries

  • Lessons to build software skills, part of the larger community The Carpentries which aims to teach foundational computational and data science skills to researchers

Github

  • Cloud-based service website based on Git software that allows develops to store and manage their code, especially helpful for version control during collaboration. The Software Carpentries has a lesson on Git and Github where you can learn more

Open Data Tools

  • List of tools and resources to explore, publish, and share public datasets with sections specifically for visualization, data, source code, and information.

Data Science Notebooks

  • List of interactive computing platforms for data science, includes comparison table at the bottom of the page

Gephi

  • Open graph visualization platform, well-known as a tool for network visualization