Research Guides: Data Literacy for Researchers: Analyzing and Visualizing Data

Data Cleaning: Preparing Your Data for Analysis and Visualization

About Data Cleaning
Data Cleaning Resources

Data cleaning (sometimes referred to as "preprocessing") prepares data for the most effective analysis and visualization. Here are some key considerations:

Removing duplicate and irrelevant observations

Remove duplicated data, which is a common consequence of human error in data collection and attempts to combine data from different places
Remove data that is irrelevant to the problem you are trying to analyze
By removing extraneous information, you can make it easier to analyze and visualize the most relevant data

Standardizing the data and fixing structural errors

Correct for typos along with variations in spelling and spacing
Ensure units are the same for a given variable
Computer programs are not like humans and will consider "California" and "CA" to be completely separate and unrelated. Standardizing these types of variations will make your analysis more straightforward

Dealing with outliers

Identify outliers and the reasons behind them
Removing certain outliers is sometimes appropriate for data visualization, but not always for statistical analysis
Outliers can be exceptions or inconsistencies found in representative samples
Instead of always defaulting to removing outliers in every case, be aware of why you might find them in your sample, and be prepared to justify why you remove any data from data collection

Handling Missing Values

As many methods for data analysis such as machine learning are not inherently equipped to handle missing values, you must carefully evaluate what method or methods of dealing with missing values are suitable given your data and its context
Options include imputing values, or filling in the missing values based on existing data, and dropping features or instances that have too many missing values

Data Preprocessing in Data Mining
Guide to Data Cleansing
A Hands On Guide to Data Preprocessing
Integrated Ethics Labs — Data Cleaning
Integrated Ethics Labs offers lesson plans designed to build a foundation of ethics in computer science, data science, and statistics courses
Open Refine
A powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
Tidy Data
A guide to tidying messy datasets

Statistical Analysis

Statistics is a branch of mathematics that is used to summarize data and compare differences seen between samples and groups. Statistical tests can be used to check if patterns you might observe are real, or simply due to random variation naturally seen in the world.

In the research process we can collect data from a group representing what we want to study, but we might not be able to measure everyone in the entire population. Summary statistics such as the mean, median, standard deviation, and variance can provide insight in the center, and spread of the data collected for the sample. These summarizing characteristics can be used to make predictions of patterns, compare differences between groups or extrapolate information about the larger population.

In popular usage, average values (like mean and median) are often described as statistics, however the field of statistics goes far beyond these simple numbers. Check out the other tabs in this section for more information.

Statistical tests are used to compare values between groups. The result of a statistical test is often a "p-value" which represents the probability that the observed differences between groups occurred by random chance. Simply, if you multiply the p-value of a statistical test by 100, you get the percent chance that the test occurred by random chance.

Statistical tests often make assumptions which need to be met in order to validate any results, and low p-values are generally used to determine if something is “statistically significant.” Essentially, a low percent change that the result was random, helps researchers understand how significant a difference might be. The function of these tests can be categorized as follows:

Comparing quantitative measurements between two populations
- Examples: t-test, z-test, proportion tests, regression, permutation testing
Comparing quantitative measurements between more than two qualitative categories
- Examples: ANOVA, goodness of fit, chi-squared
Comparing quantitative measurements between a large number of quantitative observations that fall under different qualitative categories.
- Example: Principal component analysis

Resources:

Introduction to Inferential Statistics
Scribbr guide with exampes
What Statistical Analysis Should I Use? Using SPSS
UCLA Institute for Digital Research and Education (IDRE) Office of Statistics
A Beginner Guide to t-test and ANOVA
Includes examples in R programming
Application of t-test, Analysis of Variance, and Covariance

Data analysis strategies and statistics used on research data should be clearly justified and carefully described. Competition for funding and desire for career advancement can lead to hasty analysis, data falsification, and inadvertently lying through statistics. This can happen through a variety of ways:

Improper sampling: as discussed with Creating Datasets Ethics, samples that do not accurately represent the population being studied will lead to faulty conclusions.
P-hacking: the repeated use of statistical tests in order to selectively report one that results in a p-value that is statistically significant or otherwise manipulating data to achieve a statistically significant p-value
Incomplete reporting: not including information like sample size, study assumptions, or excluded variables can make it difficult to verify statistics.
Poor representation: poorly representing things like variability in data can make the data seem more significant than it really is
Exaggeration: this can include making big assumptions as to how the study sample relates to the broader population or claiming causation when there is only a correlation

Resources:

Lying with Statistics Workshop Slides
From the UCLA Library Data Literacy Workshop Series
Ethics and Data Analysis
JSTOR journal article, may require campus VPN to access

Data visualization is the process of creating summaries of datasets in the form of images or graphics. There are two central approaches to data visualizations.

Exploratory data visualization is used to understand patterns or trends in data. These may depict averages, most common values, variability, and range of data.
Explanatory data visualization is used to communicate data to an audience. This can be used to compare values between populations, or depict trends or relationships between things

A central principle of data visualization is to increase the density or volume of data you that you can view at once. Since data is most commonly organized in the form of tables or spreadsheets, an excellent data visualization will help a reader view the contents of the tables in an efficient way, without overloading them with information.

Data visualization resources:

Data Visualization Guide UCLA Library Guide - still under construction!
Data Visualization Checklist Checklist for graphics meant for presentation, prioritizing viewer ability to read, interpret, and retain content
Show Me the Data: Data Visualization Webinar Series Two introductory webinar videos on data visualization
Exploratory Data Analysis Handbook Engineering Statistics Handbook from National Institute of Standards and Technology
Exploratory Data Analysis: Introduction and Tools IBM Guide

As graphics are typically expected to provide a quick and intuitive understanding, manipulation or a lack of proper consideration can lead to false impressions.

As a part of accessibility consideration, any colors you use in your visualization should be easily distinguishable from each other by non-colorblind and colorblind people alike, regardless of what type of colorblindness they may have.

Resources:

The Ethics of Data Visualization
Common deception techniques in data visualization
Accessibility Considerations In Data Visualization Design
An Intro to Designing Accessible Data Visualizations
Colors in Action
Takes user-specified color palette and shows what various visualizations would look like to people with different types of colorblindness

Data Analysis Tools and Resources

Here are some commonly used tools for data cleaning, statistical analysis and visualization.

UCLA offers various free and discounted licenses for some software products, so make sure to check the list before paying for a program.

Python is a programming language that enables data analysis.

Download Python (free)
Google Colaboratory (no download)
- Allows you to interactively write and execute Python in your browser with easy storage and sharing through Google Drive. It is a cloud-based version of Jupyter Notebook.
Plotting and Programming in Python (Software Carpentries)
The Python Tutorial (Python Documentation)
Data Visualization with Python (GeeksforGeeks)
Python Tutorial (W3Schools)
Scikit-learn: Machine Learning Library

R is a programming language that enables data analysis.

Download R (free)
R Resources from UCLA OARC Stats Consulting
Detailed Introduction to R
R for Reproducible Scientific Analysis (Software Carpentries)
R for Data Science

MATLAB is a proprietary programming language and numeric computing environment.

How to Get Matlab (free for UCLA students, staff and faculty)
Matlab Statistics & Machine Learning Toolbox
Matlab Plotting (Tutorialspoint)
Advanced Graphics and Visualization Techniques with MATLAB

Open Refine is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

Download OpenRefine (free)
OpenRefine Introduction
OpenRefine Documentation and Support
Lesson on OpenRefine (Library Carpentries)
Cleaning Data with OpenRefine (The Programming Historian)
Fetching and Parsing Data from the Web with OpenRefine (The Programming Historian)
Using OpenRefine to Clean Your Data (Berkeley Advanced Media Institute)

Tableau Software helps people see and understand data. Tableau allows anyone to perform sophisticated education analytics and share their findings with online dashboards.

Tableau Download (free for full-time UCLA students)
Tableau Getting Started Overview
Tableau Help Guide (Princeton)

Get the Microsoft Office 365 Education Suite (free for UCLA students)
Data Analysis with Excel
Excel Data Analysis Overview (Tutorialspoint)

Stata is a proprietary, general-purpose statistical software package for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including biomedicine, economics, epidemiology and sociology.

Order Stata (discounted for students and education)
STATA Resources from UCLA OARC Stats Consulting
Stata User Guide
Stata Coding Guide
Online Stata Tutorial
Getting Started in Data Analysis using Stata

SPSS (Statistical Package for the Social Sciences) is a software package used for the analysis of statistical data. Although the name of SPSS reflects its original use in the field of social sciences, its use has since expanded into other data markets.

ArcGIS is geospatial software to view, edit, manage and analyze geographic data. It enables users to visualize spacial data and create maps.

Stackoverflow

A community where people can ask, answer, and search for questions related to programming

Software Carpentries

Lessons to build software skills, part of the larger community The Carpentries which aims to teach foundational computational and data science skills to researchers

Github

Cloud-based service website based on Git software that allows develops to store and manage their code, especially helpful for version control during collaboration. The Software Carpentries has a lesson on Git and Github where you can learn more

Open Data Tools

List of tools and resources to explore, publish, and share public datasets with sections specifically for visualization, data, source code, and information.

Data Science Notebooks

List of interactive computing platforms for data science, includes comparison table at the bottom of the page

Gephi

Open graph visualization platform, well-known as a tool for network visualization