An introductory guide to finding, analyzing, and communicating data in the research process.

Data cleaning (sometimes referred to as "preprocessing") prepares data for the most effective analysis and visualization. Here are some key considerations:

- Remove duplicated data, which is a common consequence of human error in data collection and attempts to combine data from different places
- Remove data that is irrelevant to the problem you are trying to analyze
- By removing extraneous information, you can make it easier to analyze and visualize the most relevant data

- Correct for typos along with variations in spelling and spacing
- Ensure units are the same for a given variable
- Computer programs are not like humans and will consider "California" and "CA" to be completely separate and unrelated. Standardizing these types of variations will make your analysis more straightforward

- Identify outliers and the reasons behind them
- Removing certain outliers is sometimes appropriate for data visualization, but not always for statistical analysis
- Outliers can be exceptions or inconsistencies found in representative samples
- Instead of always defaulting to removing outliers in every case, be aware of why you might find them in your sample, and be prepared to justify why you remove any data from data collection

- As many methods for data analysis such as machine learning are not inherently equipped to handle missing values, you must carefully evaluate what method or methods of dealing with missing values are suitable given your data and its context
- Options include imputing values, or filling in the missing values based on existing data, and dropping features or instances that have too many missing values

- Integrated Ethics Labs — Data CleaningIntegrated Ethics Labs offers lesson plans designed to build a foundation of ethics in computer science, data science, and statistics courses
- Open RefineA powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
- Tidy DataA guide to tidying messy datasets

Statistics is a branch of mathematics that is used to summarize data and compare differences seen between samples and groups. Statistical tests can be used to check if patterns you might observe are real, or simply due to random variation naturally seen in the world.

In the research process we can collect data from a group representing what we want to study, but we might not be able to measure everyone in the entire population. Summary statistics such as the mean, median, standard deviation, and variance can provide insight in the center, and spread of the data collected for the sample. These summarizing characteristics can be used to make predictions of patterns, compare differences between groups or extrapolate information about the larger population.

In popular usage, average values (like mean and median) are often described as statistics, however the field of statistics goes far beyond these simple numbers. Check out the other tabs in this section for more information.

Statistical tests are used to compare values between groups. The result of a statistical test is often a "p-value" which represents the probability that the observed differences between groups occurred by random chance. Simply, if you multiply the p-value of a statistical test by 100, you get the percent chance that the test occurred by random chance.

Statistical tests often make assumptions which need to be met in order to validate any results, and low p-values are generally used to determine if something is “statistically significant.” Essentially, a low percent change that the result was random, helps researchers understand how significant a difference might be. The function of these tests can be categorized as follows:

- Comparing quantitative measurements between two populations
- Examples: t-test, z-test, proportion tests, regression, permutation testing

- Comparing quantitative measurements between more than two qualitative categories
- Examples: ANOVA, goodness of fit, chi-squared

- Comparing quantitative measurements between a large number of quantitative observations that fall under different qualitative categories.
- Example: Principal component analysis

Resources:

- Introduction to Inferential StatisticsScribbr guide with exampes
- What Statistical Analysis Should I Use? Using SPSSUCLA Institute for Digital Research and Education (IDRE) Office of Statistics
- A Beginner Guide to t-test and ANOVAIncludes examples in R programming

Data analysis strategies and statistics used on research data should be clearly justified and carefully described. Competition for funding and desire for career advancement can lead to hasty analysis, data falsification, and inadvertently lying through statistics. This can happen through a variety of ways:

**Improper sampling**: as discussed with Creating Datasets Ethics, samples that do not accurately represent the population being studied will lead to faulty conclusions.**P-hacking**: the repeated use of statistical tests in order to selectively report one that results in a p-value that is statistically significant or otherwise manipulating data to achieve a statistically significant p-value**Incomplete reporting**: not including information like sample size, study assumptions, or excluded variables can make it difficult to verify statistics.**Poor representation**: poorly representing things like variability in data can make the data seem more significant than it really is**Exaggeration**: this can include making big assumptions as to how the study sample relates to the broader population or claiming causation when there is only a correlation

Resources:

- Lying with Statistics Workshop SlidesFrom the UCLA Library Data Literacy Workshop Series
- Ethics and Data AnalysisJSTOR journal article, may require campus VPN to access

Data visualization is the process of creating summaries of datasets in the form of images or graphics. There are two central approaches to data visualizations.

is used to understand patterns or trends in data. These may depict averages, most common values, variability, and range of data.**Exploratory data visualization**is used to communicate data to an audience. This can be used to compare values between populations, or depict trends or relationships between things**Explanatory data visualization**

A central principle of data visualization is to increase the density or volume of data you that you can view at once. Since data is most commonly organized in the form of tables or spreadsheets, an excellent data visualization will help a reader view the contents of the tables in an efficient way, without overloading them with information.

Data visualization resources:

- Data Visualization Guide UCLA Library Guide - still under construction!
- Data Visualization Checklist Checklist for graphics meant for presentation, prioritizing viewer ability to read, interpret, and retain content
- Show Me the Data: Data Visualization Webinar Series Two introductory webinar videos on data visualization
- Exploratory Data Analysis Handbook Engineering Statistics Handbook from National Institute of Standards and Technology

As graphics are typically expected to provide a quick and intuitive understanding, manipulation or a lack of proper consideration can lead to false impressions.

As a part of accessibility consideration, any colors you use in your visualization should be easily distinguishable from each other by non-colorblind *and* colorblind people alike, regardless of what type of colorblindness they may have.

Resources:

- The Ethics of Data VisualizationCommon deception techniques in data visualization
- Colors in ActionTakes user-specified color palette and shows what various visualizations would look like to people with different types of colorblindness

Here are some commonly used tools for data cleaning, statistical analysis and visualization.

UCLA offers various free and discounted licenses for some software products, so make sure to check the list before paying for a program.

Python is a programming language that enables data analysis.

- Download Python (free)
- Google Colaboratory (no download)
- Allows you to interactively write and execute Python in your browser with easy storage and sharing through Google Drive. It is a cloud-based version of Jupyter Notebook.

- Plotting and Programming in Python (Software Carpentries)
- The Python Tutorial (Python Documentation)
- Data Visualization with Python (GeeksforGeeks)
- Python Tutorial (W3Schools)
- Scikit-learn: Machine Learning Library

R is a programming language that enables data analysis.

- Download R (free)
- R Resources from UCLA OARC Stats Consulting
- Detailed Introduction to R
- R for Reproducible Scientific Analysis (Software Carpentries)
- R for Data Science

MATLAB is a proprietary programming language and numeric computing environment.

- How to Get Matlab (free for UCLA students, staff and faculty)
- Matlab Statistics & Machine Learning Toolbox
- Matlab Plotting (Tutorialspoint)
- Advanced Graphics and Visualization Techniques with MATLAB

Open Refine is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

- Download OpenRefine (free)
- OpenRefine Introduction
- OpenRefine Documentation and Support
- Lesson on OpenRefine (Library Carpentries)
- Cleaning Data with OpenRefine (The Programming Historian)
- Fetching and Parsing Data from the Web with OpenRefine (The Programming Historian)
- Using OpenRefine to Clean Your Data (Berkeley Advanced Media Institute)

Tableau Software helps people see and understand data. Tableau allows anyone to perform sophisticated education analytics and share their findings with online dashboards.

- Tableau Download (free for full-time UCLA students)
- Tableau Getting Started Overview
- Tableau Help Guide (Princeton)

- Get the Microsoft Office 365 Education Suite (free for UCLA students)
- Data Analysis with Excel
- Excel Data Analysis Overview (Tutorialspoint)

Stata is a proprietary, general-purpose statistical software package for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including biomedicine, economics, epidemiology and sociology.

- Order Stata (discounted for students and education)
- STATA Resources from UCLA OARC Stats Consulting
- Stata User Guide
- Stata Coding Guide
- Online Stata Tutorial
- Getting Started in Data Analysis using Stata

SPSS (Statistical Package for the Social Sciences) is a software package used for the analysis of statistical data. Although the name of SPSS reflects its original use in the field of social sciences, its use has since expanded into other data markets.

ArcGIS is geospatial software to view, edit, manage and analyze geographic data. It enables users to visualize spacial data and create maps.

- A community where people can ask, answer, and search for questions related to programming

- Lessons to build software skills, part of the larger community The Carpentries which aims to teach foundational computational and data science skills to researchers

- Cloud-based service website based on Git software that allows develops to store and manage their code, especially helpful for version control during collaboration. The Software Carpentries has a lesson on Git and Github where you can learn more

- List of tools and resources to explore, publish, and share public datasets with sections specifically for visualization, data, source code, and information.

- List of interactive computing platforms for data science, includes comparison table at the bottom of the page

- Open graph visualization platform, well-known as a tool for network visualization

- Last Updated: May 17, 2024 2:17 PM
- URL: https://guides.library.ucla.edu/data-research
- Print Page