Data cleaning (sometimes referred to as "preprocessing") prepares data for the most effective analysis and visualization. Here are some key considerations:
Statistics is a branch of mathematics that is used to summarize data and compare differences seen between samples and groups. Statistical tests can be used to check if patterns you might observe are real, or simply due to random variation naturally seen in the world.
In the research process we can collect data from a group representing what we want to study, but we might not be able to measure everyone in the entire population. Summary statistics such as the mean, median, standard deviation, and variance can provide insight in the center, and spread of the data collected for the sample. These summarizing characteristics can be used to make predictions of patterns, compare differences between groups or extrapolate information about the larger population.
In popular usage, average values (like mean and median) are often described as statistics, however the field of statistics goes far beyond these simple numbers. Check out the other tabs in this section for more information.
Statistical tests are used to compare values between groups. The result of a statistical test is often a "p-value" which represents the probability that the observed differences between groups occurred by random chance. Simply, if you multiply the p-value of a statistical test by 100, you get the percent chance that the test occurred by random chance.
Statistical tests often make assumptions which need to be met in order to validate any results, and low p-values are generally used to determine if something is “statistically significant.” Essentially, a low percent change that the result was random, helps researchers understand how significant a difference might be. The function of these tests can be categorized as follows:
Resources:
Data analysis strategies and statistics used on research data should be clearly justified and carefully described. Competition for funding and desire for career advancement can lead to hasty analysis, data falsification, and inadvertently lying through statistics. This can happen through a variety of ways:
Resources:
Data visualization is the process of creating summaries of datasets in the form of images or graphics. There are two central approaches to data visualizations.
A central principle of data visualization is to increase the density or volume of data you that you can view at once. Since data is most commonly organized in the form of tables or spreadsheets, an excellent data visualization will help a reader view the contents of the tables in an efficient way, without overloading them with information.
Data visualization resources:
As graphics are typically expected to provide a quick and intuitive understanding, manipulation or a lack of proper consideration can lead to false impressions.
As a part of accessibility consideration, any colors you use in your visualization should be easily distinguishable from each other by non-colorblind and colorblind people alike, regardless of what type of colorblindness they may have.
Resources:
Here are some commonly used tools for data cleaning, statistical analysis and visualization.
UCLA offers various free and discounted licenses for some software products, so make sure to check the list before paying for a program.
Python is a programming language that enables data analysis.
R is a programming language that enables data analysis.
MATLAB is a proprietary programming language and numeric computing environment.
Open Refine is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
Tableau Software helps people see and understand data. Tableau allows anyone to perform sophisticated education analytics and share their findings with online dashboards.
Stata is a proprietary, general-purpose statistical software package for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fields, including biomedicine, economics, epidemiology and sociology.
SPSS (Statistical Package for the Social Sciences) is a software package used for the analysis of statistical data. Although the name of SPSS reflects its original use in the field of social sciences, its use has since expanded into other data markets.
ArcGIS is geospatial software to view, edit, manage and analyze geographic data. It enables users to visualize spacial data and create maps.