Exploratory data analysis and
This is a brief introduction to exploratory data analysis (EDA) and data visualization. You will come across several unfamiliar terms and graphs, but you don't have to fully understand them at this moment. The purpose of this write-up is to let you be aware what tools are available and what can be done. The philosophy and specific techniques of EDA will be introduced in further readings.
When some people claim that their methodology is exploratory, actually what they mean is that they don't know what they are doing. Unfortunately, poor research is often implemented in the name of EDA. In data collection researchers flood the subjects with hundred pages of surveys since research questions are not clearly defined and variables are not identified. It is true that EDA does not require a pre-determined hypothesis to be tested, but it doesn't justify the absence of research questions and ill-defined variables or trying every test until obtaining a significant p value (p-hacking) (Jebb, Parrigon, & Woo, 2017 ).
EDA techniques are abundant and well-structured. Exploratory data analysis, as a supplement to confirmatory data analysis (CDA), was founded by John Tukey (1977, 1980). Tukey often related EDA to detective work. In EDA, the role of the researcher is to explore the data in as many ways as possible until a plausible "story" of the data emerges. A detective does not collect just any information. Instead he collects evidence and clues related to the central question of the case. So, from now on you can call me "Detective Yu."
Velleman and Hoaglin (1981) outlined four basic elements of exploratory data analysis as the following:
- Data visualization
- Residual analysis
- Data transformation or re-expression
- Resistance procedures
The rationale for data visualization is: "A picture is worth a thousand words." It is easier to detect a data pattern from a picture than from a numeric output.
Generally speaking, there are six major categories of research goals. All of them can utilize graphing techniques for deepening our understanding of the data :
The following are some examples. The interpretation of these graphs are very involved. Just get the idea of visualization and don't be absorbed into the detail.
- Spotting outliers
- Discriminating clusters
- Checking distributional and other assumptions
- Examining relationships
- Comparing mean differences
- Observing a time-based process
Discriminating clustersBy visualization we can cluster either variables or subjects. This example shows how brushing is used to cluster subjects in helping regression analysis. Please view this animated demo.
Checking distributional and other assumptions
Data normality is required by many parametric tests. The researcher can use a simple histogram to examine the distribution. A more sophisticated way is to check the data with normality probability plot. If the data are perfectly normal, the graph should show a diagonal straight line. The deviation from the straight line indicates the degree of non-normality.
When interaction effects are present, regression lines are not consistent across all levels of other variables. The moving mesh surface depicts this change. If the animation annoys you, please press the stop button on your browser to freeze the animation. This type of data visualization can be performed in Mathematica and DataDesk
Comparing group differences
Comparing mean differences is usually conduced using parametric tests such as a t-test or a F-test. Nonetheless, graphs can be used to supplement test statistics. A typical example is to use cell-mean plot to examine the main effects and the interaction effect. Examples of advanced graphs for comparing differences are diamond plot and leverage plot.
EDA follows the model that data = fit + residual or data = model + error. The fit or the model is the expected values of the data. The residual or the error is the values that deviate from that expected value. By examining the residuals, the researcher can assess the model adequacy. A simple example can be found in regression analysis. The scatterplot on the left shows the residuals in a regression model.
Today it is not difficult to see why we should examine residuals to check how well the data fits the model. Nonetheless, "residual" is a modern concept. A few centuries ago even very well-trained scientists had a weak sense of residual. * Unfortunately, at the present time this problem still exists among several researchers who tend to take modeling for granted and ignore residuals.
In the past this iterative process was performed manually by the analyst, such as the 2-way fit approach. Today machine learning algorithms automates this process. Boosting, also known as the boosted tree, is a good example of an automated iteration.
Data transformation happens in our everyday life: Converting US dollars into Canadian dollars, converting a GPA of 5-point scale to a GPA of 4-point-scale. However, these examples belong to the linear transformation, by which the distribution of the data are not affected. In EDA, usually the non-linear transformation is used and thereby it changes the data pattern. Data re-expression is exploratory in nature because prior to the transformation, the researcher never knows which re-expression approach can achieve desirable results.
There are four major objectives of transforming data:
- Normalize the distribution: Non-normal data violate the assumption of parametric test and thus a transformation is advisable. It is a common misconception that converting raw scores to z-scores yields a normal distribution. Actually, the raw-to-z-transformation is a linear transformation. The following figure shows that after a raw-to-z transformation, the distribution shape of the z scores is still resemble to that of raw scores. The appropriate procedure should be natural log transformation or inverse probability transformation.
- Stabilize the variances: Data with unequal variances are also detrimental to parametric tests. A typical example of variance stabilizing transformation is square root transformation: y* = sqrt(y).
- Linearize the trend: Regression analysis requires the assumption of linearity. When the data show a curvilinear relationship, the researcher can either apply non-linear regression analysis or straighten the data by linearizing transformation. A logarithmic transformation is a typical example of the latter.
- Orthogonalize collinear variables: In multiple regression lack of independence between predictors could make the model unstable. In terms of hyper-space, the vectors representing these variable are non-orthogonal. To rectify the situation the variables can be orthogonalized by centering the scores, using the Gram–Schmidt process, or other transformation techniques.
Nonetheless, every statistical procedure has limitations and should be used with caution. Data transformation is not an exception. Osborne (2002) advised that data transformation should be used appropriately; many transformations reduce non-normality by changing the spacing between data points, but it raises issues in the interpretation of data. If transformations are done correctly, all data points should remain the same relative order as prior to transformation and this does not affect researchers to interpret the scores. But it might be problematic if the original variables were meant to be interpreted in a straight-forwarded fashion, such as annual income, and years of age). After the transformations, the new variables might become much more complex to interpret.
Parametric tests are based on the mean estimation, which is sensitive to outliers or skewed distributions. In EDA, robust estimators are usually used. For example:
- Median: The middle point of the data.
- Trimean: A measure of central tendency based on the arithmetic average of the values of the first quartile, the third quartile, and the median counted twice.
- Winsorized mean: A robust version of the mean in which extreme scores are pulled back to the majority of the data.
- Trimmed mean: A mean without outliers
In your first stat course you learned that the mode is more resistant against outliers than the median. You may ask why the median, instead of the mode, is used. Indeed, in most situations the median and the mode are equally robust against outliers. Please view this animated demo.
It is important to point out that there is a subtle difference between "resistance" and "robustness" though two terms are usually used interchangeably. EDA is more concerned with resistance while hypothesis testing pays more attention to robustness. Resistance is about being immune to outliers while robustness is about being immune to assumption violations. In the former, the goal is to obtain a data summary while in the latter the goal is to make a probabilistic inference.
DataDeskDataDesk (Data Description, Inc., 2008) is developed by Paul Velleman, a student of John Tukey. DataDesk is the ideal tool for beginners in exploratory data analysis. It is feature-rich and flexible enough for manipulation, but yet requires little prior knowledge of computer operation. For instance, data re-expression described above can be performed using a wide variety of transformation functions in DataDesk.
DataDesk has a richer version called Data Desk Plus, which incorporates a multimedia-based statistics tutorial entitled ActivStat.
JMPJMP (SAS Institute, 2016) is a very versatile statistical program. There are two variants of JMP, namely, JMP and JMP Pro. As the name implies, JMP Pro is a professional version that includes many powerful procedures. But for most users JMP is sufficient for EDA. Like DataDesk, JMP has built-in data transformation options as shown below.
design philosophy of JMP is similar to that of Apple's iPod. Upon installation you can start exploring your data without reading the manual. In addition to common graphing features, such as histogram and boxplot, Graph builder in JMP also provide the users with Geographical Information System (GIS).
XLISP-STATIf you like to gain a complete control by programming, XLISP-STAT should be considered. For example, in data visualization it involves data smoothing. Through programming you can view the data in different levels of detail.
LISP stands for List Processing. Someone calls it "Lots of idiotic and silly parenthesis." LISP was created during 1956-62 by John McCarthy in MIT for non-numerical computation. Later it is used specifically for the development of artificial intelligence. There are many different versions of LISP e.g. Common Lisp, Franz LISP...etc. XLISP is one of many dialects, which was developed by David Betz. Later Luke Tierney (1990) developed XLISP-STAT for statistical visualization. This package has many built-in statistical graphing functions. Based on XLISP-STAT, Cook and Weisberg (1994) developed a set of regression graphing tools called R-code. Another comprehensive EDA package named ViSta (Young, 1999) is also written in XLISP-STAT.
XLISP-STAT is cross-platform. However, it is an interpreted rather than a complied language, and therefore, you must load the written program into XLISP-STAT to run it.
Tukey (1977)'s book is considered a classic in EDA. In his time computer resources were not easily accessible, but today most of his suggested graphing techniques are available in many software packages.
Behrens (1997) and Behrens & Yu (2003) are essential for both beginners and intermediate learners. Both chapters cover the detail of visualization, data transformation, residual analysis, and resistance procedures, which are briefly mentioned in this lesson.
For a quick overview of EDA, visit NIST Engineering Statistics Handbook. Although this site gives many examples of graphing techniques, it does not tell you what specific software packages can generate those graphs.
For the philosophical foundation of EDA, please consult Yu (1994 April, 2006). EDA is a philosophy/attitude rather than a collection of techniques.
To acquire a deeper understanding of data visualization, please read Yu and Behrens (1995) and Yu (2010, 2014).
* For example, Gregor Mendel (1824-1884), who is considered the founder of modern genetics, established through his scientific findings, the notion that physical properties of species are subject to heredity. Mendel conducted a fertilization experiment to confirm his belief. In his experiment, he followed up several generations of plants to observe how specific genes carried from one generation to another. While the reported data largely conform to the inheritance hypothesis, R. A. Fisher (1936) questioned the validity of Mendel's study. Fisher pointed out that Mendel's data seemed "too good to be true." Using Chi-square tests, Fisher found that Mendel's results were so close to what would be expected that such agreement could happen by chance less than once in 10,000 times.
Another example can be found in the story of Johnannes Kepler (1571-1630), the first astronomer who proposed that the earth and other planets orbit around the sun in an elliptical fashion, rather than in circle as Galileo believed. Kepler worked under another well-known astronomer, Brahe, who collected a huge database of planetary orbits. Using Brahe's data, Kepler found data to fit into the elliptical hypothesis, rather than the circular hypothesis. However, almost 400 years later when William Donahue redid Kepler's calculation, he found that the orbit data and the elliptical model do not fit each other as claimed.
Further, there is a widespread urban legend that British physicist Arthur Eddington substantiated Einstein’s theory of general relativity by observing the positions of stars during the 1919 solar eclipse. However, in the 1980s scholars found that Eddington did collect sufficient data to reach a conclusion. Rather, he distorted the result to make it fit the theory (Swayer, 2012).
Kepler, Mendel, and Eddington are not the only three scientists who failed to accept the residuals between the data and the model. William Harvey, Isaac Newton, and Charles Darwin also had the same problem; the list goes on and on. While reviewing this phenomenon in the history of science, some scholars denounced those scientists as committing fraud. In a milder tone, Press and Tanur (2001) said that the problem was caused by "the subjectivity of scientists."
My view is that those scientists had a weak sense of residuals. They conducted science in a confirmatory mode, in which only a dichotomous answer could result. Even if residuals existed, they tended to embrace the model because by admitting any inconsistency, the entire model would be rejected. In other words, they accepted the notion that DATA = MODEL.
Last revision: May, 2017