Chapter 4 Exploratory data analyses

R is more graphically oriented than most other statistical packages; it relies more on plots and figures for initial exploratory data analysis. Numerical summaries are of course also available.

4.1 hist

This command produces a histogram. There is a useful optional argument breaks to specify the number of bins (bars), or a vector of breaks between bins.

4.2 plot

The default version of this command produces a scattergram of two variables. If you enter just one variable, then the index numbers of the observations are used on the horizontal axis, and the values on the vertical axis. Useful arguments are title, and xlab and ylab for axis labels. In addition, you can use a third variable to vary the plot symbols.

4.3 rug

This command produces tick marks at the actual data values, yielding the visual effect of a rug. This is useful in combination with a scattergram or histogram. Try it out, with the following commands:7

4.4 boxplot

This yields a boxplot summary (Tukey 1977) of one variable. You can also specify the dependent and independent variable, with argument dv~iv1; this will produce multiple boxplots for the dependent variable, broken down by the independent variable(s). Two useful arguments for this command are:
notch=T to give additional information about the distribution, and
varwidth=T to scale the width of the boxes to the numbers of observations.

4.5 qqnorm

This produces a quantile–quantile (QQ) plot. This plots the observed quantiles against the expected quantiles if the argument variable would be distributed normally. If the variable is indeed distributed normally, then the data should fall on an approximately straight line. Deviations of this line indicate deviations from normality. You can also add the expected line with qqline.

4.6 summary

This command produces a numerical summary of the argument variable. However it does not supply standard measures of variability. We often need

4.7 var

to compute the variance of the argument variable. Related functions are sd to compute standard deviation, cov to compute the covariance between two variables, and cor to compute their correlation.

4.8 length

returns the length of the argument variable, i.e. the number of observations in that vector. This is useful for checking the number of data, as a preliminary for further analyses.

In the last command above, we have programmed a new function valid.n, using standard functions provided with R.

4.9 unique

returns a vector containing the (unsorted) unique values of the argument variable, without duplicates. This is also useful for checking your data.

## [1] 3 1 2
## [1] 1 2 3

4.10 table

returns a frequency table, i.e. reports the number of observations for each (combination of) value(s) of one or more variables. This again is helpful for checking your data, to inspect the numbers of observations per cell (see also 3.2). Tables can be one-dimensional (see example below) or multidimensional. It is also possible to make a frequency table of a frequency table!

## x
## 1 2 3 
## 7 7 7
## 7 
## 3

A useful trick: if all cells should have the same number of observations, then the table-of-table should contain a single number representing the number of cells.

## 6 7 
## 2 1

The lattice package (see Chapter 8) contains additional functions for higher-order data visualization: xyplot, histogram, densityplot, cloud, and more. For more information, enter the command help(lattice).

Now that we have obtained some insightful figures, we might want to include these in our documents. The best procedure in Rstudio is to go to the Plots tab in the lower right pane, and select Export > Save as image... or Export > Save as PDF.... Figures in PDF format (extension pdf) are easy to include in LaTeX documents, and are best for publication. Other formats (e.g. png) are easy to include in LaTeX, MS Word and web documents.


Tukey, John W. 1977. Exploratory Data Analysis. Addison-Wesley.

  1. The semicolon separates multiple R commands on a single line of input.