Statistics: An Introduction using R, by Michael J Crawley

A book review by Bobulous.

Despite sitting through A Level mathematics and further mathematics, plus university maths courses, I always end up forgetting just how complex statistics is.

Sure, I vaguely remember topics such as averages, variance, the Normal distribution, the central limit theorem, and the like. But it takes a book like this to remind me just how ignorant I am when it comes to the subject of looking for patterns in collected data.

Statistics: An Introduction using R begins by going over the fundamentals that should be (but rarely are) engraved into the mind of anyone aiming to explain the world with statistics. The author gives a nice example of sampling bias by describing a forest where trees are, naturally, distributed in a less than uniform way. Someone who wants to take a random sample of the trees in the whole forest may generate a set of random coordinates within that forest, and then pick the tree nearest to each of those coordinates to form their sample. The author clearly illustrates why this will not be an unbiased sample, as the uneven scattering of trees will mean that some trees are more likely to be chosen by this method than others. A sample only counts as unbiased if every single member of the target population had an exactly equal chance of being selected. Which means that the only way to take an unbiased sample of the tree population is to number every tree in the entire forest, and then draw numbers randomly from the range of one up to the number of the last tree.

The book then moves on to explain how to import data into R, an open source statistical analysis package (available from the R Project for Statistical Computing). The book explains that once a data set is loaded into R it is called a dataframe, and then explains how to summarise and select parts of a dataframe. This chapter is very short, so I strongly recommend you then read Appendix 1 Fundamentals Of The R Language, which briefs you on the syntax and features of R which you'll need to know to make sense of the examples in the main chapters.

Returning to the main chapters, central tendency and variance are described. Next is a chapter that shows how to analyse a single sample using R, calculating mean and variance, probabilities, and checking whether a sample is normally distributed using Q-Q plots and histograms and by calculating skew and kurtosis. The next chapter covers how to compare two samples, using Fisher's F test, Student's t-test, and various other tests designed to calculate how likely it is that two samples come from the same underlying distribution.

With the foundations covered, the rest of the book then concentrates on statistical modelling. (This is where my grasp began to slip, so forgive me if the following is inaccurate.) The aim of statistical modelling is to find the simplest formula that relates the response variable (usually marked on the vertical axis of a graph) to the explanatory variables (usually marked on the horizontal axis of a graph). R has a large number of functions and tools for creating such models, for checking how well they fit (such as making sure that they follow a normal distribution), and for checking whether a simpler model fits the observed data to within an acceptable discrepancy. Using these tools, you can strip away factors that have no significant effect on the response variable, and boil the model down until it contains only the factors that do significantly affect the response variable.

The book lays on plenty of examples, using regression (used when both explanatory and response variables are continuous) Anova (analysis of variance, used when the explanatory variables are categorical and the response is continuous), Ancova (analysis of covariance, used when there are both categorical and continuous explanatory variables, and the response is continuous), multiple regression (used when there are two or more continuous explanatory variables, and the response is continuous), and later examples where the response variable is not continuous (such as count data or proportion data). Despite the author carefully describing the mathematics behind the calculations of fit (with boxes that outline the definitions of the error sum of squares, total sum of squares, regression sum of squares, and other terms necessary for understanding the main text), I began to feel lost, and had to re-read some sections several times. Possibly it didn't help that I was reading on the train, with no data of my own to work with.

The book is accompanied by a website, Statistics: An Introduction using R, where you can download the data sets used in the book's examples. I found model simplification easy enough with basic data sets, but very difficult with more complex sets, failing to end up with the same minimal adequate model that the author reached even after several attempts.

Having read the whole book, I can't say I'd feel completely safe analysing real data samples with multiple factors. But I certainly gained a better understanding of how real statistical analysis should work, and I was left with the feeling that almost every real world survey, questionnaire and poll are carried out in such a way that their conclusions are meaningless. For someone who does collect unbiased samples of data as part of their work, and needs to be able to analyse the numbers intelligently, R is an excellent piece of free software. And Statistics: An Introduction using R is an excellent and detailed book for delving into serious analysis of such samples.