After my previous post on IQ and Height, I thought it would be good to revisit the subjects of scatterplots and correlation. A scatterplot is often an excellent first step in any data analysis. You can learn a lot from a scatterplot, because humans are good at pattern recognition. A lot of silly analysis could be avoided by looking at a scatterplot. One of my main complaints about the articles I read for the IQ and Height post was that none of them included a scatterplot.
However, they all reported an r or r2 value. Correlation coefficients are particularly dangerous, because there is a hidden assumption in the mathematics. Correlation coefficients seem very simple, but there is an assumption that the data falls along a straight line with only random variation. When there is some other kind of relationship, the correlation coefficient will not let you know. There are purely mathematical ways to check for this, but why bother when you can see it?
The best illustration of this I have ever seen was published by F. J. Anscombe in The American Statistican in 1973. I have seen this same graph referenced repeatedly in other places, and the data set used to create the graph is even contained in the standard R distribution. If you would like to see the data and graph yourself in R, check out the code.
Anscombe created these data sets to not only have the same r value, but they also have the same mean and same least-squares fit. For the first three, Anscombe noted that you might not even notice something was amiss if you just ran your eye down the data table, because the data points are not always in order. Hopefully if your data looked like case 4, you would notice.
The moral here is beware of correlations, and graph your data.
Anscombe's Data Sets