Why Regression to the Moon is No Bueno

Students really enjoy learning about scatter plots, correlation, and the least squares regression line. They find it fascinating that as ice cream sales in urban areas increase, so do homicide rates, and the lurking variable in the background - ambient temperature and its effect on behavior! - is the culprit. The content gives students the opportunity to develop a robust BS detector (calm down, BS stands for Bad Statistics).

One data set that challenges the student's notions on regression and the LSRL is Anscombe's Quartet. We are using the activity in AP Stats class today.

APS Anscombes Quartet Activity

Anscombe_PDF_ImageBelow is a sample student response to the worksheet. The paragraphs students wrote at the bottom of the page revealed many things about the students' current understanding of LSRL concepts.

Captured_Student_Work_Example_Anscombes_QuartetExhibit A: Sample student response to the task.

I wanted to show students that because the coefficient of correlation depends on several non-resistant measures, including the standard deviation of both x and y as well as the mean of both x and y, a single data point can have a devastating impact on the LSRL. Specifically, one data point can completely reverse the direction of a linear model. The images below support this claim. I showed the students on a dynamic Geogebra worksheet posted by another teacher.

LSRL_Initial_ImageExhibit B: Initial LSRL. The source site can be found here.

Drag a point or two to an extreme of the screen and suddenly the direction of the relationship changes.

LSRL_Image_2Exhibit C: Moving points and checking the effect on the sum of the areas of the squares whose side lengths are the residuals for each point.

Coupling the Anscombe's Quartet worksheet with the LSRL activity in Geogebra convinced the students the coefficient of correlation, r, and the coefficient of determination, r^2, do not necessarily provide a complete summary of bivariate data. Furthermore, both measures do not always tell us definitively whether a linear model is appropriate for a bivariate data set.

Leave a Reply