"To be a statistician is great! Younever have to be 'absolutely sure' of something. Being 'reasonablycertain' is enough."

Pavel E. Guarisma, North Carolina StateUniversity

 Here are those AP STATers again. Each is holding a copy of The Practice of Statistics, by Yates, Moore, McCabe. Considering the textbooks as points, what characteristic would the least-squares regression line possess? (A) It would be approximately y = x. (B) It would be approximately horizontal. (C) It would be approximately vertical.

3.3 LEAST-SQUARES REGRESSION (Pages 137- 160)

OVERVIEW: If a scatterplot shows a linear relationship between two quantitative variables, least-squares regression is a method for finding a line that summarizes the relationship between the two variables, at least within the domain of the explanatory variable, x. The least-squares regression line (LSRL) is a mathematical model for the data.

Regression Line: A straight line that describes how aresponse variable y chances as an explanatory variable x changes. Itcan sometimes be used to predict the value of y for a given value ofx.

A residual is a difference between an observed y and apredicted y.

Important facts about the least squares regression line.

• It is a mathematical model for the data.
• It is the line that makes the sum of the squares of the residuals as small as possible.
• The point (x,y) is on the line, where x is the mean of the x values, and y is the mean of the y values.
• Its form is y(hat) = a + bx. (Note that b is slope and a is the y-intercept.)
• b = r(sy/sx). (On the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y.)
• a = y - bx.
• The slope b is the approximate change in y when x increases by 1. (The word "approximate" is important here.)
• The y-intercept a is the predicted value of y when x = 0. Note that this only has meaning when x can assume values close to 0... and the word "predicted" is important.)

r2 in regression: The coefficient ofdetermination, r2, is the fraction of the variation in thevalues of y that is explained the least squares regression of y on x.

Calculation of r2 for a simple example:

r2 = (SSM-SSE)/SSM, where

SSM = sum(y-y)2 (Sum ofsquares about the mean y)
SSM = sum(y-y(hat))2 (Sum of squares of residuals)

In this example, y(hat) = 2 + 2.25x, the mean of x is 4, and themean of y is 11.

 x y y-11 (y-11)2 y(hat) residual=y-y(hat) (residual)2 2 6 -5 25 6.5 -0.5 0.25 4 12 1 1 11.0 1.0 1.00 6 15 4 16 15.5 -0.5 0.25 TOTALS 0 42 = SM 0.0 1.50 = SSE
r2 = (SSM-SSE)/SSM =(42-1.5)/42 = 0.9642857143

THINGS TO NOTE:

• Sum of deviations from mean = 0.
• Sum of residuals = 0.
• r2 > 0 does not mean r > 0. If x and y are negatively associated, then r < 0.

Outlier: A point that lies outside the overall pattern ofthe other points in a scatterplot. (It can be an outlier in the xdirection, in the y direction, or in both directions.)

Influential point: A point that, if removed, wouldconsiderably change the position of the regression line. (Points thatare outliers in the x direction are often influential.)

NOTE: Do not confuse the slope b of the LSRL with the correlationr. The relation between the two is given by the formula b =r(sy/sx). If you are working with normalizeddata, then b does equal r since sy = sx = 1.(When you normalize a data set, the normalized data has mean = 0 andstandard deviation = 1.) If you are working with normalized data, theregression line has the sample form yn = rxn,where xn and yn are normalized x and y values,respectively. Since the regression line contains the mean of x andthe mean of y, and since normalized data has a mean of 0, theregression line for normalized x and y values contains (0,0).

PHACS (Procedure, Hypothesis, Assumptions, Calculations,Summarize)