Least squares regression

"To be a statistician is great! Younever have to be 'absolutely sure' of something. Being 'reasonablycertain' is enough."

Pavel E. Guarisma, North Carolina StateUniversity

Here are those AP STATers again. Each is holding a copy of The Practice of Statistics, by Yates, Moore, McCabe. Considering the textbooks as points, what characteristic would the least-squares regression line possess?

(A) It would be approximately y = x.

(B) It would be approximately horizontal.

3.3 LEAST-SQUARES REGRESSION (Pages 137- 160)

OVERVIEW: If a scatterplot shows a linear relationship between two quantitative variables, least-squares regression is a method for finding a line that summarizes the relationship between the two variables, at least within the domain of the explanatory variable, x. The least-squares regression line (LSRL) is a mathematical model for the data.

Regression Line: A straight line that describes how aresponse variable y chances as an explanatory variable x changes. Itcan sometimes be used to predict the value of y for a given value ofx.

A residual is a difference between an observed y and apredicted y.

Important facts about the least squares regression line.

It is a mathematical model for the data.
It is the line that makes the sum of the squares of the residuals as small as possible.
The point (x,y) is on the line, where x is the mean of the x values, and y is the mean of the y values.
Its form is y(hat) = a + bx. (Note that b is slope and a is the y-intercept.)
b = r(s_y/s_x). (On the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y.)
a = y - bx.
The slope b is the approximate change in y when x increases by 1. (The word "approximate" is important here.)
The y-intercept a is the predicted value of y when x = 0. Note that this only has meaning when x can assume values close to 0... and the word "predicted" is important.)

r² in regression: The coefficient ofdetermination, r², is the fraction of the variation in thevalues of y that is explained the least squares regression of y on x.

Calculation of r² for a simple example:

r² = (SSM-SSE)/SSM, where
SSM = sum(y-y)²(Sum ofsquares about the mean y)
SSM = sum(y-y(hat))² (Sum of squares of residuals)
In this example, y(hat) = 2 + 2.25x, the mean of x is 4, and themean of y is 11.

x	y	y-11	(y-11)²	y(hat)	residual=y-y(hat)	(residual)²
2	6	-5	25	6.5	-0.5	0.25
4	12	1	1	11.0	1.0	1.00
6	15	4	16	15.5	-0.5	0.25
TOTALS		0	42 = SM		0.0	1.50 = SSE

r² = (SSM-SSE)/SSM =(42-1.5)/42 = 0.9642857143

THINGS TO NOTE:

Sum of deviations from mean = 0.
Sum of residuals = 0.
r² > 0 does not mean r > 0. If x and y are negatively associated, then r < 0.

Outlier: A point that lies outside the overall pattern ofthe other points in a scatterplot. (It can be an outlier in the xdirection, in the y direction, or in both directions.)

Influential point: A point that, if removed, wouldconsiderably change the position of the regression line. (Points thatare outliers in the x direction are often influential.)

NOTE: Do not confuse the slope b of the LSRL with the correlationr. The relation between the two is given by the formula b =r(s_y/s_x). If you are working with normalizeddata, then b does equal r since s_y = s_x = 1.(When you normalize a data set, the normalized data has mean = 0 andstandard deviation = 1.) If you are working with normalized data, theregression line has the sample form y_n = rx_n,where x_n and y_n are normalized x and y values,respectively. Since the regression line contains the mean of x andthe mean of y, and since normalized data has a mean of 0, theregression line for normalized x and y values contains (0,0).

PHACS (Procedure, Hypothesis, Assumptions, Calculations,Summarize)

RETURN TO TEXTBOOK HOME PAGE /Back to the top of this page