Copyright © Philip M. Parker, INSEAD. Terms of Use.

Linear Regression


Synonym: Linear Regression

Synonym: rectilinear regression (n). (additional references)

Top     

Specialty Definition: Linear regression

(From Wikipedia, the free Encyclopedia)

Linear regression is a method of data analysis intended to be used with a set of paired observations on two variables on the same set of statistical units. Conventionally, we refer to one of the variables as independent (usually labeled ) and the other as dependent (labeled ).

The notion of an independent variable often (but not always) implies the ability to choose the levels of the independent variable and that the dependent variable will respond naturally as in the stimulus-response model. The independent variable x may be a scalar or a vector. In the former case we may write one of the simplest linear-regression models as follows:

where is a random "error".

Historically, in applications to measurements in astronomy, the "error" was actually a random measurement error, but in many applications, ε is merely the amount by which the individual -value differs from the average -value among individuals having the same -value. The average value of the random "error" is zero. Often in linear regression problems statisticians rely on the Gauss-Markov assumptions:

  1. The random errors have expected value 0.
  2. The random errors are uncorrelated (this is weaker than an assumption of probabilistic independence).
  3. The random errors are "homoscedastic", i.e., they all have the same variance.
(See also Gauss-Markov theorem. That result says that under the assumptions above, least-squares estimators are in a certain sense optimal.)

Sometimes stronger assumptions are relied on:

  1. The random errors have expected value 0.
  2. They are independent.
  3. They are normally distributed.
  4. They all have the same variance.

If is a vector we can take the product to be a "dot-product".

It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of is a line. But in fact, if the model is

(in which case we have put the vector in the role formerly played by and the vector in the role formerly played by ), then the problem is still one of linear regression, even though the graph is not a straight line. The rationale for this terminology will be explained below.

A statistician will usually estimate the unobservable values of the parameters α and β by the method of least squares, which consists of finding the values of and that minimize the sum of squares of the residuals

Those values are the "least-squares estimates." The residuals may be regarded as estimates of the errors.

Notice that, whereas the errors are independent, the residuals cannot be independent because the use of least-squares estimates implies that the sum of the residuals must be 0, and the dot-product of the vector of residuals with the vector of -values must be 0, i.e., we must have

and
These two linear constraints imply that the vector of residuals must lie within a certain -dimensional subspace of ; hence we say that there are " degrees of freedom for error". If one assumes the errors are normally distributed and independent, then it can be shown to follow that 1) the sum of squares of residuals
is distributed as
i.e., the sum of squares divided by the error-variance , has a chi-square distribution with degrees of freedom, and 2) the sum of squares of residuals is actually probabilistically independent of the estimates , of the parameters and .

These facts make it possible to use Student's t-distribution with degrees of freedom (so named in honor of the pseudonymous "Student") to find confidence intervals for and .

Denote by capital Y the column vector whose ith entry is yi, and by capital X the n x 2 matrix whose second column contains the xi as its ith entry, and whose first column contains n 1s. Let ε be the column vector containing the errors εi. Let δ and d be respectively the 2x1 column vector containing α and β and the 2x1 column vector containing the estimates a and b. Then the model can be written as

where ε is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance σ2 In, where In is the n x n identity matrix. The matrix Xd (where (remember) d is the vector of estimates) is then the orthogonal projection of Y onto the column space of X.

Then it can be shown that

(where X' is the transpose of X) and the sum of squares of residuals is

The fact that the matrix X(X'X)-1X' is a symmetric idempotent matrix is incessantly relied on both in computations and in proofs of theorems. The linearity of d as a function of the vector Y, expressed above by saying d = (X' X)-1 X' Y, is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.

The matrix In - X (X' X)-1 X' that appears above is a symmetric idempotent matrix of rank n - 2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that any real symmetric matrix M can be diagonalized by an orthogonal matrix G, i.e., the matrix G'MG is a diagonal matrix. If the matrix M is also idempotent, then the diagonal entries in G'MG must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So In-X(X'X)-1X', after diagonalization, has n-2 0s and two 1s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with n-2 degrees of freedom.

Note: A useful alternative to linear regression is robust regression in which mean absolute error is minimized instead of mean squared error as in linear regression. Robust regression is computationally much more intensive than linear regression and is somewhat more difficult to implement as well.

Summarizing the data

We sum the observations, the squares of the Y's and X's and the products of X*Y to obtain the following quantities.

and similarly.
and SYY similarly.

Estimating beta

We use the summary statistics above to calculate b, the estimate of beta.

Estimating alpha

We use the estimate of beta and the other statistics to estimate alpha by:

Displaying the residuals

The first method of displaying the residuals use the histogram or cumulative distribution to depict the similarity (or lack thereof) to a normal distribution. Non-normality suggests that the model may not be a good summary description of the data.

We plot the residuals, against the independent variable, X. There should be no discernible trend or pattern if the model is satisfactory for this data. Some of the possible problems are:

Ancillary statistics

The sum of squared deviations can be partitioned as in ANOVA to indicate what part of the dispersion of the dependent variable is explained by the independent variable.

The correlation coefficient, r, can be calculated by

This statistic is a measure of how well a straight line describes the data. Values near zero suggest that the model is ineffective. r2 is frequently interpreted as the fraction of the variability explained by the independent variable, X.

Source: adapted by the editor from Wikipedia, the free encyclopedia under a copyleft GNU Free Documentation License (GFDL) from the article "Linear regression."

Top     

Crosswords: Linear Regression

English words defined with "linear regression": regression coefficient, regression curve, regression line. (references)
Specialty definitions using "linear regression": Linear Modelsmulticollinearity. (references)

Top     

Commercial Usage: Linear Regression

DomainTitle

Books

  • Adaptive Linear Regression (reference)

  • Applied Linear Regression (Wiley Series in Probability and Mathematical Statistics) (reference)

  • Linear Regression Analysis (Wiley Series in Probability and Mathematical Statistics) (reference)

  • Sensitivity Analysis in Linear Regression (reference)

    (more book examples)

Source: compiled by the editor from various references; see credits.

Top     

Frequency of Internet Keywords: Linear Regression

The following statistics estimate the number of searches per day across the major English-language search engines as identified by various trade publications. Hyperlinks lead to commercial use of the expression at Amazon.com.
 
ExpressionFrequency
per Day

  linear regression

162

  multiple linear regression

37

  simple linear regression

21

  non linear regression

11

  linear regression example

7

  linear regression model

5

  example linear regression simple

4

  channel linear regression

4

  linear regression excel

4

  barbie bungee linear regression

4

  linear regression line

3

  linear regression formula

3

  correlation linear regression

3

  equation linear regression

2

  error linear regression standard

2
Source: compiled by the editor from various references; see credits.

Top     

Modern Translations: Linear Regression

Language Translations for "linear regression"; alternative meanings/domain in parentheses.

Danish

  

lineaer regression, lineær regression. (various references)

   

Dutch

  

lineaire regressie. (various references)

   

Finnish

  

lineaarinen regressio. (various references)

   

French

  

régression linéaire. (various references)

   

German

  

lineare Regression, lineare Progression. (various references)

   

Greek 

  

γραμμική παλινδρόμηση. (various references)

   

Italian

  

regressione lineare. (various references)

   

Pig Latin

  

inearlay egressionray

   

Portuguese

  

regressão linear. (various references)

   

Spanish

  

regresión lineal. (various references)

   

Swedish

  

linjär regression. (various references)

Source: compiled by the editor from various translation references.

Top     

Misspellings: Linear Regression

Misspellings

"Linear Regression" is suggested in spellcheckers for the following: linear reggression. (additional references)

Source: compiled by the editor, based on several corpora (additional references).

Top     

Anagrams: Linear Regression

Scrabble® Enable2K-Verified Anagrams

Words within the letters "a-e-e-e-g-i-i-l-n-n-o-r-r-r-s-s"

-4 letters: legionnaires.

-5 letters: generalises, legionaries, legionnaire, rereleasing, reseasoning.

Source: compiled by the editor from various references; see credits.

SCRABBLE® is a registered trademark. All intellectual property rights in and to the game are owned in the U.S.A and Canada by Hasbro Inc., and throughout the rest of the world by J.W. Spear & Sons Limited of Maidenhead, Berkshire, England, a subsidiary of Mattel Inc. Mattel and Spear are not affiliated with Hasbro.

Top     

Alternative Orthography: Linear Regression


Hexadecimal (or equivalents, 770AD-1900s) (references)

4C 69 6E 65 61 72      52 65 67 72 65 73 73 69 6F 6E

Leonardo da Vinci (1452-1519; backwards) (references)

    

Binary Code (1918-1938, probably earlier) (references)

01001100 01101001 01101110 01100101 01100001 01110010 00100000 01010010 01100101 01100111 01110010 01100101 01110011 01110011 01101001 01101111 01101110

HTML Code (1990) (references)

&#76 &#105 &#110 &#101 &#97 &#114 &#32 &#82 &#101 &#103 &#114 &#101 &#115 &#115 &#105 &#111 &#110

ISO 10646 (1991-1993) (references)

004C 0069 006E 0065 0061 0072      0052 0065 0067 0072 0065 0073 0073 0069 006F 006E

Encryption (beginner's substitution cypher): (references)

467580716784252717384718585758180

Top     

 

Bibliographic Items: "linear regression"


Top     

Amazon.com BOOKS: Search for: "linear regression"

Top     

Public Service or Web Sites Triggered by: Linear Regression