Linear Regression#

The importance of the sweep operation in statistical computing is not so much that it is an inversion technique, but rather that is a conceptual tool for understanding the least squares process. Without this conceptual tool, it is extremely difficult to explain concepts such as absorption and what the R notation is testing in terms of the parameters of the model.

--James Goodnight (1979)

[6]:

import sweepystats as sw
import numpy as np

Lets generate some random data. Here we simulated 10 samples each with 3 covariates.

[2]:

X = np.random.normal(10, 3, size=(10, 3))
beta = np.array([1., 2., 3.])
y = X @ beta + np.random.normal(5)

We can form an instance of the LinearRegression class and fit it as follows:

[3]:

ols = sw.LinearRegression(X, y)
ols.fit()

100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 3306.94it/s]

The resulting beta (estimated effect size) can be extracted as

[4]:

beta = ols.coef()
beta

[4]:

array([1.09983632, 2.02886888, 3.2716904 ])

Sum-of-square residuals is

[5]:

resid = ols.resid()
resid

[5]:

np.float64(2.3953241840840747)

Var(\(\hat{\beta}\)):

[6]:

cov = ols.cov()
cov

[6]:

array([[ 0.00323325, -0.00144217, -0.00143833],
       [-0.00144217,  0.00344722, -0.00208484],
       [-0.00143833, -0.00208484,  0.00358656]])

Standard deviation of \(\hat{\beta}\):

[7]:

std = ols.coef_std()
std

[7]:

array([0.05686169, 0.05871306, 0.05988785])

R2 (coefficient of determination):

[8]:

ols.R2()

[8]:

np.float64(0.9969400431529486)

Comparison with `numpy`#

For comparison, lets check whether the answer agrees with the least squares solution implemented in numpy package.

[9]:

# least squares solution by QR
beta, resid, _, _ = np.linalg.lstsq(X, y)
beta

[9]:

array([1.09983632, 2.02886888, 3.2716904 ])

[10]:

resid # true residuals

[10]:

array([2.39532418])

numpy doesn’t have built-in methods to extract Var(\(\hat{\beta}\)) or std of beta, but we can manually extract them as:

[11]:

# true Var(beta)
n, p = 10, 3
sigma2 = resid[0] / (n - p)
beta_cov = sigma2 * np.linalg.inv(X.T @ X)
beta_cov

[11]:

array([[ 0.00323325, -0.00144217, -0.00143833],
       [-0.00144217,  0.00344722, -0.00208484],
       [-0.00143833, -0.00208484,  0.00358656]])

[12]:

# true std of beta
beta_std = np.sqrt(np.diag(beta_cov))
beta_std

[12]:

array([0.05686169, 0.05871306, 0.05988785])

Rank deficient case#

If the design matrix is rank deficient, the sweep operator can still return (one of the infinitely many) solutions, corresponding to using the pseudo-inverse to solve the normal equation.

[37]:

# example from https://blogs.sas.com/content/iml/2018/11/21/generalized-inverses-for-matrices.html
X = np.array(
    [[100.,  50, 20, 10],
      [50, 106, 46, 23],
      [20, 46, 56, 28],
      [10, 23, 28, 14]])
y = np.array([130, 776, 486, 243])

# X is not full rank
np.linalg.matrix_rank(X)

[37]:

np.int64(3)

Fitting a rank deficient matrix uses the exact same syntax:

[38]:

ols = sw.LinearRegression(X, y)
ols.fit()
ols.coef() # solution even if X is not full rank

100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 4008.89it/s]

[38]:

array([-3.,  7.,  4.,  0.])

Compare the solution against one obtained from the Moore-Penrose inverse:

[44]:

np.linalg.pinv(X.T @ X) @ X.T @ y

[44]:

array([-3. ,  7. ,  3.2,  1.6])

Linear Regression#

Comparison with numpy#

Rank deficient case#

Comparison with `numpy`#