Linear Regression#

The importance of the sweep operation in statistical computing is not so much that it is an inversion technique, but rather that is a conceptual tool for understanding the least squares process. Without this conceptual tool, it is extremely difficult to explain concepts such as absorption and what the R notation is testing in terms of the parameters of the model.
--James Goodnight (1979)
[6]:
import sweepystats as sw
import numpy as np

Lets generate some random data. Here we simulated 10 samples each with 3 covariates.

[2]:
X = np.random.normal(10, 3, size=(10, 3))
beta = np.array([1., 2., 3.])
y = X @ beta + np.random.normal(5)

We can form an instance of the LinearRegression class and fit it as follows:

[3]:
ols = sw.LinearRegression(X, y)
ols.fit()
100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 3306.94it/s]

The resulting beta (estimated effect size) can be extracted as

[4]:
beta = ols.coef()
beta
[4]:
array([1.09983632, 2.02886888, 3.2716904 ])

Sum-of-square residuals is

[5]:
resid = ols.resid()
resid
[5]:
np.float64(2.3953241840840747)

Var(\(\hat{\beta}\)):

[6]:
cov = ols.cov()
cov
[6]:
array([[ 0.00323325, -0.00144217, -0.00143833],
       [-0.00144217,  0.00344722, -0.00208484],
       [-0.00143833, -0.00208484,  0.00358656]])

Standard deviation of \(\hat{\beta}\):

[7]:
std = ols.coef_std()
std
[7]:
array([0.05686169, 0.05871306, 0.05988785])

R2 (coefficient of determination):

[8]:
ols.R2()
[8]:
np.float64(0.9969400431529486)

Comparison with numpy#

For comparison, lets check whether the answer agrees with the least squares solution implemented in numpy package.

[9]:
# least squares solution by QR
beta, resid, _, _ = np.linalg.lstsq(X, y)
beta
[9]:
array([1.09983632, 2.02886888, 3.2716904 ])
[10]:
resid # true residuals
[10]:
array([2.39532418])

numpy doesn’t have built-in methods to extract Var(\(\hat{\beta}\)) or std of beta, but we can manually extract them as:

[11]:
# true Var(beta)
n, p = 10, 3
sigma2 = resid[0] / (n - p)
beta_cov = sigma2 * np.linalg.inv(X.T @ X)
beta_cov
[11]:
array([[ 0.00323325, -0.00144217, -0.00143833],
       [-0.00144217,  0.00344722, -0.00208484],
       [-0.00143833, -0.00208484,  0.00358656]])
[12]:
# true std of beta
beta_std = np.sqrt(np.diag(beta_cov))
beta_std
[12]:
array([0.05686169, 0.05871306, 0.05988785])

Rank deficient case#

If the design matrix is rank deficient, the sweep operator can still return (one of the infinitely many) solutions, corresponding to using the pseudo-inverse to solve the normal equation.

[37]:
# example from https://blogs.sas.com/content/iml/2018/11/21/generalized-inverses-for-matrices.html
X = np.array(
    [[100.,  50, 20, 10],
      [50, 106, 46, 23],
      [20, 46, 56, 28],
      [10, 23, 28, 14]])
y = np.array([130, 776, 486, 243])

# X is not full rank
np.linalg.matrix_rank(X)
[37]:
np.int64(3)

Fitting a rank deficient matrix uses the exact same syntax:

[38]:
ols = sw.LinearRegression(X, y)
ols.fit()
ols.coef() # solution even if X is not full rank
100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 4008.89it/s]
[38]:
array([-3.,  7.,  4.,  0.])

Compare the solution against one obtained from the Moore-Penrose inverse:

[44]:
np.linalg.pinv(X.T @ X) @ X.T @ y
[44]:
array([-3. ,  7. ,  3.2,  1.6])