Linear Regression#
[6]:
import sweepystats as sw
import numpy as np
Lets generate some random data. Here we simulated 10 samples each with 3 covariates.
[2]:
X = np.random.normal(10, 3, size=(10, 3))
beta = np.array([1., 2., 3.])
y = X @ beta + np.random.normal(5)
We can form an instance of the LinearRegression class and fit it as follows:
[3]:
ols = sw.LinearRegression(X, y)
ols.fit()
100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 3306.94it/s]
The resulting beta (estimated effect size) can be extracted as
[4]:
beta = ols.coef()
beta
[4]:
array([1.09983632, 2.02886888, 3.2716904 ])
Sum-of-square residuals is
[5]:
resid = ols.resid()
resid
[5]:
np.float64(2.3953241840840747)
Var(\(\hat{\beta}\)):
[6]:
cov = ols.cov()
cov
[6]:
array([[ 0.00323325, -0.00144217, -0.00143833],
[-0.00144217, 0.00344722, -0.00208484],
[-0.00143833, -0.00208484, 0.00358656]])
Standard deviation of \(\hat{\beta}\):
[7]:
std = ols.coef_std()
std
[7]:
array([0.05686169, 0.05871306, 0.05988785])
R2 (coefficient of determination):
[8]:
ols.R2()
[8]:
np.float64(0.9969400431529486)
Comparison with numpy#
For comparison, lets check whether the answer agrees with the least squares solution implemented in numpy package.
[9]:
# least squares solution by QR
beta, resid, _, _ = np.linalg.lstsq(X, y)
beta
[9]:
array([1.09983632, 2.02886888, 3.2716904 ])
[10]:
resid # true residuals
[10]:
array([2.39532418])
numpy doesn’t have built-in methods to extract Var(\(\hat{\beta}\)) or std of beta, but we can manually extract them as:
[11]:
# true Var(beta)
n, p = 10, 3
sigma2 = resid[0] / (n - p)
beta_cov = sigma2 * np.linalg.inv(X.T @ X)
beta_cov
[11]:
array([[ 0.00323325, -0.00144217, -0.00143833],
[-0.00144217, 0.00344722, -0.00208484],
[-0.00143833, -0.00208484, 0.00358656]])
[12]:
# true std of beta
beta_std = np.sqrt(np.diag(beta_cov))
beta_std
[12]:
array([0.05686169, 0.05871306, 0.05988785])
Rank deficient case#
If the design matrix is rank deficient, the sweep operator can still return (one of the infinitely many) solutions, corresponding to using the pseudo-inverse to solve the normal equation.
[37]:
# example from https://blogs.sas.com/content/iml/2018/11/21/generalized-inverses-for-matrices.html
X = np.array(
[[100., 50, 20, 10],
[50, 106, 46, 23],
[20, 46, 56, 28],
[10, 23, 28, 14]])
y = np.array([130, 776, 486, 243])
# X is not full rank
np.linalg.matrix_rank(X)
[37]:
np.int64(3)
Fitting a rank deficient matrix uses the exact same syntax:
[38]:
ols = sw.LinearRegression(X, y)
ols.fit()
ols.coef() # solution even if X is not full rank
100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 4008.89it/s]
[38]:
array([-3., 7., 4., 0.])
Compare the solution against one obtained from the Moore-Penrose inverse:
[44]:
np.linalg.pinv(X.T @ X) @ X.T @ y
[44]:
array([-3. , 7. , 3.2, 1.6])