ANOVA#
[1]:
import sweepystats as sw
import numpy as np
import pandas as pd
1-way ANOVA#
Suppose we are given an example data set, and we want to know:
Group have different Outcomes?
[2]:
df = pd.DataFrame({
'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
})
df
[2]:
| Outcome | Group | |
|---|---|---|
| 0 | 3.6 | A |
| 1 | 3.5 | A |
| 2 | 4.2 | B |
| 3 | 2.7 | B |
| 4 | 4.1 | A |
| 5 | 5.2 | C |
| 6 | 3.0 | B |
| 7 | 4.8 | C |
| 8 | 4.0 | C |
Statistically, we want to test whether the mean of each group (i.e. categories A vs B vs C) is different. The null hypothesis is \(\mu_A = \mu_B = \mu_C\) . For this, we can conduct a 1-way ANOVA.
Sweepystats accepts patsy’s formula to specify which variable is being considered.
[3]:
formula = "Outcome ~ Group"
one_way = sw.ANOVA(df, formula)
one_way.fit()
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 6754.11it/s]
The F-statistic and p-value can be extracted as:
[4]:
f_stat, pval = one_way.f_test("Group")
f_stat, pval
[4]:
(np.float64(3.966867469879486), np.float64(0.0798456235718277))
If we reject the null at \(\alpha = 0.05\) level, then no, there is no statistically significant difference between at least one pair of group means.
Check answer is correct#
We can compare the answer via sweep operator is correct using statsmodels package:
[5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Fit the model
model = ols('Outcome ~ Group', data=df).fit()
# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=3) # Type I ANOVA
anova_table
[5]:
| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| Intercept | 41.813333 | 1.0 | 113.349398 | 0.000040 |
| Group | 2.926667 | 2.0 | 3.966867 | 0.079846 |
| Residual | 2.213333 | 6.0 | NaN | NaN |
\(k\)-way ANOVA#
Now suppose we have another covariate Factor that was measured, and we want to know:
Group and Factor have different Outcomes?
[6]:
df = pd.DataFrame({
'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
'Factor': pd.Categorical(["X", "X", "Y", "X", "Y", "Y", "X", "Y", "X"])
})
df
[6]:
| Outcome | Group | Factor | |
|---|---|---|---|
| 0 | 3.6 | A | X |
| 1 | 3.5 | A | X |
| 2 | 4.2 | B | Y |
| 3 | 2.7 | B | X |
| 4 | 4.1 | A | Y |
| 5 | 5.2 | C | Y |
| 6 | 3.0 | B | X |
| 7 | 4.8 | C | Y |
| 8 | 4.0 | C | X |
We previously saw that Group alone is not significant, using 1-way ANOVA. Lets additionally adjust for Factor and the interaction effect between Group and Factor.
[7]:
formula = "Outcome ~ Group + Factor + Group:Factor"
two_way = sw.ANOVA(df, formula)
two_way.fit()
100%|████████████████████████████████████████████| 6/6 [00:00<00:00, 7861.86it/s]
Now, we can test for significance of Group, Factor, and their interaction using an F-test. For example,
[8]:
# test for Group variable
f_stat, pval = two_way.f_test("Group")
f_stat, pval
[8]:
(np.float64(11.561538461537321), np.float64(0.03891754069189004))
[17]:
# test for Factor variable
f_stat, pval = two_way.f_test("Factor")
f_stat, pval
[17]:
(np.float64(4.653846153845692), np.float64(0.11988267006105482))
[9]:
# test for interaction
f_stat, pval = two_way.f_test("Group:Factor")
f_stat, pval
[9]:
(np.float64(2.474358974358741), np.float64(0.2318655632501541))
Conclusion:
If we test
Groupby itself (in a 1-way ANOVA), then it is not significant.If we add
Factor, thenGroupbecomes significant, whileFactoris not.
Check answer is correct#
Again we can compare the answer is correct using statsmodels package:
[10]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Fit the model
model = ols('Outcome ~ Group + Factor + Group:Factor', data=df).fit()
# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=3) # Type III ANOVA (note: use type 2 if no interaction term)
print(anova_table)
sum_sq df F PR(>F)
Intercept 25.205000 1.0 581.653846 0.000156
Group 1.002000 2.0 11.561538 0.038918
Factor 0.201667 1.0 4.653846 0.119883
Group:Factor 0.214444 2.0 2.474359 0.231866
Residual 0.130000 3.0 NaN NaN
ANCOVA - analysis of co-variance#
Now suppose we also measured a continuous covariate Environment and we want to adjust for its effect on Group and Factor. Our question becomes:
Group and Factor have different Outcomes, after adjusting for Environment?
[11]:
df = pd.DataFrame({
'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
'Factor': pd.Categorical(["X", "X", "Y", "X", "Y", "Y", "X", "Y", "X"]),
'Environment': [-1.2, 0.3, 3.3, 0.0, -2.7, -1.1, -0.1, 0.1, 1.0]
})
df
[11]:
| Outcome | Group | Factor | Environment | |
|---|---|---|---|---|
| 0 | 3.6 | A | X | -1.2 |
| 1 | 3.5 | A | X | 0.3 |
| 2 | 4.2 | B | Y | 3.3 |
| 3 | 2.7 | B | X | 0.0 |
| 4 | 4.1 | A | Y | -2.7 |
| 5 | 5.2 | C | Y | -1.1 |
| 6 | 3.0 | B | X | -0.1 |
| 7 | 4.8 | C | Y | 0.1 |
| 8 | 4.0 | C | X | 1.0 |
[12]:
formula = "Outcome ~ Group + Factor + Environment"
anvoca = sw.ANOVA(df, formula)
anvoca.fit()
100%|████████████████████████████████████████████| 5/5 [00:00<00:00, 9541.18it/s]
[13]:
f_stat, pval = anvoca.f_test("Group")
f_stat, pval
[13]:
(np.float64(12.498763543030575), np.float64(0.019028215317113274))
[14]:
f_stat, pval = anvoca.f_test("Factor")
f_stat, pval
[14]:
(np.float64(29.73336179288068), np.float64(0.00549619765483064))
Of course, we can also check for the importance of Environment:
[15]:
f_stat, pval = anvoca.f_test("Environment")
f_stat, pval
[15]:
(np.float64(1.3761221387331437), np.float64(0.30585068386326636))
Conclusion: both Group and Factor are significant after adjusting for Environment!
Check answer#
[16]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Fit the model
model = ols('Outcome ~ Group + Factor + Environment', data=df).fit()
# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
sum_sq df F PR(>F)
Group 1.601574 2.0 12.498764 0.019028
Factor 1.904996 1.0 29.733362 0.005496
Environment 0.088167 1.0 1.376122 0.305851
Residual 0.256277 4.0 NaN NaN