ANOVA#

[1]:
import sweepystats as sw
import numpy as np
import pandas as pd

1-way ANOVA#

Suppose we are given an example data set, and we want to know:

Question: Do samples in different Group have different Outcomes?
[2]:
df = pd.DataFrame({
    'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
    'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
})
df
[2]:
Outcome Group
0 3.6 A
1 3.5 A
2 4.2 B
3 2.7 B
4 4.1 A
5 5.2 C
6 3.0 B
7 4.8 C
8 4.0 C

Statistically, we want to test whether the mean of each group (i.e. categories A vs B vs C) is different. The null hypothesis is \(\mu_A = \mu_B = \mu_C\) . For this, we can conduct a 1-way ANOVA.

Sweepystats accepts patsy’s formula to specify which variable is being considered.

[3]:
formula = "Outcome ~ Group"
one_way = sw.ANOVA(df, formula)
one_way.fit()
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 6754.11it/s]

The F-statistic and p-value can be extracted as:

[4]:
f_stat, pval = one_way.f_test("Group")
f_stat, pval
[4]:
(np.float64(3.966867469879486), np.float64(0.0798456235718277))

If we reject the null at \(\alpha = 0.05\) level, then no, there is no statistically significant difference between at least one pair of group means.

Check answer is correct#

We can compare the answer via sweep operator is correct using statsmodels package:

[5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Outcome ~ Group', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=3)  # Type I ANOVA
anova_table
[5]:
sum_sq df F PR(>F)
Intercept 41.813333 1.0 113.349398 0.000040
Group 2.926667 2.0 3.966867 0.079846
Residual 2.213333 6.0 NaN NaN

\(k\)-way ANOVA#

Now suppose we have another covariate Factor that was measured, and we want to know:

Question: Do samples in different Group and Factor have different Outcomes?
[6]:
df = pd.DataFrame({
    'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
    'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
    'Factor': pd.Categorical(["X", "X", "Y", "X", "Y", "Y", "X", "Y", "X"])
})
df
[6]:
Outcome Group Factor
0 3.6 A X
1 3.5 A X
2 4.2 B Y
3 2.7 B X
4 4.1 A Y
5 5.2 C Y
6 3.0 B X
7 4.8 C Y
8 4.0 C X

We previously saw that Group alone is not significant, using 1-way ANOVA. Lets additionally adjust for Factor and the interaction effect between Group and Factor.

[7]:
formula = "Outcome ~ Group + Factor + Group:Factor"
two_way = sw.ANOVA(df, formula)
two_way.fit()
100%|████████████████████████████████████████████| 6/6 [00:00<00:00, 7861.86it/s]

Now, we can test for significance of Group, Factor, and their interaction using an F-test. For example,

[8]:
# test for Group variable
f_stat, pval = two_way.f_test("Group")
f_stat, pval
[8]:
(np.float64(11.561538461537321), np.float64(0.03891754069189004))
[17]:
# test for Factor variable
f_stat, pval = two_way.f_test("Factor")
f_stat, pval
[17]:
(np.float64(4.653846153845692), np.float64(0.11988267006105482))
[9]:
# test for interaction
f_stat, pval = two_way.f_test("Group:Factor")
f_stat, pval
[9]:
(np.float64(2.474358974358741), np.float64(0.2318655632501541))

Conclusion:

  • If we test Group by itself (in a 1-way ANOVA), then it is not significant.

  • If we add Factor, then Group becomes significant, while Factor is not.

NOTE: in each of these tests, internally we are NOT refitting the reduced model - we simply swept out the (one-hot encoded) variable from the full model!

Check answer is correct#

Again we can compare the answer is correct using statsmodels package:

[10]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Outcome ~ Group + Factor + Group:Factor', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=3)  # Type III ANOVA (note: use type 2 if no interaction term)
print(anova_table)
                 sum_sq   df           F    PR(>F)
Intercept     25.205000  1.0  581.653846  0.000156
Group          1.002000  2.0   11.561538  0.038918
Factor         0.201667  1.0    4.653846  0.119883
Group:Factor   0.214444  2.0    2.474359  0.231866
Residual       0.130000  3.0         NaN       NaN

ANCOVA - analysis of co-variance#

Now suppose we also measured a continuous covariate Environment and we want to adjust for its effect on Group and Factor. Our question becomes:

Question: Do samples in different Group and Factor have different Outcomes, after adjusting for Environment?
[11]:
df = pd.DataFrame({
    'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
    'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
    'Factor': pd.Categorical(["X", "X", "Y", "X", "Y", "Y", "X", "Y", "X"]),
    'Environment': [-1.2, 0.3, 3.3, 0.0, -2.7, -1.1, -0.1, 0.1, 1.0]
})
df
[11]:
Outcome Group Factor Environment
0 3.6 A X -1.2
1 3.5 A X 0.3
2 4.2 B Y 3.3
3 2.7 B X 0.0
4 4.1 A Y -2.7
5 5.2 C Y -1.1
6 3.0 B X -0.1
7 4.8 C Y 0.1
8 4.0 C X 1.0
[12]:
formula = "Outcome ~ Group + Factor + Environment"
anvoca = sw.ANOVA(df, formula)
anvoca.fit()
100%|████████████████████████████████████████████| 5/5 [00:00<00:00, 9541.18it/s]
[13]:
f_stat, pval = anvoca.f_test("Group")
f_stat, pval
[13]:
(np.float64(12.498763543030575), np.float64(0.019028215317113274))
[14]:
f_stat, pval = anvoca.f_test("Factor")
f_stat, pval
[14]:
(np.float64(29.73336179288068), np.float64(0.00549619765483064))

Of course, we can also check for the importance of Environment:

[15]:
f_stat, pval = anvoca.f_test("Environment")
f_stat, pval
[15]:
(np.float64(1.3761221387331437), np.float64(0.30585068386326636))

Conclusion: both Group and Factor are significant after adjusting for Environment!

Check answer#

[16]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Outcome ~ Group + Factor + Environment', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
               sum_sq   df          F    PR(>F)
Group        1.601574  2.0  12.498764  0.019028
Factor       1.904996  1.0  29.733362  0.005496
Environment  0.088167  1.0   1.376122  0.305851
Residual     0.256277  4.0        NaN       NaN