ANOVA#

[1]:

import sweepystats as sw
import numpy as np
import pandas as pd

1-way ANOVA#

Suppose we are given an example data set, and we want to know:

Question: Do samples in different Group have different Outcomes?

[2]:

df = pd.DataFrame({
    'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
    'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
})
df

[2]:

	Outcome	Group
0	3.6	A
1	3.5	A
2	4.2	B
3	2.7	B
4	4.1	A
5	5.2	C
6	3.0	B
7	4.8	C
8	4.0	C

Statistically, we want to test whether the mean of each group (i.e. categories A vs B vs C) is different. The null hypothesis is \(\mu_A = \mu_B = \mu_C\) . For this, we can conduct a 1-way ANOVA.

Sweepystats accepts patsy’s formula to specify which variable is being considered.

[3]:

formula = "Outcome ~ Group"
one_way = sw.ANOVA(df, formula)
one_way.fit()

100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 6754.11it/s]

The F-statistic and p-value can be extracted as:

[4]:

f_stat, pval = one_way.f_test("Group")
f_stat, pval

[4]:

(np.float64(3.966867469879486), np.float64(0.0798456235718277))

If we reject the null at \(\alpha = 0.05\) level, then no, there is no statistically significant difference between at least one pair of group means.

Check answer is correct#

We can compare the answer via sweep operator is correct using statsmodels package:

[5]:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Outcome ~ Group', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=3)  # Type I ANOVA
anova_table

[5]:

	sum_sq	df	F	PR(>F)
Intercept	41.813333	1.0	113.349398	0.000040
Group	2.926667	2.0	3.966867	0.079846
Residual	2.213333	6.0	NaN	NaN

\(k\)-way ANOVA#

Now suppose we have another covariate Factor that was measured, and we want to know:

Question: Do samples in different Group and Factor have different Outcomes?

[6]:

df = pd.DataFrame({
    'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
    'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
    'Factor': pd.Categorical(["X", "X", "Y", "X", "Y", "Y", "X", "Y", "X"])
})
df

[6]:

	Outcome	Group	Factor
0	3.6	A	X
1	3.5	A	X
2	4.2	B	Y
3	2.7	B	X
4	4.1	A	Y
5	5.2	C	Y
6	3.0	B	X
7	4.8	C	Y
8	4.0	C	X

We previously saw that Group alone is not significant, using 1-way ANOVA. Lets additionally adjust for Factor and the interaction effect between Group and Factor.

[7]:

formula = "Outcome ~ Group + Factor + Group:Factor"
two_way = sw.ANOVA(df, formula)
two_way.fit()

100%|████████████████████████████████████████████| 6/6 [00:00<00:00, 7861.86it/s]

Now, we can test for significance of Group, Factor, and their interaction using an F-test. For example,

[8]:

# test for Group variable
f_stat, pval = two_way.f_test("Group")
f_stat, pval

[8]:

(np.float64(11.561538461537321), np.float64(0.03891754069189004))

[17]:

# test for Factor variable
f_stat, pval = two_way.f_test("Factor")
f_stat, pval

[17]:

(np.float64(4.653846153845692), np.float64(0.11988267006105482))

[9]:

# test for interaction
f_stat, pval = two_way.f_test("Group:Factor")
f_stat, pval

[9]:

(np.float64(2.474358974358741), np.float64(0.2318655632501541))

Conclusion:

If we test Group by itself (in a 1-way ANOVA), then it is not significant.
If we add Factor, then Group becomes significant, while Factor is not.

NOTE: in each of these tests, internally we are NOT refitting the reduced model - we simply swept out the (one-hot encoded) variable from the full model!

Check answer is correct#

Again we can compare the answer is correct using statsmodels package:

[10]:

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Outcome ~ Group + Factor + Group:Factor', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=3)  # Type III ANOVA (note: use type 2 if no interaction term)
print(anova_table)

                 sum_sq   df           F    PR(>F)
Intercept     25.205000  1.0  581.653846  0.000156
Group          1.002000  2.0   11.561538  0.038918
Factor         0.201667  1.0    4.653846  0.119883
Group:Factor   0.214444  2.0    2.474359  0.231866
Residual       0.130000  3.0         NaN       NaN

ANCOVA - analysis of co-variance#

Now suppose we also measured a continuous covariate Environment and we want to adjust for its effect on Group and Factor. Our question becomes:

Question: Do samples in different Group and Factor have different Outcomes, after adjusting for Environment?

[11]:

df = pd.DataFrame({
    'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
    'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]),
    'Factor': pd.Categorical(["X", "X", "Y", "X", "Y", "Y", "X", "Y", "X"]),
    'Environment': [-1.2, 0.3, 3.3, 0.0, -2.7, -1.1, -0.1, 0.1, 1.0]
})
df

[11]:

	Outcome	Group	Factor	Environment
0	3.6	A	X	-1.2
1	3.5	A	X	0.3
2	4.2	B	Y	3.3
3	2.7	B	X	0.0
4	4.1	A	Y	-2.7
5	5.2	C	Y	-1.1
6	3.0	B	X	-0.1
7	4.8	C	Y	0.1
8	4.0	C	X	1.0

[12]:

formula = "Outcome ~ Group + Factor + Environment"
anvoca = sw.ANOVA(df, formula)
anvoca.fit()

100%|████████████████████████████████████████████| 5/5 [00:00<00:00, 9541.18it/s]

[13]:

f_stat, pval = anvoca.f_test("Group")
f_stat, pval

[13]:

(np.float64(12.498763543030575), np.float64(0.019028215317113274))

[14]:

f_stat, pval = anvoca.f_test("Factor")
f_stat, pval

[14]:

(np.float64(29.73336179288068), np.float64(0.00549619765483064))

Of course, we can also check for the importance of Environment:

[15]:

f_stat, pval = anvoca.f_test("Environment")
f_stat, pval

[15]:

(np.float64(1.3761221387331437), np.float64(0.30585068386326636))

Conclusion: both Group and Factor are significant after adjusting for Environment!

Check answer#

[16]:

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Outcome ~ Group + Factor + Environment', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

               sum_sq   df          F    PR(>F)
Group        1.601574  2.0  12.498764  0.019028
Factor       1.904996  1.0  29.733362  0.005496
Environment  0.088167  1.0   1.376122  0.305851
Residual     0.256277  4.0        NaN       NaN