Contingency Tables

Author

Dr. Cohen

Contingency Tables

Definition A contingency table is an array of natural numbers in matrix form where those numbers represent counts / frequencies.

	Col 1	Col2	Totals
row 1	\(O_{11}\)	\(O_{12}\)	\(r_{1}\)
row 2	\(O_{21}\)	\(O_{22}\)	\(r_{2}\)
Totals	\(c_{1}\)	\(c_{2}\)	N

2 x 2 contingency table

Chi-squared Test for differences in Probabilities

Data

	Class 1	Class2	`Totals`
Population 1	\(O_{11}\) (\(p_1\))	\(O_{12}\)	\(n_1\)
Population 2	\(O_{21}\) (\(p_2\))	\(O_{22}\)	\(n_2\)
`Totals`	\(c_1\)	\(c_2\)	N

2 x 2 contingency table

Assumption

Each sample is Random Sample
The 2 samples are independent
Each observation can be classified into class 1 and class 2

Test Statistic

\[ T= \frac{\sqrt{N} (O_{11}O_{22} - O_{12}O_{21})}{\sqrt{n_1n_2c_1c_1}}\] Null distirbution: \(T \sim N(0,1)\)

Hypothesis: Two-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 \neq p_2 \] - \(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1. - \(p_2\) the probability that a randomly selected obs from the population 2 will be in class 1.

P-value\(= 2\times \min\{ P( T \leq T_{Obs}), P(T \geq T_{Obs}) \}\)
Decision: If p_value < \(\alpha\) then REJECT \(H_0\)

Hypothesis: Lower-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 < p_2 \] - \(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1.

P-value\(= P( T \leq T_{Obs})\)
Decision: If p_value < \(\alpha\) then REJECT \(H_0\)

Hypothesis: Upper-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 > p_2 \] - \(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1.

P-value\(= P( T \geq T_{Obs})\)
Decision: If p_value < \(\alpha\) then REJECT \(H_0\)

Chi-squared Test - Example 1

The number of items in two car loads.

Data

	Defective	Non defective	`Totals`
Carload 1	a =13	b=73	\(n_1\) = 86
Carload 2	c = 17	d=57	\(n_2\) = 74
`Totals`	\(c_1\) = 30	\(c_2\)= 130	N = 160

2 x 2 contingency table

Question: Test whether there are differences in proportions of defective items between the two carloads.

Define the null and alternative hypotheses
Answer: \[ H_0: p_1 = p_2 \] \[H_1: p_1 \neq p_2 \] This is an two-tailed test.
Find the Test statistic observed and null distribution
Answer: \(T_{obs}=-1.2695\) and \(T\sim N(0,1)\)
Determine critical values (rejection region)
Answer: +/- -1.959964
Find P-value
Answer: p-value=0.2042628
Decision
Answer: Since P-value > 0.05 then Fail to Reject \(H_0\).

Chi-squared Test - Example 1 with R

data = cbind(c(13,17),c(73,57)) # create data
chisq.test(data, # table data
           correct = FALSE # find p-value without Yates' correction
           )


    Pearson's Chi-squared test

data:  data
X-squared = 1.6116, df = 1, p-value = 0.2043

chisq.test(data, # table data
           correct = TRUE # find p-value with Yates' correction
           )


    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 1.1372, df = 1, p-value = 0.2863

Interpretation: Fail to Reject \(H_0\). There is evidence to support that the data is compatible with equal proportions \(p-value=0.2\).
Note: \(T_{obs}^2 = (-1.2695)^2\) = 1.6116303

Chi-squared Test - Example 2

A new toothpaste is tested for men and women preferences.

Data

	Like	Do not like	`Totals`
Men	a =64	b=36	\(n_1\) = 100
Women	c = 74	d=26	\(n_2\) = 100
`Totals`	\(c_1\) = 138	\(c_2\)= 62	N = 200

2 x 2 contingency table

Question: Do men and women differ in their preferences regarding the new toothpaste?

Define the null and alternative hypotheses
Answer: \[ H_0: p_1 = p_2 \] \[H_1: p_1 \neq p_2 \] This is an two-tailed test.
Find the Test statistic observed and null distribution
Answer: \(T_{obs}=-1.53\) and \(T\sim N(0,1)\)
Determine critical values (rejection region)
Answer: +/- -1.959964
Find P-value
Answer: p-value=0.1260167
Decision and Interpretation
Answer: Since P-value > 0.05 then Fail to Reject \(H_0\).

There is insufficient evidence to support that men and women differ in their preferences regarding the new toothpaste.

Fisher’s Exact Test

Data

	col 1	col 2	`Totals`
row 1	X (\(p_1\))	r-X	r
row 2	c-X (\(p_2\))	N-r-c+X	N-r
`Totals`	c	N-c	N

2 x 2 contingency table

Assumption

Each observation can be in one cell
The row and column totals are fixed.

Fisher’s Exact Test

Test Statistic

T = X = number of obs. in row 1 and col 1.

\[ T (H_0) \sim hypergeometric(N,r,C) \] The PMF is:

\[ P(T=x) = \frac{\binom{r}{x}\binom{N-r}{c-x}}{\binom{N}{c}} \] x=0,1,2,…,min(r,m)

Fisher’s Exact Test

Hypothesis: Two-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 \neq p_2 \] - \(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.

P-value\(= 2\times \min\{ P( T \leq T_{Obs}), P(T \geq T_{Obs}) \}\)
Decision: IF p_value < \(\alpha\) then REJECT \(H_0\)

Fisher’s Exact Test

Hypothesis: Lower-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 < p_2 \] - \(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.

P-value\(= P( T \leq T_{Obs})\)
Decision: IF p_value < \(\alpha\) then REJECT \(H_0\)

Hypothesis: Upper-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 > p_2 \] - \(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.

P-value\(= P( T \geq T_{Obs})\)
Decision: IF p_value < \(\alpha\) then REJECT \(H_0\)

Fisher’s Exact Test - Example

14 newly hired business majors. - 10 males and 4 females - 2 Jobs are needed: 10 Tellers and 4 Account Rep.

Data

	Account Rep.	Tellers	`Totals`
Males	X=1	9	r = 10
Females	3	1	4
`Totals`	c= 4	10	N = 14

2 x 2 contingency table

Question: Test if females are more likely than males to get the account Rep. job.

Fisher’s Exact Test - Example

Define the null and alternative hypotheses
Answer: \[ H_0: p_1 \geq p_2 \] \[H_1: p_1 < p_2 \] This is an lower-tailed test.
Find the Test statistic observed and null distribution
Answer: \(T_{obs}=X=1\) and \(T\sim hypergeometric(14,10,4)\)`
Find P-value
Answer: p-value=0.040959
Decision and Interpretation
Answer: Since P-value < 0.05 then Reject \(H_0\).

Fisher’s Exact Test - Example with R

data = cbind(c(1,3),c(9,1)) # create data
fisher.test(data,alternative = "l")


    Fisher's Exact Test for Count Data

data:  data
p-value = 0.04096
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.000000 0.897734
sample estimates:
odds ratio 
0.05545513

Interpretation: Reject \(H_0\). There is evidence to support that the data is compatible with the assumption that females are more likely than males to get the account Rep. job.

Mantel-Haenszel Test 2x2xk

An extension of Fisher’s exact to several 2x2 tables.

mydata =array(c(10,12,1,1,9,11,0,1,8,7,0,3),
               dim=c(2,2,3),
               dimnames=list(c("Treat.","Control"),c("Success","Failure"),c("Group 1","Group2","Group 3")))
mydata

, , Group 1

        Success Failure
Treat.       10       1
Control      12       1

, , Group2

        Success Failure
Treat.        9       0
Control      11       1

, , Group 3

        Success Failure
Treat.        8       0
Control       7       3

mantelhaen.test(mydata,alternative = "g")


    Mantel-Haenszel chi-squared test with continuity correction

data:  mydata
Mantel-Haenszel X-squared = 1.0114, df = 1, p-value = 0.1573
alternative hypothesis: true common odds ratio is greater than 1
95 percent confidence interval:
 0.7087777       Inf
sample estimates:
common odds ratio 
         4.357143

Chi-squared test rxc Table Difference in Probabilities

M =  rbind("PrivateS"=c(6,14,17,9), "PublicS"=c(30,32,17,3))
M

         [,1] [,2] [,3] [,4]
PrivateS    6   14   17    9
PublicS    30   32   17    3

chisq.test(M)

Warning in chisq.test(M): Chi-squared approximation may be incorrect


    Pearson's Chi-squared test

data:  M
X-squared = 17.286, df = 3, p-value = 0.0006172

P-value < 0.001. The conclusion is that test scores are distributed differently among public and private high school students

Chi-squared test rxc Table Test for Independence

M = rbind("InState"=c(16,14,13,13), "OutState"=c(14,6,10,8))
M

         [,1] [,2] [,3] [,4]
InState    16   14   13   13
OutState   14    6   10    8

chisq.test(M)


    Pearson's Chi-squared test

data:  M
X-squared = 1.5242, df = 3, p-value = 0.6767

The conclusion is that the college in which a student is enrolled is independent of whether high school training was in state or out of state

The Median Test

Test for equal medians.

\[H_0: \text{All C populations have the same median} \] \[H_1: \text{At least two populations have different medians} \]

Data

C random samples are independent
Arrange the data as follows:
- Find the Grand Median (GM), that is the median of the combined samples.
- Set up a 2 by C contingency table as follows:

	Sample 1	Sample 2	…	Sample C	`Totals`
\(>\) GM	\(O_{11}\)	\(O_{12}\)	…	\(O_{1C}\)	a
\(\leq\) GM	\(O_{21}\)	\(O_{22}\)	…	\(O_{2C}\)	b
`Totals`	\(n_{1}\)	\(n_{1}\)	…	\(n_{C}\)	N

Test Statistic

\[ T = \frac{N^2}{ab} \sum_{i=1}^{C} \frac{O^2_{1i}}{n_i} - \frac{Na}{b} \]

Under Null hypothesis: \(T \sim \chi^2_{C-1}\); a chi-square distribution with C-1 degrees of freedom.

P-value \(=P(T \geq T_{obs})\)
Decision: IF p_value < \(\alpha\) then REJECT \(H_0\)

The Median Test - Example

4 methods of growing corn is used.
The yield per acre is measured and compared across the 4 methods.

Question: Do the medians yield per acre differ across the 4 methods.

Define the null and alternative hypotheses \[H_0: \text{All methods have the same median yield per acre} \] \[H_1: \text{At least two of the methods medians differ} \]
Set up data: See lecture notes
Test statistic:

\[T_{obs} = 17.6\]

Under Null hypothesis: \(T \sim \chi^2_{3}\)

The Median Test - Example R

# install.packages("agricolae")
library(agricolae) # package

data(corn) # data

# The Median Test
median_test_out= Median.test(corn$observation,corn$method)


The Median Test for corn$observation ~ corn$method 

Chi Square = 17.54306   DF = 3   P.Value 0.00054637
Median = 89 

  Median  r Min Max   Q25   Q75
1   91.0  9  83  96 89.00 92.00
2   86.0 10  81  91 83.25 89.75
3   95.0  7  91 101 93.50 98.00
4   80.5  8  77  82 78.75 81.00

Post Hoc Analysis

Groups according to probability of treatment differences and alpha level.

Treatments with the same letter are not significantly different.

  corn$observation groups
3             95.0      a
1             91.0      b
2             86.0      b
4             80.5      c

Multiple Comparison

# Visualization
plot(median_test_out)

Cramer’s Contingency Coefficient

Measures row x column association. Similar to a correlation coefficient between two continuous variables.

The high school state vs College Example

#install.packages("lsr")
library(lsr)
M = rbind("InState"=c(16,14,13,13), "OutState"=c(14,6,10,8))
cramersV(M)

[1] 0.1273375

The high school type vs score Example

#install.packages("lsr")
library(lsr)
M =  rbind("PrivateS"=c(6,14,17,9), "PublicS"=c(30,32,17,3))
cramersV(M)

Warning in stats::chisq.test(...): Chi-squared approximation may be incorrect

[1] 0.3674853