Contingency Tables

Author

Dr. Cohen

Contingency Tables

Definition A contingency table is an array of natural numbers in matrix form where those numbers represent counts / frequencies.

Col 1 Col2 Totals
row 1 \(O_{11}\) \(O_{12}\) \(r_{1}\)
row 2 \(O_{21}\) \(O_{22}\) \(r_{2}\)
Totals \(c_{1}\) \(c_{2}\) N

2 x 2 contingency table

Chi-squared Test for differences in Probabilities

Data

Class 1 Class2 Totals
Population 1 \(O_{11}\) (\(p_1\)) \(O_{12}\) \(n_1\)
Population 2 \(O_{21}\) (\(p_2\)) \(O_{22}\) \(n_2\)
Totals \(c_1\) \(c_2\) N

2 x 2 contingency table

Assumption

  • Each sample is Random Sample
  • The 2 samples are independent
  • Each observation can be classified into class 1 and class 2

Test Statistic

\[ T= \frac{\sqrt{N} (O_{11}O_{22} - O_{12}O_{21})}{\sqrt{n_1n_2c_1c_1}}\] Null distirbution: \(T \sim N(0,1)\)

Hypothesis: Two-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 \neq p_2 \] - \(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1. - \(p_2\) the probability that a randomly selected obs from the population 2 will be in class 1.

  • P-value\(= 2\times \min\{ P( T \leq T_{Obs}), P(T \geq T_{Obs}) \}\)

  • Decision: If p_value < \(\alpha\) then REJECT \(H_0\)

Hypothesis: Lower-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 < p_2 \] - \(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1.

  • P-value\(= P( T \leq T_{Obs})\)

  • Decision: If p_value < \(\alpha\) then REJECT \(H_0\)

Hypothesis: Upper-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 > p_2 \] - \(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1.

  • P-value\(= P( T \geq T_{Obs})\)

  • Decision: If p_value < \(\alpha\) then REJECT \(H_0\)

Chi-squared Test - Example 1

The number of items in two car loads.

Data

Defective Non defective Totals
Carload 1 a =13 b=73 \(n_1\) = 86
Carload 2 c = 17 d=57 \(n_2\) = 74
Totals \(c_1\) = 30 \(c_2\)= 130 N = 160

2 x 2 contingency table

Question: Test whether there are differences in proportions of defective items between the two carloads.

  1. Define the null and alternative hypotheses
    Answer: \[ H_0: p_1 = p_2 \] \[H_1: p_1 \neq p_2 \] This is an two-tailed test.
  2. Find the Test statistic observed and null distribution
    Answer: \(T_{obs}=-1.2695\) and \(T\sim N(0,1)\)
  3. Determine critical values (rejection region)
    Answer: +/- -1.959964
  4. Find P-value
    Answer: p-value=0.2042628
  5. Decision
    Answer: Since P-value > 0.05 then Fail to Reject \(H_0\).

Chi-squared Test - Example 1 with R

data = cbind(c(13,17),c(73,57)) # create data
chisq.test(data, # table data
           correct = FALSE # find p-value without Yates' correction
           ) 

    Pearson's Chi-squared test

data:  data
X-squared = 1.6116, df = 1, p-value = 0.2043
chisq.test(data, # table data
           correct = TRUE # find p-value with Yates' correction
           ) 

    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 1.1372, df = 1, p-value = 0.2863
  • Interpretation: Fail to Reject \(H_0\). There is evidence to support that the data is compatible with equal proportions \(p-value=0.2\).
  • Note: \(T_{obs}^2 = (-1.2695)^2\) = 1.6116303

Chi-squared Test - Example 2

A new toothpaste is tested for men and women preferences.

Data

Like Do not like Totals
Men a =64 b=36 \(n_1\) = 100
Women c = 74 d=26 \(n_2\) = 100
Totals \(c_1\) = 138 \(c_2\)= 62 N = 200

2 x 2 contingency table

Question: Do men and women differ in their preferences regarding the new toothpaste?

  1. Define the null and alternative hypotheses
    Answer: \[ H_0: p_1 = p_2 \] \[H_1: p_1 \neq p_2 \] This is an two-tailed test.
  2. Find the Test statistic observed and null distribution
    Answer: \(T_{obs}=-1.53\) and \(T\sim N(0,1)\)
  3. Determine critical values (rejection region)
    Answer: +/- -1.959964
  4. Find P-value
    Answer: p-value=0.1260167
  5. Decision and Interpretation
    Answer: Since P-value > 0.05 then Fail to Reject \(H_0\).

There is insufficient evidence to support that men and women differ in their preferences regarding the new toothpaste.

Fisher’s Exact Test

Data

col 1 col 2 Totals
row 1 X (\(p_1\)) r-X r
row 2 c-X (\(p_2\)) N-r-c+X N-r
Totals c N-c N

2 x 2 contingency table

Assumption

  • Each observation can be in one cell
  • The row and column totals are fixed.

Fisher’s Exact Test

Test Statistic

T = X = number of obs. in row 1 and col 1.

\[ T (H_0) \sim hypergeometric(N,r,C) \] The PMF is:

\[ P(T=x) = \frac{\binom{r}{x}\binom{N-r}{c-x}}{\binom{N}{c}} \] x=0,1,2,…,min(r,m)

Fisher’s Exact Test

Hypothesis: Two-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 \neq p_2 \] - \(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.

  • P-value\(= 2\times \min\{ P( T \leq T_{Obs}), P(T \geq T_{Obs}) \}\)

  • Decision: IF p_value < \(\alpha\) then REJECT \(H_0\)

Fisher’s Exact Test

Hypothesis: Lower-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 < p_2 \] - \(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.

  • P-value\(= P( T \leq T_{Obs})\)

  • Decision: IF p_value < \(\alpha\) then REJECT \(H_0\)

Hypothesis: Upper-tailed test \[ H_0:p_1 = p_2 \] \[H_1: p_1 > p_2 \] - \(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.

  • P-value\(= P( T \geq T_{Obs})\)

  • Decision: IF p_value < \(\alpha\) then REJECT \(H_0\)

Fisher’s Exact Test - Example

14 newly hired business majors. - 10 males and 4 females - 2 Jobs are needed: 10 Tellers and 4 Account Rep.

Data

Account Rep. Tellers Totals
Males X=1 9 r = 10
Females 3 1 4
Totals c= 4 10 N = 14

2 x 2 contingency table

Question: Test if females are more likely than males to get the account Rep. job.

Fisher’s Exact Test - Example

  1. Define the null and alternative hypotheses
    Answer: \[ H_0: p_1 \geq p_2 \] \[H_1: p_1 < p_2 \] This is an lower-tailed test.
  2. Find the Test statistic observed and null distribution
    Answer: \(T_{obs}=X=1\) and \(T\sim hypergeometric(14,10,4)\)`
  3. Find P-value
    Answer: p-value=0.040959
  4. Decision and Interpretation
    Answer: Since P-value < 0.05 then Reject \(H_0\).

Fisher’s Exact Test - Example with R

data = cbind(c(1,3),c(9,1)) # create data
fisher.test(data,alternative = "l")

    Fisher's Exact Test for Count Data

data:  data
p-value = 0.04096
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.000000 0.897734
sample estimates:
odds ratio 
0.05545513 
  • Interpretation: Reject \(H_0\). There is evidence to support that the data is compatible with the assumption that females are more likely than males to get the account Rep. job.

Mantel-Haenszel Test 2x2xk

An extension of Fisher’s exact to several 2x2 tables.

mydata =array(c(10,12,1,1,9,11,0,1,8,7,0,3),
               dim=c(2,2,3),
               dimnames=list(c("Treat.","Control"),c("Success","Failure"),c("Group 1","Group2","Group 3")))
mydata
, , Group 1

        Success Failure
Treat.       10       1
Control      12       1

, , Group2

        Success Failure
Treat.        9       0
Control      11       1

, , Group 3

        Success Failure
Treat.        8       0
Control       7       3
mantelhaen.test(mydata,alternative = "g")

    Mantel-Haenszel chi-squared test with continuity correction

data:  mydata
Mantel-Haenszel X-squared = 1.0114, df = 1, p-value = 0.1573
alternative hypothesis: true common odds ratio is greater than 1
95 percent confidence interval:
 0.7087777       Inf
sample estimates:
common odds ratio 
         4.357143 

Chi-squared test rxc Table Difference in Probabilities

M =  rbind("PrivateS"=c(6,14,17,9), "PublicS"=c(30,32,17,3))
M
         [,1] [,2] [,3] [,4]
PrivateS    6   14   17    9
PublicS    30   32   17    3
chisq.test(M)
Warning in chisq.test(M): Chi-squared approximation may be incorrect

    Pearson's Chi-squared test

data:  M
X-squared = 17.286, df = 3, p-value = 0.0006172

P-value < 0.001. The conclusion is that test scores are distributed differently among public and private high school students

Chi-squared test rxc Table Test for Independence

M = rbind("InState"=c(16,14,13,13), "OutState"=c(14,6,10,8))
M
         [,1] [,2] [,3] [,4]
InState    16   14   13   13
OutState   14    6   10    8
chisq.test(M)

    Pearson's Chi-squared test

data:  M
X-squared = 1.5242, df = 3, p-value = 0.6767

The conclusion is that the college in which a student is enrolled is independent of whether high school training was in state or out of state

The Median Test

Test for equal medians.

\[H_0: \text{All C populations have the same median} \] \[H_1: \text{At least two populations have different medians} \]

Data

  • C random samples are independent
  • Arrange the data as follows:
    • Find the Grand Median (GM), that is the median of the combined samples.
    • Set up a 2 by C contingency table as follows:
Sample 1 Sample 2 Sample C Totals
\(>\) GM \(O_{11}\) \(O_{12}\) \(O_{1C}\) a
\(\leq\) GM \(O_{21}\) \(O_{22}\) \(O_{2C}\) b
Totals \(n_{1}\) \(n_{1}\) \(n_{C}\) N
  • Test Statistic

\[ T = \frac{N^2}{ab} \sum_{i=1}^{C} \frac{O^2_{1i}}{n_i} - \frac{Na}{b} \]

Under Null hypothesis: \(T \sim \chi^2_{C-1}\); a chi-square distribution with C-1 degrees of freedom.

  • P-value \(=P(T \geq T_{obs})\)

  • Decision: IF p_value < \(\alpha\) then REJECT \(H_0\)

The Median Test - Example

  • 4 methods of growing corn is used.
  • The yield per acre is measured and compared across the 4 methods.

Question: Do the medians yield per acre differ across the 4 methods.

  1. Define the null and alternative hypotheses \[H_0: \text{All methods have the same median yield per acre} \] \[H_1: \text{At least two of the methods medians differ} \]

  2. Set up data: See lecture notes

  3. Test statistic:

\[T_{obs} = 17.6\]

Under Null hypothesis: \(T \sim \chi^2_{3}\)

The Median Test - Example R

# install.packages("agricolae")
library(agricolae) # package

data(corn) # data

# The Median Test
median_test_out= Median.test(corn$observation,corn$method)

The Median Test for corn$observation ~ corn$method 

Chi Square = 17.54306   DF = 3   P.Value 0.00054637
Median = 89 

  Median  r Min Max   Q25   Q75
1   91.0  9  83  96 89.00 92.00
2   86.0 10  81  91 83.25 89.75
3   95.0  7  91 101 93.50 98.00
4   80.5  8  77  82 78.75 81.00

Post Hoc Analysis

Groups according to probability of treatment differences and alpha level.

Treatments with the same letter are not significantly different.

  corn$observation groups
3             95.0      a
1             91.0      b
2             86.0      b
4             80.5      c

Multiple Comparison

# Visualization
plot(median_test_out)

Cramer’s Contingency Coefficient

Measures row x column association. Similar to a correlation coefficient between two continuous variables.

The high school state vs College Example

#install.packages("lsr")
library(lsr)
M = rbind("InState"=c(16,14,13,13), "OutState"=c(14,6,10,8))
cramersV(M)
[1] 0.1273375

The high school type vs score Example

#install.packages("lsr")
library(lsr)
M =  rbind("PrivateS"=c(6,14,17,9), "PublicS"=c(30,32,17,3))
cramersV(M)
Warning in stats::chisq.test(...): Chi-squared approximation may be incorrect
[1] 0.3674853