[1] 1 2 3 4 5 6 7 8 9 10 11 12
50%
6.5
25% 50% 75%
3.75 6.50 9.25
Dr. Cohen
This course covers an introduction to nonparametric statistics methods as follows:
Let’s \(X \sim p(x)\) be a discrete r.v., then the expected value:
\[E(X) = \sum_{\forall x} x p(x)\]
Let’s \(X \sim f(x)\) be a continuous r.v., then the expected value:
\[E(X) = \int_{\forall x} x f(x) dx \]
R is a free software environment for statistical computing and graphics.
RStudio is an integrated development environment (IDE) for R and Python, with a console, syntax-highlighting editor, and other features.
R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both save and execute code to generate high quality reports.
Quarto is an open-source scientific and technical publishing system.
In this class, we will use the UWF RStudio Server. Log in using your UWF account. Go to UWF RStudio Server
Definition
: The number \(x_p\) (\(0 \leq p \leq 1\)) is called the \(p^{th}\) quantile of a random variable \(X\) if:
Continuous: \(P(X\leq x_p) = p\)
Discrete: \(P(X < x_p) \leq p\) AND \(P(X\leq x_p) \geq p\)
We can think of quantiles as points that split the distribution (values of \(X\)) into equal intervals.
Examples:
Let’s \(X\) be a r.v. with PMF:
x | 0 | 1 | 2 | 3 |
---|---|---|---|---|
P(X=x) | 1/4 | 1/4 | 1/3 | 1/6 |
Solution
: Find \(x_{0.75}\) that satisfies \(P(X < x_{0.75}) \leq 0.75\) AND \(P(X\leq x_{0.75}) \geq 0.75\); if more than one value satisfy this then the average of all values is the answer. (\(x_{0.75}\)=2)
Solution
: Find \(x_{0.5}\) that satisfies \(P(X < x_{0.5}) \leq 0.5\) AND \(P(X\leq x_{0.5}) \geq 0.5\); if more than one value satisfy this then the average of all values is the answer.(\(x_{0.5}\)=1.5)
Example 1:
[1] 1 2 3 4 5 6 7 8 9 10 11 12
50%
6.5
25% 50% 75%
3.75 6.50 9.25
Example 2:
Hypothesis Testing is the process of inferring from a sample whether or not a given statement about the population appears to be true.
Null hypothesis - \(H_0\): it is usually formulated for the express purpose of being rejected. No differences. Example: “The process is in-control” in quality control.
Alternative Hypothesis - \(H_1\): \(H_0\) is rejected. It is the statement that the experimenter would like to prove. There are differences. For example, the alternative hypothesis could be “the quality of the product or service is unsatis- factory” or process is out of control”.
The Test Statistic is chosen to be sensitive to the difference between the null hypothesis and the alternative hypothesis.
Level of significance (\(\alpha\)): When the null hypothesis and the alternative hypothesis have been stated, and when the appropriate statistical test has been selected, the next step is to specify a level of significance \(\alpha\) and a sample size. Since the level of significance goes into the determination of whether \(H_0\) is or is not rejected, the requirement of objectivity demands that \(\alpha\) be set in advance. This is also called Type I Error.
\[ \alpha = P(Reject~H_0~|~ H_0~is~True) \]
The Null distribution is the distribution of the test statistic when \(H_0\) is TRUE
. This defines the rejection region along with the level of significance.
The Power: The probability of rejecting \(H_0\) when it is in fact FALSE
.
\[Power = 1 - \beta = P(Rejecting~ H_0 | H_0 ~is~FALSE)\]
P-value is the probability, computed assuming that \(H_0\) is TRUE
, that the test statistic would take a value as extreme or more extreme than that actually observed.
Accepting or Failing to Reject \(H_0\) does not mean that the data prove the null hypothesis to be true. Read the ASA statement on p-values
A binomial experiment has the following properties:
random variable
\(Y\) is the number of successes (\(yesS\)) observed during the \(n\) trials.Consider \(Y \sim binom(n,p)\), then \(Y\) can take on values \(0, 1, 2, 3,\ldots, n\). The probability mass function is given by \[\begin{align} p(y)=&P(Y=y)=\binom{n}{y}p^y q^{n-y}, \\ y=&0,1,2,\ldots,n ; \qquad 0\leq p \leq 1 ; \quad q=1-p \end{align}\] The expected value of \(Y\) is: \[\begin{align}E(Y)=np\end{align}\] The Variance \(Y\) is: \[\begin{align} V(Y)=npq \end{align}\]
There are some populations which are conceived as consisting of only two classes:
For such cases all the population observations will fall into either one or the other of the two classes. If we know that the proportion of cases in one class is \(p\), sometimes we are interested to estimate/test the population proportion \(p\).
1.DATA:
The sample consists of outcomes of \(n\) independent trails. Each trail has two outcomes (mutually exclusive= they cannot both occur at same time). Let consider \(O_1\) to be \(\#\) of observations in class1 and \(O_2\) to be the \(\#\) of observations in class2. We have \(n=O_1+O_2\).
2.Assumptions:
- \(n\) observation are independent. - Each trial has probability \(p\) of resulting in the outcome \(O_1\), where \(p\) is the same for all \(n\) trials.
3.Test Statistic:
Since we are interested in the probability of an outcome to be in class1, we will let the test statistic \(T\) be the number of times the outcomes is in class1
. Therefore, the test statistic follows the binomial distribution (null distribution) with \(n\) and \(\hat{p}\) (the hypothesized proportion) parameters.
4.Hypothesis: Two-tailed test
\[ H_0: p=p^* \] \[H_1: p \neq p^* \]
\[ P( T \leq t_1) \approx \alpha/2 \] \[ P( T \leq t_2) \approx 1- \alpha/2 \]
Decision
: IF \((T_{Observed} \leq t_1\) OR \(T_{Obs}> t_2\)) then REJECT \(H_0\)
P-value
\(= 2\times \min\{ P( Y \leq T_{Obs}), P(Y \geq T_{Obs}) \}\); If greater than 1 then \(P-value=1\).
4.Hypothesis: Upper-tailed test
\[ H_0: p \leq p^* \]
\[H_1: p > p^* \]
The rejection region is defined by \(t\) as follows:
\[P( T \leq t_2) \approx 1-\alpha \]
Decision
: IF \(T_{Obs} > t\) then REJECT \(H_0\)
P-value
\(=P(Y \geq T_{Obs})\)
4.Hypothesis: Lower-tailed test
\[ H_0: p \geq p^* \]
\[H_1: p < p^* \]
\[P( T \leq t_1) \approx \alpha \]
Decision
: IF \(T_{Obs} \leq t\) then REJECT \(H_0\)
P-value
\(=P(Y \leq T_{Obs})\)
We would like to know whether the passing rate for a course is \(75\%\). We have a sample of \(n=15\) students taking the course and only 8 passed. Level of significance is \(5\%\). Test the appropriate hypothesis as follows:
Answer:
\[ H_0: p=0.75 \] \[H_1: p \neq 0.75 \]Answer:
\(T_{obs}=8\) and \(T\sim bin(15,0.75)\)Answer:
t_1=8 and t_2=14Answer:
p-value=0.1132406Answer:
Since P-value > 0.05 then Fail to Reject \(H_0\).Solution
binom.test(x = 8, # number of successes (#students pass)
n = 15, # sample size
p = 0.75, # hypothesized p
alternative = 'two.sided', # two-tailed test
conf.level = 0.95) # Confidence Interval
Exact binomial test
data: 8 and 15
number of successes = 8, number of trials = 15, p-value = 0.06998
alternative hypothesis: true probability of success is not equal to 0.75
95 percent confidence interval:
0.2658613 0.7873333
sample estimates:
probability of success
0.5333333
Interpretation
: Fail to Reject \(H_0\). There is evidence to support that the data is compatible with the null hypothesis \(\hat{p}=0.533\); \(95\%CI (0.27,0.79)\).The sign test is a binomial test with \(p=0.5\).
It is useful for testing whether one random variable in a pair \((X, Y)\) tends to be larger than the other random variable in the pair.
Example
: Consider a clinical investigation to assess the effectiveness of a new drug designed to reduce repetitive behavior, we can compare time before and after taking the new drug.
This test can be use as alternative to the parametric t-paired test
.
1.DATA:
The data consist of observations on a bivariate random sample \((X_i, Y_i)\), where \(n'\) is the number of the pairs
. If the X’s and Y’s are independent, then the more powerful Mann-Whitney test is more appropriate.
Within each pair \((X_i, Y_i)\) a comparison is made and the pair is classified as:
"+"
if \(X_i < Y_i\)"-"
if \(X_i > Y_i\)"0"
if \(X_i = Y_i\) (tie)2.Assumptions:
- The bivariate random variables \((X_i, Y_i)\) are mutually independent . - The measurement scale is at least ordinal.
3.Test Statistic:
The test statistic is defined as follows:
\[T = \text{Total number of +'s}\]
The null distribution is the binomial distribution with \(p=0.5\) and \(n=\text{the number of non-tied pairs}\).
4.Hypothesis: Two-tailed test
\[ H_0:E(X_i) = E(Y_i) \text{ or } P(+) = P(-) \]
\[H_1: E(X_i) \neq E(Y_i) \text{ or } P(+) \neq P(-) \]
\[ P( Y \leq t) \approx \alpha/2 \]
Decision
: IF \((T_{Obs} \leq t\) or $T_{Obs} n-t) $ REJECT \(H_0\)
P-value
\(= 2\times \min\{ P( Y \leq T_{Obs}), P(Y \geq T_{Obs}) \}\)
4.Hypothesis: Upper-tailed test
\[ H_0:E(X_i) \geq E(Y_i) \text{ or } P(+) \leq P(-) \]
\[H_1: E(X_i) < E(Y_i) \text{ or } P(+) > P(-) \]
\[P( Y \leq t) \approx \alpha \]
Decision
: IF \(T_{Obs} \geq n-t\) REJECT \(H_0\)
P-value
\(= P( Y \geq T_{Obs})\)
4.Hypothesis: Lower-tailed test
\[ H_0:E(X_i) \leq E(Y_i) \text{ or } P(+) \geq P(-) \]
\[H_1: E(X_i) > E(Y_i) \text{ or } P(+) < P(-)\]
\[P( Y \leq t) \approx \alpha \]
Decision
: IF \(T_{Obs} \leq t\) REJECT \(H_0\)
P-value
\(= P( Y \leq T_{Obs})\)
Question:
Is the RTAL significantly longer than RTBL?
Solution:
We have \(n'=18\) pairs, each one for a worker:
\[ (RTBL,RTAL) = (X,Y)\]
"+"
if \(RTBL < RTAL\) (12)"-"
if \(RTBL > RTAL\) (5)"0"
if \(RTBL = RTAL\) (tie) (1)then \(n=17\)
Solution
\[ H_0: E(RTBL) \leq E(RTAL) \text{ OR } P(+) \leq P(-) \] \[H_1: E(RTBL) < E(RTAL) \text{ OR } P(+) > P(-) \]
This is an upper-tailed test.
Test statistic: \(T_{obs}=12\) and \(T\sim bin(17,0.5)\)
Determine critical values (rejection region): t = 12
p-value=0.0717316
Decision: P-value < 0.05 then Reject \(H_0\).
Solution
Exact binomial test
data: 12 and 17
number of successes = 12, number of trials = 17, p-value = 0.07173
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.4780823 1.0000000
sample estimates:
probability of success
0.7058824
Interpretation
: Fail to Reject \(H_0\). There is evidence to support that the data is compatible with the null hypothesis \(\hat{p}=0.71\); \(95\%CI (0.48,1)\).\[ P(X_{(r)} \leq \text{a fraction q of the population} \leq X_{(n+m-1)}) \geq 1-\alpha \]
The tolerance limits can be use to find A sample size \(n\) needed to have at least \(q\) proportion of the population between the tolerance limits with \(1-\alpha\) probability.
\[ n \approx \frac{1}{4} \chi^2_{1-\alpha;2(r+m)} \frac{1+q}{1-q} + \frac{1}{2} (r+m-1)\]
where \(\chi^2_{1-\alpha;2(r+m)}\) is the quantile of a chi-squared random variable.
We can find The percent \(q\) of the population that is within the tolerance limits, given n, \(1-\alpha\), \(r\), and \(m\):
\[q=\frac{4n-2(r+m-1)-\chi^2_{1-\alpha;2(r+m)}}{4n-2(r+m-1) + \chi^2_{1-\alpha;2(r+m)}}\]
Electric seat adjusters for a luxury car manufacturer wants to know what range
of vertical adjustment is needed to be \(90\%\) sure that at least \(80\%\) of population of potential buyers will be able to adjust their seat.
To answer the question we need to find \(n\)
Next, we need to randomly pick 18 people from the population of potential buyers and collect their adjustments.
Definition
A contingency table is an array of natural numbers in matrix form where those numbers represent counts / frequencies
Col 1 | Col2 | Totals | |
---|---|---|---|
row 1 | a | b | a+b |
row 2 | c | d | c+d |
Totals | a+c | b+d | a+b+c+d |
2 x 2 contingency table
Data
Class 1 | Class2 | Totals |
|
---|---|---|---|
Population 1 | a (\(p_1\)) | b | \(n_1\) = a+b |
Population 2 | c (\(p_2\)) | d | \(n_2\) = c+d |
Totals |
\(c_1\) = a+c | \(c_2\)= b+d | N = a+b+c+d |
2 x 2 contingency table
Assumption
Test Statistic
\[ T= \frac{\sqrt{N} (ad - bc)}{\sqrt{n_1n_2c_1c_1}}\]
Null distirbution
: \(T \sim N(0,1)\)
Hypothesis: Two-tailed test
\[ H_0:p_1 = p_2 \]
\[H_1: p_1 \neq p_2 \]
\(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1.
P-value
\(= 2\times \min\{ P( T \leq T_{Obs}), P(T \geq T_{Obs}) \}\)
Decision
: IF p_value < \(\alpha\) then REJECT \(H_0\)
Hypothesis: Lower-tailed test
\[ H_0:p_1 = p_2 \]
\[H_1: p_1 < p_2 \]
\(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1.
P-value
\(= P( T \leq T_{Obs})\)
Decision
: IF p_value < \(\alpha\) then REJECT \(H_0\)
Hypothesis: Upper-tailed test
\[ H_0:p_1 = p_2 \]
\[H_1: p_1 > p_2 \]
\(p_1\) the probability that a randomly selected obs from the population 1 will be in class 1.
P-value
\(= P( T \geq T_{Obs})\)
Decision
: IF p_value < \(\alpha\) then REJECT \(H_0\)
The number of items in two car loads.
Data
Defective | Non defective | Totals |
|
---|---|---|---|
Carload 1 | a =13 | b=73 | \(n_1\) = 86 |
Carload 2 | c = 17 | d=57 | \(n_2\) = 74 |
Totals |
\(c_1\) = 30 | \(c_2\)= 130 | N = 160 |
2 x 2 contingency table
Question
: Test whether there are differences in proportions of defective items between the two carloads.
\[ H_0: p_1 = p_2 \] \[H_1: p_1 \neq p_2 \]
This is an two-tailed test.
Find the Test statistic observed and null distribution
\(T_{obs}=-1.2695\) and \(T\sim N(0,1)\)
Determine critical values (rejection region): +/- 1.96
Find P-value: p-value = 0.204
Decision: Since P-value > 0.05 then Fail to Reject \(H_0\).
data = cbind(c(13,17),c(73,57)) # create data
chisq.test(data, # table data
correct = FALSE # find p-value without Yates' correction
)
Pearson's Chi-squared test
data: data
X-squared = 1.6116, df = 1, p-value = 0.2043
Pearson's Chi-squared test with Yates' continuity correction
data: data
X-squared = 1.1372, df = 1, p-value = 0.2863
Interpretation
: Fail to Reject \(H_0\). There is evidence to support that the data is compatible with equal proportions \(p-value=0.2\).A new toothpaste is tested for men and women preferences.
Data
Like | Do not like | Totals |
|
---|---|---|---|
Men | a =64 | b=36 | \(n_1\) = 100 |
Women | c = 74 | d=26 | \(n_2\) = 100 |
Totals |
\(c_1\) = 138 | \(c_2\)= 62 | N = 200 |
2 x 2 contingency table
Question
: Do men and women differ in their preferences regarding the new toothpaste?
\[ H_0: p_1 = p_2 \] \[H_1: p_1 \neq p_2 \]
This is an two-tailed test.
Find the Test statistic observed and null distribution
\(T_{obs}=-1.53\) and \(T\sim N(0,1)\)
Determine critical values (rejection region): +/- 1.96
Find P-value: p-value=0.1260167
Decision and Interpretation: Since P-value > 0.05 then Fail to Reject \(H_0\). There is insufficient evidence to support that men and women differ in their preferences regarding the new toothpaste.
Data
col 1 | col 2 | Totals |
|
---|---|---|---|
row 1 | X (\(p_1\)) | r-X | r |
row 2 | c-X (\(p_2\)) | N-r-c+X | N-r |
Totals |
c | N-c | N |
2 x 2 contingency table
Assumption
Test Statistic
T = X = number of obs. in row 1 and col 1.
\[ T (H_0) \sim hypergeometric(N,r,C) \]
The PMF is:
\[ P(T=x) = \frac{\binom{r}{x}\binom{N-r}{c-x}}{\binom{N}{c}} \]
x=0,1,2,…,min(r,m)
Hypothesis: Two-tailed test
\[ H_0:p_1 = p_2 \]
\[H_1: p_1 \neq p_2 \]
\(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.
P-value
\(= 2\times \min\{ P( T \leq T_{Obs}), P(T \geq T_{Obs}) \}\)
Decision
: IF p_value < \(\alpha\) then REJECT \(H_0\)
Hypothesis: Lower-tailed test
\[ H_0:p_1 = p_2 \]
\[H_1: p_1 < p_2 \]
\(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.
P-value
\(= P( T \leq T_{Obs})\)
Decision
: IF p_value < \(\alpha\) then REJECT \(H_0\)
Hypothesis: Upper-tailed test
\[ H_0:p_1 = p_2 \]
\[H_1: p_1 > p_2 \]
\(p_1\) the probability that a randomly selected obs from the row 1 will be in col 1.
P-value
\(= P( T \geq T_{Obs})\)
Decision
: IF p_value < \(\alpha\) then REJECT \(H_0\)
14 newly hired business majors. - 10 males and 4 females - 2 Jobs are needed: 10 Tellers and 4 Account Rep.
Data
Account Rep. | Tellers | Totals |
|
---|---|---|---|
Males | X=1 | 9 | r = 10 |
Females | 3 | 1 | 4 |
Totals |
c= 4 | 10 | N = 14 |
2 x 2 contingency table
Question
: Test if females are more likely than males to get the account Rep. job.
Answer:
\[ H_0: p_1 \geq p_2 \] \[H_1: p_1 < p_2 \] This is an lower-tailed test.Answer:
\(T_{obs}=X=1\) and \(T\sim hypergeometric(14,10,4)\)`Answer:
p-value=0.040959Answer:
Since P-value < 0.05 then Reject \(H_0\).
Fisher's Exact Test for Count Data
data: data
p-value = 0.04096
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
0.000000 0.897734
sample estimates:
odds ratio
0.05545513
Interpretation
: Reject \(H_0\). There is evidence to support that the data is compatible with the assumption that females are more likely than males to get the account Rep. job.Test for equal medians.
\[H_0: \text{All C populations have the same median} \]
\[H_1: \text{At least two populations have different medians} \]
Data
Sample 1 | Sample 2 | … | Sample C | Totals |
|
---|---|---|---|---|---|
\(>\) GM | \(O_{11}\) | \(O_{12}\) | … | \(O_{1C}\) | a |
\(\leq\) GM | \(O_{21}\) | \(O_{22}\) | … | \(O_{2C}\) | b |
Totals |
\(n_{1}\) | \(n_{1}\) | … | \(n_{C}\) | N |
Test Statistic
\[ T = \frac{N^2}{ab} \sum_{i=1}^{C} \frac{O^2_{1i}}{n_i} - \frac{Na}{b} \]
Under Null hypothesis: \(T \sim \chi^2_{C-1}\); a chi-square distribution with C-1 degrees of freedom.
P-value
\(=P(T \geq T_{obs})\)
Decision
: IF p_value < \(\alpha\) then REJECT \(H_0\)
Question
: Do the medians yield per acre differ across the 4 methods.
\[H_0: \text{All methods have the same median yield per acre} \]
\[H_1: \text{At least two of the methods medians differ} \]
Set up data: See lecture notes
Test statistic
:
\[T_{obs} = 17.6\]
Under Null hypothesis: \(T \sim \chi^2_{3}\)
# install.packages("agricolae")
library(agricolae) # package
data(corn) # data
# The Median Test
median_test_out= Median.test(corn$observation,corn$method)
The Median Test for corn$observation ~ corn$method
Chi Square = 17.54306 DF = 3 P.Value 0.00054637
Median = 89
Median r Min Max Q25 Q75
1 91.0 9 83 96 89.00 92.00
2 86.0 10 81 91 83.25 89.75
3 95.0 7 91 101 93.50 98.00
4 80.5 8 77 82 78.75 81.00
Post Hoc Analysis
Groups according to probability of treatment differences and alpha level.
Treatments with the same letter are not significantly different.
corn$observation groups
3 95.0 a
1 91.0 b
2 86.0 b
4 80.5 c
Multiple Comparison
Test whether a sample is coming from a known distribution. For example, test the normality assumption.
Test if speed
data is normally distributed.
Data
Interpretation
: Fail to reject H0. There is evidence to support that data is normally distributed.
Asymptotic one-sample Kolmogorov-Smirnov test
data: cars$speed
D = 0.068539, p-value = 0.9729
alternative hypothesis: two-sided
Interpretation
: Fail to reject H0. There is evidence to support that data is normally distributed.
p1 = pnorm(5,mean_speed,sd_speed)# P(X <= 5)
p2 = pnorm(10,mean_speed,sd_speed)-pnorm(5,mean_speed,sd_speed) # P(5<=X <= 10)
p3 = pnorm(15,mean_speed,sd_speed)-pnorm(10,mean_speed,sd_speed) # P(10<=X <= 15)
p4 = pnorm(20,mean_speed,sd_speed)-pnorm(15,mean_speed,sd_speed) # P(15<=X <= 20)
p5 = 1- pnorm(20,mean_speed,sd_speed) # P(X> 20)
Pj = c(p1,p2,p3,p4,p5) # put everything in one array
Pj
[1] 0.02460029 0.12896802 0.31628124 0.33798729 0.19216316
[1] 1
Chi-squared test for given probabilities
data: Ob
X-squared = 1.3267, df = 4, p-value = 0.8568
The degrees of freedom of the test is df=C-1-k=5-1-2=2. k=2 because we did estimate the mean and variance from the sample. We need to recompute
the p-value.
Comparing two independent samples.
Data
Two random samples are of interest here. Let \(X_1, X_2, \ldots, X_n\) be a random sample of size \(n\) from population 1. And Let \(Y_1, Y_2, \ldots, Y_m\) be a random sample of size \(m\) from population 2.
Assign ranks 1 to \(n+m\) to the all observations and let \(R(X_i)\) be the rank assigned to \(X_i\). The same let \(R(Y_i)\) be the rank assigned to \(Y_i\). \(N=n+m\).
If several samples values are exactly equal to each other (tied), assign to each their average of the ranks that would have been used to them if there had been no ties.
Assumptions
Test Statistic
If no ties: \(T = \sum_{i=1}^n R(X_i)\)
If there are ties: \(T_1 = \frac{T - \frac{n(N+1)}{2}}{\sqrt{\frac{nm}{N(N-1)} \sum_{i=1}^N R^2_i -\frac{nm(N+1)^2}{4(N-1)}}}\), where \(\sum_{i=1}^N R^2_i\) is the sum of the squares of all Ranks.
The null distribution
is approximated with the standard normal distribution for \(T_1\). However, \(T\) has the exact distribution for small sample sizes \(<20\).
Hypotheses
Let \(F(x)\) be the distribution function corresponding to \(X\).
Let \(G(x)\) be the distribution function corresponding to \(Y\).
Two-tailed test
\[ H_0: F(x) = G(x) \quad \text{for all x} \]
\[H_1: F(x) \neq G(x) \quad \text{for some x} \]
The rejection region is defined by the quantiles \(T_{\alpha/2} (T_1)\) and \(T_{1-\alpha/2} (T_1)\). \(\alpha\) is the significance level.
Upper-tailed test
\[ H_0: F(x) = G(x)\]
\[H_1: E(X) > E(Y) \]
Here \(X\) tends to be greater than \(Y\).
Lower-tailed test
\[ H_0: F(x) = G(x)\]
\[H_1: E(X) < E(Y) \]
Here \(X\) tends to be smaller than \(Y\).
High temperatures are randomly selected from days in a summer in 2 cities.
Question
: Test if the mean high temperature in city1 is higher than the mean high temperature in city2.
temp_city1 = c(83,91,89,89,94,96,91,92,90)
temp_city2 = c(78,82,81,77,79,81,80,81)
wilcox.test(temp_city1,temp_city2,alternative = "g")
Wilcoxon rank sum test with continuity correction
data: temp_city1 and temp_city2
W = 72, p-value = 0.0003033
alternative hypothesis: true location shift is greater than 0
Comparing two dependent samples. This test can be used as an alternative to the paired t-test when the assumptions are not met.
Data
\[ \mid D_i\mid = \mid Y_i - X_i \mid, \qquad i=1,2\ldots,n' \]
Omit the pairs where the difference is zero. Let \(n\) be the remaining pairs. Ranks from 1 to \(n\) are assigned to the \(n\) pairs according to \(\mid D_i\mid\).
If several samples values are exactly equal to each other (tied), assign to each their average of the ranks that would have been used to them if there had been no ties.
Test Statistic
Let \(R_i\) called the signed rank be defined for each pair \((X_i, Y_i)\) as follows: the rank is positive if the D is positive, and negative if the difference D is negative.
If no ties: \(T^+ = \sum (R_i \text{where} D_i \text{is positive})\). Tables for exact lower quantiles, under \(H_0\), exist. The upper quantile is given: \(w_p = n(n+1)/2 - w_{1-p}\). We will use R.
If there are ties: \(T = \frac{\sum_{i=1}^n R_i}{\sqrt{\sum_{i=1}^n R_i^2}}\sim\) Standard normal distribution, under \(H_0\).
Hypotheses
Two-tailed test
\[ H_0: E(D) = 0 \]
\[H_1: E(D) \neq 0 \]
The rejection region is defined by the quantiles \(T^+_{\alpha/2} (or T)\) and \(T^+_{1-\alpha/2} (or T)\). \(\alpha\) is the significance level.
Upper-tailed test
\[ H_0: E(D) \leq 0\]
\[H_1: E(D) > 0 \]
The rejection region is defined by the quantiles \(T^+_{1-\alpha} ( or T)\). \(\alpha\) is the significance level.
Lower-tailed test
\[ H_0: E(D) \geq 0 \]
\[H_1: E(D) <0 \]
The rejection region is defined by the quantiles \(T^+_{\alpha} (or T)\). \(\alpha\) is the significance level.
Question
: Test if first born twin is more aggressive than the other.
first_born= c(86,71,77,68,91,72,77,91,70,71,88,87)
second_born = c(88,77,76,64,96,72,65,90,65,80,81,72)
wilcox.test(first_born,second_born,paired=T,alternative = "g",correct = T,exact = F)
Wilcoxon signed rank test with continuity correction
data: first_born and second_born
V = 41.5, p-value = 0.2382
alternative hypothesis: true location shift is greater than 0
P-value > 0.05. Fail to reject H0. There is evidence to suggest that the first born is not more aggressive than the second born.
STA4051 - Nonparametric Statistics