Author: | Ben Reynwar |
---|---|
Copyright: | Whatever wikipedia uses since there are probably bits cut and pasted. |
Created: | 2011 April 1 |
Last Edited: | 2011 May 28 |
These are notes I made while working through the book "Statistics Explained". They're rather terse but include R examples so I thought they might possibly be useful.
Type I error - finding an effect that isn't real.
Type II error - not finding a real effect.
The chi-square distribution with k degrees of freedom is the distribution of the sum of squares of k independent standard normal variables.
It is a special case of the gamma distribution.
Useful for estimating variance of a population. Suppose we have n observations from a normal population \(N(\mu, \sigma^2)\) and \(S\) is their standard deviation then \(\frac{(n-1)S^2}{\sigma^2}\) has a chi squared distribution with n-1 degrees of freedom. Since \(\sigma\) is the only unknown once can get confidence intervals for the population variance.
> # Number of observations in sample.
> n <- 10
> # Set alpha for 95% confidence interval (two-sided)
> alpha <- 0.05
> # Get a sample. Pretend we don't know mean or standard deviation.
> sample = rnorm(n, 25, 8)
> top = var(sample)*(n-1)
> # Get bounds for confidence interval for the standard deviation.
> lower = sqrt(top/qchisq(1-alpha/2, n-1))
> upper = sqrt(top/qchisq(alpha/2, n-1))
> c(lower, upper)
[1] 3.531557 9.373243
The ratio of two independent random variables both of which have chi-squared distributions has an F-distribution.
> # We take two samples from a normal population and take the ratio of variances.
> n1 <- 3; n2 <- 4
> # A 95% confidence interval for our result is:
> alpha <- 0.05
> c(qf(alpha/2, n1, n2), qf(1-alpha/2, n1, n2))
[1] 0.06622087 9.97919853
> mu <- 10; sigma <- 5
> sample1 <- rnorm(n1, mu, sigma)
> sample2 <- rnorm(n2, mu, sigma)
> ratio <- var(sample1)/var(sample2)
> # Calculated ratio is:
> ratio
[1] 0.2974744
> # Calculate the value of x for which the cumulative distribution of the
> # t-distribution is 0.05, for degrees of freedom = 1000.
> qt(0.05, 1000)
[1] -1.646379
> # Now for degrees for freedom = 3.
> qt(0.05, 3)
[1] -2.353363
> # Get the density of a t-distribution.
> dt(0.05, 1000)
[1] 0.3983438
> # Get the cumultive distribution
> pt(1, 1000)
[1] 0.8412238
> # Generate a random vector containing 10 values from a normal distibution
> # with mean 10 and standard deviation 2.
> x = rnorm(10, 10, 2)
> x
[1] 7.288907 10.679290 7.021853 12.356291 11.973576 13.452286 10.765847
[8] 12.815315 10.776181 11.478132
> # Generate another similar random vector.
> y = rnorm(10, 10, 2)
> y
[1] 10.228506 9.307230 9.007056 11.699393 12.524014 15.061729 9.449236
[8] 14.378682 10.251995 9.862443
> # Perform the t-test on them.
> # The p-value is the chance that the difference between the means would be
> # this large if the null hypothesis were true.
> t.test(x, y, var.equal=TRUE)
Two Sample t-test
data: x and y
t = -0.3273, df = 18, p-value = 0.7472
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.346419 1.713897
sample estimates:
mean of x mean of y
10.86077 11.17703
Variance Ratio(F) = (Between conditions variance)/(Error variance)
Assume samples come from normally distributed populations with equal variances.
F statistic depends on the dof for the two kinds of variances. These should always be given with the F value:
\(F(df_{bt.conds}, df_{error})\) = calculated value
The p-value is then found from a table or calculation.
Between conditions variance = \(\sum_i n_i(\bar{Y}_{i} - \bar{Y})^2/(K-1)\) you Error variance = \(\sum_{ij} (Y_{ij}-\bar{Y}_{i})^2/(N-K)\)
For two conditions it is mathematically identical to the t-test.
> # Number of points in each data set.
> ni <- 4
> # Generate four sets of data with different means.
> a <- rnorm(ni, 10, 2)
> b <- rnorm(ni, 12, 2)
> c <- rnorm(ni, 14, 2)
> d <- rnorm(ni, 16, 2)
> # Merge them all together into a data frame.
> values <- c(a, b, c, d)
> letters <- c(rep('a', ni), rep('b', ni), rep('c', ni), rep('d', ni))
> df <- data.frame(letter=letters, value=values)
> # Perform an anova analysis.
> fit <- aov(value ~ letter, data=df)
> summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
letter 3 91.902 30.6342 24.829 1.964e-05 ***
Residuals 12 14.805 1.2338
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If we find an effect with the F-test then we need to work out where it is coming from. We do this with post-hoc tests. The risk is the increased chance of a type I error.
> th <- TukeyHSD(fit)
> th
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = value ~ letter, data = df)
$letter
diff lwr upr p adj
b-a 0.7224116 -1.609443 3.054266 0.7949923
c-a 4.8536226 2.521768 7.185477 0.0002394
d-a 5.3724864 3.040632 7.704341 0.0000919
c-b 4.1312110 1.799356 6.463066 0.0009963
d-b 4.6500748 2.318220 6.981929 0.0003539
d-c 0.5188638 -1.812991 2.850718 0.9097766
>
> # Now try to do the same thing but more manually for (b-a) comparison.
> # Variance of residual errors.
> vare <- mean(c(var(a), var(b), var(c), var(d)))
> # Normalize the difference between the means the estimate of the standard deviation
> # of the means.
> st_range <- (mean(d)-mean(a))/(sqrt(vare/ni))
> # ptukey: ptukey(q, nmeans, df, nranges = 1, lower.tail = TRUE, log.p = FALSE)
> # q - a given studentized range
> # nmeans - the number of samples (i.e. number of means in this case)
> # df - degrees of freedom in the calculation of the stdev.
> # It returns the probability that the studentized range of the sample of means is
> # less than the given value of q.
> # For our example nmeans is clearly 4 (a,b,c and d)
> # And df is 4*(ni-1) because we calculated the variance from 4 sets of samples each of
> # which had (ni-1) degrees of freedom.
> manual <- 1 - ptukey(st_range, 4, 4*(ni-1))
> manual
[1] 9.19243e-05
> th$letter["d-a", "p adj"] - manual < 10e-6
[1] TRUE
Very similar to Turkey method except we do not limit to pairwise comparisons.
Let μ1, ..., μr be the means of some variable in r disjoint populations.
Begin cut and paste from wikipedia: An arbitrary contrast is defined by
\(C = \sum_{i=1}^r c_i\mu_i\)
where
\(\sum_{i=1}^r c_i = 0.\)
If μ1, ..., μr are all equal to each other, then all contrasts among them are 0. Otherwise, some contrasts differ from 0.
Technically there are infinitely many contrasts. The simultaneous confidence coefficient is exactly 1 − α, whether the factor level sample sizes are equal or unequal. (Usually only a finite number of comparisons are of interest. In this case, Scheffé's method is typically quite conservative, and the experimental error rate will generally be much smaller than α.)
We estimate C by
\(\hat{C} = \sum_{i=1}^r c_i\bar{Y}_i\)
for which the estimated variance is
\(s_{\hat{C}}^2 = \hat{\sigma}_e^2\sum_{i=1}^r \frac{c_i^2}{n_i}.\)
It can be shown that the probability is 1 − α that all confidence limits of the type
\(\hat{C}\pm s_\hat{C}\sqrt{\left(r-1\right)F_{\alpha ; r-1 ; N-r}}\)
> # Perform a Scheffe test to see if the average of samples a and b is significantly
> # different from the average of c and d.
> cs <- c(-0.5, -0.5, 0.5, 0.5)
> means <- tapply(df$value, df$letter, mean)
> c.hat <- sum(cs*means)
> s2 <- vare/ni * sum(cs^2)
> fval = c.hat^2/s2/(4*(ni-1))
> fval
[1] 6.100458
> # The chance of any linear combination deviating by this much if all samples
> # were from the same normal population.
> 1 - pf(fval, 3, 4*(ni-1))
[1] 0.009188318
A population can be divided in c categories. The chance of an observation being of category i is \(p_i\). We make N observations and find \(n_i\) in category i.
\(\sum_i{\frac{(n_i-N p_i)^2}{N p_i}}\) can be approximated by a chi-squared distribution with c-1 degrees of freedom as long as \(N p_i\) is greater than 5 for all categories.
Possible Uses:
We have a sample of (x, y) pairs. Pearson correllation coefficient is:
\(r = \frac{Sum of XY products}{Sum of XX squares + Sum of YY squares}\)
r=1 is perfect positive correlation, r=0 is uncorrelated, r=-1 is perfect negative correlation. It is the slope of the line of best-fit through the reduced variables.
The distribution of r is approximately a Studentized t-distribution. To be exact the variable t (\(t = r\sqrt{\frac{n-2}{1 - r^2}}\)) has this distribution.
> # Create some correlated data.
> n <- 10
> a <- rnorm(n, 12, 1)
> b <- 3*a + 6 + rnorm(n, 0, 3)
> r <- cor(a, b)
> r
[1] 0.5985606
> t <- r * sqrt((n-2)/(1-r^2))
> t
[1] 2.113384
> # The probability of a correlation being this far from 0 by chance is:
> 2 * (1 - pt(t, n-2))
[1] 0.06751678