5.7: Test of a Single Variance - Mathematics

A test of a single variance assumes that the underlying distribution is normal. The test statistic is:

[chi^{2} = frac{(n-1)s^{2}}{sigma^{2}} label{test}]


  • (n) is the the total number of data
  • (s^{2}) is the sample variance
  • (sigma^{2}) is the population variance

You may think of (s) as the random variable in this test. The number of degrees of freedom is (df = n - 1). A test of a single variance may be right-tailed, left-tailed, or two-tailed. The next example will show you how to set up the null and alternative hypotheses. The null and alternative hypotheses contain statements about the population variance.

Example (PageIndex{1})

Math instructors are not only interested in how their students do on exams, on average, but how the exam scores vary. To many instructors, the variance (or standard deviation) may be more important than the average.

Suppose a math instructor believes that the standard deviation for his final exam is five points. One of his best students thinks otherwise. The student claims that the standard deviation is more than five points. If the student were to conduct a hypothesis test, what would the null and alternative hypotheses be?


Even though we are given the population standard deviation, we can set up the test using the population variance as follows.

  • (H_{0}: sigma^{2} = 5^{2})
  • (H_{a}: sigma^{2} > 5^{2})

Exercise (PageIndex{1})

A SCUBA instructor wants to record the collective depths each of his students dives during their checkout. He is interested in how the depths vary, even though everyone should have been at the same depth. He believes the standard deviation is three feet. His assistant thinks the standard deviation is less than three feet. If the instructor were to conduct a test, what would the null and alternative hypotheses be?

  • (H_{0}: sigma^{2} = 3^{2})
  • (H_{a}: sigma^{2} > 3^{2})

Example (PageIndex{2})

With individual lines at its various windows, a post office finds that the standard deviation for normally distributed waiting times for customers on Friday afternoon is 7.2 minutes. The post office experiments with a single, main waiting line and finds that for a random sample of 25 customers, the waiting times for customers have a standard deviation of 3.5 minutes.

With a significance level of 5%, test the claim that a single line causes lower variation among waiting times (shorter waiting times) for customers.


Since the claim is that a single line causes less variation, this is a test of a single variance. The parameter is the population variance, (sigma^{2}), or the population standard deviation, (sigma).

Random Variable: The sample standard deviation, (s), is the random variable. Let (s = ext{standard deviation for the waiting times}).

  • (H_{0}: sigma^{2} = 7.2^{2})
  • (H_{a}: sigma^{2} < 7.2^{2})

The word "less" tells you this is a left-tailed test.

Distribution for the test: (chi^{2}_{24}), where:

  • (n = ext{the number of customers sampled})
  • (df = n - 1 = 25 - 1 = 24)

Calculate the test statistic (Equation ef{test}):

[chi^{2} = frac{(n-1)s^{2}}{sigma^{2}} = frac{(25-1)(3.5)^{2}}{7.2^{2}} = 5.67 onumber]

where (n = 25), (s = 3.5), and (sigma = 7.2).


Figure (PageIndex{1}).

Probability statement: (p ext{-value} = P(chi^{2} < 5.67) = 0.000042)

Compare (alpha) and the (p ext{-value}):

[alpha = 0.05 (p ext{-value} = 0.000042 alpha > p ext{-value} onumber]

Make a decision: Since (alpha > p ext{-value}), reject (H_{0}). This means that you reject (sigma^{2} = 7.2^{2}). In other words, you do not think the variation in waiting times is 7.2 minutes; you think the variation in waiting times is less.

Conclusion: At a 5% level of significance, from the data, there is sufficient evidence to conclude that a single line causes a lower variation among the waiting times or with a single line, the customer waiting times vary less than 7.2 minutes.

In2nd DISTR, use7:χ2cdf. The syntax is(lower, upper, df)for the parameter list. For Example,χ2cdf(-1E99,5.67,24). The (p ext{-value} = 0.000042).

Exercise (PageIndex{2})

The FCC conducts broadband speed tests to measure how much data per second passes between a consumer’s computer and the internet. As of August of 2012, the standard deviation of Internet speeds across Internet Service Providers (ISPs) was 12.2 percent. Suppose a sample of 15 ISPs is taken, and the standard deviation is 13.2. An analyst claims that the standard deviation of speeds is more than what was reported. State the null and alternative hypotheses, compute the degrees of freedom, the test statistic, sketch the graph of the p-value, and draw a conclusion. Test at the 1% significance level.

  • (H_{0}: sigma^{2} = 12.2^{2})
  • (H_{a}: sigma^{2} > 12.2^{2})

In2nd DISTR, use7:χ2cdf. The syntax is(lower, upper, df)for the parameter list.χ2cdf(16.39,10^99,14). The (p ext{-value} = 0.2902).

(df = 14)

[ ext{chi}^{2} ext{test statistic} = 16.39 onumber]

Figure (PageIndex{2}).

The (p ext{-value}) is (0.2902), so we decline to reject the null hypothesis. There is not enough evidence to suggest that the variance is greater than (12.2^{2}).


To test variability, use the chi-square test of a single variance. The test may be left-, right-, or two-tailed, and its hypotheses are always expressed in terms of the variance (or standard deviation).

Formula Review

(chi^{2} = frac{(n-1) cdot s^{2}}{sigma^{2}}) Test of a single variance statistic where:

(n: ext{sample size})

(s: ext{sample standard deviation})

(sigma: ext{population standard deviation})

(df = n – 1 ext{Degrees of freedom})

Test of a Single Variance

  • Use the test to determine variation.
  • The degrees of freedom is the ( ext{number of samples} - 1).
  • The test statistic is (frac{(n-1) cdot s^{2}}{sigma^{2}}), where (n = ext{the total number of data}), (s^{2} = ext{sample variance}), and (sigma^{2} = ext{population variance}).
  • The test may be left-, right-, or two-tailed.

Use the following information to answer the next three exercises: An archer’s standard deviation for his hits is six (data is measured in distance from the center of the target). An observer claims the standard deviation is less.

Exercise (PageIndex{3})

What type of test should be used?


a test of a single variance

Exercise (PageIndex{4})

State the null and alternative hypotheses.

Exercise (PageIndex{5})

Is this a right-tailed, left-tailed, or two-tailed test?


a left-tailed test

Use the following information to answer the next three exercises: The standard deviation of heights for students in a school is 0.81. A random sample of 50 students is taken, and the standard deviation of heights of the sample is 0.96. A researcher in charge of the study believes the standard deviation of heights for the school is greater than 0.81.

Exercise (PageIndex{6})

What type of test should be used?

Exercise (PageIndex{5})

State the null and alternative hypotheses.


(H_{0}: sigma^{2} = 0.81^{2});

(H_{a}: sigma^{2} > 0.81^{2})

Exercise (PageIndex{6})

(df =) ________

Use the following information to answer the next four exercises: The average waiting time in a doctor’s office varies. The standard deviation of waiting times in a doctor’s office is 3.4 minutes. A random sample of 30 patients in the doctor’s office has a standard deviation of waiting times of 4.1 minutes. One doctor believes the variance of waiting times is greater than originally thought.

Exercise (PageIndex{7})

What type of test should be used?


a test of a single variance

Exercise (PageIndex{8})

What is the test statistic?

Exercise (PageIndex{9})

What is the (p ext{-value})?



Exercise (PageIndex{10})

What can you conclude at the 5% significance level?

Standard Deviation Calculator

Please provide numbers separated by comma to calculate the standard deviation, variance, mean, sum, and margin of error.

Standard deviation in statistics, typically denoted by σ, is a measure of variation or dispersion (refers to a distribution's extent of stretching or squeezing) between values in a set of data. The lower the standard deviation, the closer the data points tend to be to the mean (or expected value), &mu. Conversely, a higher standard deviation indicates a wider range of values. Similarly to other mathematical and statistical concepts, there are many different situations in which standard deviation can be used, and thus many different equations. In addition to expressing population variability, the standard deviation is also often used to measure statistical results such as the margin of error. When used in this manner, standard deviation is often called the standard error of the mean, or standard error of the estimate with regard to a mean. The calculator above computes population standard deviation and sample standard deviation, as well as confidence interval approximations.

Population Standard Deviation

The population standard deviation, the standard definition of σ, is used when an entire population can be measured, and is the square root of the variance of a given data set. In cases where every member of a population can be sampled, the following equation can be used to find the standard deviation of the entire population:

For those unfamiliar with summation notation, the equation above may seem daunting, but when addressed through its individual components, this summation is not particularly complicated. The i=1 in the summation indicates the starting index, i.e. for the data set 1, 3, 4, 7, 8, i=1 would be 1, i=2 would be 3, and so on. Hence the summation notation simply means to perform the operation of (xi - &mu 2 ) on each value through N, which in this case is 5 since there are 5 values in this data set.

EX: &mu = (1+3+4+7+8) / 5 = 4.6
σ = &radic [(1 - 4.6) 2 + (3 - 4.6) 2 + . + (8 - 4.6) 2 )]/5
σ = &radic (12.96 + 2.56 + 0.36 + 5.76 + 11.56)/5 = 2.577

Sample Standard Deviation

In many cases, it is not possible to sample every member within a population, requiring that the above equation be modified so that the standard deviation can be measured through a random sample of the population being studied. A common estimator for σ is the sample standard deviation, typically denoted by s. It is worth noting that there exist many different equations for calculating sample standard deviation since unlike sample mean, sample standard deviation does not have any single estimator that is unbiased, efficient, and has a maximum likelihood. The equation provided below is the "corrected sample standard deviation." It is a corrected version of the equation obtained from modifying the population standard deviation equation by using the sample size as the size of the population, which removes some of the bias in the equation. Unbiased estimation of standard deviation however, is highly involved and varies depending on distribution. As such, the "corrected sample standard deviation" is the most commonly used estimator for population standard deviation, and is generally referred to as simply the "sample standard deviation." It is a much better estimate than its uncorrected version, but still has significant bias for small sample sizes (N<10).

Refer to the "Population Standard Deviation" section for an example on how to work with summations. The equation is essentially the same excepting the N-1 term in the corrected sample deviation equation, and the use of sample values.

Applications of Standard Deviation

Standard deviation is widely used in experimental and industrial settings to test models against real-world data. An example of this in industrial applications is quality control for some product. Standard deviation can be used to calculate a minimum and maximum value within which some aspect of the product should fall some high percentage of the time. In cases where values fall outside the calculated range, it may be necessary to make changes to the production process to ensure quality control.

Standard deviation is also used in weather to determine differences in regional climate. Imagine two cities, one on the coast and one deep inland, that have the same mean temperature of 75°F. While this may prompt the belief that the temperatures of these two cities are virtually the same, the reality could be masked if only the mean is addressed and the standard deviation ignored. Coastal cities tend to have far more stable temperatures due to regulation by large bodies of water, since water has a higher heat capacity than land essentially, this makes water far less susceptible to changes in temperature, and coastal areas remain warmer in winter, and cooler in summer due to the amount of energy required to change the temperature of water. Hence, while the coastal city may have temperature ranges between 60°F and 85°F over a given period of time to result in a mean of 75°F, an inland city could have temperatures ranging from 30°F to 110°F to result in the same mean.

Another area in which standard deviation is largely used is finance, where it is often used to measure the associated risk in price fluctuations of some asset or portfolio of assets. The use of standard deviation in these cases provides an estimate of the uncertainty of future returns on a given investment. For example, in comparing stock A that has an average return of 7% with a standard deviation of 10% against stock B, that has the same average return but a standard deviation of 50%, the first stock would clearly be the safer option, since standard deviation of stock B is significantly larger, for the exact same return. That is not to say that stock A is definitively a better investment option in this scenario, since standard deviation can skew the mean in either direction. While Stock A has a higher probability of an average return closer to 7%, Stock B can potentially provide a significantly larger return (or loss).

These are only a few examples of how one might use standard deviation, but many more exist. Generally, calculating standard deviation is valuable any time it is desired to know how far from the mean a typical value from a distribution can be.

Analysis of Variance

Ronald N. Forthofer , . Mike Hernandez , in Biostatistics (Second Edition) , 2007


In this chapter we presented several basic models of analysis of variance . The one-way ANOVA is used to analyze data from a completely randomized experimental design. The two-way ANOVA can be used for a randomized block design as well as for a two-factor design with interaction. To use these analytical methods properly, we must be aware of how the data were collected and make sure that the data meet the ANOVA assumptions. Finally, we discussed the problems and methods for analyzing unbalanced data. In the next chapter, we will expand the linear model to regression models.


This definition encompasses random variables that are generated by processes that are discrete, continuous, neither, or mixed. The variance can also be thought of as the covariance of a random variable with itself:

In other words, the variance of X is equal to the mean of the square of X minus the square of the mean of X . This equation should not be used for computations using floating point arithmetic, because it suffers from catastrophic cancellation if the two components of the equation are similar in magnitude. For other numerically stable alternatives, see Algorithms for calculating variance.

Discrete random variable Edit

(When such a discrete weighted variance is specified by weights whose sum is not 1, then one divides by the sum of the weights.)

Absolutely continuous random variable Edit

In these formulas, the integrals with respect to d x and d F ( x ) are Lebesgue and Lebesgue–Stieltjes integrals, respectively.

Exponential distribution Edit

The exponential distribution with parameter λ is a continuous distribution whose probability density function is given by

on the interval [0, ∞) . Its mean can be shown to be

Using integration by parts and making use of the expected value already calculated, we have:

Thus, the variance of X is given by

Fair dice Edit

A fair six-sided dice can be modeled as a discrete random variable, X , with outcomes 1 through 6, each with equal probability 1/6. The expected value of X is ( 1 + 2 + 3 + 4 + 5 + 6 ) / 6 = 7 / 2. Therefore, the variance of X is

The general formula for the variance of the outcome, X , of an n -sided die is

Commonly used probability distributions Edit

The following table lists the variance for some commonly used probability distributions.

Name of the probability distribution Probability distribution function Mean Variance
Binomial distribution Pr ( X = k ) = ( n k ) p k ( 1 − p ) n − k >p^(1-p)^> n p n p ( 1 − p )
Geometric distribution Pr ( X = k ) = ( 1 − p ) k − 1 p p> 1 p


( 1 − p ) p 2 >>>
Normal distribution f ( x ∣ μ , σ 2 ) = 1 2 π σ 2 e − ( x − μ ) 2 2 σ 2 ight)=>>>e^<-><2sigma ^<2>>>>> μ σ 2 >
Uniform distribution (continuous) f ( x ∣ a , b ) = < 1 b − a for a ≤ x ≤ b , 0 for x < a or x > b >&< ext>aleq xleq b,[3pt]0&< ext>x<a< ext< or >>x>bend>> a + b 2 <2>>> ( b − a ) 2 12 ><12>>>
Exponential distribution f ( x ∣ λ ) = λ e − λ x > 1 λ >> 1 λ 2 >>>
Poisson distribution f ( x ∣ λ ) = e − λ λ x x ! lambda ^>>> λ λ

Basic properties Edit

Variance is non-negative because the squares are positive or zero:

The variance of a constant is zero.

Conversely, if the variance of a random variable is 0, then it is almost surely a constant. That is, it always has the same value:

Variance is invariant with respect to changes in a location parameter. That is, if a constant is added to all values of the variable, the variance is unchanged:

If all values are scaled by a constant, the variance is scaled by the square of that constant:

The variance of a sum of two random variables is given by

These results lead to the variance of a linear combination as:

Issues of finiteness Edit

If a distribution does not have a finite expected value, as is the case for the Cauchy distribution, then the variance cannot be finite either. However, some distributions may not have a finite variance, despite their expected value being finite. An example is a Pareto distribution whose index k satisfies 1 < k ≤ 2.

Sum of uncorrelated variables (Bienaymé formula) Edit

One reason for the use of the variance in preference to other measures of dispersion is that the variance of the sum (or the difference) of uncorrelated random variables is the sum of their variances:

This statement is called the Bienaymé formula [2] and was discovered in 1853. [3] [4] It is often made with the stronger condition that the variables are independent, but being uncorrelated suffices. So if all the variables have the same variance σ 2 , then, since division by n is a linear transformation, this formula immediately implies that the variance of their mean is

That is, the variance of the mean decreases when n increases. This formula for the variance of the mean is used in the definition of the standard error of the sample mean, which is used in the central limit theorem.

To prove the initial statement, it suffices to show that

The general result then follows by induction. Starting with the definition,

Using the linearity of the expectation operator and the assumption of independence (or uncorrelatedness) of X and Y, this further simplifies as follows:

Sum of correlated variables Edit

With correlation and fixed sample size Edit

In general, the variance of the sum of n variables is the sum of their covariances:

(Note: The second equality comes from the fact that Cov(Xi,Xi) = Var(Xi) .)

Here, Cov ⁡ ( ⋅ , ⋅ ) (cdot ,cdot )> is the covariance, which is zero for independent random variables (if it exists). The formula states that the variance of a sum is equal to the sum of all elements in the covariance matrix of the components. The next expression states equivalently that the variance of the sum is the sum of the diagonal of covariance matrix plus two times the sum of its upper triangular elements (or its lower triangular elements) this emphasizes that the covariance matrix is symmetric. This formula is used in the theory of Cronbach's alpha in classical test theory.

So if the variables have equal variance σ 2 and the average correlation of distinct variables is ρ, then the variance of their mean is

This implies that the variance of the mean increases with the average of the correlations. In other words, additional correlated observations are not as effective as additional independent observations at reducing the uncertainty of the mean. Moreover, if the variables have unit variance, for example if they are standardized, then this simplifies to

This formula is used in the Spearman–Brown prediction formula of classical test theory. This converges to ρ if n goes to infinity, provided that the average correlation remains constant or converges too. So for the variance of the mean of standardized variables with equal correlations or converging average correlation we have

Therefore, the variance of the mean of a large number of standardized variables is approximately equal to their average correlation. This makes clear that the sample mean of correlated variables does not generally converge to the population mean, even though the law of large numbers states that the sample mean will converge for independent variables.

I.i.d. with random sample size Edit

There are cases when a sample is taken without knowing, in advance, how many observations will be acceptable according to some criterion. In such cases, the sample size N is a random variable whose variation adds to the variation of X, such that,

Var(∑X) = E(N)Var(X) + Var(N)E 2 (X). [5]

If N has a Poisson distribution, then E(N) = Var(N) with estimator N = n. So, the estimator of Var(∑X) becomes nS 2 X + n X 2 giving

standard error( X) = √[(S 2 X + X 2 )/n].

Matrix notation for the variance of a linear combination Edit

This implies that the variance of the mean can be written as (with a column vector of ones)

Weighted sum of variables Edit

The scaling property and the Bienaymé formula, along with the property of the covariance Cov(aX, bY) = ab Cov(X, Y) jointly imply that

This implies that in a weighted sum of variables, the variable with the largest weight will have a disproportionally large weight in the variance of the total. For example, if X and Y are uncorrelated and the weight of X is two times the weight of Y, then the weight of the variance of X will be four times the weight of the variance of Y.

The expression above can be extended to a weighted sum of multiple variables:

Product of independent variables Edit

If two variables X and Y are independent, the variance of their product is given by [7]

Equivalently, using the basic properties of expectation, it is given by

Product of statistically dependent variables Edit

In general, if two variables are statistically dependent, the variance of their product is given by:

Decomposition Edit

A similar formula is applied in analysis of variance, where the corresponding formula is

This can also be derived from the additivity of variances, since the total (observed) score is the sum of the predicted score and the error score, where the latter two are uncorrelated.

Similar decompositions are possible for the sum of squared deviations (sum of squares, S S >> ):

Calculation from the CDF Edit

The population variance for a non-negative random variable can be expressed in terms of the cumulative distribution function F using

This expression can be used to calculate the variance in situations where the CDF, but not the density, can be conveniently expressed.

Characteristic property Edit

Units of measurement Edit

Unlike expected absolute deviation, the variance of a variable has units that are the square of the units of the variable itself. For example, a variable measured in meters will have a variance measured in meters squared. For this reason, describing data sets via their standard deviation or root mean square deviation is often preferred over using the variance. In the dice example the standard deviation is √ 2.9 ≈ 1.7 , slightly larger than the expected absolute deviation of 1.5.

The standard deviation and the expected absolute deviation can both be used as an indicator of the "spread" of a distribution. The standard deviation is more amenable to algebraic manipulation than the expected absolute deviation, and, together with variance and its generalization covariance, is used frequently in theoretical statistics however the expected absolute deviation tends to be more robust as it is less sensitive to outliers arising from measurement anomalies or an unduly heavy-tailed distribution.

The delta method uses second-order Taylor expansions to approximate the variance of a function of one or more random variables: see Taylor expansions for the moments of functions of random variables. For example, the approximate variance of a function of one variable is given by

provided that f is twice differentiable and that the mean and variance of X are finite.

Real-world observations such as the measurements of yesterday's rain throughout the day typically cannot be complete sets of all possible observations that could be made. As such, the variance calculated from the finite set will in general not match the variance that would have been calculated from the full population of possible observations. This means that one estimates the mean and variance that would have been calculated from an omniscient set of observations by using an estimator equation. The estimator is a function of the sample of n observations drawn without observational bias from the whole population of potential observations. In this example that sample would be the set of actual measurements of yesterday's rainfall from available rain gauges within the geography of interest.

The simplest estimators for population mean and population variance are simply the mean and variance of the sample, the sample mean and (uncorrected) sample variance – these are consistent estimators (they converge to the correct value as the number of samples increases), but can be improved. Estimating the population variance by taking the sample's variance is close to optimal in general, but can be improved in two ways. Most simply, the sample variance is computed as an average of squared deviations about the (sample) mean, by dividing by n. However, using values other than n improves the estimator in various ways. Four common values for the denominator are n, n − 1, n + 1, and n − 1.5: n is the simplest (population variance of the sample), n − 1 eliminates bias, n + 1 minimizes mean squared error for the normal distribution, and n − 1.5 mostly eliminates bias in unbiased estimation of standard deviation for the normal distribution.

Firstly, if the omniscient mean is unknown (and is computed as the sample mean), then the sample variance is a biased estimator: it underestimates the variance by a factor of (n − 1) / n correcting by this factor (dividing by n − 1 instead of n) is called Bessel's correction. The resulting estimator is unbiased, and is called the (corrected) sample variance or unbiased sample variance. For example, when n = 1 the variance of a single observation about the sample mean (itself) is obviously zero regardless of the population variance. If the mean is determined in some other way than from the same samples used to estimate the variance then this bias does not arise and the variance can safely be estimated as that of the samples about the (independently known) mean.

Secondly, the sample variance does not generally minimize mean squared error between sample variance and population variance. Correcting for bias often makes this worse: one can always choose a scale factor that performs better than the corrected sample variance, though the optimal scale factor depends on the excess kurtosis of the population (see mean squared error: variance), and introduces bias. This always consists of scaling down the unbiased estimator (dividing by a number larger than n − 1), and is a simple example of a shrinkage estimator: one "shrinks" the unbiased estimator towards zero. For the normal distribution, dividing by n + 1 (instead of n − 1 or n) minimizes mean squared error. The resulting estimator is biased, however, and is known as the biased sample variation.

Population variance Edit

In general, the population variance of a finite population of size N with values xi is given by

where the population mean is

The population variance can also be computed using

The population variance matches the variance of the generating probability distribution. In this sense, the concept of population can be extended to continuous random variables with infinite populations.

Sample variance Edit

Biased sample variance Edit

In many practical situations, the true variance of a population is not known a priori and must be computed somehow. When dealing with extremely large populations, it is not possible to count every object in the population, so the computation must be performed on a sample of the population. [9] Sample variance can also be applied to the estimation of the variance of a continuous distribution from a sample of that distribution.

We take a sample with replacement of n values Y1, . Yn from the population, where n < N, and estimate the variance on the basis of this sample. [10] Directly taking the variance of the sample data gives the average of the squared deviations:

Unbiased sample variance Edit

Correcting for this bias yields the unbiased sample variance, denoted s 2 > :

Either estimator may be simply referred to as the sample variance when the version can be determined by context. The same proof is also applicable for samples taken from a continuous probability distribution.

The use of the term n − 1 is called Bessel's correction, and it is also used in sample covariance and the sample standard deviation (the square root of variance). The square root is a concave function and thus introduces negative bias (by Jensen's inequality), which depends on the distribution, and thus the corrected sample standard deviation (using Bessel's correction) is biased. The unbiased estimation of standard deviation is a technically involved problem, though for the normal distribution using the term n − 1.5 yields an almost unbiased estimator.

The unbiased sample variance is a U-statistic for the function ƒ(y1, y2) = (y1y2) 2 /2, meaning that it is obtained by averaging a 2-sample statistic over 2-element subsets of the population.

The positive square root of Variance is called Standard deviation . That is, standard deviation is the positive square root of the mean of the squares of deviations of the given values from their mean. It is denoted by σ.

Standard deviation gives a clear idea about how far the values are spreading or deviating from the mean.

Calculating Standard Deviation from Variance

In finance and in most other disciplines, standard deviation is used more frequently than variance. Both are measures of dispersion or volatility in a data set and they are closely related.

Standard deviation is the square root of variance.

And vice versa, variance is standard deviation squared.

To calculate standard deviation from variance, take the square root.

In our example, variance is 200, therefore standard deviation is square root of 200, which is 14.14.

To calculate standard deviation of a data set, first calculate the variance and then the square root of that.


The bootstrap was published by Bradley Efron in "Bootstrap methods: another look at the jackknife" (1979), [5] [6] [7] inspired by earlier work on the jackknife. [8] [9] [10] Improved estimates of the variance were developed later. [11] [12] A Bayesian extension was developed in 1981. [13] The bias-corrected and accelerated (BCa) bootstrap was developed by Efron in 1987, [14] and the ABC procedure in 1992. [15]

The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modelled by resampling the sample data and performing inference about a sample from resampled data (resampled → sample). As the population is unknown, the true error in a sample statistic against its population value is unknown. In bootstrap-resamples, the 'population' is in fact the sample, and this is known hence the quality of inference of the 'true' sample from resampled data (resampled → sample) is measurable.

More formally, the bootstrap works by treating inference of the true probability distribution J, given the original data, as being analogous to inference of the empirical distribution Ĵ, given the resampled data. The accuracy of inferences regarding Ĵ using the resampled data can be assessed because we know Ĵ. If Ĵ is a reasonable approximation to J, then the quality of inference on J can in turn be inferred.

As an example, assume we are interested in the average (or mean) height of people worldwide. We cannot measure all the people in the global population, so instead we sample only a tiny part of it, and measure that. Assume the sample is of size N that is, we measure the heights of N individuals. From that single sample, only one estimate of the mean can be obtained. In order to reason about the population, we need some sense of the variability of the mean that we have computed. The simplest bootstrap method involves taking the original data set of heights, and, using a computer, sampling from it to form a new sample (called a 'resample' or bootstrap sample) that is also of size N. The bootstrap sample is taken from the original by using sampling with replacement (e.g. we might 'resample' 5 times from [1,2,3,4,5] and get [2,5,4,4,1]), so, assuming N is sufficiently large, for all practical purposes there is virtually zero probability that it will be identical to the original "real" sample. This process is repeated a large number of times (typically 1,000 or 10,000 times), and for each of these bootstrap samples we compute its mean (each of these are called bootstrap estimates). We now can create a histogram of bootstrap means. This histogram provides an estimate of the shape of the distribution of the sample mean from which we can answer questions about how much the mean varies across samples. (The method here, described for the mean, can be applied to almost any other statistic or estimator.)

Advantages Edit

A great advantage of bootstrap is its simplicity. It is a straightforward way to derive estimates of standard errors and confidence intervals for complex estimators of the distribution, such as percentile points, proportions, odds ratio, and correlation coefficients. Bootstrap is also an appropriate way to control and check the stability of the results. Although for most problems it is impossible to know the true confidence interval, bootstrap is asymptotically more accurate than the standard intervals obtained using sample variance and assumptions of normality. [16] Bootstrapping is also a convenient method that avoids the cost of repeating the experiment to get other groups of sample data.

Disadvantages Edit

Although bootstrapping is (under some conditions) asymptotically consistent, it does not provide general finite-sample guarantees. The result may depend on the representative sample. The apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples) where these would be more formally stated in other approaches. Also, bootstrapping can be time-consuming.

Recommendations Edit

Scholars have recommended more bootstrap samples as available computing power has increased. If the results may have substantial real-world consequences, then one should use as many samples as is reasonable, given available computing power and time. Increasing the number of samples cannot increase the amount of information in the original data it can only reduce the effects of random sampling errors which can arise from a bootstrap procedure itself. Moreover, there is evidence that numbers of samples greater than 100 lead to negligible improvements in the estimation of standard errors. [17] In fact, according to the original developer of the bootstrapping method, even setting the number of samples at 50 is likely to lead to fairly good standard error estimates. [18]

Adèr et al. recommend the bootstrap procedure for the following situations: [19]

  • When the theoretical distribution of a statistic of interest is complicated or unknown. Since the bootstrapping procedure is distribution-independent it provides an indirect method to assess the properties of the distribution underlying the sample and the parameters of interest that are derived from this distribution.
  • When the sample size is insufficient for straightforward statistical inference. If the underlying distribution is well-known, bootstrapping provides a way to account for the distortions caused by the specific sample that may not be fully representative of the population.
  • When power calculations have to be performed, and a small pilot sample is available. Most power and sample size calculations are heavily dependent on the standard deviation of the statistic of interest. If the estimate used is incorrect, the required sample size will also be wrong. One method to get an impression of the variation of the statistic is to use a small pilot sample and perform bootstrapping on it to get impression of the variance.

However, Athreya has shown [20] that if one performs a naive bootstrap on the sample mean when the underlying population lacks a finite variance (for example, a power law distribution), then the bootstrap distribution will not converge to the same limit as the sample mean. As a result, confidence intervals on the basis of a Monte Carlo simulation of the bootstrap could be misleading. Athreya states that "Unless one is reasonably sure that the underlying distribution is not heavy tailed, one should hesitate to use the naive bootstrap".

In univariate problems, it is usually acceptable to resample the individual observations with replacement ("case resampling" below) unlike subsampling, in which resampling is without replacement and is valid under much weaker conditions compared to the bootstrap. In small samples, a parametric bootstrap approach might be preferred. For other problems, a smooth bootstrap will likely be preferred.

For regression problems, various other alternatives are available. [21]

Case resampling Edit

Bootstrap is generally useful for estimating the distribution of a statistic (e.g. mean, variance) without using normal theory (e.g. z-statistic, t-statistic). Bootstrap comes in handy when there is no analytical form or normal theory to help estimate the distribution of the statistics of interest, since bootstrap methods can apply to most random quantities, e.g., the ratio of variance and mean. There are at least two ways of performing case resampling.

  1. The Monte Carlo algorithm for case resampling is quite simple. First, we resample the data with replacement, and the size of the resample must be equal to the size of the original data set. Then the statistic of interest is computed from the resample from the first step. We repeat this routine many times to get a more precise estimate of the Bootstrap distribution of the statistic.
  2. The 'exact' version for case resampling is similar, but we exhaustively enumerate every possible resample of the data set. This can be computationally expensive as there are a total of ( 2 n − 1 n ) = ( 2 n − 1 ) ! n ! ( n − 1 ) ! >=>> different resamples, where n is the size of the data set. Thus for n = 5, 10, 20, 30 there are 126, 92378, 6.89 × 10 10 and 5.91 × 10 16 different resamples respectively. [22]

Estimating the distribution of sample mean Edit

Consider a coin-flipping experiment. We flip the coin and record whether it lands heads or tails. Let X = x1, x2, …, x10 be 10 observations from the experiment. xi = 1 if the i th flip lands heads, and 0 otherwise. From normal theory, we can use t-statistic to estimate the distribution of the sample mean,

Regression Edit

In regression problems, case resampling refers to the simple scheme of resampling individual cases – often rows of a data set. For regression problems, as long as the data set is fairly large, this simple scheme is often acceptable. However, the method is open to criticism [ citation needed ] .

In regression problems, the explanatory variables are often fixed, or at least observed with more control than the response variable. Also, the range of the explanatory variables defines the information available from them. Therefore, to resample cases means that each bootstrap sample will lose some information. As such, alternative bootstrap procedures should be considered.

Bayesian bootstrap Edit

Smooth bootstrap Edit

Under this scheme, a small amount of (usually normally distributed) zero-centered random noise is added onto each resampled observation. This is equivalent to sampling from a kernel density estimate of the data. Assume K to be a symmetric kernel density function with unit variance. The standard kernel estimator f ^ h ( x ) >_(x)> of f ( x ) is

where h is the smoothing parameter. And the corresponding distribution function estimator F ^ h ( x ) >_(x)> is

Parametric bootstrap Edit

Based on the assumption that the original data set is a realization of a random sample from a distribution of a specific parametric type, in this case a parametric model is fitted by parameter θ, often by maximum likelihood, and samples of random numbers are drawn from this fitted model. Usually the sample drawn has the same sample size as the original data. Then the estimate of original function F can be written as F ^ = F θ ^ >=F_>> . This sampling process is repeated many times as for other bootstrap methods. Considering the centered sample mean in this case, the random sample original distribution function F θ > is replaced by a bootstrap random sample with function F θ ^ >> , and the probability distribution of X n ¯ − μ θ >>-mu _< heta >> is approximated by that of X ¯ n ∗ − μ ∗ >_^<*>-mu ^<*>> , where μ ∗ = μ θ ^ =mu _>> , which is the expectation corresponding to F θ ^ >> . [25] The use of a parametric model at the sampling stage of the bootstrap methodology leads to procedures which are different from those obtained by applying basic statistical theory to inference for the same model.

Resampling residuals Edit

Another approach to bootstrapping in regression problems is to resample residuals. The method proceeds as follows.

This scheme has the advantage that it retains the information in the explanatory variables. However, a question arises as to which residuals to resample. Raw residuals are one option another is studentized residuals (in linear regression). Although there are arguments in favour of using studentized residuals in practice, it often makes little difference, and it is easy to compare the results of both schemes.

Gaussian process regression bootstrap Edit

When data are temporally correlated, straightforward bootstrapping destroys the inherent correlations. This method uses Gaussian process regression (GPR) to fit a probabilistic model from which replicates may then be drawn. GPR is a Bayesian non-linear regression method. A Gaussian process (GP) is a collection of random variables, and any finite number of which have a joint Gaussian (normal) distribution. A GP is defined by a mean function and a covariance function, which specify the mean vectors and covariance matrices for each finite collection of the random variables. [26]

Gaussian process prior:

Gaussian process posterior:

According to GP prior, we can get

Let x1 * . xs * be another finite collection of variables, it's obvious that

According to the equations above, the outputs y are also jointly distributed according to a multivariate Gaussian. Thus,

Wild bootstrap Edit

The wild bootstrap, proposed originally by Wu (1986), [27] is suited when the model exhibits heteroskedasticity. The idea is, like the residual bootstrap, to leave the regressors at their sample value, but to resample the response variable based on the residuals values. That is, for each replicate, one computes a new y based on

  • The standard normal distribution
  • A distribution suggested by Mammen (1993). [28]
  • Or the simpler distribution, linked to the Rademacher distribution:

Block bootstrap Edit

The block bootstrap is used when the data, or the errors in a model, are correlated. In this case, a simple case or residual resampling will fail, as it is not able to replicate the correlation in the data. The block bootstrap tries to replicate the correlation by resampling inside blocks of data. The block bootstrap has been used mainly with data correlated in time (i.e. time series) but can also be used with data correlated in space, or among groups (so-called cluster data).

Time series: Simple block bootstrap Edit

In the (simple) block bootstrap, the variable of interest is split into non-overlapping blocks.

Time series: Moving block bootstrap Edit

In the moving block bootstrap, introduced by Künsch (1989), [29] data is split into nb + 1 overlapping blocks of length b: Observation 1 to b will be block 1, observation 2 to b + 1 will be block 2, etc. Then from these nb + 1 blocks, n/b blocks will be drawn at random with replacement. Then aligning these n/b blocks in the order they were picked, will give the bootstrap observations.

This bootstrap works with dependent data, however, the bootstrapped observations will not be stationary anymore by construction. But, it was shown that varying randomly the block length can avoid this problem. [30] This method is known as the stationary bootstrap. Other related modifications of the moving block bootstrap are the Markovian bootstrap and a stationary bootstrap method that matches subsequent blocks based on standard deviation matching.

Time series: Maximum entropy bootstrap Edit

Vinod (2006), [31] presents a method that bootstraps time series data using maximum entropy principles satisfying the Ergodic theorem with mean-preserving and mass-preserving constraints. There is an R package, meboot, [32] that utilizes the method, which has applications in econometrics and computer science.

Cluster data: block bootstrap Edit

Cluster data describes data where many observations per unit are observed. This could be observing many firms in many states, or observing students in many classes. In such cases, the correlation structure is simplified, and one does usually make the assumption that data is correlated within a group/cluster, but independent between groups/clusters. The structure of the block bootstrap is easily obtained (where the block just corresponds to the group), and usually only the groups are resampled, while the observations within the groups are left unchanged. Cameron et al. (2008) discusses this for clustered errors in linear regression. [33]

The bootstrap is a powerful technique although may require substantial computing resources in both time and memory. Some techniques have been developed to reduce this burden. They can generally be combined with many of the different types of Bootstrap schemes and various choices of statistic.

Poisson bootstrap Edit

The ordinary bootstrap requires the random selection of n elements from a list, which is equivalent to drawing from a multinomial distribution. This may require a large number of passes over the data and is challenging to run these computations in parallel. For large values of n, the Poisson bootstrap is an efficient method of generating bootstrapped data sets. [34] When generating a single bootstrap sample, instead of randomly drawing from the sample data with replacement, each data point is assigned a random weight distributed according to the Poisson distribution with λ = 1 . For large sample data, this will approximate random sampling with replacement. This is due to the following approximation:

This method also lends itself well to streaming data and growing data sets, since the total number of samples does not need to be known in advance of beginning to take bootstrap samples.

Bag of Little Bootstraps Edit

The bootstrap distribution of a point estimator of a population parameter has been used to produce a bootstrapped confidence interval for the parameter's true value, if the parameter can be written as a function of the population's distribution.

A Bayesian point estimator and a maximum-likelihood estimator have good performance when the sample size is infinite, according to asymptotic theory. For practical problems with finite samples, other estimators may be preferable. Asymptotic theory suggests techniques that often improve the performance of bootstrapped estimators the bootstrapping of a maximum-likelihood estimator may often be improved using transformations related to pivotal quantities. [36]

The bootstrap distribution of a parameter-estimator has been used to calculate confidence intervals for its population-parameter. [ citation needed ]

Bias, asymmetry, and confidence intervals Edit

  • Bias: The bootstrap distribution and the sample may disagree systematically, in which case bias may occur. If the bootstrap distribution of an estimator is symmetric, then percentile confidence-interval are often used such intervals are appropriate especially for median-unbiased estimators of minimum risk (with respect to an absoluteloss function). Bias in the bootstrap distribution will lead to bias in the confidence-interval. Otherwise, if the bootstrap distribution is non-symmetric, then percentile confidence-intervals are often inappropriate.

Methods for bootstrap confidence intervals Edit

There are several methods for constructing confidence intervals from the bootstrap distribution of a real parameter:

  • Basic bootstrap, [36] also known as the Reverse Percentile Interval. [37] The basic bootstrap is a simple scheme to construct the confidence interval: one simply takes the empirical quantiles from the bootstrap distribution of the parameter (see Davison and Hinkley 1997, equ. 5.6 p. 194):
  • Percentile bootstrap. The percentile bootstrap proceeds in a similar way to the basic bootstrap, using percentiles of the bootstrap distribution, but with a different formula (note the inversion of the left and right quantiles!):
  • Studentized bootstrap. The studentized bootstrap, also called bootstrap-t, is computed analogously to the standard confidence interval, but replaces the quantiles from the normal or student approximation by the quantiles from the bootstrap distribution of the Student's t-test (see Davison and Hinkley 1997, equ. 5.7 p. 194 and Efron and Tibshirani 1993 equ 12.22, p. 160):
  • Bias-corrected bootstrap – adjusts for bias in the bootstrap distribution.
  • Accelerated bootstrap – The bias-corrected and accelerated (BCa) bootstrap, by Efron (1987), [14] adjusts for both bias and skewness in the bootstrap distribution. This approach is accurate in a wide variety of settings, has reasonable computation requirements, and produces reasonably narrow intervals. [citation needed]

Bootstrap hypothesis testing Edit

Smoothed bootstrap Edit

In 1878, Simon Newcomb took observations on the speed of light. [41] The data set contains two outliers, which greatly influence the sample mean. (The sample mean need not be a consistent estimator for any population mean, because no mean need exist for a heavy-tailed distribution.) A well-defined and robust statistic for central tendency is the sample median, which is consistent and median-unbiased for the population median.

The bootstrap distribution for Newcomb's data appears below. A convolution method of regularization reduces the discreteness of the bootstrap distribution by adding a small amount of N(0, σ 2 ) random noise to each bootstrap sample. A conventional choice is σ = 1 / n >> for sample size n. [ citation needed ]

Histograms of the bootstrap distribution and the smooth bootstrap distribution appear below. The bootstrap distribution of the sample-median has only a small number of values. The smoothed bootstrap distribution has a richer support.

In this example, the bootstrapped 95% (percentile) confidence-interval for the population median is (26, 28.5), which is close to the interval for (25.98, 28.46) for the smoothed bootstrap.

Relationship to other resampling methods Edit

The bootstrap is distinguished from:

  • the jackknife procedure, used to estimate biases of sample statistics and to estimate variances, and , in which the parameters (e.g., regression weights, factor loadings) that are estimated in one subsample are applied to another subsample.

Bootstrap aggregating (bagging) is a meta-algorithm based on averaging the results of multiple bootstrap samples.

U-statistics Edit

In situations where an obvious statistic can be devised to measure a required characteristic using only a small number, r, of data items, a corresponding statistic based on the entire sample can be formulated. Given an r-sample statistic, one can create an n-sample statistic by something similar to bootstrapping (taking the average of the statistic over all subsamples of size r). This procedure is known to have certain good properties and the result is a U-statistic. The sample mean and sample variance are of this form, for r = 1 and r = 2.

One-way anova

Use one-way anova when you have one nominal variable and one measurement variable the nominal variable divides the measurements into two or more groups. It tests whether the means of the measurement variable are the same for the different groups.

When to use it

Analysis of variance (anova) is the most commonly used technique for comparing the means of groups of measurement data. There are lots of different experimental designs that can be analyzed with different kinds of anova in this handbook, I describe only one-way anova, nested anova and two-way anova.

In a one-way anova (also known as a one-factor, single-factor, or single-classification anova), there is one measurement variable and one nominal variable. You make multiple observations of the measurement variable for each value of the nominal variable. For example, here are some data on a shell measurement (the length of the anterior adductor muscle scar, standardized by dividing by length I'll call this "AAM length") in the mussel Mytilus trossulus from five locations: Tillamook, Oregon Newport, Oregon Petersburg, Alaska Magadan, Russia and Tvarminne, Finland, taken from a much larger data set used in McDonald et al. (1991).

0.06590.0725 0.0689

The nominal variable is location, with the five values Tillamook, Newport, Petersburg, Magadan, and Tvarminne. There are six to ten observations of the measurement variable, AAM length, from each location.

Null hypothesis

The statistical null hypothesis is that the means of the measurement variable are the same for the different categories of data the alternative hypothesis is that they are not all the same. For the example data set, the null hypothesis is that the mean AAM length is the same at each location, and the alternative hypothesis is that the mean AAM lengths are not all the same.

How the test works

The basic idea is to calculate the mean of the observations within each group, then compare the variance among these means to the average variance within each group. Under the null hypothesis that the observations in the different groups all have the same mean, the weighted among-group variance will be the same as the within-group variance. As the means get further apart, the variance among the means increases. The test statistic is thus the ratio of the variance among means divided by the average variance within groups, or Fs. This statistic has a known distribution under the null hypothesis, so the probability of obtaining the observed Fs under the null hypothesis can be calculated.

The shape of the F-distribution depends on two degrees of freedom, the degrees of freedom of the numerator (among-group variance) and degrees of freedom of the denominator (within-group variance). The among-group degrees of freedom is the number of groups minus one. The within-groups degrees of freedom is the total number of observations, minus the number of groups. Thus if there are n observations in a groups, numerator degrees of freedom is a-1 and denominator degrees of freedom is n-a. For the example data set, there are 5 groups and 39 observations, so the numerator degrees of freedom is 4 and the denominator degrees of freedom is 34. Whatever program you use for the anova will almost certainly calculate the degrees of freedom for you.

The conventional way of reporting the complete results of an anova is with a table (the "sum of squares" column is often omitted). Here are the results of a one-way anova on the mussel data:

sum of
among groups0.0045240.0011137.122.8吆 -4
within groups0.00539340.000159

If you're not going to use the mean squares for anything, you could just report this as "The means were significantly heterogeneous (one-way anova, F4, 34=7.12, P=2.8吆 -4 )." The degrees of freedom are given as a subscript to F, with the numerator first.

Note that statisticians often call the within-group mean square the "error" mean square. I think this can be confusing to non-statisticians, as it implies that the variation is due to experimental error or measurement error. In biology, the within-group variation is often largely the result of real, biological variation among individuals, not the kind of mistakes implied by the word "error." That's why I prefer the term "within-group mean square."


One-way anova assumes that the observations within each group are normally distributed. It is not particularly sensitive to deviations from this assumption if you apply one-way anova to data that are non-normal, your chance of getting a P value less than 0.05, if the null hypothesis is true, is still pretty close to 0.05. It's better if your data are close to normal, so after you collect your data, you should calculate the residuals (the difference between each observation and the mean of its group) and plot them on a histogram. If the residuals look severely non-normal, try data transformations and see if one makes the data look more normal.

If none of the transformations you try make the data look normal enough, you can use the Kruskal-Wallis test. Be aware that it makes the assumption that the different groups have the same shape of distribution, and that it doesn't test the same null hypothesis as one-way anova. Personally, I don't like the Kruskal-Wallis test I recommend that if you have non-normal data that can't be fixed by transformation, you go ahead and use one-way anova, but be cautious about rejecting the null hypothesis if the P value is not very far below 0.05 and your data are extremely non-normal.

One-way anova also assumes that your data are homoscedastic, meaning the standard deviations are equal in the groups. You should examine the standard deviations in the different groups and see if there are big differences among them.

If you have a balanced design, meaning that the number of observations is the same in each group, then one-way anova is not very sensitive to heteroscedasticity (different standard deviations in the different groups). I haven't found a thorough study of the effects of heteroscedasticity that considered all combinations of the number of groups, sample size per group, and amount of heteroscedasticity. I've done simulations with two groups, and they indicated that heteroscedasticity will give an excess proportion of false positives for a balanced design only if one standard deviation is at least three times the size of the other, and the sample size in each group is fewer than 10. I would guess that a similar rule would apply to one-way anovas with more than two groups and balanced designs.

Heteroscedasticity is a much bigger problem when you have an unbalanced design (unequal sample sizes in the groups). If the groups with smaller sample sizes also have larger standard deviations, you will get too many false positives. The difference in standard deviations does not have to be large a smaller group could have a standard deviation that's 50% larger, and your rate of false positives could be above 10% instead of at 5% where it belongs. If the groups with larger sample sizes have larger standard deviations, the error is in the opposite direction you get too few false positives, which might seem like a good thing except it also means you lose power (get too many false negatives, if there is a difference in means).

You should try really hard to have equal sample sizes in all of your groups. With a balanced design, you can safely use a one-way anova unless the sample sizes per group are less than 10 and the standard deviations vary by threefold or more. If you have a balanced design with small sample sizes and very large variation in the standard deviations, you should use Welch's anova instead.

If you have an unbalanced design, you should carefully examine the standard deviations. Unless the standard deviations are very similar, you should probably use Welch's anova. It is less powerful than one-way anova for homoscedastic data, but it can be much more accurate for heteroscedastic data from an unbalanced design.

Additional analyses

Tukey-Kramer test

If you reject the null hypothesis that all the means are equal, you'll probably want to look at the data in more detail. One common way to do this is to compare different pairs of means and see which are significantly different from each other. For the mussel shell example, the overall P value is highly significant you would probably want to follow up by asking whether the mean in Tillamook is different from the mean in Newport, whether Newport is different from Petersburg, etc.

It might be tempting to use a simple two-sample t&ndashtest on each pairwise comparison that looks interesting to you. However, this can result in a lot of false positives. When there are a groups, there are (a 2 &minusa)/2 possible pairwise comparisons, a number that quickly goes up as the number of groups increases. With 5 groups, there are 10 pairwise comparisons with 10 groups, there are 45, and with 20 groups, there are 190 pairs. When you do multiple comparisons, you increase the probability that at least one will have a P value less than 0.05 purely by chance, even if the null hypothesis of each comparison is true.

There are a number of different tests for pairwise comparisons after a one-way anova, and each has advantages and disadvantages. The differences among their results are fairly subtle, so I will describe only one, the Tukey-Kramer test. It is probably the most commonly used post-hoc test after a one-way anova, and it is fairly easy to understand.

In the Tukey–Kramer method, the minimum significant difference (MSD) is calculated for each pair of means. It depends on the sample size in each group, the average variation within the groups, and the total number of groups. For a balanced design, all of the MSDs will be the same for an unbalanced design, pairs of groups with smaller sample sizes will have bigger MSDs. If the observed difference between a pair of means is greater than the MSD, the pair of means is significantly different. For example, the Tukey MSD for the difference between Newport and Tillamook is 0.0172. The observed difference between these means is 0.0054, so the difference is not significant. Newport and Petersburg have a Tukey MSD of 0.0188 the observed difference is 0.0286, so it is significant.

There are a couple of common ways to display the results of the Tukey&ndashKramer test. One technique is to find all the sets of groups whose means do not differ significantly from each other, then indicate each set with a different symbol.

Locationmean AAMTukey&ndashKramer
Magadan0.0780a, b
Tillamook0.0802a, b
Tvarminne0.0957b, c

Then you explain that "Means with the same letter are not significantly different from each other (Tukey&ndashKramer test, P>0.05)." This table shows that Newport and Magadan both have an "a", so they are not significantly different Newport and Tvarminne don't have the same letter, so they are significantly different.

Another way you can illustrate the results of the Tukey&ndashKramer test is with lines connecting means that are not significantly different from each other. This is easiest when the means are sorted from smallest to largest:

Mean AAM (anterior adductor muscle scar standardized by total shell length) for Mytilus trossulus from five locations. Pairs of means grouped by a horizontal line are not significantly different from each other (Tukey&ndashKramer method, P>0.05).

There are also tests to compare different sets of groups for example, you could compare the two Oregon samples (Newport and Tillamook) to the two samples from further north in the Pacific (Magadan and Petersburg). The Scheffé test is probably the most common. The problem with these tests is that with a moderate number of groups, the number of possible comparisons becomes so large that the P values required for significance become ridiculously small.

Partitioning variance

The most familiar one-way anovas are "fixed effect" or "model I" anovas. The different groups are interesting, and you want to know which are different from each other. As an example, you might compare the AAM length of the mussel species Mytilus edulis, Mytilus galloprovincialis, Mytilus trossulus and Mytilus californianus you'd want to know which had the longest AAM, which was shortest, whether M. edulis was significantly different from M. trossulus, etc.

The other kind of one-way anova is a "random effect" or "model II" anova. The different groups are random samples from a larger set of groups, and you're not interested in which groups are different from each other. An example would be taking offspring from five random families of M. trossulus and comparing the AAM lengths among the families. You wouldn't care which family had the longest AAM, and whether family A was significantly different from family B they're just random families sampled from a much larger possible number of families. Instead, you'd be interested in how the variation among families compared to the variation within families in other words, you'd want to partition the variance.

Under the null hypothesis of homogeneity of means, the among-group mean square and within-group mean square are both estimates of the within-group parametric variance. If the means are heterogeneous, the within-group mean square is still an estimate of the within-group variance, but the among-group mean square estimates the sum of the within-group variance plus the group sample size times the added variance among groups. Therefore subtracting the within-group mean square from the among-group mean square, and dividing this difference by the average group sample size, gives an estimate of the added variance component among groups. The equation is:

where no is a number that is close to, but usually slightly less than, the arithmetic mean of the sample size (ni) of each of the a groups:

Each component of the variance is often expressed as a percentage of the total variance components. Thus an anova table for a one-way anova would indicate the among-group variance component and the within-group variance component, and these numbers would add to 100%.

Although statisticians say that each level of an anova "explains" a proportion of the variation, this statistical jargon does not mean that you've found a biological cause-and-effect explanation. If you measure the number of ears of corn per stalk in 10 random locations in a field, analyze the data with a one-way anova, and say that the location "explains" 74.3% of the variation, you haven't really explained anything you don't know whether some areas have higher yield because of different water content in the soil, different amounts of insect damage, different amounts of nutrients in the soil, or random attacks by a band of marauding corn bandits.

Partitioning the variance components is particularly useful in quantitative genetics, where the within-family component might reflect environmental variation while the among-family component reflects genetic variation. Of course, estimating heritability involves more than just doing a simple anova, but the basic concept is similar.

Another area where partitioning variance components is useful is in designing experiments. For example, let's say you're planning a big experiment to test the effect of different drugs on calcium uptake in rat kidney cells. You want to know how many rats to use, and how many measurements to make on each rat, so you do a pilot experiment in which you measure calcium uptake on 6 rats, with 4 measurements per rat. You analyze the data with a one-way anova and look at the variance components. If a high percentage of the variation is among rats, that would tell you that there's a lot of variation from one rat to the next, but the measurements within one rat are pretty uniform. You could then design your big experiment to include a lot of rats for each drug treatment, but not very many measurements on each rat. Or you could do some more pilot experiments to try to figure out why there's so much rat-to-rat variation (maybe the rats are different ages, or some have eaten more recently than others, or some have exercised more) and try to control it. On the other hand, if the among-rat portion of the variance was low, that would tell you that the mean values for different rats were all about the same, while there was a lot of variation among the measurements on each rat. You could design your big experiment with fewer rats and more observations per rat, or you could try to figure out why there's so much variation among measurements and control it better.

There's an equation you can use for optimal allocation of resources in experiments. It's usually used for nested anova, but you can use it for a one-way anova if the groups are random effect (model II).

Partitioning the variance applies only to a model II (random effects) one-way anova. It doesn't really tell you anything useful about the more common model I (fixed effects) one-way anova, although sometimes people like to report it (because they're proud of how much of the variance their groups "explain," I guess).


Here are data on the genome size (measured in picograms of DNA per haploid cell) in several large groups of crustaceans, taken from Gregory (2014). The cause of variation in genome size has been a puzzle for a long time I'll use these data to answer the biological question of whether some groups of crustaceans have different genome sizes than others. Because the data from closely related species would not be independent (closely related species are likely to have similar genome sizes, because they recently descended from a common ancestor), I used a random number generator to randomly choose one species from each family.

7.16 0.402.672.446.79
8.48 0.475.452.668.60
13.49 0.636.812.788.82
16.09 0.87 2.80
27.00 2.77 2.83
50.91 2.91 3.01
64.62 4.34

After collecting the data, the next step is to see if they are normal and homoscedastic. It's pretty obviously non-normal most of the values are less than 10, but there are a small number that are much higher. A histogram of the largest group, the decapods (crabs, shrimp and lobsters), makes this clear:

Histogram of the genome size in decapod crustaceans.

The data are also highly heteroscedastic the standard deviations range from 0.67 in barnacles to 20.4 in amphipods. Fortunately, log-transforming the data make them closer to homoscedastic (standard deviations ranging from 0.20 to 0.63) and look more normal:

Histogram of the genome size in decapod crustaceans after base-10 log transformation.

Analyzing the log-transformed data with one-way anova, the result is F6,76=11.72, P=2.9×10 &minus9 . So there is very significant variation in mean genome size among these seven taxonomic groups of crustaceans.

The next step is to use the Tukey-Kramer test to see which pairs of taxa are significantly different in mean genome size. The usual way to display this information is by identifying groups that are not significantly different here I do this with horizontal bars:

Neans and 95% confidence limits of genome size in seven groups of crustaceans. Horizontal bars link groups that are not significantly different (Tukey-Kramer test, P>0.05). Analysis was done on log-transformed data, then back-transformed for this graph.

This graph suggests that there are two sets of genome sizes, groups with small genomes (branchiopods, ostracods, barnacles, and copepods) and groups with large genomes (decapods and amphipods) the members of each set are not significantly different from each other. Isopods are in the middle the only group they're significantly different from is branchiopods. So the answer to the original biological question, "do some groups of crustaceans have different genome sizes than others," is yes. Why different groups have different genome sizes remains a mystery.

Graphing the results

The usual way to graph the results of a one-way anova is with a bar graph. The heights of the bars indicate the means, and there's usually some kind of error bar, either 95% confidence intervals or standard errors. Be sure to say in the figure caption what the error bars represent.

Similar tests

If you have only two groups, you can do a two-sample t&ndashtest. This is mathematically equivalent to an anova and will yield the exact same P value, so if all you'll ever do is comparisons of two groups, you might as well call them t&ndashtests. If you're going to do some comparisons of two groups, and some with more than two groups, it will probably be less confusing if you call all of your tests one-way anovas.

If there are two or more nominal variables, you should use a two-way anova, a nested anova, or something more complicated that I won't cover here. If you're tempted to do a very complicated anova, you may want to break your experiment down into a set of simpler experiments for the sake of comprehensibility.

If the data severely violate the assumptions of the anova, you can use Welch's anova if the standard deviations are heterogeneous or use the Kruskal-Wallis test if the distributions are non-normal.

How to do the test


I have put together a spreadsheet to do one-way anova on up to 50 groups and 1000 observations per group. It calculates the P value, does the Tukey&ndashKramer test, and partitions the variance.

Some versions of Excel include an "Analysis Toolpak," which includes an "Anova: Single Factor" function that will do a one-way anova. You can use it if you want, but I can't help you with it. It does not include any techniques for unplanned comparisons of means, and it does not partition the variance.

Web pages

Several people have put together web pages that will perform a one-way anova one good one is here. It is easy to use, and will handle three to 26 groups and 3 to 1024 observations per group. It does not do the Tukey-Kramer test and does not partition the variance.

Salvatore Mangiafico's R Companion has a sample R program for one-way anova.

There are several SAS procedures that will perform a one-way anova. The two most commonly used are PROC ANOVA and PROC GLM. Either would be fine for a one-way anova, but PROC GLM (which stands for "General Linear Models") can be used for a much greater variety of more complicated analyses, so you might as well use it for everything.

Here is a SAS program to do a one-way anova on the mussel data from above.

The output includes the traditional anova table the P value is given under "Pr > F".

PROC GLM doesn't calculate the variance components for an anova. Instead, you use PROC VARCOMP. You set it up just like PROC GLM, with the addition of METHOD=TYPE1 (where "TYPE1" includes the numeral 1, not the letter el. The procedure has four different methods for estimating the variance components, and TYPE1 seems to be the same technique as the one I've described above. Here's how to do the one-way anova, including estimating the variance components, for the mussel shell example.

The results include the following:

The output is not given as a percentage of the total, so you'll have to calculate that. For these results, the among-group component is 0.0001254/(0.0001254+0.0001586)=0.4415, or 44.15% the within-group component is 0.0001587/(0.0001254+0.0001586)=0.5585, or 55.85%.

Welch's anova

If the data show a lot of heteroscedasticity (different groups have different standard deviations), the one-way anova can yield an inaccurate P value the probability of a false positive may be much higher than 5%. In that case, you should use Welch's anova. I've written a spreadsheet to do Welch's anova. It includes the Games-Howell test, which is similar to the Tukey-Kramer test for a regular anova. (Note: the original spreadsheet gave incorrect results for the Games-Howell test it was corrected on April 28, 2015). You can do Welch's anova in SAS by adding a MEANS statement, the name of the nominal variable, and the word WELCH following a slash. Unfortunately, SAS does not do the Games-Howell post-hoc test. Here is the example SAS program from above, modified to do Welch's anova:

Here is part of the output:

Power analysis

To do a power analysis for a one-way anova is kind of tricky, because you need to decide what kind of effect size you're looking for. If you're mainly interested in the overall significance test, the sample size needed is a function of the standard deviation of the group means. Your estimate of the standard deviation of means that you're looking for may be based on a pilot experiment or published literature on similar experiments.

If you're mainly interested in the comparisons of means, there are other ways of expressing the effect size. Your effect could be a difference between the smallest and largest means, for example, that you would want to be significant by a Tukey-Kramer test. There are ways of doing a power analysis with this kind of effect size, but I don't know much about them and won't go over them here.

To do a power analysis for a one-way anova using the free program G*Power, choose "F tests" from the "Test family" menu and "ANOVA: Fixed effects, omnibus, one-way" from the "Statistical test" menu. To determine the effect size, click on the Determine button and enter the number of groups, the standard deviation within the groups (the program assumes they're all equal), and the mean you want to see in each group. Usually you'll leave the sample sizes the same for all groups (a balanced design), but if you're planning an unbalanced anova with bigger samples in some groups than in others, you can enter different relative sample sizes. Then click on the "Calculate and transfer to main window" button it calculates the effect size and enters it into the main window. Enter your alpha (usually 0.05) and power (typically 0.80 or 0.90) and hit the Calculate button. The result is the total sample size in the whole experiment you'll have to do a little math to figure out the sample size for each group.

As an example, let's say you're studying transcript amount of some gene in arm muscle, heart muscle, brain, liver, and lung. Based on previous research, you decide that you'd like the anova to be significant if the means were 10 units in arm muscle, 10 units in heart muscle, 15 units in brain, 15 units in liver, and 15 units in lung. The standard deviation of transcript amount within a tissue type that you've seen in previous research is 12 units. Entering these numbers in G*Power, along with an alpha of 0.05 and a power of 0.80, the result is a total sample size of 295. Since there are five groups, you'd need 59 observations per group to have an 80% chance of having a significant (P<0.05) one-way anova.


McDonald, J.H., R. Seed and R.K. Koehn. 1991. Allozymes and morphometric characters of three species of Mytilus in the Northern and Southern Hemispheres. Marine Biology 111:323-333.

&lArr Previous topic|Next topic &rArr Table of Contents

This page was last revised July 20, 2015. Its address is It may be cited as:
McDonald, J.H. 2014. Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland. This web page contains the content of pages 145-156 in the printed version.

©2014 by John H. McDonald. You can probably do what you want with this content see the permissions page for details.

Inferences for Two Populations

Rudolf J. Freund , . Donna L. Mohr , in Statistical Methods (Third Edition) , 2010

5.2.2 The Sampling Distribution of the Difference between Two Means

Since sample means are random variables, the difference between two sample means is a linear function of two random variables. That is,

In terms of the linear function specified above, n = 2 , a 1 = 1 , and a 2 = − 1 . Using these specifications, the sampling distribution of the difference between two means has a mean of ( μ 1 − μ 2 ) .

Further, since the y ¯ 1 and y ¯ 2 are sample means, the variance of y ¯ 1 is σ 1 2 ∕ n 1 and the variance of y ¯ 2 is σ 2 2 ∕ n 2 . Also, because we have made the assumption that the two samples are independently drawn from the two populations, the two sample means are independent random variables. Therefore, the variance of the difference ( y ¯ 1 − y ¯ 2 ) is

Note that for the special case where σ 1 2 = σ 2 2 = σ 2 and n 1 = n 2 = n , the variance of the difference is 2 σ 2 ∕ n .

Finally, the central limit theorem states that if the sample sizes are sufficiently large, y ¯ 1 and y ¯ 2 are normally distributed hence for most applications L is also normally distributed.

Thus, if the variances σ 1 2 and σ 2 2 are known, we can determine the variance of the difference ( y ¯ 1 − y ¯ 2 ) . As in the one-population case we first present inference procedures that assume that the population variances are known. Procedures using estimated variances are presented later in this section.


While the analysis of variance reached fruition in the 20th century, antecedents extend centuries into the past according to Stigler. [1] These include hypothesis testing, the partitioning of sums of squares, experimental techniques and the additive model. Laplace was performing hypothesis testing in the 1770s. [2] Around 1800, Laplace and Gauss developed the least-squares method for combining observations, which improved upon methods then used in astronomy and geodesy. It also initiated much study of the contributions to sums of squares. Laplace knew how to estimate a variance from a residual (rather than a total) sum of squares. [3] By 1827, Laplace was using least squares methods to address ANOVA problems regarding measurements of atmospheric tides. [4] Before 1800, astronomers had isolated observational errors resulting from reaction times (the "personal equation") and had developed methods of reducing the errors. [5] The experimental methods used in the study of the personal equation were later accepted by the emerging field of psychology [6] which developed strong (full factorial) experimental methods to which randomization and blinding were soon added. [7] An eloquent non-mathematical explanation of the additive effects model was available in 1885. [8]

Ronald Fisher introduced the term variance and proposed its formal analysis in a 1918 article The Correlation Between Relatives on the Supposition of Mendelian Inheritance. [9] His first application of the analysis of variance was published in 1921. [10] Analysis of variance became widely known after being included in Fisher's 1925 book Statistical Methods for Research Workers.

Randomization models were developed by several researchers. The first was published in Polish by Jerzy Neyman in 1923. [11]

The analysis of variance can be used to describe otherwise complex relations among variables. A dog show provides an example. A dog show is not a random sampling of the breed: it is typically limited to dogs that are adult, pure-bred, and exemplary. A histogram of dog weights from a show might plausibly be rather complex, like the yellow-orange distribution shown in the illustrations. Suppose we wanted to predict the weight of a dog based on a certain set of characteristics of each dog. One way to do that is to explain the distribution of weights by dividing the dog population into groups based on those characteristics. A successful grouping will split dogs such that (a) each group has a low variance of dog weights (meaning the group is relatively homogeneous) and (b) the mean of each group is distinct (if two groups have the same mean, then it isn't reasonable to conclude that the groups are, in fact, separate in any meaningful way).

In the illustrations to the right, groups are identified as X1, X2, etc. In the first illustration, the dogs are divided according to the product (interaction) of two binary groupings: young vs old, and short-haired vs long-haired (e.g., group 1 is young, short-haired dogs, group 2 is young, long-haired dogs, etc.). Since the distributions of dog weight within each of the groups (shown in blue) has a relatively large variance, and since the means are very similar across groups, grouping dogs by these characteristics does not produce an effective way to explain the variation in dog weights: knowing which group a dog is in doesn't allow us to predict its weight much better than simply knowing the dog is in a dog show. Thus, this grouping fails to explain the variation in the overall distribution (yellow-orange).

An attempt to explain the weight distribution by grouping dogs as pet vs working breed and less athletic vs more athletic would probably be somewhat more successful (fair fit). The heaviest show dogs are likely to be big, strong, working breeds, while breeds kept as pets tend to be smaller and thus lighter. As shown by the second illustration, the distributions have variances that are considerably smaller than in the first case, and the means are more distinguishable. However, the significant overlap of distributions, for example, means that we cannot distinguish X1 and X2 reliably. Grouping dogs according to a coin flip might produce distributions that look similar.

An attempt to explain weight by breed is likely to produce a very good fit. All Chihuahuas are light and all St Bernards are heavy. The difference in weights between Setters and Pointers does not justify separate breeds. The analysis of variance provides the formal tools to justify these intuitive judgments. A common use of the method is the analysis of experimental data or the development of models. The method has some advantages over correlation: not all of the data must be numeric and one result of the method is a judgment in the confidence in an explanatory relationship.

ANOVA is a form of statistical hypothesis testing heavily used in the analysis of experimental data. A test result (calculated from the null hypothesis and the sample) is called statistically significant if it is deemed unlikely to have occurred by chance, assuming the truth of the null hypothesis. A statistically significant result, when a probability (p-value) is less than a pre-specified threshold (significance level), justifies the rejection of the null hypothesis, but only if the a priori probability of the null hypothesis is not high.

In the typical application of ANOVA, the null hypothesis is that all groups are random samples from the same population. For example, when studying the effect of different treatments on similar samples of patients, the null hypothesis would be that all treatments have the same effect (perhaps none). Rejecting the null hypothesis is taken to mean that the differences in observed effects between treatment groups are unlikely to be due to random chance.

By construction, hypothesis testing limits the rate of Type I errors (false positives) to a significance level. Experimenters also wish to limit Type II errors (false negatives). The rate of Type II errors depends largely on sample size (the rate is larger for smaller samples), significance level (when the standard of proof is high, the chances of overlooking a discovery are also high) and effect size (a smaller effect size is more prone to Type II error).

The terminology of ANOVA is largely from the statistical design of experiments. The experimenter adjusts factors and measures responses in an attempt to determine an effect. Factors are assigned to experimental units by a combination of randomization and blocking to ensure the validity of the results. Blinding keeps the weighing impartial. Responses show a variability that is partially the result of the effect and is partially random error.

ANOVA is the synthesis of several ideas and it is used for multiple purposes. As a consequence, it is difficult to define concisely or precisely.

"Classical" ANOVA for balanced data does three things at once:

  1. As exploratory data analysis, an ANOVA employs an additive data decomposition, and its sums of squares indicate the variance of each component of the decomposition (or, equivalently, each set of terms of a linear model).
  2. Comparisons of mean squares, along with an F-test . allow testing of a nested sequence of models.
  3. Closely related to the ANOVA is a linear model fit with coefficient estimates and standard errors. [12]

ANOVA "has long enjoyed the status of being the most used (some would say abused) statistical technique in psychological research." [13]

ANOVA is difficult to teach, particularly for complex experiments, with split-plot designs being notorious. [14] In some cases the proper application of the method is best determined by problem pattern recognition followed by the consultation of a classic authoritative test. [15]

There are three classes of models used in the analysis of variance, and these are outlined here.

Fixed-effects models Edit

The fixed-effects model (class I) of analysis of variance applies to situations in which the experimenter applies one or more treatments to the subjects of the experiment to see whether the response variable values change. This allows the experimenter to estimate the ranges of response variable values that the treatment would generate in the population as a whole.

Random-effects models Edit

Random-effects model (class II) is used when the treatments are not fixed. This occurs when the various factor levels are sampled from a larger population. Because the levels themselves are random variables, some assumptions and the method of contrasting the treatments (a multi-variable generalization of simple differences) differ from the fixed-effects model. [16]

Mixed-effects models Edit

A mixed-effects model (class III) contains experimental factors of both fixed and random-effects types, with appropriately different interpretations and analysis for the two types.

Example: Teaching experiments could be performed by a college or university department to find a good introductory textbook, with each text considered a treatment. The fixed-effects model would compare a list of candidate texts. The random-effects model would determine whether important differences exist among a list of randomly selected texts. The mixed-effects model would compare the (fixed) incumbent texts to randomly selected alternatives.

Defining fixed and random effects has proven elusive, with competing definitions arguably leading toward a linguistic quagmire. [17]

The analysis of variance has been studied from several approaches, the most common of which uses a linear model that relates the response to the treatments and blocks. Note that the model is linear in parameters but may be nonlinear across factor levels. Interpretation is easy when data is balanced across factors but much deeper understanding is needed for unbalanced data.

Textbook analysis using a normal distribution Edit

The analysis of variance can be presented in terms of a linear model, which makes the following assumptions about the probability distribution of the responses: [18] [19] [20] [21]

    of observations – this is an assumption of the model that simplifies the statistical analysis. – the distributions of the residuals are normal.
  • Equality (or "homogeneity") of variances, called homoscedasticity — the variance of data in groups should be the same.

The separate assumptions of the textbook model imply that the errors are independently, identically, and normally distributed for fixed effects models, that is, that the errors ( ε ) are independent and

Randomization-based analysis Edit

In a randomized controlled experiment, the treatments are randomly assigned to experimental units, following the experimental protocol. This randomization is objective and declared before the experiment is carried out. The objective random-assignment is used to test the significance of the null hypothesis, following the ideas of C. S. Peirce and Ronald Fisher. This design-based analysis was discussed and developed by Francis J. Anscombe at Rothamsted Experimental Station and by Oscar Kempthorne at Iowa State University. [22] Kempthorne and his students make an assumption of unit treatment additivity, which is discussed in the books of Kempthorne and David R. Cox. [ citation needed ]

Unit-treatment additivity Edit

The assumption of unit treatment additivity usually cannot be directly falsified, according to Cox and Kempthorne. However, many consequences of treatment-unit additivity can be falsified. For a randomized experiment, the assumption of unit-treatment additivity implies that the variance is constant for all treatments. Therefore, by contraposition, a necessary condition for unit-treatment additivity is that the variance is constant.

The use of unit treatment additivity and randomization is similar to the design-based inference that is standard in finite-population survey sampling.

Derived linear model Edit

Kempthorne uses the randomization-distribution and the assumption of unit treatment additivity to produce a derived linear model, very similar to the textbook model discussed previously. [26] The test statistics of this derived linear model are closely approximated by the test statistics of an appropriate normal linear model, according to approximation theorems and simulation studies. [27] However, there are differences. For example, the randomization-based analysis results in a small but (strictly) negative correlation between the observations. [28] [29] In the randomization-based analysis, there is no assumption of a normal distribution and certainly no assumption of independence. On the contrary, the observations are dependent!

The randomization-based analysis has the disadvantage that its exposition involves tedious algebra and extensive time. Since the randomization-based analysis is complicated and is closely approximated by the approach using a normal linear model, most teachers emphasize the normal linear model approach. Few statisticians object to model-based analysis of balanced randomized experiments.

Statistical models for observational data Edit

However, when applied to data from non-randomized experiments or observational studies, model-based analysis lacks the warrant of randomization. [30] For observational data, the derivation of confidence intervals must use subjective models, as emphasized by Ronald Fisher and his followers. In practice, the estimates of treatment-effects from observational studies generally are often inconsistent. In practice, "statistical models" and observational data are useful for suggesting hypotheses that should be treated very cautiously by the public. [31]

Summary of assumptions Edit

The normal-model based ANOVA analysis assumes the independence, normality and homogeneity of variances of the residuals. The randomization-based analysis assumes only the homogeneity of the variances of the residuals (as a consequence of unit-treatment additivity) and uses the randomization procedure of the experiment. Both these analyses require homoscedasticity, as an assumption for the normal-model analysis and as a consequence of randomization and additivity for the randomization-based analysis.

However, studies of processes that change variances rather than means (called dispersion effects) have been successfully conducted using ANOVA. [32] There are no necessary assumptions for ANOVA in its full generality, but the F-test used for ANOVA hypothesis testing has assumptions and practical limitations which are of continuing interest.

Problems which do not satisfy the assumptions of ANOVA can often be transformed to satisfy the assumptions. The property of unit-treatment additivity is not invariant under a "change of scale", so statisticians often use transformations to achieve unit-treatment additivity. If the response variable is expected to follow a parametric family of probability distributions, then the statistician may specify (in the protocol for the experiment or observational study) that the responses be transformed to stabilize the variance. [33] Also, a statistician may specify that logarithmic transforms be applied to the responses, which are believed to follow a multiplicative model. [24] [34] According to Cauchy's functional equation theorem, the logarithm is the only continuous transformation that transforms real multiplication to addition. [ citation needed ]

ANOVA is used in the analysis of comparative experiments, those in which only the difference in outcomes is of interest. The statistical significance of the experiment is determined by a ratio of two variances. This ratio is independent of several possible alterations to the experimental observations: Adding a constant to all observations does not alter significance. Multiplying all observations by a constant does not alter significance. So ANOVA statistical significance result is independent of constant bias and scaling errors as well as the units used in expressing observations. In the era of mechanical calculation it was common to subtract a constant from all observations (when equivalent to dropping leading digits) to simplify data entry. [35] [36] This is an example of data coding.

The calculations of ANOVA can be characterized as computing a number of means and variances, dividing two variances and comparing the ratio to a handbook value to determine statistical significance. Calculating a treatment effect is then trivial: "the effect of any treatment is estimated by taking the difference between the mean of the observations which receive the treatment and the general mean". [37]

Partitioning of the sum of squares Edit

The fundamental technique is a partitioning of the total sum of squares SS into components related to the effects used in the model. For example, the model for a simplified ANOVA with one type of treatment at different levels.

The number of degrees of freedom DF can be partitioned in a similar way: one of these components (that for error) specifies a chi-squared distribution which describes the associated sum of squares, while the same is true for "treatments" if there is no treatment effect.

The F-test Edit

The F-test is used for comparing the factors of the total deviation. For example, in one-way, or single-factor ANOVA, statistical significance is tested for by comparing the F test statistic

There are two methods of concluding the ANOVA hypothesis test, both of which produce the same result:

  • The textbook method is to compare the observed value of F with the critical value of F determined from tables. The critical value of F is a function of the degrees of freedom of the numerator and the denominator and the significance level (α). If F ≥ FCritical, the null hypothesis is rejected.
  • The computer method calculates the probability (p-value) of a value of F greater than or equal to the observed value. The null hypothesis is rejected if this probability is less than or equal to the significance level (α).

The ANOVA F-test is known to be nearly optimal in the sense of minimizing false negative errors for a fixed rate of false positive errors (i.e. maximizing power for a fixed significance level). For example, to test the hypothesis that various medical treatments have exactly the same effect, the F-test's p-values closely approximate the permutation test's p-values: The approximation is particularly close when the design is balanced. [27] [38] Such permutation tests characterize tests with maximum power against all alternative hypotheses, as observed by Rosenbaum. [nb 2] The ANOVA F-test (of the null-hypothesis that all treatments have exactly the same effect) is recommended as a practical test, because of its robustness against many alternative distributions. [39] [nb 3]

Extended logic Edit

ANOVA consists of separable parts partitioning sources of variance and hypothesis testing can be used individually. ANOVA is used to support other statistical tools. Regression is first used to fit more complex models to data, then ANOVA is used to compare models with the objective of selecting simple(r) models that adequately describe the data. "Such models could be fit without any reference to ANOVA, but ANOVA tools could then be used to make some sense of the fitted models, and to test hypotheses about batches of coefficients." [40] "[W]e think of the analysis of variance as a way of understanding and structuring multilevel models—not as an alternative to regression but as a tool for summarizing complex high-dimensional inferences . " [40]

The simplest experiment suitable for ANOVA analysis is the completely randomized experiment with a single factor. More complex experiments with a single factor involve constraints on randomization and include completely randomized blocks and Latin squares (and variants: Graeco-Latin squares, etc.). The more complex experiments share many of the complexities of multiple factors. A relatively complete discussion of the analysis (models, data summaries, ANOVA table) of the completely randomized experiment is available.

For a single factor, there are some alternatives of one-way analysis of variance namely, Welch's heteroscedastic F test, Welch's heteroscedastic F test with trimmed means and Winsorized variances, Brown-Forsythe test, AlexanderGovern test, James second order test and Kruskal-Wallis test, available in onewaytests R package. [41]

ANOVA generalizes to the study of the effects of multiple factors. When the experiment includes observations at all combinations of levels of each factor, it is termed factorial. Factorial experiments are more efficient than a series of single factor experiments and the efficiency grows as the number of factors increases. [42] Consequently, factorial designs are heavily used.

The use of ANOVA to study the effects of multiple factors has a complication. In a 3-way ANOVA with factors x, y and z, the ANOVA model includes terms for the main effects (x, y, z) and terms for interactions (xy, xz, yz, xyz). All terms require hypothesis tests. The proliferation of interaction terms increases the risk that some hypothesis test will produce a false positive by chance. Fortunately, experience says that high order interactions are rare. [43] [ verification needed ] The ability to detect interactions is a major advantage of multiple factor ANOVA. Testing one factor at a time hides interactions, but produces apparently inconsistent experimental results. [42]

Caution is advised when encountering interactions Test interaction terms first and expand the analysis beyond ANOVA if interactions are found. Texts vary in their recommendations regarding the continuation of the ANOVA procedure after encountering an interaction. Interactions complicate the interpretation of experimental data. Neither the calculations of significance nor the estimated treatment effects can be taken at face value. "A significant interaction will often mask the significance of main effects." [44] Graphical methods are recommended to enhance understanding. Regression is often useful. A lengthy discussion of interactions is available in Cox (1958). [45] Some interactions can be removed (by transformations) while others cannot.

A variety of techniques are used with multiple factor ANOVA to reduce expense. One technique used in factorial designs is to minimize replication (possibly no replication with support of analytical trickery) and to combine groups when effects are found to be statistically (or practically) insignificant. An experiment with many insignificant factors may collapse into one with a few factors supported by many replications. [46]

Some analysis is required in support of the design of the experiment while other analysis is performed after changes in the factors are formally found to produce statistically significant changes in the responses. Because experimentation is iterative, the results of one experiment alter plans for following experiments.

Preparatory analysis Edit

The number of experimental units Edit

In the design of an experiment, the number of experimental units is planned to satisfy the goals of the experiment. Experimentation is often sequential.

Early experiments are often designed to provide mean-unbiased estimates of treatment effects and of experimental error. Later experiments are often designed to test a hypothesis that a treatment effect has an important magnitude in this case, the number of experimental units is chosen so that the experiment is within budget and has adequate power, among other goals.

Reporting sample size analysis is generally required in psychology. "Provide information on sample size and the process that led to sample size decisions." [47] The analysis, which is written in the experimental protocol before the experiment is conducted, is examined in grant applications and administrative review boards.

Besides the power analysis, there are less formal methods for selecting the number of experimental units. These include graphical methods based on limiting the probability of false negative errors, graphical methods based on an expected variation increase (above the residuals) and methods based on achieving a desired confidence interval. [48]

Power analysis Edit

Power analysis is often applied in the context of ANOVA in order to assess the probability of successfully rejecting the null hypothesis if we assume a certain ANOVA design, effect size in the population, sample size and significance level. Power analysis can assist in study design by determining what sample size would be required in order to have a reasonable chance of rejecting the null hypothesis when the alternative hypothesis is true. [49] [50] [51] [52]

Effect size Edit

Several standardized measures of effect have been proposed for ANOVA to summarize the strength of the association between a predictor(s) and the dependent variable or the overall standardized difference of the complete model. Standardized effect-size estimates facilitate comparison of findings across studies and disciplines. However, while standardized effect sizes are commonly used in much of the professional literature, a non-standardized measure of effect size that has immediately "meaningful" units may be preferable for reporting purposes. [53]

Model confirmation Edit

Sometimes tests are conducted to determine whether the assumptions of ANOVA appear to be violated. Residuals are examined or analyzed to confirm homoscedasticity and gross normality. [54] Residuals should have the appearance of (zero mean normal distribution) noise when plotted as a function of anything including time and modeled data values. Trends hint at interactions among factors or among observations.

Follow-up tests Edit

A statistically significant effect in ANOVA is often followed by additional tests. This can be done in order to assess which groups are different from which other groups or to test various other focused hypotheses. Follow-up tests are often distinguished in terms of whether they are "planned" (a priori) or "post hoc." Planned tests are determined before looking at the data, and post hoc tests are conceived only after looking at the data (though the term "post hoc" is inconsistently used).

The follow-up tests may be "simple" pairwise comparisons of individual group means or may be "compound" comparisons (e.g., comparing the mean pooling across groups A, B and C to the mean of group D). Comparisons can also look at tests of trend, such as linear and quadratic relationships, when the independent variable involves ordered levels. Often the follow-up tests incorporate a method of adjusting for the multiple comparisons problem.

There are several types of ANOVA. Many statisticians base ANOVA on the design of the experiment, [55] especially on the protocol that specifies the random assignment of treatments to subjects the protocol's description of the assignment mechanism should include a specification of the structure of the treatments and of any blocking. It is also common to apply ANOVA to observational data using an appropriate statistical model. [ citation needed ]

Some popular designs use the following types of ANOVA:

    is used to test for differences among two or more independent groups (means), e.g. different levels of urea application in a crop, or different levels of antibiotic action on several different bacterial species, [56] or different levels of effect of some medicine on groups of patients. However, should these groups not be independent, and there is an order in the groups (such as mild, moderate and severe disease), or in the dose of a drug (such as 5 mg/mL, 10 mg/mL, 20 mg/mL) given to the same group of patients, then a linear trend estimation should be used. Typically, however, the one-way ANOVA is used to test for differences among at least three groups, since the two-group case can be covered by a t-test. [57] When there are only two means to compare, the t-test and the ANOVA F-test are equivalent the relation between ANOVA and t is given by F = t 2 .
    ANOVA is used when there is more than one factor. ANOVA is used when the same subjects are used for each factor (e.g., in a longitudinal study). (MANOVA) is used when there is more than one response variable.

Balanced experiments (those with an equal sample size for each treatment) are relatively easy to interpret Unbalanced experiments offer more complexity. For single-factor (one-way) ANOVA, the adjustment for unbalanced data is easy, but the unbalanced analysis lacks both robustness and power. [58] For more complex designs the lack of balance leads to further complications. "The orthogonality property of main effects and interactions present in balanced data does not carry over to the unbalanced case. This means that the usual analysis of variance techniques do not apply. Consequently, the analysis of unbalanced factorials is much more difficult than that for balanced designs." [59] In the general case, "The analysis of variance can also be applied to unbalanced data, but then the sums of squares, mean squares, and F-ratios will depend on the order in which the sources of variation are considered." [40] The simplest techniques for handling unbalanced data restore balance by either throwing out data or by synthesizing missing data. More complex techniques use regression.

ANOVA is (in part) a test of statistical significance. The American Psychological Association (and many other organisations) holds the view that simply reporting statistical significance is insufficient and that reporting confidence bounds is preferred. [53]

ANOVA is considered to be a special case of linear regression [60] [61] which in turn is a special case of the general linear model. [62] All consider the observations to be the sum of a model (fit) and a residual (error) to be minimized.

The Kruskal–Wallis test and the Friedman test are nonparametric tests, which do not rely on an assumption of normality. [63] [64]

Connection to linear regression Edit

Below we make clear the connection between multi-way ANOVA and linear regression.

With this notation in place, we now have the exact connection with linear regression. We simply regress response y k > against the vector X k > . However, there is a concern about identifiability. In order to overcome such issues we assume that the sum of the parameters within each set of interactions is equal to zero. From here, one can use F-statistics or other methods to determine the relevance of the individual factors.

Example Edit

We can consider the 2-way interaction example where we assume that the first factor has 2 levels and the second factor has 3 levels.

Watch the video: Περιγραφική Στατιστική Excel (October 2021).