**
In this section, we'll use the survival experiment to illustrate that both
qualitative and quantitative understanding are important, and to show how to
appropriately use statistics at several levels. We'll analyze data taken from an
earlier experiment.**

**
**

Exposure Time in Seconds | Count Plate 1 | Count Plate 2 | Count Plate 3 | Concentration Factor |
---|---|---|---|---|

0 | 129 | 127 | 119 | 1 |

5 | 101 | 140 | 109 | 1 |

10 | 96 | 82 | 62 | 1 |

15 | 39 | 32 | 29 | 1 |

15 | 298 | 357 | 322 | 10 |

20 | 149 | 122 | 128 | 10 |

25 | 52 | 38 | 52 | 10 |

30 | 22 | 27 | 24 | 10 |

On the other hand, using these differing dilutions can be confusing. The 300+ survivors at a 15 second dose should be compared to the 1,250+ survivors we would have gotten had we used the same dilution factor for the 0 second dose. Convert all the data to the same dose to simplify the analysis. Here is the table we get if we convert all the data to the 10 dilution factor:

Exposure Time in Seconds | Count Plate 1 | Count Plate 2 | Count Plate 3 |
---|---|---|---|

0 | 1290 | 1270 | 1190 |

5 | 1010 | 1400 | 1090 |

10 | 960 | 820 | 620 |

15 | 390 | 320 | 290 |

15 | 298 | 357 | 322 |

20 | 149 | 122 | 128 |

25 | 52 | 38 | 52 |

30 | 22 | 27 | 24 |

We are justified in "massaging the data" in this way because otherwise we might
mislead people who haven't done the experiment, and who don't know or care
about the details of the dilution factors.

To figure out what is going on, and to explain the results to other people
plot the data on a graph. As an example we have plotted the results for plate #2 in
Figure 1. The difference between 25 and 30 second exposures on this graph is
hard to see because the graph covers a big range: 1400 to 27. All the exposures
are easier to see if you use a semi-log graph. On a semi-log graph, the vertical
scale has the same distance between 1 and 10 as between 10 and 100, and between
100 and 1000. These same data are plotted, much more clearly, on a semi-log
graph in Figure 2. We can see that all the data are different, but cannot yet draw
any conclusion about how survival depends on dose. In fact, this graph suggests
that a little UV actually helps colonies grow. If we plot ALL the data from the
table on this graph however, we learn more, as we see in Figure 3.

Figure 1: Linear plot of column two data.

Figure 2: Semi-log plot of column two data.

Figure 3: Semi-log plot of all data.

Figure 3 is revealing. In the first place, survival more clearly depends on
dose. The low dose behavior is still uncertain, but the idea that a little UV helps
survival looks less likely. We need more data to study this question. For larger
doses, the log of the surviving colonies clearly decreases roughly linearly as the
dose increases. Note the smoothing effect of lots of data: each individual data
point represents some random fluctuation just like the dilutions do, and each point
also contains some experimental errors, but the errors and fluctuations push one
point one way, another point in the opposite direction, and so the collection of
points becomes much more useful than the individual ones.

This plot can effectively demonstrate the results from an entire class. It is
actually a form of "statistics": we can roughly estimate a number of survivors at
each dose, note how much variability there is in this number and, using that
marvelous analytical engine, our brain, even fill in an approximate line through
the data. The next step is to make these rough insights quantitative, but the three
steps above are useful in themselves if you have limited time.

First, we have identified an effect to be studied.

Second, we've described the result qualitatively.

Third, we've presented all our data clearly on a useful graph.

Exposure Time in Seconds | Average Colonies per Plate | Concentration Factor |
---|---|---|

0 | 125 | 1 |

5 | 117 | 1 |

10 | 80 | 1 |

15 | 33 | 1 |

15 | 325 | 10 |

20 | 133 | 10 |

25 | 47 | 10 |

30 | 24 | 10 |

But what should we do if five runs at an exposure get 22, 25, 26, 31 and 196 cells? Should we just average these five and say the average is 60? Doesn't this look a little misleading? Discussion will probably suggest that the odd result is likely to be a wrong dilution, or some other procedural problem, and ought to be left out. This may be a proper time to "massage the data" because we should not report data that we have reason to believe are wrong. However, we should first re-check that exposure, keep the odd result in mind, and consider the possibility that it reflects an interesting unanticipated effect.

To understand this better consider the results of the 5 second exposure in our data. The points are 101, 109, and 140. The 140 is more than any of the 0 second plates, and 140 is really quite far from the 117 average. If we omitted this number, the drop from 0 to 5 seconds would appear more like the other changes. Should we drop it? Before deciding, this we'll calculate the other quantity that we report when using only averages instead of all the data: the "standard deviation."

The standard deviation measures how variable the data are. For example, the standard deviation of the closely bunched 0 second data is just over 4, while that for the 5 second data is 17. You can figure out the standard deviation using the 5 second data as follows.

First, find the average:

You square these deviations from the average to get a positive number that measures the deviation. The average of these numbers is

and is called the "variance" of the data. To get rid of the peculiar "cells squared," we take the square root:

For practice, figure out the standard deviation of the 0 second data (the
answer is 4.3). Most spreadsheets have statistical functions, and you should
experiment with yours to learn how to ask your computer for the average and
standard deviation of a bunch of numbers. (Be forewarned that some programs
offer two types of standard deviation, and only one of them works here. For
reference, we used Lotus 1-2-3.)

Now we can construct a table of both the average number of surviving cells
and the experimental variability in this number for each dose.

Exposure Time in Seconds | Average Colonies per Plate | Standard Deviation | Concentration Factor |
---|---|---|---|

0 | 125 | +/- 4.3 | 1 |

5 | 117 | +/- 17 | 1 |

10 | 80 | +/- 14 | 1 |

15 | 33 | +/- 4.2 | 1 |

15 | 325 | +/- 24 | 10 |

20 | 133 | +/- 12 | 10 |

25 | 47 | +/- 7 | 10 |

30 | 24 | +/- 2 | 10 |

Now, should we retain or throw away the 140 cell plate in the 5 second
exposure? The standard deviation of 101, 109, and 140 is 17, so 140 is just about
1 1/2 standard deviations away from the average. This is not very far; in fact, in
typical experiments about 1/3 of the points will be more than one standard
deviation away from the average. On the other hand, if the data for 30 seconds
had been 22, 25, 26, 31, and 196, then the average would be 60, and the standard
deviation about 35; 196 is almost 4 standard deviations from the average. Such a
result is extremely unlikely, and so we would drop that point and report only on
the other four. Thus we would report an average of 26 and standard deviation of
3.2. (See Figure 4 for a plot of the data in this table and the standard deviations
used as error bars.)

The two sets of data at 15 seconds furnish a final lesson about errors, the
usefulness of more data, and the justifiable massaging of data. The standard
deviation of the dilution 1 data is about 0.13 of the mean; therefore, we only know
the answer to 13%. The dilution 10 data are as is typical, more accurate: we know
the mean to about 7%. The larger the mean, the smaller the relative error (=
error/mean). We should probably use the more accurate data, but consider this
question before discarding the smaller numbers: How accurately do you suppose
the plates with 300+ colonies were counted? Mistakes easily occur when too
many colonies grow on a plate; we may miss colonies or have two colonies
counted as one because they are growing right on top of each other. If an average
of 500 colonies are growing on some plates, we would expect a lot of
experimental error because the number of colonies would probably be
systematically under-counted. This kind of error is called "systematic error." It is
an error that creeps into our data because of problems with our procedure. How
would you choose between data with a mean of 500 +/- 25 or of 50 +/- 8? We
would probably accept the 50 even though the relative error is 8/50 = 0.16,
because the data with the 500 mean is likely to contain a big systematic error. On
the other hand, a further dilution that gives data with a mean of 5 +/- 2 gives even
poorer data. Although zero error occurs in counting the colonies, the relative error
due to statistical fluctuations is 0.40. This sort of error, called "statistical error,"
is a big problem when you are dealing with small numbers. We could therefore
discard the plates giving 5 +/- 2 because of the large statistical error and just report
the 50 +/- 8 data. We should be conscious of statistical and systematic errors
when selecting data, and consider which of our procedures are likely to give the
most reliable data.

Discard the wide-spread prejudice that all data are sacred, but don't fall into the temptation to discard results just a little bit higher or lower than you were expecting. You would then report only the data that reinforce your expectations. Do not discard data because they disagree with a pre-existing theory but subject all results to the same rigorous scrutiny; accept nothing at face value. If you think that a procedure was faulty, then it is only honest to discard the data gathered using that procedure. No magic rule governs such decisions; we just have to think hard and be honest.

Figure 4: Semi-log plot of survival with a straight line fit.

Exposure Time in Seconds | Average Colonies per Plate | Standard Deviation | Concentration Factor |
---|---|---|---|

0 | 125 | +/- 12 | 1 |

5 | 117 | +/- 17 | 1 |

10 | 80 | +/- 14 | 1 |

15 | 325 | +/- 25 | 10 |

20 | 133 | +/-12 | 10 |

25 | 47 | +/- 7 | 10 |

30 | 24 | +/- 5 | 10 |

We can work with the results of two or more classes by calculating the *fraction *of
cells that survive. Another class using a different starting suspension of cells
might begin with an average of 170 cells/plate, so all their numbers would differ
from ours. The fraction that survive at each dose, however, can be compared.

We've compared our results with those from other experiments in a final table:

Survival With Dilution Correction | Fraction | |||
---|---|---|---|---|

Exposure Time in Seconds | Average Colonies per Plate | Standard Deviation | Average | Standard Deviation |

0 | 125 | +/- 12 | 1.00 | +/- 0.10 |

5 | 117 | +/- 17 | 0.94 | +/- 0.14 |

10 | 80 | +/- 14 | 0.64 | +/- 0.11 |

15 | 32.5 | +/- 2.5 | 0.26 | +/- 0.02 |

20 | 13.3 | +/- 1.2 | 0.11 | +/- 0.01 |

25 | 4.7 | +/- 0.7 | 0.04 | +/- 0.01 |

30 | 2.4 | +/- 0.5 | 0.02 | +/- 0.01 |

The surviving fraction is plotted in Figure 5. The graph starts out with a bit of a
shoulder and then after 10 seconds exposure, becomes a fairly straight line. We
need more data to study the dose dependence between 0 and 10 seconds, although
most likely it is not constant. After 10 seconds a straight line represents the data
well. The line in Figure 5 was drawn by Lotus 1-2-3 using linear regression on the
last five points, although we could have done as well using a ruler and our eyeball.
Many other curves will also fit this data. We are not creating a theory when we
draw a line through the points, we are just fitting data, whether we "eyeball it,"
calculate the "best fit" ourselves, or ask a statistical program to do a "linear
regression." The straight line portion of the graph after 10 seconds, which
represents our data simply and clearly, has several uses.

One use is to suggest a theory. A downward sloping straight line on a
semi-log graph means "exponential decay." Just as the UV intensity drops
exponentially as it passes through the ozone layer, so do these cells die off
exponentially when exposed to more and more UV. Thus, after the first 10
seconds, only about 0.4 of the cells survive each successive 5 seconds of radiation.
At 10 seconds there are 80 cells; 0.4 of these = 32 survive until 15 seconds; 0.4 of
these = 13 survive until 20 seconds; 0.4 of these = 5 survive until 25 seconds; and
0.4 of these = 2 survive until 30 seconds. The straight line on the semi-log plot
suggests this model of events.

Figure 5: Semi-log plot of the surviving fraction with a straight line fit.

Another use of the line is to economically express our data: after 10 seconds the fraction that survive is

(fraction at 10 seconds) x (0.4)^(time past 10 seconds/5)

=(0.64) x (0.4)^((t-10)/5).

The term "exponential decay" derives from this expression. The exposure
time is in the exponent; it is not just a multiplier.

This form also gives us the useful concept "LD10" for "lethal dose for 10
percent survival." The exposure time needed to get a surviving fraction = 0.1 is
about 20 seconds because if we use t = 20 in the above formula the exponent is
(20-10)/5 = 2 and the formula becomes (0.64) x
(0.4)^2 = 0.10.

However we use the straight line fit to our data, the data themselves provide
a valuable monitor of the biologically-important UV intensity where you live.
Both the straight line fit and the graph itself reveal that only 0.1 of the cells
survive an exposure of 20 seconds to this lamp, the "lethal dose for 10 percent
survival," or LD10. When you do this experiment outdoors, you may find that on
one day a 4-minute exposure leaves only 0.1 cells surviving. Four minutes in the
sun would be the LD10, providing the same amount of UV exposure as 20 seconds
of the lamp used in this experiment. On the next day the LD10 might be only 3
minutes, suggesting that the UV is more intense than the day before.

There are other statistical notations that are useful in special situations, such as "Confidence intervals" or "Chi-square," but these are not appropriate for a first experience with statistics. (For an application of chi-square testing, see the notes on statistics in our photoreactivation experiment.) When these notions are combined with curve-fitting, we are usually trying to pin down an underlying theory with several parameters. Here we are just starting down that road; our data suggest that there might exist an underlying theory of radiation damage that gives roughly a straight line, but assuming there is such a theory and setting confidence levels for its parameters, is premature!

SUMMARY:

- Observe effects and investigate causes.
- Describe relationships qualitatively.
- Plot all results on graphs; experiment with linear and semi-log plots.
- Make graphs of averages using standard deviations as error bars, and consider other sources of error.
- Look for the "best fit" between a simple curve (often a straight line on some graph) and the data. First, use your eye, then use "least squares."
- If a theory exists and specifies parameters for the curve in #5, then you may be able to determine a "confidence interval" for some of the parameters. This is not the standard situation in science: a glance through Physical Review or Genetics will reveal lots of error bars and few confidence levels, though when available these are important.

Click here to return

Last updated Wednesday, 04-Dec-2002 20:16:56 UTC