Mary Richardson
Department of Statistics
Grand Valley State University
1 Campus Drive
Allendale, MI 49401-9403
Statistics Teaching and
Resource Library, March 17, 2003
© 2003 by
Mary Richardson, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the
editor.
Dawson (1995) presented a data set
giving a population at risk and fatalities for an “unusual episode” (the
sinking of the ocean liner Titanic) and discussed the use of the data set
in a first statistics course as an elementary exercise in statistical
thinking, the goal being to deduce the origin of the data. Simonoff (1997)
discussed the use of this data set in a second statistics course to
illustrate logistic regression. Moore (2000) used an abbreviated form of
the data set in a chapter exercise on the chi-square test. This article
describes an activity that illustrates contingency table (two-way table)
analysis. Students use contingency tables to analyze the “unusual episode”
data (from Dawson 1995) and attempt to use their analysis to deduce the
origin of the data. The activity is appropriate for use in an introductory
college statistics course or in a high school AP statistics course.
Key words: contingency table (two-way table), conditional distribution
Objectives
After completing the activity, students
will understand:
 |
How to
construct and interpret a contingency
table. |
 |
How to
construct and interpret conditional
distributions. |
 |
The
usefulness of contingency tables. |
Materials and equipment
The activity can be completed
interactively or as homework. If the activity is to be completed
interactively students work in groups of three to five. Each student needs
a copy of the activity.
Time involved
The estimated interactive completion
time is a one-hour class period.
Activity description
Prior to completing the activity,
students should be familiar with the basics of setting up contingency
tables.
To begin the interactive activity, the background of the data is
discussed. The sinking of the ocean liner Titanic after colliding with an
iceberg on April 15th, 1912 is referred to as an “unusual episode.” The
initial data tables (given in the Student’s Version of the activity) give
counts for the population at risk and the deaths for the passengers on the
Titanic. The 2201 people at risk are categorized by economic status (I,
II, III, or Other), age (child or adult), gender (female or male), and
survival status (survived or did not survive). Economic status is
determined based on the class in which the passengers traveled:
first-class (I), second-class (II), third-class (III), or crew member
(Other). The goal of completing the activity is to determine the
historical mortality episode that produced the data.
Through using two-way tables to analyze the data, students discover
interesting characteristics of the data that should help them to determine
the nature of the “unusual episode.” To complete the activity, students
are asked to answer a series of questions based on the data. Each question
is intended to highlight a different characteristic of the data.
Teacher notes
While working on the activity, students
are allowed to ask the instructor questions about the origin of the data.
One question that is commonly asked is “What is the Other group?” The
instructor might answer this question by pointing out that there are no
children in the Other group, only 3 females, and this group does not
completely fit into an economic status characterization. Another question
that is commonly asked is “When did this 'unusual episode' occur?” The
instructor might choose to answer this question by giving the year of the
sinking of the Titanic (which will more than likely give away the answer)
or the instructor might simply say that the “unusual episode” is not a
recent event. Another point that the instructor might want to make is that
the “unusual episode” was an isolated incident and that there were only
2201 people at risk (which might eliminate erroneous guesses for which
many thousands or even millions of people were at risk).
Some interesting characteristics of the data are:
-
68% of the people at risk died
-
92% of the people who died were male
- The
death rate was higher for the lower economic status groups (especially
among females)
- There
were no children in the Other economic status group and only 3 females
(out of 673)
-
The only deaths of children were in the
third-class.
In a typical class with several groups,
at least one of the groups will usually correctly guess the origin of the
data. Here are some example group responses to the question: “What
'unusual episode' in history do you think this data set describes?”
“This is probably the death stats for
the sinking of the Titanic since rich were put on the lifeboats first and
women and children took precedence over men. The 'other' could/would be
crew explaining why there would be no children in that category.”
“We think this unusual episode is the sinking of the Titanic. We believe
this because the ship did consist of men/women and children. The reason
for the women’s death being so low is due to the fact that they were the
first to be shipped on the safety of the other boats. We also believe that
economic status I, II, III and other is the wealth distribution throughout
the boat, I consisted of the wealthy, II consisted of the middle class,
III consisted of the lower deck which had a hard time escaping because
they were so close to the bottom of the ship, and we believe that the
other class represents the workers on the ship. (They were the closest to
the bottom of the ship as well and last to get off the ship as well.) This
'unusual episode' is the sinking of the Titanic, and that is our educated
guess.”
“The data set could be explaining WWI. The rich could buy their way out of
the war so they wouldn’t have as many people at risk. Women would be found
in hospitals and other non-battle areas so they would be less at risk.
And, children would not be present for the most part of the war.”
“We think the set describes the Civil War. Our reasoning is because men
fought in the wars and the Civil War is when women started to be nurses
for the Army. They were exposed to the battlefield. The children that died
could have been at risk due to their age. If the child was near 18 years
old they would have gone to fight. If they were not 18 years old they
would be considered children still.”
“This unusual episode data could be explaining heart failure. Look at the
data it shows that men at a lower economic status die of it. This holds
true for heart failure. More men die of heart failure than women and
children. Also the lower the economic status you are the less treatment
you are able to receive.”
“We initially figured this data was describing the Black Plague, which
would describe the differences in deaths in the different social classes.
But this wouldn’t support the differences in gender and age. Our best
guess is that this data describes the Nazi persecution of the Jews in the
30’s and early 40’s. Higher educated men and women were likely considered
either useful or desirable and lower income children very undesirable or
useful. The gender differences are probably explained by men being
subjected to more harsh conditions because of physical work ability.”
“We believe the unusual episode that is being described is the sinking of
the Titanic. First of all we see that a high number of male adults perish,
and a formidably smaller amount of adult women and children perished. This
would support the 'women and children first' ideal of 1912. Based on
economic status we can see that a larger number of high-class citizens
(male and female alike) managed to survive. While the highest numerical
amount of deaths occurred in the lower two classes. In fact, the only
children that perished were lower class ones. We also see by sheer number,
there were more men, more lower class citizens, and few children. All of
these factors would have been common place in travel (due to society,
immigration and other factors) during the era of the tragedy. In general
the total number of occupants seems similar to those that would have been
aboard, plus the high mortality rate (68%) is common knowledge of the
event.”
Through completing the activity,
students see an illustration of the usefulness of two-way tables for
summarizing two categorical variables. In addition, constructing
appropriate conditional distributions illustrates how to informally use
two-way tables to determine if two categorical variables may be
associated.
After completion of the activity, the instructor might have a summary
discussion. One possible point for discussion is the fact that, overall,
the data set is hard to interpret. There are many classifications, and
counts cannot be compared due to unequal subgroup sizes. However, by
breaking down the data, focusing on two-way tables, and calculating
conditional percentages, more useful information can be obtained. We can
see that women had a much lower likelihood of death than men, and the rich
had a lower likelihood of death than the poor (especially for women). At
this point, students quite often comment on the fact that the motion
picture Titanic (released in the 1990’s) portrays the third-class
passengers (whose cabins were in the lower level of the ship) as being
prevented from moving to the top level of the ship after the collision
with the iceberg (although this fact has not been confirmed historically).
A point of caution here is that the activity involves a very informal
analysis. In general, collapsing an initial contingency table over
variables without examining associations between all of the variables at
once leaves open the possibility of Simpson’s paradox occurring. The
instructor should preface completion of the activity by telling students
that a less informal analysis of contingency table data can be completed
with more sophisticated statistical tools.
Assessment
Students should understand how to
construct and interpret a contingency table. In addition, students should
understand how to construct and interpret conditional distributions.
The following test question can be used to assess student understanding.
An insurance company has examined a
large number of claims resulting from low speed collisions of vehicles and
has classified the claims according to type of vehicle and to whether the
claim was for more than $10,000. The data are shown below.
|
|
|
Type of Vehicle |
| |
|
Car |
Truck |
Sport utility |
|
Claim Amount |
>$10,000 |
147 |
120 |
270 |
|
£$10,000 |
470 |
280 |
330 |
-
The company would like to learn more
about the relationship between claim amount and type of vehicle. In
particular, the company would like to compare the claim amounts for
each type of vehicle. What conditional distributions should the
company compute?
-
Provide the conditional distributions
stated in part a.
-
Do you think there is an association
between the type of vehicle and the claim amount? Explain.
References
Dawson, Robert J. M. (1995). The ‘Unusual Episode’ Data Revisited. Journal
of Statistics Education [on-line] 3(3). (http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html).
Moore, David S. (2000). The Basic Practice of Statistics, 2nd
edition. New York: W. H. Freeman and Company.
Simonoff, Jeffrey S. (1997). The ‘Unusual Episode’ and a Second
Statistics Course. Journal of Statistics Education [on-line]
5(1). (http://www.amstat.org/publications/jse/v5n1/simonoff.html).