"The primary question is not 'What dowe know?', but 'How do we know it?' "

Aristotle, to Thales

4.3 RELATIONS IN CATEGORICAL DATA (Pages215-226)

OVERVIEW: We can see relations betweentwo or more categorical variables by setting up tables. Up to thispoint, we have studied relations in which at least the responsevariable was quantitative.

A two-waytable of counts describes the relationshipbetween two categorical variables... the row variable and the columnvariable. The row totals and column totals give the marginaldistributions of the two variables separately, but do not give anyinformation about the relationships between the variables.Probabilities, including conditional probabilities, can be calculatedfrom two-way tables.

Simple example:
Notation:
...Prob(X) is the probability that X is true.
...Prob(X|Y) is the probability that X is true, given that Y istrue.

Two hundred employees of a company are classifiedaccording to the following 2-by-3 table, where A, B, and C aremutually exclusive properties.

 Have A Have B Have C ROW TOTALS FEMALE 20 40 60 120 MALE 30 10 40 80 COLUMN TOTALS 50 50 100 200

o What is the probability that a randomly chosenperson is female?

Ans. Prob(F) = 120/200 = 60%.

o What is the probability that a randomly chosenperson has property A?

Ans. Prob(A) = 50/200 = 25%.

o If a randomly chosen person is female, what isthe probability that she has property B?

Ans. Prob(B|F) = 40/120 = 33 1/3% [=prob(B and F)/prob(F).]

o If a randomly chosen person has property C, whatis the probability that the individual is a male?

Ans. Prob(M|C) = 40/100 = 40% [=prob(C and M)/prob(C).]

o If a randomly chosen person has B or C, what isthe probability that the person is a male?

Ans. Prob(M|B or C) = 50/150 = 331/3%.

===================================

Here are the batting averages of two baseballplayers for both halves of a season.
[Batting average is simply the ratio of
number of hits tonumber of times at bat.]

 FIRST HALF-SEASON SECOND HALF-SEASON Hits Times at bat Batting average Hits Times at bat Batting average Caldwell 60 200 .300 50 200 .250 Wilson 29 100 .290 1 5 .200

Here are the batting averages for the entireseason.

Caldwell: 110/400 =.275

Wilson: 30/105 = .286

Caldwell, despite having a better average thanWilson for both halves of the season, ends up with an overall averagethat is less than that of Wilson. Using percentages, one canconstruct numerous examples of Simpson's paradox.

From an algebraicstandpoint:
If a/b > c/d and p/q > r/s, then
...it is true that a/b + p/q > c/d + r/s.
...it is not necessarily true that (a+p)/(b+q) >(c+r)/(d+s).