Trolley Problem: Motion Fairness Analysis for Huber Debates 2012

Alfred Snider and James Hardy have been extremely kind (and unprecedentedly open) in sending me full tab data for a very recent tournament. The Huber Debates 2012 involved 80 teams in 20 rooms, so it's just large enough that we get valid statistical inference using large-sample approximations. I use the fairness test I describe here, which tests the null hypothesis that every team had an expected team score of 1.5.

Motion fairness tests follow.

motion
THW ban employers from accessing the criminal records of prospective employees.
means for round 1 (OG/OO/CG/CO)
1.4000 1.6000 1.2500 1.7500
p-value for round 1
0.7269
Hypothesis not rejected at the 5% significance level.

motion
THBT politicians should not glorify military service.
means for round 2
1.4000 1.3000 1.7500 1.5500
p-value for round 2
0.6997
Hypothesis not rejected at the 5% significance level.

motion
THW introduce a mandatory maximum working week of 40 hours.
means for round 3
1.1000 1.5500 1.6000 1.7500
p-value for round 3
0.1771
Hypothesis not rejected at the 5% significance level.

motion
THBT juries should not convict criminals who act in the public interest
means for round 4
0.7500 1.8500 1.0500 2.3500
p-value for round 4
1.2640*10^-8
Hypothesis rejected at all reasonable significance levels.

motion
THW defriend Facebook friends who post offensive political views.
means for round 5
0.8000 2.1000 1.7000 1.4000
p-value for round 5
2.8473*10^-5
Hypothesis rejected at all reasonable significance levels.

motion
THBT newly emerged democracies should ban religious parties.
means for round 6
1.1000 1.4000 1.5500 1.9500
p-value for round 6
0.1785
Hypothesis not rejected at the 5% significance level.

variance of team points by position, for all rounds
1.0924 1.1081 1.3443 1.2251

Several observations:

It's really quite hard to set completely fair motions, and I think that the Huber Debates 2012 have done rather better on motion fairness than the average IV of a comparable size. (Though right now we haven't really enough data to tell whether that's the case.)
There are lots of things that are as important as motion fairness, which we can't quantify. Is the motion interesting? Does it allow better teams to reliably defeat worse teams, or is it fair only in the sense that a dice is fair? Let's be careful not to over-weight the quantifiable.
What's particularly striking to me is the variance in p-values, round by round. For rounds 1 and 2, if the motion was fair, we'd see results at least as extreme about 70% of the time. For rounds 3 and 6, we'd observe results at least as extreme about 18% of the time. For rounds 4 and 5, the chances of observing such extreme results is microscopically small. For round 4 in particular, if the motion was fair, we'd have a 1 in 10 billion probability of observing results this extreme. (For comparison, that's a lower p-value than it takes to announce a new particle discovery in high-energy physics.)
This seems to imply is that, even though the range of variation doesn't look that extreme, the data actually contains rounds that are so close to fair as to be indistinguishable from it, and so far from fair that we can say so with scientific certainty. In turn, this suggests that most of the (alleged?) problems with position fairness in the WUDC format could in principle be fixed by careful motion-setting, which requires the accumulation of experience and the sharing of knowledge.
However, I'd point out that it's easy to find that a motion is unfair when you have hindsight and statistics on your side. It's much harder to do so when you're setting motions, particularly if you're setting new motions that generate worthwhile discussions. We shouldn't demand that CA teams be entirely clairvoyant, and I think that the CA team for Huber did a great job.
Finally, the sample variance of team points by position is consistent with the folk wisdom that teams in OG have less variance in their outcomes than teams in other positions.

Trolley Problem

Tuesday, November 13, 2012

Motion Fairness Analysis for Huber Debates 2012

1 comment: