Posted by: Paul Hewitt | January 5, 2012

Good Judgment Project Performance

The Good Judgment Team is competing against other teams to see which one is able to “more accurately” predict future events (mainly political, so far).  After the first month of official predictions, the Good Judgment Team released the following statement by email (bold/italics are mine):

“Our forecasters are simply the best!  (That’s not just our opinion:  in the early days of the tournament, the Good Judgment Team’s aggregate forecasts have proven to be more accurate than those of any other research team participating in the IARPA tournament.)”

This got me to thinking.  How is the IARPA determining which Team is more accurate in their predictions?  I’ve posed the question to my team, but haven’t received a response, yet.  So, let’s make a few educated guesses.

Each Team has a large number of participants.  On our Team, there are a number of groups, presumably with some common characteristics, that are each predicting future events.  We took a variety of tests before joining the team, to measure or describe how we make decisions, process information, etc…

Almost all of the questions about future events are  binary.  They will either happen or not, by a specific date.  Our Team uses a modified prediction market to generate a likelihood of each event occurring (more information here).  Now, this is where it gets interesting.  I’m guessing that most, if not all, of the Teams predicted the correct outcomes for most of the questions.  If our Team got one or two more correct than the other teams, does that really mean that we are “simply the best“?

Could it be that our collective likelihoods of the events that occurred were higher than those for the other Teams?  In other words, when an event did happen, our Team gave the event a higher likelihood of occurring.  Jeez, I hope not, for a number of reasons.  Remember, these are binary events.  Just because a likelihood is higher doesn’t mean that it is more correct than a lower likelihood prediction!  What we really want to compare is the calibration of the market predictions with market outcomes.  Unfortunately, there isn’t enough data, yet, to determine whether our predictions are better calibrated than any other Team’s.

These markets are kept open for trading until a day or so before the outcome is revealed, unless the outcome is determined prior to the anticipated closing.  Uncertainty surrounding the outcome decreases as time marches toward the market closing.  Consequently, at the market close, all that should remain is the irreducible uncertainty (random events that affect the outcome).  Accordingly, most market should converge on a likelihood close to 100% for one of the binary outcomes, and there shouldn’t be very much variability among the Teams.

Could it be that accuracy is being determined at various points in time prior to the market close?  It’s a better basis, but again, we can’t prove calibration.  So, this isn’t likely the answer.  Maybe it’s the speed of adjusting predictions, given new information?  I doubt this one, too.  In some cases, information will lead one forecaster to conclude the event is more likely and another to conclude the opposite.  It would be impossible to determine whether the market was incorporating new information in every case.

Maybe our Team won more money.  Nope.  Basically, with an Automated Market Maker, except for the seed capital, it’s a zero sum game.  All teams would do equally well, with the same system.


Let’s forget for a minute that these predictions are pretty useless, if they’re only “accurate” immediately before the outcome being revealed.  How many times have I spouted on about this issue?  Also, they’re predicting binary events.  There’s no such thing as being almost right in a binary market.  So, even though it isn’t theoretically correct, I’m going to guess that the IARPA thinks a higher likelihood prediction is more accurate than a lower likelihood one, when the event does, in fact, come true.  Maybe that’s the best they can do, until they figure out the calibration issue.



  1. I’m in the GJP tournament as well, and our group uses Brier scoring to calculate accuracy. Brier scoring is essentially the square of the error of the prediction. It’s weighted equally by day in the period up to the question close. I would suspect that the GJP project is using that system as a whole to calculate the “aggregate” accuracy.

  2. Can you elaborate on the scoring mechanism? Specifically, how is the daily error calculated?

  3. The example they gave us goes something like this. Suppose the question is whether there will be a UN security resolution on Syria within the next 3 days:

    Day 1: predict 40% chance of resolution
    Day 2: 30% chance
    Day 3: 20% chance
    Final Result: no resolution

    Error of day 1 = (0-0.40)^2 + (1-0.6)^2 = 0.32
    Error of day 2 = (0-0.30)^2 + (1-0.7)^2 = 0.18
    Error of day 3 = (0-0.20)^2 + (1-0.8)^2 = 0.08

    Total score = (0.32+0.18+0.8)/3 = 0.193

    I think when they say GJP is doing the best, they mean either a. our average Brier score as a group is best, or b. they created an aggregate predictor from all our individual calls, and this aggregate person scores best. Hard to tell which.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: