Facebook RSS
banner

Calculus is Impossible on Rainy Days

And other spurious correlations...


One of the true joys of being a data scientist is digging into a new data set -- exploring a new field, figuring out how different things interact and discovering correlations. Each field has its own unique quirks -- different factors that end up having enormous influence on what you see in the data. And there’s one particularly enjoyable way to learn about these quirks: making the most absurd conclusions you possibly can.

Today, we’ll forget for a moment that correlation doesn’t imply causation, and discover some of the most baffling things that affect how difficult math is.


Disclaimer: None of the things I’m about to say are truly causal -- all of these statements are merely a result of confounding factors and spurious correlations -- studying math on rainy days is excellent for you, I promise.



--------------------------------------------------------------------------------------------------------------------------


We all know that rainy and cold days feel dreary, dark, and more frustrating. But did you know that math is actually more difficult the colder it gets? Yep:



Slide1.png




If you look at accuracy across all math problems on Khan Academy, you’ll see that accuracy is almost 5% lower on the coldest days than the warmest days. This is a mind-bogglingly huge effect. Why does it happen? Is math really more difficult when it’s cold?


Of course not. What we’re really seeing is that seasonality has a huge effect on who is doing math problems. If we look at accuracy throughout the year, we see:



Slide5.png



The reason for these huge shifts is that there’s many different motivations for using Khan Academy: some folks are using Khan Academy for their own enrichment, enthusiastic about learning new things and reviewing things they have learned in the past, and these users are likely to continue to be active on Khan Academy throughout the entire year, including the summer and the holidays. However, a less motivated user may be less inclined to stay active when they’re not currently in school.


Here’s another fun fact: did you know that people are noticeably more accurate during football games? Afternoons during which there is a nationally-televised NFL game have an almost 1.5% higher accuracy rate:



Slide4.png
Fun Fact: If you zoom in far enough, all two-bar plots looks extremely impressive.

Of course, as before, this is just because afternoon NFL games are all on Sunday (or Saturday in January!), and accuracy is far higher on the weekends than on weekdays:



Slide2.png



Similarly, users are more accurate during baseball games than basketball games (summer vs. winter), ice cream is absolutely awesome for your math abilities, ice skatingis disastrous, and holidays are fantastic.


This ends up having significant implications for data science -- it’s very easy to reach highly misleading conclusions whenever you do anything that involves time. Testing out a new feature that has different effects on more vs. less engaged users can have wildly different effects depending upon the time of day, time of week, or even time of year that you launch it.

This might be obvious in any field when you launch something around the holidays or late at night, but for education in particular, the timing of back to school and school breaks are hugely important.


--------------------------------------------------------------------------------------------------------------------------


Quick question: What age group do you think is the most accurate on Khan Academy? The answer is 97 year olds. In fact, 97 year olds tend to answer over 85% of questions correctly, which is vastly higher than the average accuracy.


Why is this? It’s the same reason that the ‘best’ and ‘worst’ states in the U.S. are also the smallest ones -- smaller sample sizes have far higher variance. Only 17 users claim to be 97 years old, while younger ages typically have hundreds of thousands. Thus, while younger ages tend to be very close to the overall average, higher ages can vary wildly. Incidentally, the least accurate users are 99 year olds.


Another fun question: Which city has the highest mission completion rate in the world? You’re probably thinking this is another sample size trick, so let’s change it up slightly and ask: of cities with at least 100 purported users, which city has the highest mission completion rate?  


That would be Antarctic Great Wall Station, Antarctica. The average user from Antarctica has completed a staggering 2.3 entire missions.



BlogPost1.jpg



What’s causing this? Well, we’re all liars. When you select a city on Khan Academy, you choose from a dropdown menu of real cities -- so if you want to pick something ‘fun’, your options are somewhat limited. Antarctica is a pretty great choice.


In fact, 132 users claim to be from Antarctic Great Wall Station, Antarctica, which is pretty interesting when you consider that the fount of all true knowledge, Wikipedia, claims that the summer population is only 40 (winter: 14).


Users who choose this location also happen to be far more engaged, and far more accurate, than the average user. Other cities come pretty close: Nowhere Else, Tasmania, Australia is strangely popular too. In fact, since selecting a city is purely optional (and requires deliberately editing your profile), merely choosing one at all makes you far more accurate.


In conclusion,



  •      1. Calculus is impossible on rainy days.
  •      2. Watching football makes you far more accurate.
  •      3. Antarcticans are math experts.
  •      4. 97-year olds are excellent at math, 99-year olds not as much.

Have any good spurious correlations you’d like to share, or curious about this data and how it was collected? Leave a comment below! 


SHARE THIS POST

  • Facebook
  • Twitter
  • Myspace
  • Google Buzz
  • Reddit
  • Stumnleupon
  • Delicious
  • Digg
  • Technorati

2 comments:

  1. No to be nitpicking......but while you are calling out statistical fallacies....you might as well point out to the dangers of truncated graphs.......a y axis ranging from 0.77 to 0.8 can be grossly misleading,... ;)

    ReplyDelete
  2. Not nitpicky at all :) It is, indeed, quite misleading! (though in this case of course, intentionally)

    To be fair though, the typical variation in these numbers is tiny, and the difference here is very, very strongly significant and not at all trivial (knowing 5% more math would be awesome!)

    ReplyDelete