24. January 2014

GIGO (Garbage in, garbage out)

A recent paper published by researchers from Princeton predicts that “Facebook will undergo a rapid decline in the coming years, losing 80 percent of its peak user base between 2015 and 2017”. Their modelling approach included an epedemic model, the so called infectiopus recovery SIR model (irSIR).

Actually the modelling approach doesn't really matter, since the real problem with their analysis is the input data they use to do this forecast: Google trend data. The same data that Google also applies to estimate flu activity around the world, quite precisely. They use the number of searches for "Facebook" as a proxy for user activity, because it's publicly available. This is the problem with so many data sciency analysis: a more or less advanced modelling approach is chosen, that hides the most crucial aspect of the whole analysis, the input data. I've seen this behavior many times, "hey, we only have access to this dubious data, but let's use this fancy algorithm on it so everybody thinks we're really smart and won't question the validity of our analysis"...

The authors of the paper argue that "[t]he public nature of this data is an important feature of the approach used in this paper, as historical OSN[Online Social Network] user activity data is typically proprietary and diﬃcult to obtain", which basically means that they chose this data because they didn't have access to better data. Furthermore "the use of search query data for the study of OSN adoption is advantageous compared to using registration or membership data in that search query data provides a measure of the level of web traﬃc for a given OSN". They should detail, that it provides a measure of the level of web traffic coming from google. What they fail to prove/discuss is if it's a reliable proxy for Facebooks whole traffic and user activity.

Google trend data is aggregated Google search data, now please step back for a moment and think how good that data describes how active Facebook's users are/will be. Sometimes it helps to think in extremes, for example how many Facebook related search queries are being issued if every internet user is on Facebook on a daily basis? How many would be there if no one is on Facebook anymore? When does someone search on Google for anything Facebook related? If she's an active user or a new user not that active yet? What if almost half of Facebook's user are mobile only, how many of these users search for Facebook through Google? These are the kind of questions one should go ahead with to try to understand how good the input data is. For other reasons why trend data might be inappropriate check out Facebook's interesting rebuttal.

Secondly why are they using MySpace to fit their model? "MySpace is a particularly useful case study for model validation because it represents one of the largest OSNs in history to exhibit the full life cycle of an OSN, from rise to fall." This sounds quite like selection and confirmation bias. Why haven't they chosen another OSN that hasn't fallen yet, for example linkedin, which is even older than MySpace and doesn't seem to be close to falling?

So for me this analysis, even though it's coming from an institution with quite a reputation, is a classical example of a garbage in, garbage out. Doesn't matter how sophisticated or state of the art your modelling algorithm is, if the input data is garbage, your analysis results are garbage. So please cut the BS and start the process by looking at what the best data you possibly could have is to achieve your analysis goal. Don't try to use an algorithm whose complexity negatively correlates with the BSness of your data.

Write a comment

Comments: 0