What is big data?

How would you define Big Data? Big hype, Big BS? The best definition I heard thus far is from George Dyson, who recently stated at the Strata conference in London that the era of big data started “when the human cost of making the decision of throwing something away became higher than the machine cost of continuing to store it.” What I like about this definition is that it is a more abstract framing of the term and therefore has a broader validity. It implies a relation between two kinds of costs and thus does not depend on absolute values in terms of TBs. Classical definitions you can find on wikipedia talk about big data being “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.” How will “on-hand” databases look like ten years from now and won’t they be capable of easily process what we today consider to be big?

 

There are also the three “Vs” that are suppose to be typical for Big Data: Volume, Variety and Velocity. We talked about volume being a moving target, dozens of tera bytes, peta bytes, whatever. There are sectors that have dealt with what today is considered to be big data long before the term big data has been coined. For example scientific data volumes have always been among the largest, just think of the vast amounts of data that are beign processed at CERN (one petabyte per second) and in the LHC Computing Grid (many petabytes per year).

 

Variety is something that has been managed in the past as well, before big data became such a hype. As a data miner you always look for new and different data sources to increase the quality of your models and the amount of variance the model can explain. When I worked at Otto we looked into integrating messy call-center data and weather data into models which would predict customer responses, i.e. sales, for more efficiently sending out catalogues. At xing we found a way to integrate the non-standardized tags with which a user can describe his profile (wants, haves), which definitely would fall under the data variety aspect (see paper).

 

Thus in my opinion the most noteable aspect of big data is velocity: the aim is to get away from daily batch processes updating your data, analysis and models once a day, to online streaming updates. A good example for such a use case are recommender systems, whose underlying models shouldn’t be updated only once a day but online to reflect the current trends in user behaviour. If users suddenly started buying product z with product a, you don’t want to wait until the next day to update your recommendations. You would lose sales by not recommending product Z to users who already looked at product A but not Z. Another example is the notorious prediction of click-through-rates (CTR) in online marketing environments. This is done to decided which ad to serve or to determine how much to spend for an ad. Here is another interesting use case of real time analytics:http://userguide.socialbro.com/post/16003931427/how-can-real-time-analytics-for-twitter-be-useful-for-yo. One of the more important technologies with regard to stream processing of data is Storm (http://storm-project.net/). The most promiment user of Storm is Twitter, which acquired the company that developed Storm, BackType, in 2011.

Write a comment

Comments: 0