End of February news made the round that Google awarded US$750,000 to “the automatic statistician”, a project at the University of Cambridge’s led by Zoubin Gharamani, Professor of Engineering. The ultimate aim of this project is to produce an artificially intelligent (AI) system for statistics and data science. Automating the process of statistical modelling would have a huge impact on all fields that rely on statistics-, machine learning- or data science experts. Even though it’s easier than ever before to collect, store and combine all kinds of data, there are very few people who are trained in data science methods required to derive models and knowledge from this data and produce predictions. The “automatic statistician” produces a 10-15 page “human readable” report describing patterns discovered in the data and returns a statistical model. Bayesian model selection is used to automatically select good models and features.
Looking at the example reports on http://www.automaticstatistician.com/ it’s easy to imagine how this can be part of the future in data science. A data analytics project involves specific steps and craftsmanship, in a sense that certain processes and specific rules need to be followed. For example when developing a predictive model one would use hold-out samples, cross-validation, look for multicollinearities, filter outliers, transform categorical data into dummy variables and impute missing values. It would be revolutionary to have a tool which would be fed with some cleansed input data and then automatically chooses the best statistical model describing the data. This way some of tedious and error prone work of trying out different statistical models and parameters thereof would be diminished.
But even with the “automatic statistician”, a business aiming to derive concrete actions from data analytics still needs someone who is able to interpret the 10-15 pages report and communicate the insights to management and implementing teams. With systems taking over more of the statistics and machine learning part of data science, communication skills and expertise in the specific vertical become even more important. As John Foreman described in a blog post, which stressed the importance of soft skills in data science, we need more translators to embed the data science processes and insights as deeply into organizations as possible.
Foreman says, data scientist should “push to be viewed as a person worth talking to and not as an extension of some number- crunching machine that problems are thrown at from a distance”. With all the analytics tools and all the data available almost every analytical problem is to some degree solvable. The key skills then is to be able to ask the right questions and to avoid working on a “poorly posed problem”. Working on and solving the wrong analytics problem can happen when someone outside of the analytics team (e.g. management, marketing) describes the problem using their past experience and potential lack of analytics skills and hands over the task to the data scientist as if it is set in stone.
Kaggle is a platform that runs data science competitions and invites anyone to contribute algorithms to solve a specific machine learning problem. In February news broke, that Kaggle is cutting a third of its staff and explores new ways of making money. One could hypothesize that one of the key reasons for Kaggle’s problems lies in seeing the development of machine learning algorithms as something separable from the core business operations. Too much business context is at the risk of getting lost in abstracting the business problem into the "outsourceable" development of an algorithm. Similar to this is the fact that Netflix never implemented the RecSys algorithm that resulted from its renowned $1 Million Netflix Prize. Netflix mentioned changes in their business model and too much of an engineering effort to implement the costly algorithm as reasons.
This emphasizes the importance of having analytics experts with well rounded soft-skills in-house. These kind of business scientists translate between business and data analytics and are a requirement for efficiently embedding a system like the “automatic statistician” in an organization.