As has been suggested in this summary, the theme of this years Strata was “data demystified”. I really enjoy reading about all efforts and shifts from taking the “magic”, “scienceness” and “academia” out of data analytics. In my opinion there are still too many people who want to make data analytics look a lot more complex than it is, only for an elite of the brightest of the bright to be worked on. (Similarly there are too many people talking about data science or big data who have never fired a single sql or done a correlation analysis, but that’s another topic).
I’d like to think of data analytics eventually being something that everybody who wants to do actually can and should do, given that they have a good understanding of numbers, don’t hate math and are willing to put in some effort to learn the basics.
I had enough of reading posts from PhDs and other academics who seem to perpetuate this notion of data analytics being something that is only for “true data analytics experts” and people who spent years at university. This blog post suggests that people only should use p-values if they have “proven” competency in statistics and and data analysis. Every company I worked for so far has benefitted from having more people talking about statistical significance and p-values than less. Even if some of the colleagues haven’t completely understood p-values and their pitfalls, they at least started thinking about what statistical significance means and stopped only looking at differences between means.
Another example of these “data anlytics only for experts” sentiment is this article. Here the prepartion of the modelling data by separating it into test and trainging sets is given as an example for one of the hard parts of data analysis, one with many potentail pitalls and only for the true expert. Yes, but so is driving a car at 200 km/h on the autobahn. In my view it’s a really artisanal task, with specific rules one has to follow which can be learned through practice, not necessarily only through studying (I honestly think that someone who has been in a car accident and experienced the immense force of heavy object coliding is better suited to drive his car at 200 km/h than someone who hasn't).
Blog posts like these seem to suggest that academic and detailed knowledge about analytical methods and algorithms is required and that one has to know linear algebra to be able to work as a data scientist. More often than not they recommend a bunch of books and online courses that one should go through before continue working...
Having read a bunch of statistic books and knowing about linear algebra is generally a good prerequisite for being a data scientist. But it’s not nearly enough, nor is it the most important part. Speaking for myself, I tend to forget details about specific algorithms one day after I heard or read about them (latest). How does the Logit-Transformation for a logistic regression exactly look like? How does Deep Learning differ from artificial neural nets? What’s the difference between linear and non-linear kernels in SVMs? I’ve learned linear algebra, but for my daily work as a data scientist it is as useful as knowing assembler to a ruby dev.
For a data scientist to generate business value it’s more important to have a broad but not necessarily deep knowledge of the different fields of data analytics, data mining and machine learning. The key skills of a data scientist is to be able identify the correct problem to work on, then source the necessary data and think about important variables, be able to find an appropriate (not the best) algorithm and effectively communicate the results. Without being able to integrate data science results and processes into the existing organization, by understanding the business process already in place and communicating effectively, data science teams will be something of an academic ivory tower. They might make no methodological mistakes and know all about the details of specific ml-algorithms, but they do not increase profits.
“Anyone who knows me well knows that I’m not the sharpest knife in the drawer. My quantitative skills are middling, but I’ve seen folks much smarter than me fail mightily at working as analytics professionals.” This is from John Foreman, chief data scientist at mailchimp and someone I would regard as pretty damn smart. John argues in this highly recommended blog article, that soft skills matter quite a lot in data science, especially to really understand the problem.
Similarly, Kaggle CEO Anthony Goldbloom says the following in an article that discusse how the most successfull data scientists are not necessarily PhDs: “In fact, I argue that often Ph.D.s in computer science in statistics spend too much time thinking about what algorithm to apply and not enough thinking about common sense issues like which set of variables (or features) are most likely to be important.”
Years ago during my time at XING, we had a meeting with some PhDs from DFKI (German Research Center for Artifical Intelligence) as we tried to figure out how to best machine learn about the different tags and their associations that a member enters to describe his profile (e.g. “data mining”, “machine learning” etc.). We used ordinary association- and lift-Analysis to determine the relation between any two tags, and discussed ways to calculate the relation between any n-tags towards another set of tags. So for example, what tags should I be interested in if I have the tags “data science” and “big data”? Eventually my team came up with the idea of just adding the single Lift-Values for the 1:1 relations up, which worked very well and bested the job recommender engines that were in place until that day. Recommendations with this method were not only performing better, they were less expensive to calculate and more universally applicable (we were able to calculate association between any types of items on xing). The only theoretical problem: the calculation, adding up Lift-Values, is mathematically incorrect! Did we care? No, because they were proving to generate more value to the business in every single split-test that we ran. Here is the link to the paper which describes the early version of tag-based recommender system.
I am not saying that any kind of specialized knowledge about different machine learning algorithms is unnecessary, but I think that data science and specifically predictive modelling is being portrayed as something way more complex than it actually is. There are really robust and proven ways (from when data science was called data mininig or BI and big data was called data) to make sure that mistakes in your predictive modelling don’t ruin your business. These include:
- Deviding the data your develop your model on into traininig- and test-sets. Model/Train on the training data, evaluate on your test data by looking at R-Square, ROC-Curves, AUC, AIC etc.
- Have other, complete distinct validation data sets, evaluate model performance on this set as well
- Test your model on real data, evaluate performance, look at score-quantiles (did the users with a high predicted converison score really convert more often than the others?)
There are frameworks which integrate these steps into an industrialized process for developing predictive models, for example SEMMA and CRISP which both originate from a time when the term data science and big data didn't really exist. Sometimes it helps to realize that there were many companies making millions with what is now called data science before the hype, but this again is another topic.
Some of the algorithms being applied today are so complex, that without these simple validation steps they couldn’t be implemented at all. For example the models being produced by SVMs and Artificial Neural Nets are more like a black box compared to logistic regression or decision trees. Only after validating their performance on different test- and validation-sets one should consider implementing them in practice.
On the other hand, methodological errors occuring while training a model will be uncovered easily, for example if the model performed very well on the test set, but seems to overfit on validation sets from other timeframes.
Having gone through this process for developing robust predictive models many times with real data in a real business in my view is far more important than knowing about how one would best implement a decision tree with map reduce or the latest trends in deep learning. In the last summer we interviewed a dozens of really intelligent data scientists for a data science job. Most of them with a respective PhD, some of them worked on amazing research projects. But asking them to transfer their knowledge onto a realistic business problem that we faced often resulted in a really akward moment of silence. Many solution were so theoretical that we never could have implemented them. What was even worse, there wasn’t really an informed discussion taking place. I’d rather hire someone with a proven track record of having applied data science algorithms and methods on real business problems than someone whose papers have been cited more than a 100 times but can’t transfer her knowledge into a new domain.
In one of the next blog posts in this “data to the people” series I’d like to talk about tools and practical examples of how to democratize data analytics. One of these include teaching our colleagues at Jimdo how to do analytics and querying our data warehouse with sql. This led to one of our non-technical Co-Founders typing sql regularly and even publishing the results internally.