The Two Cultures of Data Analysis

Much of the reading I do tends to end up leading me to many papers which seem to be carried out in the machine learning field rather than statistics. I always ask myself what’s the difference. There’s so much blurring between the two areas like topic modelling which is based off of Bayesian statistics but still is worked on primarily by people in the machine learning community. Then, there’re areas which fall more in the statistics field like Expectation-Maximization; and the computer science field like neural networks. This lead me to asking why are two fields so different? So I did what anyone would do and went searching around google.

To be honest I don’t have any strong opinions about either, here I’ll just summarize up a few views of other people in the field which I found interesting. These views are probably now dated as the two fields are changing rapidly. I’ve added a few extra links at the bottom which give very detailed looks on the matter.

Max Wellings remarks Dec-2015

Prof. Max Welling is a research chair in machine learning at the University of Amsterdam. He gave his some remarks on “Data Science in the next 50 years”, I found his views extremely interesting and it’s also the most recent work I could find on the topic (Dec - 2015). He notes that that statistics is more focused on “explaining and testing properties of a population from which we see a random sample, machine learning is more concerned with making predictions, even if the prediction can not be explained very well (a.k.a. a black-box prediction)”. Which really explains why the machine learning field has widely adopted deep-learning; Nothing seems to beat it in terms of prediction and it is in most cases a black-box prediction.

He then continues to make a pile of valid points and questions about the future. One of the most interesting one to me is, will statistics and machine learning every converge in the deep learning field (a.k.a will statisticians ever adopt the deep learning paradigm)? Then notes that the field could greatly benefit from the tools in statistics such as improving insights into deep learning predictions. He makes a note to statisticians that it would be silly to ignore the massive amounts of data we have availible today.

A comment he makes which I personally agree with is “It therefore seems reasonable to include computer science classes in a statistics curriculum.”. In my undergraduate degree there was very little encouragement to learn anything more than basic R skills. The only programming language used (in statistics modules) was R; and the most complex programming would be a for loop. There was of course optional modules for computing but these generally involved slightly more complex R. However, areas around memory and parallel computing were completely avoided, which are core areas for analysing big data.

He did ask for a fight… Dec 2008

A second resource I found was a machine learning blog post by Brendan O’Connor who’s currently (was not at the time) an assistant professor in the Information and Computer Sciences at University of Massachusetts Amherst. The title is “Machine Learning and Statistics, fight!” and that’s exactly what he got, a very large thread of comments. This was written almost 10 years ago so the

One of my favourite aspects of the post is where he says “ML sounds like it’s young, vibrant, interesting to learn, and growing; Stats does not.”. This sums up exactly the problem with statistics, the two are almost the same (in 2008, now the deep learning craze sets a bit of a boundary); however, ML jobs generally pay more, people think it’s cooler and more state of the art. For example my sister works in marketing, she’s always going on about how they use Machine learning; but she thinks that such problems are completely outside the reach of statisticians. There’s a common stigma around statisticians that we only do hypothesis testing, the part of making interesting insights of data is often associated with our machine learning counterparts.

This is mainly a problem with the marketing of statistics. The fields hugely overlap; however, when you’re going for a job the statistician will most likely be paid less than the machine learning practitioner even though the skill set is almost identical.

He also mentions a pile of terms in the post which are quite cool, few of which I list here:

Computer Science Statistics
Neural Network stack of linear functions paired with a wonky, overhyped training algorithm; the combination of which, many years later, still causes confusion.
Weights Parameters
Learning Fitting

Andrew Gelman response to Brendan’s post

Prof. Andrew Gelman is a statistician at Columbia university. He posted a response which lead to another blast of arguments in the comments, the last two of which I found quite interesting.

Simon Blomberg posted in a comment

From R’s fortunes package: To paraphrase provocatively, ‘machine learning is statistics minus any checking of models and assumptions’. - Brian D. Ripley (about the difference between machine learning and statistics) useR! 2004, Vienna (May 2004)

which then resulted in a response by Prof. Andrew Gelman:

In that case, maybe we should get rid of checking of models and assumptions more often. Then maybe we’d be able to solve some of the problems that the machine learning people can solve but we can’t!

I find this quite interesting as in statistics we generally focus on models which we can explain the fitting for in a statistically justified way and make sure that insights can be made once the model is fit. Whereas, in ML they generally just go with the philosophy that if a model predicts well it’s good

Conclusion

I just talked about two posts I found online, there’s many more debates. Jerome H. Friedman’s data mining summary is a very detailed view on the matter; however, it’s slightly outdated. There’s also a post named “The Stat’s Handicap” which points out some disadvantages of statistics such as the amplitude of publications PhD students generally end up with.

Mike

Statistics & Operations Student

STOR-i, Lancaster University http://lancaster.ac.uk/~omalley3