Several articles have been published lately bemoaning the fate of Statistics relative to Data Science. See, for example, “Aren’t we Data Science” and “Let us own Data Science”. The general message is that outsiders have been repackaging Statistics, presenting it as the new and exciting field of Data Science, and then taking all the credit and funding. There is some acceptance that there is more to Data Science than just Statistics but a feeling that these other areas, such as data management and big data manipulation could be incorporated within the standard statistical training. In “50 years of Data Science”, David Donoho points out that this has been going on since John Tukey presciently described how Data Science (as it would later be called) differed from the Statistics of the time. Tukey also told Statisticians how they would need to change to rise to these new challenges. The advice wasn’t entirely ignored but the existence of fields such as Data Science and Machine Learning is testament to the failure of Statisticians to evolve. So what went wrong?
In “Statistics Losing Ground to Computer Science”, Norman Matloff criticises the Computer Science model of research. Articles appear quickly as conference proceedings with lighter refereeing in contrast to the much slower approach of publishing in Statistics journals. This does indeed lead to sometimes superficial results and much reinvention of the wheel. Nevertheless, one cannot deny its effectiveness in advancing data analytic methods that solve a wide range of practical problems that Statisticians have been reluctant to tackle. Judging by mindshare, the publishing model in Computer Science is superior.
The model of publishing in Statistics goes to the heart of the problem because it determines who has significant influence over the field. Who decides who gets hired, who gets grants awarded and what is taught to undergraduate and graduate students in Statistics? Securing a senior position in Statistics requires a strong record of publication in these journals. And what is an almost essential ingredient for such publications? Mathematics.
Mathematics has been good for Statistics. Nothing can beat the power and generality of a good mathematical theory. Mathematical rigour provides a firm foundation to much statistical practice. Mathematics is wonderful but has also been a brake on the development of Statistics. In Computer Science, you can develop a method using mathematical intuitions but not rigour, you can try it out on a few datasets and show it works. You can publish it and others may apply it to new datasets. If it works well, the method will again acceptance and will be developed and extended. In Statistics, the editor or referee will reject the same paper on the grounds that it lacks mathematical rigour and they can’t tell whether your method works in full generality.
Now certainly it would be a fine thing to prove a theorem showing that your method works in full generality but the complexity of the problem will typically block you from a genuinely useful mathematical result. You can resort to Mathematistry, a term coined by Rod Little. This entails filling your paper with a result that is technically impressive but says little or nothing about the practical performance of your method. Unfortunately, it seems many of the general and powerful theorems of Statistics have already been proved and the scope for genuine mathematical progress is limited. Meanwhile the more empirical approach of Computer Science is far more effective in generating new and powerful methodology.
There have been many sensible recommendations for how Statistics can respond to the challenge of Data Science but I am not optimistic that meaningful change is possible. The institution of Statistics is dominated by people with a certain mindset that is self-perpetuating. Rather like Romans viewing the massed barbarians at the gate, they knew what was wrong but were incapable of change. But the barbarians were more civilised than they realised and the legacy of Rome lived on. That I think is the fate of Statistics – like Latin it will continue to be influential but will become a dead language.