Statistics authors single no more

While looking at tables of contents for journals and CVs for job applications, it struck me that there are a lot of multi-author papers now. Just to check that I was not imagining it, I downloaded the tables of contents for JASA and JRSS-B for the years 2014, 2004, 1994, 1984 and 1974. I removed the comments, book reviews and corrections to focus just on the research articles. I counted the number of authors for each paper and here’s what I found:

Mean number of authors per article
1974 1984 1994 2004 2014
JASA 1.48 1.61 1.89 2.50 2.97
JRSS 1.29 1.51 1.70 2.45 2.65

and

Fraction of single author articles
1974 1984 1994 2004 2014
JASA 0.56 0.50 0.35 0.19 0.06
JRSS 0.71 0.57 0.41 0.12 0.03

I’d like to have done more but Web of Science doesn’t make it easy to get the data. Nevertheless, the trend is very clear. Browse through a few journals, old and new and you’ll see the same pattern. You can see similar trends in other fields – here’s one from Astronomy. The average number of authors per paper has been increasing significantly over time. Most dramatically, the number of single author papers has gone from a majority down to single digit percentages.

Now there are several good reasons not to go it alone. If you write a genuine applications paper, you will be collaborating with a scientist from another field who would be a natural co-author. But most of the articles in these two journals are methodological with Statistical authors. In any case this doesn’t explain the trend. It’s also possible that your co-authors bring different skills to the table, theoretical or computational. It’s also nice to have someone to discuss the paper. You correct each other’s errors and misconceptions. You learn from each other. The social aspect of the collaboration can make you more productive. But that’s also been true in the past. You can point to the greater ease of long distance collaborations using the internet but I suspect there is a stronger cause.

The academic world is increasingly driven by metrics. You will get far more credit for writing four articles, each with four authors than for producing one single author paper. So there’s a huge incentive to collaborate. This is good when the co-authors have clearly distinct skills but often they are just other statisticians, much like you.

But something has been lost. Most good papers have one main idea and only one person can have that idea. Sure, other people can help refine that idea and turn it into a paper but somewhere in that list of multiple authors there is that one person who had the idea. Truly creative and innovative ideas start out sounding a bit crazy, ill-formed and speculative. If you discuss the idea with your peers, they will back away since it won’t sound like something that will reliably result in a publishable paper. Don’t even think of submitting a grant proposal. So you work on the idea and turn it into flesh. By that time, you don’t need co-authors. Working as a group is a reliable way to get work published but for truly creative work, you need to go it alone.

 

Advertisements

The Decline and Fall of Statistics

Several articles have been published lately bemoaning the fate of Statistics relative to Data Science. See, for example, “Aren’t we Data Science” and “Let us own Data Science”. The general message is that outsiders have been repackaging Statistics, presenting it as the new and exciting field of Data Science, and then taking all the credit and funding. There is some acceptance that there is more to Data Science than just Statistics but a feeling that these other areas, such as data management and big data manipulation could be incorporated within the standard statistical training. In “50 years of Data Science”, David Donoho points out that this has been going on since John Tukey presciently described how Data Science (as it would later be called) differed from the Statistics of the time. Tukey also told Statisticians how they would need to change to rise to these new challenges. The advice wasn’t entirely ignored but the existence of fields such as Data Science and Machine Learning is testament to the failure of Statisticians to evolve. So what went wrong?

In “Statistics Losing Ground to Computer Science”, Norman Matloff criticises the Computer Science model of research. Articles appear quickly as conference proceedings with lighter refereeing in contrast to the much slower approach of publishing in Statistics journals. This does indeed lead to sometimes superficial results and much reinvention of the wheel. Nevertheless, one cannot deny its effectiveness in advancing data analytic methods that solve a wide range of practical problems that Statisticians have been reluctant to tackle. Judging by mindshare, the publishing model in Computer Science is superior.

The model of publishing in Statistics goes to the heart of the problem because it determines who has significant influence over the field. Who decides who gets hired, who gets grants awarded and what is taught to undergraduate and graduate students in Statistics? Securing a senior position in Statistics requires a strong record of publication in these journals. And what is an almost essential ingredient for such publications? Mathematics.

Mathematics has been good for Statistics. Nothing can beat the power and generality of a good mathematical theory. Mathematical rigour provides a firm foundation to much statistical practice. Mathematics is wonderful but has also been a brake on the development of Statistics. In Computer Science, you can develop a method using mathematical intuitions but not rigour, you can try it out on a few datasets and show it works. You can publish it and others may apply it to new datasets. If it works well, the method will again acceptance and will be developed and extended. In Statistics, the editor or referee will reject the same paper on the grounds that it lacks mathematical rigour and they can’t tell whether your method works in full generality.

Now certainly it would be a fine thing to prove a theorem showing that your method works in full generality but the complexity of the problem will typically block you from a genuinely useful mathematical result. You can resort to Mathematistry, a term coined by Rod Little. This entails filling your paper with a result that is technically impressive but says little or nothing about the practical performance of your method. Unfortunately, it seems many of the general and powerful theorems of Statistics have already been proved and the scope for genuine mathematical progress is limited.  Meanwhile the more empirical approach of Computer Science is far more effective in generating new and powerful methodology.

There have been many sensible recommendations for how Statistics can respond to the challenge of Data Science but I am not optimistic that meaningful change is possible. The institution of Statistics is dominated by people with a certain mindset that is self-perpetuating. Rather like Romans viewing the massed barbarians at the gate, they knew what was wrong but were incapable of change. But the barbarians were more civilised than they realised and the legacy of Rome lived on. That I think is the fate of Statistics – like Latin it will continue to be influential but will become a dead language.

Plotting regression datasets

Consider a regression dataset with a response and several predictors. You want a single plot showing the response plotted against each of the predictors. You could use the pairs() but that also shows plots between the predictors. If there are more than a few predictors, there are too many plots to see any one of them clearly. Here’s a simple solution:

Here’s an example dataset:

head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Now reorganise the data using the tidyr package so that there is one (x,y) pair on each line:

library(tidyr)
rdf <- gather(swiss, variable, value, -Fertility)
head(rdf)
##   Fertility    variable value
## 1      80.2 Agriculture  17.0
## 2      83.1 Agriculture  45.1
## 3      92.5 Agriculture  39.7
## 4      85.8 Agriculture  36.5
## 5      76.9 Agriculture  43.5
## 6      76.1 Agriculture  35.3

Use ggplot2 to plot the response against each of the predictors (which are on different scales so
we need to allow for that)

library(ggplot2)
ggplot(rdf, aes(x=value,y=Fertility)) + geom_point() + facet_wrap(~ variable, scale="free_x")

unnamed-chunk-3-1

We can elaborate as needed.

GLM – an overused paradigm

There are a very large number of books, articles and university courses with “Generalised Linear Models” (GLM) in the title. I even wrote such a book myself. But is the GLM paradigm really that useful?

The idea is that we can develop a general theory of estimation, inference and diagnostics that will apply to a wide class of models. We will avoid duplication of effort and synergies will arise across this class of models. The idea goes back to a 1972 paper in JRSS-B by Nelder and Wedderburn.

But what response distributions belong to the GLM family? The Gaussian (or normal) distribution is the most used in practice. But this is simply the linear model for which much more general and powerful results exist. The Gaussian gains nothing from the GLM perspective.

Next we have the binomial and Poisson, which do benefit from GLM membership. But beyond that, the family members become progressively more exotic. The gamma GLM is not commonly used because a linear model with a transformed response will often suffice. Venturing further into the outback, you may find an inverse Gaussian or Tweedie GLM but these are truly rare birds. There are more interesting distributions such as the negative binomial and beta but they are excluded from the club as they don’t belong to the “exponential family” of distributions.

So there are only two important members of the GLM family: the binomial and the Poisson. Even here we must be careful because large chunks of the theory and practice for these models is specific to one or the other. The GLM paradigm does tell us one way to estimate and do inference for these models. But there other ways to do these things.

Statisticians love widely applicable theories – but perhaps a little too much. GLM is a nice, useful theory but the paradigm has become too dominant in the way people learn or are taught about these kind of models.