## Confidence Bands for Smoothness

In What’s wrong with simultaneous confidence bands, I discussed the deficiencies of standard confidence bands, but how can we do better? Suppose we have a Gaussian process for the curve with some mean and a covariance kernel:

$k(x,x^\prime) = \sigma^2_f \exp \left( - {1 \over 2l^2} (x-x^\prime)^2 \right) + \sigma^2_n \delta(x - x^\prime)$

where δ is a delta function. The three hyper parameters, $\sigma_f, \sigma_n$ and control the appearance of the curve. We can put priors on these parameters along with a prior on the mean function and calculate the posterior using MCMC methods. The most important parameter l controls the smoothness of the curve. We can view the uncertainty in this smoothness by computing a 95% credible interval for this parameter and plot the two curves corresponding to the endpoints of this interval. The other parameters are set at the posterior means. An example of these bands can be seen in the figure:

The solid line corresponds to the lower end of 95% interval for the smoothing parameter giving a rougher fit while the dashed line is the upper end of the interval which is a smoother fit. We are fairly sure that the correct amount of smoothness lies between these two limits.

Notice this makes a big difference to how many peaks and valleys we see in the function. The uncertainty about these features is more interesting than any uncertainty about the vertical position of the curve expressed by the traditional bands.

You can read all the details in my article: Confidence bands for smoothness in nonparametric regression and also view the R code.

## What’s wrong with simultaneous confidence bands

Here are some 95% confidence bands for a fitted curve shown on the same data. The first uses spline smoothing from the mgcv package while the second uses a loess smoother from the ggplot2 package.

If all you want is a graphical expression of the uncertainty in these two estimates, these are just fine. But suppose you want to check whether some proposed function fits entirely within the bands, then you will need to do more work. The bands above are pointwise meaning that the confidence statement is true at any given point but not true for the entire curve. For that, you will need simultaneous confidence bands. These are more work to produce and involve all sorts of intricate calculations. Hundreds of papers have been written on the topic because of the fascinating theoretical challenges it raises. I’m responsible for a few of these papers myself.

But are these bands really useful? Properly constructed, they may tell us there’s a 95% chance that the bands capture the true curve. But what is the value in that? There will be some users who are interested in replacing the smooth fit with some particular parametric form, say a quadratic. Such users would be better off embedding their search within a family of increasingly complex alternatives and choosing accordingly. SCBs would not be an efficient strategy.

In SCB papers that have a data example, the usual motivation is that the user is interested in whether some particular feature exists, say a secondary maximum, for example. But users looking for particular features are better off with a method designed to look for those features, such as SiZer.

The greatest uncertainty is demonstrated by comparing the two figures above. How much smoothing should be applied? This makes a crucial difference to our interpretation. In the typical SCB construction, the amount of smoothing is chosen by some algorithm and the bands only reflect the uncertainty in the amplitude of the curves.

We need bands that tell us about the uncertainty in the smoothing. I will explain how to do this in the next blog post.

## Status Publishing

In Academic Publishing Is All About Status, the view is expressed that publishing in academic journals is about signalling status rather than in communicating results. After all, many papers have been online in ArXiv or similar repositories, presented at conferences and discussed in blogs for some time so the contents hardly come as a surprise. All the journal publication does is supply varying amounts of credit to the authors and provide a somewhat more durable archiving of the contents. The original article is about the field of Economics but could equally well be about Statistics.

## Statistics authors single no more

While looking at tables of contents for journals and CVs for job applications, it struck me that there are a lot of multi-author papers now. Just to check that I was not imagining it, I downloaded the tables of contents for JASA and JRSS-B for the years 2014, 2004, 1994, 1984 and 1974. I removed the comments, book reviews and corrections to focus just on the research articles. I counted the number of authors for each paper and here’s what I found:

Mean number of authors per article
1974 1984 1994 2004 2014
JASA 1.48 1.61 1.89 2.50 2.97
JRSS 1.29 1.51 1.70 2.45 2.65

and

Fraction of single author articles
1974 1984 1994 2004 2014
JASA 0.56 0.50 0.35 0.19 0.06
JRSS 0.71 0.57 0.41 0.12 0.03

I’d like to have done more but Web of Science doesn’t make it easy to get the data. Nevertheless, the trend is very clear. Browse through a few journals, old and new and you’ll see the same pattern. You can see similar trends in other fields – here’s one from Astronomy. The average number of authors per paper has been increasing significantly over time. Most dramatically, the number of single author papers has gone from a majority down to single digit percentages.

Now there are several good reasons not to go it alone. If you write a genuine applications paper, you will be collaborating with a scientist from another field who would be a natural co-author. But most of the articles in these two journals are methodological with Statistical authors. In any case this doesn’t explain the trend. It’s also possible that your co-authors bring different skills to the table, theoretical or computational. It’s also nice to have someone to discuss the paper. You correct each other’s errors and misconceptions. You learn from each other. The social aspect of the collaboration can make you more productive. But that’s also been true in the past. You can point to the greater ease of long distance collaborations using the internet but I suspect there is a stronger cause.

The academic world is increasingly driven by metrics. You will get far more credit for writing four articles, each with four authors than for producing one single author paper. So there’s a huge incentive to collaborate. This is good when the co-authors have clearly distinct skills but often they are just other statisticians, much like you.

But something has been lost. Most good papers have one main idea and only one person can have that idea. Sure, other people can help refine that idea and turn it into a paper but somewhere in that list of multiple authors there is that one person who had the idea. Truly creative and innovative ideas start out sounding a bit crazy, ill-formed and speculative. If you discuss the idea with your peers, they will back away since it won’t sound like something that will reliably result in a publishable paper. Don’t even think of submitting a grant proposal. So you work on the idea and turn it into flesh. By that time, you don’t need co-authors. Working as a group is a reliable way to get work published but for truly creative work, you need to go it alone.

## The Decline and Fall of Statistics

Several articles have been published lately bemoaning the fate of Statistics relative to Data Science. See, for example, “Aren’t we Data Science” and “Let us own Data Science”. The general message is that outsiders have been repackaging Statistics, presenting it as the new and exciting field of Data Science, and then taking all the credit and funding. There is some acceptance that there is more to Data Science than just Statistics but a feeling that these other areas, such as data management and big data manipulation could be incorporated within the standard statistical training. In “50 years of Data Science”, David Donoho points out that this has been going on since John Tukey presciently described how Data Science (as it would later be called) differed from the Statistics of the time. Tukey also told Statisticians how they would need to change to rise to these new challenges. The advice wasn’t entirely ignored but the existence of fields such as Data Science and Machine Learning is testament to the failure of Statisticians to evolve. So what went wrong?

In “Statistics Losing Ground to Computer Science”, Norman Matloff criticises the Computer Science model of research. Articles appear quickly as conference proceedings with lighter refereeing in contrast to the much slower approach of publishing in Statistics journals. This does indeed lead to sometimes superficial results and much reinvention of the wheel. Nevertheless, one cannot deny its effectiveness in advancing data analytic methods that solve a wide range of practical problems that Statisticians have been reluctant to tackle. Judging by mindshare, the publishing model in Computer Science is superior.

The model of publishing in Statistics goes to the heart of the problem because it determines who has significant influence over the field. Who decides who gets hired, who gets grants awarded and what is taught to undergraduate and graduate students in Statistics? Securing a senior position in Statistics requires a strong record of publication in these journals. And what is an almost essential ingredient for such publications? Mathematics.

Mathematics has been good for Statistics. Nothing can beat the power and generality of a good mathematical theory. Mathematical rigour provides a firm foundation to much statistical practice. Mathematics is wonderful but has also been a brake on the development of Statistics. In Computer Science, you can develop a method using mathematical intuitions but not rigour, you can try it out on a few datasets and show it works. You can publish it and others may apply it to new datasets. If it works well, the method will again acceptance and will be developed and extended. In Statistics, the editor or referee will reject the same paper on the grounds that it lacks mathematical rigour and they can’t tell whether your method works in full generality.

Now certainly it would be a fine thing to prove a theorem showing that your method works in full generality but the complexity of the problem will typically block you from a genuinely useful mathematical result. You can resort to Mathematistry, a term coined by Rod Little. This entails filling your paper with a result that is technically impressive but says little or nothing about the practical performance of your method. Unfortunately, it seems many of the general and powerful theorems of Statistics have already been proved and the scope for genuine mathematical progress is limited.  Meanwhile the more empirical approach of Computer Science is far more effective in generating new and powerful methodology.

There have been many sensible recommendations for how Statistics can respond to the challenge of Data Science but I am not optimistic that meaningful change is possible. The institution of Statistics is dominated by people with a certain mindset that is self-perpetuating. Rather like Romans viewing the massed barbarians at the gate, they knew what was wrong but were incapable of change. But the barbarians were more civilised than they realised and the legacy of Rome lived on. That I think is the fate of Statistics – like Latin it will continue to be influential but will become a dead language.

## Plotting regression datasets

Consider a regression dataset with a response and several predictors. You want a single plot showing the response plotted against each of the predictors. You could use the pairs() but that also shows plots between the predictors. If there are more than a few predictors, there are too many plots to see any one of them clearly. Here’s a simple solution:

Here’s an example dataset:

head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6


Now reorganise the data using the tidyr package so that there is one (x,y) pair on each line:

library(tidyr)
rdf <- gather(swiss, variable, value, -Fertility)

##   Fertility    variable value
## 1      80.2 Agriculture  17.0
## 2      83.1 Agriculture  45.1
## 3      92.5 Agriculture  39.7
## 4      85.8 Agriculture  36.5
## 5      76.9 Agriculture  43.5
## 6      76.1 Agriculture  35.3


Use ggplot2 to plot the response against each of the predictors (which are on different scales so
we need to allow for that)

library(ggplot2)
ggplot(rdf, aes(x=value,y=Fertility)) + geom_point() + facet_wrap(~ variable, scale="free_x")

We can elaborate as needed.

## GLM – an overused paradigm

There are a very large number of books, articles and university courses with “Generalised Linear Models” (GLM) in the title. I even wrote such a book myself. But is the GLM paradigm really that useful?

The idea is that we can develop a general theory of estimation, inference and diagnostics that will apply to a wide class of models. We will avoid duplication of effort and synergies will arise across this class of models. The idea goes back to a 1972 paper in JRSS-B by Nelder and Wedderburn.

But what response distributions belong to the GLM family? The Gaussian (or normal) distribution is the most used in practice. But this is simply the linear model for which much more general and powerful results exist. The Gaussian gains nothing from the GLM perspective.

Next we have the binomial and Poisson, which do benefit from GLM membership. But beyond that, the family members become progressively more exotic. The gamma GLM is not commonly used because a linear model with a transformed response will often suffice. Venturing further into the outback, you may find an inverse Gaussian or Tweedie GLM but these are truly rare birds. There are more interesting distributions such as the negative binomial and beta but they are excluded from the club as they don’t belong to the “exponential family” of distributions.

So there are only two important members of the GLM family: the binomial and the Poisson. Even here we must be careful because large chunks of the theory and practice for these models is specific to one or the other. The GLM paradigm does tell us one way to estimate and do inference for these models. But there other ways to do these things.

Statisticians love widely applicable theories – but perhaps a little too much. GLM is a nice, useful theory but the paradigm has become too dominant in the way people learn or are taught about these kind of models.