- Quality beats quantity. A high quality small dataset is often more informative than a biased larger dataset. Performance is a tradeoff between bias and variance. As the sample size increases, the variance decreases but the bias remains. You don’t need a huge amount of data to achieve an acceptably small variance so that small dataset with no bias due to careful sampling or a controlled experiment will beat the big garbage dump of a dataset. David beats Goliath with a well-aimed shot.
- Cost. There is no free lunch. More data costs money – what did you think those power studies were for? But it’s not just the acquisition costs of data. Some procedures are computationally expensive and the cost increases at a faster than linear rate with data size. If you need your results now, you might do better with less data. Other costs of data are not financial. People value privacy – we should assign a cost to invading that privacy. We should avoid using more data than necessary to protect privacy.
- Inference works better on small data. Statisticians have spent years developing methods for inference on relatively small datasets. Unfortunately, most of these methods don’t work well with big data because the inference becomes unbelievably sharp. The reason for this is that most statistical methods do not allow for model uncertainty or unknown sampling biases. Machine learners do no better as their methods often fail to tackle the uncertainty problem at all. Until we learn how to express uncertainty in big data models, we might be better off sticking with small data.
- Aggregation. Sometimes we have the option of reducing a big individual level dataset to a smaller grouped dataset. Information may be lost by this aggregation but sometimes it can be beneficial. It reduces variation, needs simpler models and reduces privacy concerns.
- Teaching. Students now need to learn about big and small data methods. But where to start? Small data is easier to work with. It’s much simpler for students to understand both the principles and details of the computation without the technical overhead of big data. Students will need to learn big data methods sooner rather than later but it’s best to come into this with a good understanding of the ideas of uncertainty.
There’s always been statistics in sports. People have kept records of achievements such as the most goals in a season or home runs in a career for many years. These statistics add extra flavour to the mere winning and losing of games and championships. I’ve no complaint about such descriptive statistics. My objection is to the use of more advanced statistical methods to improve the chances of a team to win games. The book, and later the movie, Moneyball, was about recruiting underrated players and changing the strategy of a baseball team to win more games. The underlying methods are statistical and have been very successful. The ideas have been developed and spread throughout the sporting world. Many regard this as a success story for statistics but I beg to differ.
I understand the attraction. As a graduate student in Berkeley, I came across Bill James’ Baseball Abstract and was fascinated. James was not a professional statistician but he was great at assembling the right data to answer a question and backing up his claims with sensible statistical summaries. This was a model for how applied statistics should be done. Nonetheless, I realised that Baseball would not take much notice of a nerdy guy and so it proved for many years until Moneyball. So I’m not one of those people who object to sports sessions at Statistics conferences because sports is a trivial matter (as they feel) that does not deserve such attention at a serious conference.
We all like to feel that our work is improving the world in some way such as improving medical procedures or building more reliable products. American football coach, Vince Lombardi said “Winning isn’t everything, it’s the only thing.” so you might think the purpose of professional sports is winning. If statisticians can help teams win, surely that’s a good thing? But for every winner, there is a loser. The statistician can only help one team win at the expense of other teams losing. It’s a zero-sum game.
The true purpose of professional sports is not winning but entertainment. Lombardi had it all wrong. We watch sports because we find it enjoyable. The winning and losing are all part of the enjoyment. Do statisticians improve the enjoyment of sports? At best the answer is neutral and there is some evidence to suggest that they make it less enjoyable. In baseball, statisticians have discovered that players who foul off a lot of balls improve the chances of winning while stealing bases does not. Yet watching balls being fouled off is boring while base stealing is exciting. In football, statistical advice has sometimes led to boring, park-the-bus, defensive play.
It’s probably too late to turn back the clock as sports teams will not forgo the chance to gain an advantage. Nevertheless, statisticians should realise that, while they may derive some satisfaction and employment in applying statistics to sports, the overall effect of statistics on the professional sporting world has been negative.
Because sometimes the built-in choices are not enough. Here is an example of using a half Cauchy prior.
In a previous blog post I described the construction of Confidence Bands for Smoothness. The idea is to express the uncertainty in the amount of smoothing to be applied when fitting functions to data using Gaussian process regression. Now you can do the same thing with INLA. The previous implementation is in STAN. This is quite fast but not as fast as INLA.
I am writing a book entitled Bayesian Regression with INLA with Xiaofeng Wang and Ryan Yue. INLA stands for integrated nested Laplace approximations. Bayesian computation is not straightforward. In a few simple cases, explicit solutions exist, but in most statistical applications one typically uses simulation – usually based on MCMC (Markov chain Monte Carlo) methods. In some cases, this simulation can take a long time so it would be nice if you could do it faster. INLA is an approximation-based method that can do some Bayesian model fitting computation very quickly compared to simulation-based methods. You can learn more at the R-INLA website. You can also see some preview examples from the book.
Almost every student has a smart phone so it makes sense to format lecture notes so that they can be read on these small screen devices. But this can be difficult to achieve if you use LaTeX to produce your lecture notes or other mathematical/statistical handouts. We also want to maintain a full size version so the smaller version needs to be produced with minimal changes. Here are some tips which require increasing effort to implement
- Use A6 size paper by opening your LaTeX document with:
This is one quarter the size of A4 so it will shrink the page size greatly. I use the default 10pt font since students tend to have good eyesight. I can just about read this if I hold my phone in landscape orientation.
- Use the geometry package to specify minimal margins:
Because who needs margins if you are not printing this out.
- I use the graphicx package for including graphics. When including a plot or diagram use something like this:
The plot will take up 75% of the width of the page which works for both the small and the large screen versions.
- Some mathematical expressions can be quite long and will exceed the page width particularly on the small screen version. This is not a problem with text since LaTeX knows how to set the line breaks. But it’s much harder to do this with mathematical expressions. This is where the breqn package is very useful. At a minimum you can easily replace all your equation environments with dmath and displaymath with dmath*. This will get you automated line breaking in your equations. The breqn package has a lot more functionality if you want to make more of an effort.
It would be nice if LaTeX could produce documents that could dynamically reflow depending on screen size like the epub format used on e-readers. But that’s unlikely to happen so a version formatted for the small screen is the next best thing.
Today, the Guardian published its University Guide 2017. Let’s take a closer look at the components of these league tables. I don’t mean to pick on the Guardian table as the other tables have similar characteristics.
Three of the measures are based on student satisfaction and are drawn from the National Student Survey. When the NSS started out, it received fairly honest responses and had some value in provoking genuine improvements in education. But its value has deteriorated over time as Universities and students have reacted to it. Most students realise that the future value of their degree depends on the esteem in which their university is held. It is rational to rate your own university highly even if you don’t really feel that way. Furthermore, students find it difficult to rate their experience since most have only been to one university. It’s like rating the only restaurant you’ve ever eaten at. The Guardian makes things worse by using three measures from the NSS in its rankings.
Student to staff ratio is a slippery statistic. Many academics have divided responsibilities between teaching and research. It’s difficult to measure how much teaching they do and how much they interact with students. Class sizes can vary a lot according to type of programme and year in the degree – it’s not like primary education. Spend per student is another problematic measure. Expenditure on facilities can vary substantially from year to year and unpicking university budgets is difficult.
Average entry tariff is a solid measure and reflects student preferences. If you reorder the Guardian table on this measure alone, you’ll get a ranking that is closer to something knowledgeable raters would construct. This measure is sensitive to the selection of courses offered by the university since grades vary among A-level subjects.
Value added scores are highly dubious. It’s a measure of the difference between degree output and A-level input. A-levels are national tests and are a reasonable measure of entry ability. Degree qualifications are not comparable across Universities. A 2:1 at a top university is not the same as a 2:1 from a lower level university. If you compare the exams in Mathematics taken in top universities with those given by lower level universities, you will see huge differences in the difficulty and content. A student obtaining a 2:2 in Mathematics at a top university will likely know far more Maths than a student with a first from a post 92 university. This means it is foolish to take the proportion of good degrees (1 and 2:1) as a measure of output performance.
The final component is the percentage holding career jobs after six months. This is a useful statistic but is hard to measure accurately. This will also be affected by the mixture of courses offered by the university.
All these measures are then combined into a single “Guardian score”. There is no one right way to do this. If you consider the set of all convex combinations of the measures, you would generate a huge number of possible rankings, all them just as valid (or invalid) as the Guardian score. It’s a cake baked from mostly dodgy ingredients using an arbitrary recipe.
We might laugh this off as a bit of harmless fun. Unfortunately, some prospective students and members of the public take it seriously. Universities boast of their performance in press releases and on their websites. University administrators set their policies to increase their league table standing. Some of these policies are harmful to education and rarely are they beneficial. Meanwhile, the league tables are not actually useful to a 17 year old deciding on a degree course. The choice is constrained by expected A-level grades, course and location preferences. The statistics in the tables fluctuate from year to year and are an unreliable predictor of the individual student experience.
Suppose you have some data and you want to build a model to predict future observations. Data splitting means dividing your data into two, not necessarily equal, parts. One part is used to build the model and the other half to evaluate it in some way. There are two related, but distinct, reasons why people do this. It’s well known that if you use the same data to both build the model and test the performance of that model, you’ll be overoptimistic about how well your model will do in predicting new data. Analysts have various tricks to avoid overconfidence, such as crossvalidation, but these are not perfect. Furthermore, if you need to prove to someone else how well you’ve done, they will be sceptical of such tricks. The gold standard is to reserve part of your data as a test set and build and fit your model on the remaining training set part of the data. You use this model to predict the observations in the test set. Because the test set has been held back, it’s (almost) like having fresh data. The performance on this test set will be a realistic assessment of how the model will perform on future data. But you lost something in the data splitting – the training data is smaller than the full data so the model you select and fit won’t be as good. If you don’t need to prove how good your model is, this form of data splitting is a bad idea. It’s as if a customer orders a new car and asks that the seller drive it around for 10K miles to prove there’s nothing wrong with it. The customer will receive the assurance that the car is not a lemon but it won’t be as good as getting a brand new car in perfect condition.
By the way, if you find your model doesn’t do as well as you’d hoped on the test set, you might be tempted to go back and change the model. But the performance on this new model cannot be judged cleanly with the test set because you’ve borrowed some information from the test set to help build the model. You only get one shot using the test set.
There’s a second reason why you might use data splitting. The typical model building process involves choosing from a wide range of potential models. Usually there is substantial model uncertainty about the choice but you take your best pick. You then use the data to estimate the parameters of your chosen model. Statistical methods are good at assessing the parametric uncertainty in your estimates, but don’t reflect the model uncertainty at all. This a big reason why statistical models tend to be overconfident about future predictions. That’s where the data splitting can help. You use the the first part of your data to build the model and the second part to estimate the parameters of your chosen model. This results in more realistic estimates of the uncertainty in predictions. Again you pay a price for the data splitting. You use less data to choose the model and less data to fit the model. Your point predictions will not be as good as if you used the full data for everything. But your statements of the uncertainty in your predictions will be better (and probably wider).
So judging the value of data splitting in this context means we have to attach some value to uncertainty assessment as well as point prediction. The method of scoring is a good way to do this. All this and more is discussed in my paper Does data splitting improve prediction? Although it depends on the circumstances, I show that this form of data splitting can improve prediction.
Sky surveys provide a wide-angle view of the universe. Instead of focusing on a single star or galaxy, they provide a snapshot of a wider portion of the sky. Their primary purpose is to search for asteroids, so they take a sequence of images over a short timespan: minutes or hours. By seeing what has changed, the asteroid can be detected. But meanwhile, much further away, other things are changing. Most objects in the night sky do not change much on human timescales, but some of the most interesting stars and galaxies are performing for us. As the survey returns to the same location from time to time, we can see those objects that have changed. But how do we detect and categorise these objects? Here is a plot of four such objects:
The first of the four is an example of an active galactic nucleus. The plot shows how the brightness of the object varies over time. The second plot shows a supernova – most of the time we see nothing because the galaxy is below the detection limit but briefly we see a spark of light for a few weeks. The third object is a flare – it’s mostly quiet but burst into life from time to time. The fourth object is like our sun – it varies a little but enough to be interesting (fortunately for us!).
These are examples of light curves. As technology improves, increasingly large numbers of these curves will be recorded. We need to quickly identify and categorise the interesting ones so that they can be rapidly followed up (before the show is over!). In “Modeling lightcurves for improved classification of astronomical objects”, we describe how this can be done. You can also read about it on ArXiv and examine the software and data.
In What’s wrong with simultaneous confidence bands, I discussed the deficiencies of standard confidence bands, but how can we do better? Suppose we have a Gaussian process for the curve with some mean and a covariance kernel:
where δ is a delta function. The three hyper parameters, and l control the appearance of the curve. We can put priors on these parameters along with a prior on the mean function and calculate the posterior using MCMC methods. The most important parameter l controls the smoothness of the curve. We can view the uncertainty in this smoothness by computing a 95% credible interval for this parameter and plot the two curves corresponding to the endpoints of this interval. The other parameters are set at the posterior means. An example of these bands can be seen in the figure:
The solid line corresponds to the lower end of 95% interval for the smoothing parameter l giving a rougher fit while the dashed line is the upper end of the interval which is a smoother fit. We are fairly sure that the correct amount of smoothness lies between these two limits.
Notice this makes a big difference to how many peaks and valleys we see in the function. The uncertainty about these features is more interesting than any uncertainty about the vertical position of the curve expressed by the traditional bands.
You can read all the details in my article: Confidence bands for smoothness in nonparametric regression and also view the R code.