# “Vite fait, bien fait” – Averaging improves both accuracy and speed of time series classification

Time series classification using k-nearest neighbors and dynamic time warping can be improved in many practical applications in both speed and accuracy using averaging.

**By Francois Petitjean, Dec 2014.**

Literally meaning “Quickly done, well done”, I used to say that as a kid, when I didn’t want to spend too much time on a chore; “Don’t worry mum, it was ‘quickly done, [but] well done’!”

If a ‘but’ was implied with my mum, our most recent work on time series showed that classification can actually be performed better

*and*faster. As part of a team of researchers spanning Australia, France and the United States I have just presented this work at the

*IEEE International Conference on Data Mining*. Before getting to the content, I want to acknowledge their amazing work: Thank you Germain, Geoff, Ann, Yanping, and Eamonn!

**Time series classification**

Classification of time series is one of the hot topics in data mining. It is used to recognize gestures, detect abnormal ECGs, and even to find the species of a mosquito from its “buzz”.

The data mining community has shown two somewhat surprising things:

- Although there are dozens of algorithms for time series classification, the
*Nearest Neighbor*(NN) classifier usually performs best. This is probably because we don’t understand temporal spaces very well. Quick reminder: NN compares the to-be-classified instance to every time series of the training database, and assigns to it the label of the most similar one. - Now if NN works well for time series, the next question is how to compute the similarity between sequences. Here again, a bit surprising, the best performing measure is Dynamic Time Warping (DTW), which was invented way back in 1972!

This is not the place to discuss

*why*this is true, but there is a strong consensus of the community that it is, supported by large scale reproducible experiments (see [a,b,c]).

**So what’s the problem with NN-DTW?**

The main issue is that using it to classify takes time. Even if recent advances have basically made the time complexity of DTW linear with the length of the series, NN cannot classify the query without looking at each time series of the database.

Whether you’re trying to decide if the insect that just flew past your sensor is a potential carrier of the Dengue virus, or recognize a gesture on your Xbox, there are many situations in which you don’t realistically have time to scan the whole database to obtain your classification.

We usually respond to this problem by using sampling: we satisfy our time constraints by comparing the sample to only a subset of the training database. The picture on the right shows the general behaviour of NN on our insect case study: the more examples you use, the better the classification is. Smart sampling techniques have also been studied for NN, for instance exploiting the fact that if an object of the training set is surrounded by objects of the same class, then this object might be ignored, because any query that falls within its neighborhood would be classified correctly even without it [d].

**Summarizing the database with average objects**

Another approach to remedying the shortcomings of NN-DTW is summarizing the database with

*constructed*prototypes (as opposed to prototypes from the database itself). This is the idea exploited by one of NN’s variations called the

*Nearest Centroid*classifier (NC), which replaces the database with the centroids of a few clusters to which the queries are then compared against.

Why is that better than sampling the training set directly? Because the

*mean*very often captures the “essence” of the set it summarizes much better than any object of the set can.

This idea that the

*mean*of a set of objects may be more representative than any

*individual*object from that set dates back at least a century to a famous observation of Francis Galton. Galton noted that the crowd at a county fair accurately guessed the weight of an ox when their individual guesses were averaged [e]. Galton realized that the

*average*was closer to the ox's true weight than the estimates of most crowd members, and also much closer than any of the separate estimates made by cattle experts.

**Averaging time series**

The issue for time series is that averaging might be

*very*complicated. The reason is that defining the mean is actually an optimization problem for which the measure that induces the space is central. DTW induces a very complex space. Broadly speaking, this is because it optimally realigns the time series. This makes the definition of a DTW-consistent average time series challenging to say the least! Computational biologists working on the same problem on sequences even call it the

*Holy grail*[f]. In fact, everything seems to indicate that this problem is NP-complete [g], which means that we need a good approximation.

This is where we intervene with our averaging method:

*DTW Barycentric Averaging*(DBA). In short, DBA is an expectation-maximization scheme that takes inspiration from the work of our colleagues in computational biology, but is specifically designed to tackle the particular needs of time series comparison and classification, such as autocorrelation (the likelihood to find an element similar to at ). We have demonstrated that DBA not only significantly outperforms competing methods, but is also very efficient (in fact, during my work for the French Space Agency, it was common for me to average millions of time series on my desktop computer).

**The results**

Let’s put everything together: take each class of the training set, launch K-means (or your favourite clustering algorithm), construct an average time series per cluster, and use these centroids as the prototypes to compare each query against.

When we started this work, I was hoping to improve a bit on the “smart” sampling techniques, i.e. that we would get faster without losing too much predictive power. In our experiments on 44 datasets, we did get a few that managed to do so (‘Gun point’ for instance – bold lines are based on average time series). However, we often managed to achieve faster

*and*better results, such as when classifying the insects. In fact, you can see there that even with one single prototype per class, we are able to improve on the full 1-NN classifier, with a 100x speedup.

**So, “Vite fait, bien fait”?**

Well, the short answer is Yes: we did demonstrate the statistical significance of the results I have intuited above.

The longer answer is that there is something even more exciting: we are starting to

*learn*models of time series. I’m saying this because in a way, NN does not really

*understand*the data; its

*learning*phase does not do anything but storing the raw data, without trying to make sense of their similarities and differences. To a data scientist, this is quite frustrating, because it is admitting that we don’t really understand how the classes are.

Our results however suggest that

*learning*a model for time series can be useful. I know; computing a

*mean*is a baby step toward what we call

*learning.*But having tried to make the most of time series for a few years, I definitely take it!

**More**

You’ll find more resources on my website www.francois-petitjean.com, including the paper, the source code and the slides. Don’t hesitate to get in touch: francois.petitjean@monash.edu, or on twitter at @LeDataMiner.

**Reference**

[a] A. Bagnalland J. Lines, “An experimental evaluation of nearest neighbour time series classification. Technical report #CMP-C14-01,” Department of Computing Sciences, University of East Anglia, Tech. Rep., 2014.

[b] X. Xi, E. Keogh, C. Shelton, L. Wei, and C.A. Ratanamahatana, “Fast time series classification using numerosity reduction,” in

*Int. Conf. on Machine Learning*, 2006, pp. 1033–1040.

[c] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, E. Keogh: Experimental comparison of representation methods and distance measures for time series data. Data Min. Knowl. Discov. 26(2): 275-309 (2013)

[d] E. Pekalska, R. P. Duin, and P. Paclìk, “Prototype selection for dissimilarity-based classifiers,” Pattern Recognition, vol. 39, no. 2, pp. 189–208, 2006.

[e] F. Galton, “Vox populi,”

*Nature*, vol. 75, no. 1949, pp. 450–451, 1907.

[f] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, January 1997, ch. 14 Multiple String Comparison – The Holy Grail, pp. 332–367.

[g] L. Wang and T. Jiang, “On the complexity of multiple sequence alignment,” Journal of Computational Biology, vol. 1, no. 4, pp. 337–348, 1994.

**Bio**

Francois Petitjean tries to develop useful solutions for big data. He completed PhD working for the French Space Agency in 2012, and since then was working in Geoff Webb’s team at Monash University’s Centre for Data Science.

**Related:**