SPOTLIGHT: Can Data Science Save Humanity from Mosquitoes and other Deadly Insects?

KDnuggets launches Spotlight initiative to bring attention to academic research. The journey begins with Prof. Eamonn Keogh and his student, Yanping Chen, who are applying data mining to save us all from insect-vectored diseases.

IDEA The KDnuggets team is glad to introduce our SPOTLIGHT initiative - a monthly column dedicated to academic researchers. Living in a world inundated by news and events in the field of Data Science, it is not hard to note that most of the media reporting is focused on industry events such as market launch of new products, startups, acquisitions, investments, and of course, marketing articles poorly disguised as news!

WHY? Far away from the shine and glamour of media attention, we have a distinct group of intellectuals consistently tinkering with new ideas, conducting laborious experiments with scientific vigour and reporting unbiased observations. Along with their peers and students, these intellectuals pursue a totally different style of research, as contrasted to the industry. They would energetically pursue problems whose solutions might not have any tangible financial benefits. In short, academic research is an integral component of scientific advancement. Thus, it deserves a fair share of media attention in order to spread out the great ideas generated in university labs and research centers.

HOW? In the pursuit of providing comprehensive reporting of academic research, we interview distinguished researchers and one of their current students. The interviews are designed to focus on novel ideas for significant problems, while also exploring potential applications, interesting trends and personality insights.

SO WHAT? Besides introducing the great hidden talent and research from the world of academia, this initiative will also help our readers expand their horizons of creative thinking for the various challenges Data Science currently poses. Furthermore, we hope that this column would help prospective students learn more about the research activities across universities and serve as an informal introduction to their future peers.

Let’s begin the journey. Our first stop: UC, Riverside.
UC Riverside
Prof. Eamonn Keogh is a leading researcher (No. 6 among Top Research Leaders in Data Mining, Data Science, and KDD), and his main focus is in the area of time series. Since recently, Prof. Keogh has been actively pursuing the applications of data mining in entomology, which he calls "Computational Entomology". His student, Yanping Chen, recently won a contest to receive research grant from Bill & Melinda Gates Foundation for the project "Using Data to Understand Insect-Vectored Diseases". Let's learn more about their research in this two part interview (Part-1: Interview with Prof. Keogh; Part-2: Interview with Yanping Chen). Each part will feature interviewee's short bio before the interview.

eanmonn_keoghEamonn Keogh was born in Dublin, Ireland, the youngest of nine children. A middling student, he dropped out of school at age 15 to serve an apprentice as an automotive refinisher. Looking for something more in life he immigrated to the US at age 19, and slowly worked his way through college supporting himself as a metal fabricator, bicycle mechanic and occasional automotive refinisher. He received his Ph.D. in Computer Science in 2001 from the University of California-Irvine, and since then has been at the University of California-Riverside, winning early tenure and early promotion to full professor. He has won best paper awards at virtually all data mining venues (SIGKDD, ICDM, SDM, SIGMOD etc) and awards from the Bill and Melinda Gates Foundation, the Vodafone Foundation, IBM and Microsoft. He is a top ten most prolific author in SIGKDD (22 papers, 10th place), ICDM (25, 3rd), SDM (19, 4th) and Data Mining and Knowledge Discovery (11, 1st).

Here is my interview with Prof. Eamonn Keogh:

Anmol Rajpurohit: Q1. What are your areas of interest? What research problems are you currently working on?

EntomologyEamonn Keogh: I am probably most known for my work with time series data. I have invented some of the most commonly used representations (SAX [e], PAA [f]), algorithms (LB_Keogh) [d], and definitions (Time series Motifs [a], Shapelets [b], Discords [c]) used by the data mining community.

However in the last few years, I have become very interested in applying data mining to problems in entomology [g][h]. In fact, I am trying to bootstrap a new area of research I call Computational Entomology [g]. My motivation is simple, insects kill a million people, and eat tens of billions of dollars worth of food each year, but at the same time, insect pollinate about half our food. Data relevant to insects/insect control comes from multiple sources, from real time sensors in the field, from century old handwritten archives, from meteorological models. Thus data mining is critically needed make sense of all this and move entomology into 21st century science.

AR: Q2. What motivated you towards research on time series classification and the development of UCR Suite(ultrafast subsequence search)? What future applications do you foresee for your research work?

EK: I sort of stumbled into the area of time series as a grad student, because NASA was funding my (then) Motivationadvisor. However, I realized that I could make a career out of just working on time series, because it is so ubiquitous. One a single day I have looked at time series recorded on Mars, and time series record from the brain activity of a mosquito. On a more pragmatic note, the rise of wearable computers and the quantified self movement means that there will be time series problems to solve for a long time.

AR: Q3. From a broader perspective, which current trends in your research area do you find the most interesting? What key developments do you expect in near future?

EK: Early in my career I tried to start two trends. In 2002 I pointed out that the majority of papers on time series data mining tested on a single dataset, and I showed (although it seems obvious then and now) Trendswhy this is a problem [i]. Now most papers test on 20 to 40 datasets. I don’t think we have solved the cherry-picking problem, but we have mitigated it somewhat.

The other trend I tried to push was the idea that an author had the obligation to make all their code and data availed to the reviewers, and to the general public. Note that this was not my original idea, the astronomy community take this for granted. This push has been partly successful, but once or twice a month I find myself pestering someone (usually unsuccessfully) for data or code (you know who you are!).

AR: Q4. What do you consider as the most important skills in student and other fellow researchers? How do you build a research team?

SkillsEK: One thing I love about data mining is that it is a broad tent. There is room for theoreticians, for system builders, for empirical folk etc. I have had students that had skills of various subsets of these qualities, and I love working with all them.

As is happens, I am not a theoretician, and I have very poor skills in this area. My first Ph.D advisor, who was theoretically minded, told me “you are not smart enough to get a Ph.D, you should leave with a masters degree”. Fortunately, Mike Pazzani took me on as a student, recognizing that while I could not prove theorems, I was creative and driven. For what is worth, I now have more papers than my first advisor.

Given the above, the two qualities I look for in a student are a strong work ethic, and good communication skills. Actually these are related, most of my students do not have English as their first language, so writing clear text requires many many passes, and lots of my “red pencil”. As Samuel Johnson said, “What is written without effort is in general read without pleasure.

AR: Q5. On a personal note, are there any good books that you have been reading lately and would like to recommend? What keeps you busy when you are not teaching or doing research? Blind_Watchmaker

EK: I think perhaps the best science writer alive is Richard Dawkins, and I make my students read his “Blind Watchmaker”. His creative use of analogies, and the extraordinary care he takes in writing make his books a joy to read. I also recommend the books of another Richard, Richard Feynman, whose curiosity about the universe and ability to see connections between apparently unrelated things inspired me. Finally, the great Carl Sagan’s book, The Demon-Haunted World: Science as a Candle in the Dark, is worth reading if only for its “baloney detection kit” (non-American readers, baloney in this context means nonsense that people believe in).

When I am not working, I am reading, swimming, riding my bike, doing metal or woodwork. However, to a first degree approximation, I am always working!

[a] Abdullah Mueen, Eamonn J. Keogh, Qiang ZhuSydney CashM. Brandon Westover: Exact Discovery of Time Series Motifs. SDM 2009: 473-484
[b] Lexiang Ye, Eamonn J. Keogh: Time series shapelets: a new primitive for data mining. KDD 2009: 947-956
[c] Dragomir Yankov, Eamonn J. Keogh, Umaa Rebbapragada: Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets. ICDM 2007: 381-390.
[d] Thanawin Rakthanmanon, Bilson J. L. Campana, Abdullah Mueen, Gustavo E. A. P. A. Batista, M. Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn J. Keogh: Searching and mining trillions of time series subsequences under dynamic time warping. KDD 2012: 262-270.
[e] Jessica Lin, Eamonn J. Keogh, Li WeiStefano Lonardi: Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov. 15(2): 107-144 (2007)
[f] Eamonn J. Keogh, Kaushik ChakrabartiMichael J. PazzaniSharad Mehrotra: Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst. 3(3): 263-286 (2001)
[g] Gustavo E. A. P. A. Batista, Eamonn J. Keogh, Agenor Mafra-Neto, Edgar Rowton: SIGKDD demo: sensors and software to allow computational entomology, an emerging application of data mining. KDD 2011: 761-764
[h] Yanping Chen , Adena Why, Gustavo Batista, Agenor Mafra-Neto, Eamonn Keogh. Flying Insect Classification with Inexpensive Sensors. Journal of Insect Behaviour 2014
[i] Keogh, E. and Kasetty, S. (2002). On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. July 23 - 26, 2002. Edmonton, Alberta, Canada. pp 102-111.

The second and last part of this article, featuring the interview of Ms. Yanping Chen, PhD student at UCR.