SPOTLIGHT: Can Data Science Save Humanity from Mosquitoes and other Deadly Insects? #2

KDnuggets launches Spotlight initiative to bring attention to academic research. The journey begins with Prof. Eamonn Keogh, UCR and his talented student, Yanping Chen, who are applying data mining to save us all from insect-vectored diseases.

In case you missed the part-1 of this article: Can Data Science Save Humanity from Mosquitoes and other Deadly Insects?

Yanping_ChenNext, I would like to introduce Prof. Keogh’s current student, Ms. Yanping Chen. Yanping Chen is a Ph.D. candidate specializing in data mining, machine learning, and the applications of these techniques to solve real-world problems in University of California, Riverside (UCR). She has two papers published in SIGKDD 2013 as the first author focusing on data mining time series. She is also doing research on insect detection and classification, and is the lead principal investigator on a $100,000 grant from Grand Challenges Explorations Round 11, funded by Bill & Melinda Gates Foundation.

Here is my interview with Ms. Chen:

Anmol Rajpurohit: Q1. Currently, what are some of the best solutions for understanding insect-vectored diseases and planning effective interventions? What do they lack? What do you consider as the most important value of your research?

Yanping Chen: The population density of insect vectors is used to predict the insect-vectored diseases. Where insect density is Insect_Diseasescurrent measured, it is typically measured with sticky traps, or the ubiquitous CDC mosquito trap. However, these traps need a trained operator to set them up and check the catch. The greatest problem with these traps is that, there is a time lag between the time of the insect’s arrival and the time they are counted. Under realistic conditions in the developing world this lag may be a week or more, however the adult stage of Anopheles is only about two weeks, thus by the time an outbreak is detected it may already be waning, and the damage done.

Our research proposed a system to automatically detect and classify flying insects. It can be used to monitor the density of insect vectors in real-time, and thus drastically reduce the need of trained personnel. Most importantly, it provides real-time information that allows health officials to plan effective interventions and to better prepare for future epidemics.

AR: Q2. Can you briefly describe your research methodology? What are the biggest challenges in this research project?

YC: Our system includes two parts, an optical sensor to record the “sound” of insect flight, and a software that leverages on the sensor information to automatically detect and identify flying insects.

The optical sensors are “pseudo-acoustic” sensors that record the “sound” of insect flights from meters away, with complete invariance to wind noise and ambient sounds. These sensors allow us to record on the order of millions of labeled training instances, far more data than all previous efforts combined. The software is a classification framework that is based primarily on the sound of insect flights, but is a principled framework that can incorporate any additional information to improve the classification accuracy. Insect Classification The biggest challenges in this project is to develop an accurate and robust insect classification algorithm. It turns out that the most commonly used feature in the previous efforts, the wingbeat frequency, is not adequate for classifying different species of insects, hence, we need to use moreinformation inherent and accompanying the flight sounds for classification. Moreover, there are several constraints on the classification algorithm. For example, the online classification has to be fast enough to provide real-time information; on the other hand, the algorithm has to be undemanding in both CPU and memory requirements, as any devices to be deployed in the field in large quantities will typically be small devices with limited resources, such as limited memory, CPU power and battery life. To develop an accurate and robust classification algorithm that meets all the above constraints is challenging.

AR: Q3. What other research problems are you currently working on?

YC: I am very interested in machine learning and its applications to real-world problems. Currently, I am also working on three other projects, including improving the performance of semi-supervised learning of time-series data[1]; discovering the frequent patterns in a data stream[2]; and scalable machine learning algorithms.

AR: Q4. What do you like the most about your current research work, and what do you like the least? 

YC: I like it most to see that my work is better than the state of art in solving real-world problems. It makes me feel that my research matters, and Like Dislikegives me a sense of accomplishment. But on the other hand, sometimes I found it very frustrating to do the comparison, as some published work does not have code and data publicly available, and the comparison can be painful especially when there are several parameters to be tuned. Therefore, I really like the idea of making all work publicly available and reproducible.

AR: Q5.    What do you plan to do in future? How are you preparing yourself for your long-term goals? Data Scientist

YC: I would like become a data scientist and an expert in data mining in future. My long-term goal is to have my own research team working on an interesting research problem which will have a significant impact in improving people’s life when it is finished.

AR: Q6. On a personal note, given a chance to meet any Data Science leader, who would you like to meet? Why?
YC: I am very impressed by the achievement of Dr. Andrew Ng, especially his success in coursera, as well as his research in proposing the new machine learning paradigm, deep learning. Deep learning has been applied to many real-world applications with great success in making significant improvement in prediction and recognition. It would be great if I have the chance to meet Dr. Andrew Ng to discuss about the deep learning and the future trend of machine learning.

[1] Yanping Chen, Bing Hu, Eamonn Keogh, Gustavo E.A.P.A Batista. DTW-D: Time Series Semi-Supervised Learning from a Single Example.  SIGKDD 2013
[2] Yuan Hao*, Yanping Chen*, JesinZakaria, Bing Hu, Thanawin Rakthanmanon, Eamonn Keogh. Towards Never-Ending Learning from Time Series Streams. SIGKDD 2013.