Interview: Xinghua Lou (Microsoft) on Mining Clinical Notes and Big Data in Healthcare

We discuss data mining of cancer clinical data, LDA topic model, challenges in mining clinical notes, big data in healthcare and more.

Xinghua LouXinghua Lou is an applied researcher and software engineer of machine learning and pattern recognition, focusing on social data at Microsoft. Xinghua has published over 25 scientific papers in top-tier machine learning and scientific computing venues such as NIPS, ICML, CVPR, Bioinformatics, IEEE, MIT Press, Springer, and has won the Best Paper Award in MICCAI Machine Learning in Medical Imaging. Xinghua received his Ph.D. from University of Heidelberg (Heidelberg, Germany) and his M.Eng and B.Eng from Tsinghua University (Beijing, China). He has also worked for IBM Research (Beijing, China) and Memorial Sloan-Kettering Cancer Center (New York, USA).

Xinghua recently delivered a talk at Big Data Innovation Summit 2014 held in Santa Clara on “Mining Cancer Clinical using Topic Modelling”.

Here is my interview with him:

Anmol Rajpurohit: Q1. When you started mining the Cancer Clinical Data what were your expectations? What insights did you obtain at the end of your research?

CancerXinghua Lou: We started this project as a simple empirical analysis to evaluate the performance of topic models in understanding clinical text notes. Later the project evolved into more predictive analysis based on the output of the topic model. At the end, we learned that topic modeling of clinical notes is quite helpful for finding special community of patients, predicting important attributes in the clinical database such as the icd-9 code, as well as discovering correlations between patient profile and genetic mutation tests.

AR: Q2. Why did you select the LDA topic model for your research? Can you please describe the process followed and results obtained?

XL: Among various techniques for understanding text corpus, we chose LDA topic models (implemented in GraphLab) because of its previous success in understanding scientific literature as well as webpages. We followed a process roughly as follows: data cleaning and standardization, topic modeling, clinical note clustering and visualization, community finding and cancer-gene correlation analysis. This process was mainly implemented by Katherine Chan under my supervision. We had a few interesting findings, such as a community of patients who highly care about the risk of the treatment, the ability of predicting icd-9 code from topic modeling output, and some interesting correlations between patient profile and genetic mutation tests (some supported by previous published research).

AR: Q3. What were the major challenges in mining Clinical Data, particularly the Clinical notes? Did you meet any unexpected challenges in your research?

Clinical notesXL: One technical challenge we encountered was data cleaning and standardization, which has to be done specifically for the clinical notes on hand. The major challenge was actually data collecting. As mentioned in my presentation, biomedical data is very expensive to expand. Dr. Gunnar Rätsch, professor of machine learning and computational biology in Sloan-Kettering/Cornell and leader of our project, spent much more time finding the appropriate data for this project than actually completing the analysis.

AR: Q4. How did you go about reducing dimensionality so that the results could be visualized well and thereby, understood better?

XL: We already have powerful tools and skills to answer quantitative questions, but the difficult part sometimes is finding the right questions. That's when dimensionality reduction and visualization comes into play. They provide an easily accessible overview of all our data, which helps us to build intuitions and formulate hypotheses. Once done, follow-up validations are mostly easy and straightforward.

AR: Q5. What kind of applications and further research do you foresee for your research work?

XL: For applications, we are looking into using this work for clinical decision support system such as predicting/validating icd-9 code as well as experimental design for cancer-gene correlation finding. As per future research, Dr. Theofanis Karaletsos, co-author of the project, will continue this research towards an exciting direction of modeling the temporal dynamics of clinical notes.

AR: Q6. During your presentation, you cited a report saying that "The potential annual value of Big Data to US health care is $300 billion". What do you think about the progress that has been made so far? What do you expect in next few years?

Big Data for HealthcareXL: There have been some exciting developments. For example, Memorial Sloan-Kettering Cancer Center is incorporating IBM's Watson for cancer treatment analysis and recommendation. And, there are similar stories with other medical institutes such as Kaiser Permanente and Mayo Clinic. I believe Big Data for healthcare will keep growing, which will provides tons of opportunities for hospitals as well as data analytics solution providers.

AR: Q7. What do you think are the most effective ways to learn Big Data skills?

XL: Just read and practice, especially practice, because you will not truly understand "curse of dimensionality" until your data makes you complain about it!

AR: Q8. What book did you recently read and liked?

XL: Nate Silver's The Signal and the Noise was quite interesting.