Interview: Xinghua Lou (Microsoft) on Mining Clinical Notes and Big Data in Healthcare
We discuss data mining of cancer clinical data, LDA topic model, challenges in mining clinical notes, big data in healthcare and more.

Xinghua recently delivered a talk at Big Data Innovation Summit 2014 held in Santa Clara on “Mining Cancer Clinical using Topic Modelling”.
Here is my interview with him:
Anmol Rajpurohit: Q1. When you started mining the Cancer Clinical Data what were your expectations? What insights did you obtain at the end of your research?

AR: Q2. Why did you select the LDA topic model for your research? Can you please describe the process followed and results obtained?
XL: Among various techniques for understanding text corpus, we chose LDA topic models (implemented in GraphLab) because of its previous success in understanding scientific literature as well as webpages. We followed a process roughly as follows: data cleaning and standardization, topic modeling, clinical note clustering and visualization, community finding and cancer-gene correlation analysis. This process was mainly implemented by Katherine Chan under my supervision. We had a few interesting findings, such as a community of patients who highly care about the risk of the treatment, the ability of predicting icd-9 code from topic modeling output, and some interesting correlations between patient profile and genetic mutation tests (some supported by previous published research).
AR: Q3. What were the major challenges in mining Clinical Data, particularly the Clinical notes? Did you meet any unexpected challenges in your research?

AR: Q4. How did you go about reducing dimensionality so that the results could be visualized well and thereby, understood better?
XL: We already have powerful tools and skills to answer quantitative questions, but the difficult part sometimes is finding the right questions. That's when dimensionality reduction and visualization comes into play. They provide an easily accessible overview of all our data, which helps us to build intuitions and formulate hypotheses. Once done, follow-up validations are mostly easy and straightforward.
AR: Q5. What kind of applications and further research do you foresee for your research work?
XL: For applications, we are looking into using this work for clinical decision support system such as predicting/validating icd-9 code as well as experimental design for cancer-gene correlation finding. As per future research, Dr. Theofanis Karaletsos, co-author of the project, will continue this research towards an exciting direction of modeling the temporal dynamics of clinical notes.
AR: Q6. During your presentation, you cited a report saying that "The potential annual value of Big Data to US health care is $300 billion". What do you think about the progress that has been made so far? What do you expect in next few years?

AR: Q7. What do you think are the most effective ways to learn Big Data skills?
XL: Just read and practice, especially practice, because you will not truly understand "curse of dimensionality" until your data makes you complain about it!
AR: Q8. What book did you recently read and liked?
XL: Nate Silver's The Signal and the Noise was quite interesting.
Top Stories Past 30 Days
|
|