Big Data for Social Good: UC Berkeley and Geisinger Health Collider Project

The Geisinger Health Collider Project gives participating students first-hand experience with various techniques, ideas, and challenges stemming from clinical informatics by using real clinical data to address impactful problems in healthcare.

UC BERKELEY Geisinger Health Collider Project



WHO WE ARE: Geisinger Health System is known as an early adopter of modern paradigms of healthcare and medical informatics. Data Science team at Geisinger combines the best practices of machine learning to support decision-making in healthcare. In 2014 we completed over 20 research projects using Electronic Medical Records (EMRs) to predict personalized treatment outcomes, to better allocate healthcare resources, and to obtain early warnings in scenarios of crisis. The team is focused on research that lies at the interface of clinical medicine and applied mathematics & computer science but is also interested in all forms of multidisciplinary studies centered on effective utilization of data in healthcare. Academic outreach is a part of our mission: we maintain collaborations with university researchers of all levels. For students and early career professionals, we believe in providing opportunities for immersive training experiences using real-world, unrefined data to address practical, patient-centric problems.

All information for this project is available at

Including the:
PROPOSAL SUMMARY: Through the Collider Project, we invite teams of college students to participate in a competition of projects focused on blending clinical data with additional, novel types and sources of data to improve the quality of patient-centric healthcare analytics. Each team will have 2 participants. A team will select one of several proposed projects aimed at improving some dimension of the quality of healthcare delivery (e.g. improving the well-being, safety and informed freedom of choice in the interactions between society, individual and a healthcare provider). We have selected three challenges from which participants can choose, each of which traditionally has been approached using a core set of clinical data in conjunction with a standard, conservative analytic strategy. Arguably, each of these analyses have fallen short of their maximal potential because they have not adequately address traditionally non-clinical (molecular, socioeconomic, behavioral, etc.) dimensions of these complex problems. Our goal is to encourage the participants in this project to improve the quality of these analytics through a combination of blending traditional and nontraditional data sources and an analytic strategy that incorporates novel approaches.

For each challenge we will provide a core set of clinical data derived from the EMR. We challenge the teams to find additional, nonclinical data to supplement our core data set, to resolve fundamental and technical difficulties of data blending, and to demonstrate the effectiveness of multi-disciplinary data use and analysis. This may result in "obtaining a better answer" or "answering a better question," depending on each team's individual take on the question being asked. Teams will have multiple opportunities to interact with Geisinger data scientists and will receive access to appropriate data sets and software tools. The final product will be an academic report accompanied by a body of transformed/blended data.

We expect that for the basic level (BS degrees, independent studies) this will be mostly an effort in data acquisition, cleanup, and blending followed by more traditional analytics. For intermediate students (early graduate studies) and advanced students (MS degree or higher) we expect increasingly more complex strategies for data integration and analysis that produce study results at a higher level of scientific rigor. The reports will be evaluated based on creativity, scientific rigor, and on how well the achieved results match the stated objectives. Different teams can address the same question and cite each other's work if they wish: we will evaluate each effort separately. The goal of this exercise is not to determine who is the best coder or the best mathematician. While some coding will be required, some guidance and technical assistance will also be provided by our team. Ultimately our goal is to allow students to apply their individual skills, backgrounds, interests, and talents to the task of data blending, hypothesis generation, and data analysis. The analyses will be intended to answer specific medical questions, and so the results may have immediate tangible impact on this field.

REWARD: The winning team will be offered 3-month summer internships at Geisinger Data Science, where they will work on a theme of their choosing and will be closely supported by the department faculty.

ELIGIBILITY: The competition is open to students of all fields, and is not limited to mathematics and computer science majors. Arguably, every modern profession has elements of data analysis, and we expect the participants to use insights from their fields to look for additional data and suggest new analysis strategies. Seniority/academic standing is also not a limiting factor for participation.

BACKGROUND AND IMPACT: Data integration, or "blending," is defined as combining data from multiple sources resulting in a unified view of the transformed data. Blended data sets can be largely analogous (same patient features, different clinics), or come from very different domains (EMRs, financial transactions, extracts from social media, academic records, criminal records, voting turnouts, etc). The sets will not overlap over individuals, but they should be related to the same general population or phenomena.

The technical process of data integration can be laborious, but it is relatively straightforward in its implementation. It may involve conversion between formats, filtering, recovery of missing portions of data, elements of mathematical modelling to formalize mappings between the data sets. However, the questions of what data to look for, and how to best use it, depend on the applied problem and cannot be formalized. Resolving these issues requires a good understanding of the involved disciplines and a creative effort to construct a novel solution. Establishing feedback is also important: in multi-source data studies, one may have to restate the problem and re-acquire data as the analysis evolves.

Modern data science is driven by looking for evidence in new ways and from new sources. But does using a larger volume and variety of data result in an improved understanding of the underlying questions? This question remains to be answered in detail, and establishing better practices of data blending and then studying the downstream consequences will be beneficial for medical informatics and data science in general. The hunt for missing data is still an art rather than a science. Many professionals don't know what is available just outside of their scope of interest, and many academic data science programs have limited access to proprietary and sensitive "real-world" data. Instead, they rely on the instructors' and students' ability to hunt for public data sources. This leads to very uneven performance of educational programs: in fact, many young career professionals are completely unprepared for dealing with data of industrial size and complexity. The proposed healthcare-academia immersive competition will contribute to the quality of modern data science education and generate a pool of talent in data acquisition and integration for multidisciplinary research.

TIMELINE: The project kick off was October 15, 2015 and will have two phases: Phase 1 will take place during the Fall 2015 academic semester and Phase 2 will take place during the Spring 2016 term.


1. Integrated data analysis for early warning of heart/lung failure

The advantages of early diagnostics of serious medical conditions are obvious. This is particularly important for common conditions such as congestive heart failure (CHF) and chronic obstructive pulmonary disease (COPD), as they are among the most common causes of death in the US. They may result from multiple causes and exist together with other complications: in the absence of an early warning model, it is difficult to prioritize testing and allocate resources in the diagnostic process. Both conditions affect millions of patients, and are associated with socioeconomic factors: occupational hazards, lifestyle choices, environmental factors. Using statistical inference on EMRs, we can estimate the chances of a CHF/COPD diagnosis pre-emptively, before it is confirmed by a medical specialist. Can we make our prediction better by using additional information?

2. Indirect data collection to support anti-obesity efforts in healthcare and society

Obesity is a primary population health concern in the US. The contributing factors (diet, inactive lifestyle, role of food in social interactions) are fairly well-described, but their role in creating a successful prevention strategy is not fully understood. Analytic efforts establishing links between isolated social and medical factors and obesity are often inconclusive or ineffective. New hope lies in integrated analysis of complete medical and social histories. Using EMRs, we can see patterns that obese patients have in common, infer risk of obesity from other medical conditions, and also find new ways to characterize patients that could be successfully treated. Can we use even more data in analysis by including non-clinical events? What non-traditional and indirect information about patient's background and multi-faceted behavior can be collected to contribute to anti-obesity studies?

3. It's not all in your head: multi-disciplinary data analysis of common psychological conditions

Psychological mood disorders are often described as both social and medical phenomena. Recent studies in suicide prevention make connections between mood disorders and patterns in residential power use, logs of phone calls, and purchasing history, while older studies identify certain demographic groups as being more at risk. The problem of screening for and predicting the risk of mood disorders in the general population could have a major impact on population health. Can a combination of EMRs (containing clinical data) and other data sources improve upon current strategies for predicting the individualized risk for developing a mood disorder?

Project Leader: Dr. Marko, Nicholas, Geisinger.
Scientific coordinator: Prof. Roberto V. Zicari, UC Berkeley Visiting Scholar and Frankfurt Big Data Lab.
For more information: David Law,