Genetics as a Social Network – A Data Scientist Perspective

You can think about a cell’s genetics as a huge social network. We can then take the DNA sequences of the transcription factor footprints associated with each gene and predict the proteins bound to these regulatory regions, and in this way reconstruct the genetic regulatory networks in every cell type.

By Nikhil Buduma. Data science and biology have never really mixed well. And in retrospect, it’s pretty understandable why. Biology and medicine have their own lingua franca, which makes for a pretty steep learning curve. People who thrive at this intersection not only have to be in tune with the fundamentals of biochemistry and genetics, but also need to be mathematically adept and strong algorithmic thinkers. For decades, we’ve gotten away with computer scientists sticking with computers and biologists sticking with genetics. But things are rapidly changing, and there’s a growing need for people who can bring a data-driven approach to medicine. The advent of modern high-throughput biotechnology has brought upon a data deluge that has completely changed the field’s landscape. For example, a binary alignment file for a single human genome could easily amount to hundreds of gigabytes or terabytes of raw data. Without data science, we risk missing out on valuable insights that could fundamentally change how we deliver medicine. Modern genetics is a clear example of where data science is already beginning to make huge impacts on our understanding of biology. Traditional biologists have nearly always approached biological systems from a highly simplified, focused perspective. We’ve tried to analyze single genes at a time, often isolated from the larger context in which they exist: protein A upregulates protein B which downregulates protein C. That’s all there was to it. Protein Networks Caption: How biologists used to think about biochemical pathways. But in reality, genetics is much more complicated than that. A single protein could have its expression be modulated by tens of upstream regulators (called transcription factors). And in turn, the same protein could affect the expression of hundreds of other proteins. In a sense, you can think about a cell’s genetics as a huge social network. The fact that protein A directly regulates protein B is analogous to person A following person B on Twitter. So, quite surprisingly, the same techniques you might use to analyze a user’s Twitter network to get them to click an advertisement are also applicable to analyzing a cell’s regulatory network to diagnose disease and design new therapies. But how exactly do we interrogate these relationships? How do we even know that protein A regulates protein B in a particular cell type? This is where high-throughput biotechnology comes in. Over the past couple of years, researchers have pioneered a technique called DNAse hypersensitivity ( ref: which helps us infer these key relationships. In addition to having a region that directly codes for a protein, a gene also has a number of upstream sequences that are bound by regulatory proteins that control its expression. Essentially, the DNase hypersensitivity technique takes advantage of the fact that DNA, for the most part, is packaged very tightly except around these very specific regulatory sequences. As a result, when the DNA is exposed to a DNA digesting enzyme, it is mostly cut at these loosely-packed and exposed regions. The only exception is the small tract of nucleotides that are directly bound to a regulatory protein. These nucleotides are protected from digestion, resulting in a very clear “transcription factor footprint.”  Histogram showing frequency of DNAse digestion Caption: Histogram showing frequency of DNAse digestion at each location, with a characteristic hypersensitivity site (green) and corresponding transcription factor footprint (red) We can then take the DNA sequences of the transcription factor footprints associated with each gene and predict the proteins bound to these regulatory regions using a database such as TRANSFAC ( This procedure ( enables us to reconstruct the genetic regulatory networks at play in every cell type in the body: Generating the DNA network from footprint data Caption: Algorithmically generating the network from footprint data. Figure borrowed from Neph et al. ( This has a huge number of applications. For example, this data could be used to understand the foundational differences that differentiate difference cell types. Concretely, this could very significantly inform drug development by allowing researchers to predict how a drug for Alzheimer’s, for example, might have side-effects on the patient’s heart or kidney. Complete Transcriptional Regulatory Networks Caption: Comparing the regulatory networks in various cell types in the human body. Figure borrowed from Neph et al. ( Moreover, my current research involves constructing these networks to compare humans to laboratory model organisms such as mice, rats, and chimpanzees. These comparative models could help us figure out why certain drugs work well in animal studies but fail miserable in clinical trials. Every single year, approximately 95% of drugs fail to obtain approval (, and building these models could potentially save billions of dollars in wasted resources. With petabytes of data being produced every single year, biology and medicine need data science now more than ever before. Undoubtedly, data will shape the future in ways that we can only begin to imagine. Cross-posted from: Bio: Nikhil Buduma is a computer science student at MIT with deep interests in machine learning and the biomedical sciences. He is a two time gold medalist at the International Biology Olympiad, a student researcher, and a “hacker.” He was selected as a finalist in the 2012 International BioGENEius Challenge for his research on the pertussis vaccine. Nikhil also has a passion for education, regularly writing technical posts on his blog, teaching machine learning tutorials at hackathons, and in 2014, receiving the Young Innovator Award from the Gordon and Betty Moore Foundation for using augmented reality to re-envision the traditional chemistry set. In his free time, Nikhil loves pick-up basketball, improvising on the guitar, and teaching himself to play mainstream pop music on the piano. Related: