The big data ecosystem for science: Genomics
The field of genomics has undergone a revolution over the past decade as the cost of sequencing has rapidly declined and the practice of sequencing has been commoditized. We review the Big Data ecosystem in genomics.
By Wahid Bhimji, NERSC
This is part 2, a continuation of the post on Big Data Ecosystem for Science.
Primary collaborator: Debbie Bard (LBNL)
Genomics and DNA sequencers
The field of genomics has undergone a revolution over the past decade as the cost of sequencing has rapidly declined and the practice of sequencing has been commoditized. These advances are enabling discoveries in all areas of biology and have broad applications in cancer research, personalized medicine, bio-energy, and genetic engineering, just to name a few. The field of genomics is too broad to quickly capture the range of inquiry and discovery that is being enabled, but I will briefly describe some of the relevant technologies and analysis patterns below.
One of the leading sources of data in the genomic space is a sequencer that can extract the DNA sequence from a physical sample. These sequencers can then generate files, typically called “reads,” with the genetic code, along with information about the errors in those measurements. Sequencing costs have dropped by more than five orders of magnitude in a roughly 10-year period (see Figure 4). This dramatic improvement has been primarily enabled by“short-read” sequencing technologies that generate billions of reads in a single run, and are typically hundreds of base pairs in length. In contrast, single-molecule-based technologies provide longer reads, but typically have lower throughput. These technologies can be used to sequence DNA to understand the basic “code” of an organism, but short-read sequencers are increasingly being used to measure protein expression, variation, environmental response, and so on.
Figure 4. Graph illustrating the declining cost of DNA sequencing technology. Source: National Human Genome Research Institute, used with permission.
The direct product of DNA sequencing is typically “reads” files, such as FASTA or FASTQ. These files can range in size from kilobytes to terabytes and capture a sequence ID; the sequence itself; and, potentially, error measurements. The smaller file sizes typically correspond to smaller, simple organisms such as microbes or viruses, and the larger file sizes come from complex organisms such as plants and animals or samples that include communities of organisms (e.g., microbial communities from a human gut or from soil). However, the raw sequence data is just one small piece of the data being generated in genomics.
Once the data is analyzed, there are a variety of file formats (of varying size) that capture everything from the code and structure of a genome, annotations about the function and role of the genes, variations of an individual compared to a reference (e.g., healthy cells versus cancerous cells), metabolic models, and so on. Although the formats may be standardized, how data is encoded in the formats can vary significantly and can make it difficult to reliably interpret data from various sources. The scale of the data stored can vary from hundreds of gigabytes or terabytes for an individual investigator to petabytes for a major sequencing center. The I/O patterns in genomics often stress storage systems because many analysis pipelines can generate a large number of small files as well as nearly random access patterns. Accommodating these workloads remains an open challenge, and even new technologies like solid state storage are often not sufficient, especially at scale.
Data processing and analysis
Scientists use the data from sequencing instruments to answer a broad range of questions. Consequently, the types of processing and analyses vary greatly, depending on what is being sequenced and the questions being addressed. However, many genomics pipelines follow a common pattern: quality control and filtering, assembly or alignment against a reference, and annotation or comparative analysis. These steps may be followed by modeling, integrated analysis, or statistical analysis, depending on the fundamental questions being asked.
The scale and types of computational resources required to conduct these analyses also vary. Some analyses, such as assembly, cannot be easily partitioned and are typically run on single large-memory nodes. For example, some large assemblies are run on terabyte memory systems. Other analyses, such as annotation or comparative analysis, can easily be subdivided and run in an “embarrassingly” parallel manner across hundreds or thousands of compute cores (i.e., without communication between the tasks). Traditionally, this type of analysis has been conducted on large workstations or commodity clusters. However, the increasing scale is leading some researchers to explore large high-performance computing (HPC) platforms for their analyses. For example, HipMeris an assembler written in a high-performance parallel language that performs de novo assembly of complex eukaryotic genomes using more than 10,000 cores. Other applications have been ported to run on accelerators and integrated circuits known as field-programmable gate arrays to accelerate specific kernels. The challenge in making this transition from commodity clusters to high-performance platforms is the variety of applications that are used in many workflows and the rapid increase in demand.
Data transfer and access
The raw data or analysis product can be transferred or shared at various stages using a variety of methods. Many web-based tools exist for analysis and sharing, such as KBase, Galaxy, Illumina BaseSpace, and the National Center for Biotechnology Information(NCBI). Transfers are often done using simple web posting, but other methods such as FTP, Globus, and even commercial solutions such as those from Aspera are also commonly used. Many sequencing products are posted to repositories like those run by the NCBI. A large number of sites also exist to support specific domains, organisms, and applications. For example, there are sites that focus on fungi, pathogens, microbes, and even specific organisms like corn or the fruit fly.
Genomic sequencing technologies continue to improve. For example, new single-molecule-based sequencers promise to allow sequencing to be done on a laptop in the field. Although sequencing costs are not declining as rapidly as in the past, they continue to decrease. Many fields are looking at how they can capitalize on these improvements to enable new applications—e.g., some precision medicine techniques aim to use sequencing data on a patient to customize therapies or detect changes in the body over time. However, these new techniques require innovations in the data space, particularly in storage and analysis methods.
Bio: Wahid Bhimji is a big data architect in the Data and Analytics Services group at NERSC. His current interests include machine learning, data management, and high-performance computing. He was previously heavily involved in data management for the Large Hadron Collider. Wahid has worked for 15 years in scientific data management and analysis in academia and government and has a Ph.D. in high-energy particle physics.
This post originally appeared on oreilly.com,
organizers of Strata Hadoop World. Republished with permission.
Strata + Hadoop World | March 13–16, 2017 | San Jose, CA
"One of the most valuable events to advance your career."
Strata + Hadoop World is a rich learning experience at the intersection of data science and business. Thousands of innovators, leaders, and practitioners gather to develop new skills, share best practices, and discover how tools and technologies are evolving to meet new challenges. Find out how big data, machine learning, and analytics are changing not only business, but society itself at Strata + Hadoop World. Save 20% on most passes with discount code PCKDNG.