The big data ecosystem for science: Physics, LHC, and Cosmology
Big Data management is essential for experimental science and technologies used in various science communities often predate those in Big Data industry and in many cases continue to develop independently. This post highlights some of these technologies, focusing on those used by several projects supported by the National Energy Research Scientific Computing Centre (NERSC).
By Wahid Bhimji, NERSC
Large-scale data management is essential for experimental science and has been for many years. Telescopes, particle accelerators and detectors, and gene sequencers, for example, generate hundreds of petabytes of data that must be processed to extract secrets and patterns in life and in the universe.
The data technologies used in these various science communities often predate those in the rapidly growing industry big data world, and, in many cases, continue to develop independently, occupying a parallel big data ecosystem for science (see Figure 1). This post highlights some of these technologies, focusing on those used by several projects supported by the National Energy Research Scientific Computing Centre (NERSC).
This post originally appeared on oreilly.com,
organizers of Strata Hadoop World. Republished with permission.
“One of the most valuable events to advance your career.”
Strata + Hadoop World is a rich learning experience at the intersection of data science and business. Thousands of innovators, leaders, and practitioners gather to develop new skills, share best practices, and discover how tools and technologies are evolving to meet new challenges. Find out how big data, machine learning, and analytics are changing not only business, but society itself at Strata + Hadoop World. Save 20% on most passes with discount code PCKDNG.
Across these projects we see a common theme: data volumes are growing, and there is an increasing need for tools that can effectively store and process data at such a scale. In some cases, the projects could benefit from big data technologies being developed in industry, and in some other projects, the research itself will lead to new capabilities.
Figure 1. Highlighted tools in the big data ecosystem for science used at NERSC. Source: Wahid Bhimji.
Particle physics and the Large Hadron Collider
The Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN) in Geneva is the world’s largest scientific instrument, designed to collide protons at the highest energies ever achieved. The resulting spray of particles is observed in detectors the size of buildings, in an attempt to discover one-in-a-billion events that have the potential to uncover new fundamental particles and, ultimately, secrets of the universe. The extreme rate of data produced, together with the overall volume of data and the rarity of interesting events, has made the research with the LHC one of the original examples of big data. LHC experiments require smart data ingestion, efficient data storage formats that allow for fast extraction of relevant data, powerful tools for transfer to collaborators around the world, and sophisticated statistical analysis.
The LHC enables protons to collide 40 million times per second in detectors packed with instruments that take hundreds of millions of measurements during each collision (Figure 2). The only way to deal with the resulting petabyte-per-second data ingest rate is to have multiple levels of data analysis with the first level, the trigger,running on dedicated hardware at the detector, and the “online” data pipelines running custom analysis software on bespoke data formats for maximum performance. Most initial data is discarded, but around 100 PB per year of raw data is stored for further, more interactive, analysis by physicists at computing centers like NERSC.
Figure 2. A collision in the ATLAS detector at the LHC showing signals recorded in various parts of the detector. Source: ATLAS experiment, copyright CERN, used with permission.
The particle physics community was one of the first to deal with storing and querying large amounts of data, introducing the world’s first petabyte-sized database in 2005 with the BaBar experiment.Around that time, due partly to limitations found by BaBar, researchers at the LHC decided to predominantly make use of the file-based format ROOT (closely related to the dominant analytics framework described below). ROOT offers a self-describing binary file format with huge flexibility for serialization of complex objects and column-wise data access. Today, hundreds of petabytes of data worldwide is stored in this way.
Data transfer and access
Physics at the LHC involves collaborations of thousands of physicists at hundreds of institutions from dozens of countries. Similarly, computing for the LHC involves hundreds of thousands of computers at hundreds of sites across the world, all combined via grid computing (a precursor to Cloud technology). Widespread resources also mean considerable movement of data—totaling several petabytes per week and requiring tools like the File Transfer Service (FTS). Another more recent development in this community has been to build data federations, based on, for example, the XrootD data access protocol, which allow all of this data to be accessed in a single global namespace and served up in a mechanism that is both fault-tolerant and high-performance.
Data processing and analysis
To coordinate this transfer and access of data across the world, physicists have built data management tools such as Big PanDA to run analyses that allow thousands of collaborators to run hundreds of thousands of processing steps on exabytes of data as well as monitor and catalog that activity. For the analyses themselves, the rare signals under investigation require sophisticated data analysis and statistics, a discipline that also has a long history in the scientific community (the CERN Program Library was one of the first open source scientific statistical software libraries). Once again, the particle physics community has developed its own similar tools primarily through the ROOT framework, which provides visualization and plugins for machine learning, rather than using alternatives common in the wider data science community, such as R or Python’s scikit-learn.
By 2020, collisions in the high-luminosity LHC will be 10 times more intense than those in the current LHC. This will mean initial streaming data analysis on orders of magnitude more data and shipping the resulting 400 PB of data per year across the world. Managing this extreme data challenge will require the community to build new tools, on the best of their own and others’ technologies.