The big data ecosystem for science: Physics, LHC, and Cosmology

Big Data management is essential for experimental science and technologies used in various science communities often predate those in Big Data industry and in many cases continue to develop independently. This post highlights some of these technologies, focusing on those used by several projects supported by the National Energy Research Scientific Computing Centre (NERSC).

Primary Collaborator: Wahid Bhimji (LBNL)
Cosmology and astronomical image surveys


Cosmologists require big data solutions to answer the big questions they ask about the nature of the universe — what is it made of, and how old is it? The theoretical side of cosmology involves running large simulations of the universe—a traditional high-performance computing problem. But analyzing the output of those simulations is a data problem. Modern experiments that simulate large volumes of the universe can output terabytes of data at each step of the simulation, resulting in petabyte-scale simulated data sets. In addition, to be able to understand how these simulations relate to our universe, we need to compare the outputs of those simulations to real data collected by the telescopes that take images of the night sky. Processing the nightly output of terabytes of image data from astronomical surveys (such as the Dark Energy Survey) and analyzing the resulting measurements of billions of stars and galaxies is a data challenge in itself.

Data ingestion and storage

Large volumes of data are created in cosmological simulations—as time passes in the theoretical universe, snapshots are taken to characterize the distribution of matter at discrete timesteps. These snapshots need to be processed to look more like the observations we make from Earth: galaxies need to be added appropriately, light paths calculated, and statistics calculated from the resulting simulated observations. This is most easily done using an in-situanalysis, where the processing of the simulation data is done at the same time (and inside the same supercomputer) as the simulation itself. The resulting reduced output may range in the tens of terabytes rather than a petabyte of raw simulation data, which is more easily shared between research sites. Data is usually stored in a custom binary format, but increasingly, cosmologists are taking advantage of the parallel input/output (I/O) capabilities offered by the HDF5 format to enable more efficient processing of simulation output. There are effectively no standard tools for ingesting this data—each group uses its own formats and methods, which poses a challenge for the community.

Data transfer and access

Astronomy has always dealt with the problem of data movement: images of the sky taken with telescopes (often in remote locations) need to be transferred to a scientist’s home institution. In the past, this may have been done physically, using a photographic plate or a hard drive of data, but today, large volumes of image data are transferred around the world electronically. However, the issue of data movement and access do not end with simply shipping the data to a host datacenter. Today’s large-scale imaging surveys provide data to collaborations of hundreds of astronomers and cosmologists, and potentially hundreds of thousands of members of the public. (See the Sloan Digital Sky Survey for one of the most successful examples of a publically available astronomical data set.) These images, and measurements of the stars and galaxies detected in them, are typically hosted on a data portal by a single institution. Challenges remain for how to serve up future petabyte-scale data sets to both scientists and the public.

Cosmological simulations also face data access issues. A recent effort demonstrated how to orchestrate cosmological simulations across multiple sites and serve the data to collaborators around the world. The Portal for Data Analysis Services for Cosmological Simulations (PDACS) provides access to shared repositories for data sets, analytical tools, cosmological workflows, and the infrastructure required to perform a wide variety of analyses on different compute facilities. This was demonstrated at the Department of Energy’s SC14 conference using data from the Dark Energy Survey. The data pipeline consisted of Docker containersencapsulating different stages of the analysis code that could be pushed out from the data host (the National Center for Supercomputing Applications, or NCSA) to Department of Energy supercomputers at Berkeley, Argonne, Brookhaven, and Oak Ridge national labs. The analysis code could be fired up on the various systems, which automatically transferred the data they needed for processing. The results were then pushed back to NCSA over ESnet.

Data processing and analysis

Astronomers and cosmologists have long relied on image-processing software like Source Extractor to remove noise from images generated by telescopes and to detect astronomical objects (such as stars and galaxies) in the pixel data. However, it is difficult to spot many interesting features of this image data using simple peak-detection algorithms, and astronomers are increasingly turning to more sophisticated methods to identify and measure properties of astronomical objects. One of the most interesting recent developments in pattern recognition in astronomy has been crowd-sourcing measurements of complex image data that algorithms struggle to interpret. For example, galaxy morphology (i.e., whether a galaxy is a spiral, elliptical, or irregular) can be more easily determined by the human eye than simplistic algorithms, but it is impossible for professional astronomers to examine every one of the millions of galaxy images captured in today’s astronomical sky surveys. The Galaxy Zoo project, which started in 2007 and has more than 150,000 participants, has resulted in the human-led identification of new classes of astronomical objects—features that would otherwise be impossible for a machine to label—and has, in turn, influenced the next generation of machine learning techniques. This set of crowd-identified features has provided a training data set for machine learning methods (see Kaggle challenge).

Another example of modern analysis on astronomical data is the identification of gravitational lenses, as done in the Space Warpsproject. These distorted images of galaxies (Figure 3) are hard to identify with traditional image-processing techniques, but the identification of lensed images by human volunteers has provided valuable input to developers of pattern-recognition algorithms.
tools in the big data ecosystem

Figure 3. Image taken with NASA/ESA Hubble Space Telescope showing gravitational lensing and a galaxy cluster. Source: NASA/ESA, via Wikimedia Commons.

The future

The forthcoming Large Synoptic Survey Telescope (LSST) will change the way astronomers handle data. It will produce 15 TB per night of imaging data, which must be processed to produce alerts for exciting transient events (like evidence of a supernova explosion, or a moving object like an asteroid) within 60 seconds. The LSST Data Management stack is being developed to process this data quickly and efficiently as it comes off the telescope in Chile and transfer it to a central facility in the United States, where it will be further analyzed and made available to scientists. The measured properties of the stars and galaxies observed by LSST will be stored in Qserv, a custom distributed database query system that can scale to handle the trillions of rows and petabytes of data that will be generated. The LSST Data Management stack consists of three layers: an underlying infrastructure layer containing the system, storage, networking, and computing software; a middleware layer consisting of software for distributed computing, data access, and user interface; and an applications layer containing the data analysis software and data archives. It is based on Python with C++, used for computationally intensive tasks, and comprises more than 60 packages managed by Extended Unix Product System (EUPS).

Article image: Pipe grid. (source: Pixabay).

Bio: Wahid Bhimji is a big data architect in the Data and Analytics Services group at NERSC. His current interests include machine learning, data management, and high-performance computing. He was previously heavily involved in data management for the Large Hadron Collider. Wahid has worked for 15 years in scientific data management and analysis in academia and government and has a Ph.D. in high-energy particle physics.