Will Apache Spark Finally Advance Genomic Data Analysis?

Spark has been useful in mapping out genetic traits that can be associated with certain diseases and the genetic makeup of microorganisms that live in our bodies.

Genomics header
Source: University of Exeter Medicine School.

For most, the thought of big data is often accompanied with that of big companies like Twitter, YouTube or Google. The last decade or so has shown massive advancements in the capabilities of storage and data processing like Apache Hadoop and as these big data technologies have grown, so have their uses. More and more industries are using big data to gain perspective in their markets and access their customers more efficiently through holding personal data at scales never before imagined. Companies can keep track of their employees and customers better, offering a more personalized approach to shopping and also keeping their workers up to snuff (think current licenses and things like ACLS certification).

Currently, YouTube and Twitter are the leaders in the amount of data that is collected and analyzed each year. YouTube has about 1-2 exabytes of video data uploaded per year. Twitter has estimated growth equalling 1.36 petabytes of data per year by 2025. But the need for big data developments has never been greater than with genomics. The first human genome cost nearly $3 billion to sequence and took about a decade to do so. Shortly thereafter, the available methods to sequence improved and genomic data analysis research skyrocketed with a lot of the work aimed at finding new and better ways to treat diseases. However, all of this new growth has created a data crunch with estimates that will far out grow both Twitter and YouTube. By 2025, we could reach one zettabase of sequence per year with the need of 2 - 40 exabytes of storage capacity for genomic data.

Before, research teams were using Hadoop clusters to run their analyses, attaching on other scripting and analysis platforms. Unfortunately, processing times were slow and compatibilities across platforms such as Hadoop and Apache Pig were problematic. Other tools like the 1000 Genomes dataset can cope with up to a few thousand genomes, but are unable to handle datasets any larger than that. With multiple sequencing projects taking place around the world, though, these large datasets are more and more common.

Thankfully, we have seen a solution crop up that could be the answer. Apache Spark has shown burgeoning success that many might have and still do doubt. Apache Spark’s processing engine has caught the eye of researchers who are needing new data architectures to mine, process and analyze their work. Cotton Seed, senior principal software engineer at Broad Institute said that he and his team use genomic research platform they built on Spark, leveraging its SQL querying function and its library of machine learning algorithms. Seed says that Spark has been useful in mapping out genetic traits that can be associated with certain diseases and the genetic makeup of microorganisms that live in our bodies. Spark has the capability to use different query languages--Python, SQL or Scala--and connect different data stores together that before couldn’t talk to each other.

Spark’s speed and scalability for data mining and genomic data analysis have made it the number one player in genomics. Additionally, Spark offers researchers to write their own tools into the platform a huge convenience for developers. In the past, platforms like Hadoop required sophisticated algorithms to gain the most from it, but only a few companies could create them. Spark, on the other hand, is extremely easy to use and can be paired with several other Hadoop platforms and languages. Moreover, it is able to be deployed and supported within already existing bioinformatics IT infrastructures. Even better is Spark’s technology that can handle real-time and more interactive workloads through add-on components including a graph-processing interface, machine learning library and a stream-processing module and it can also run in standalone mode. Overall, Spark has given genomic data analysis the answer it’s needed in running real time data seamlessly through a system and offering batch processing applications that are faster than any other platform out there.