ASE International Conference on Big Data Science 2014: Highlights from Workshops

Highlights from the presentations by Data Science leaders from MIT, Georgia Tech, Microsoft Research and CUHK during workshops at ASE Conference on Big Data Science 2014 held in Stanford University.

The Second ASE International Conference on Big Data Science provided a great opportunity for students, scientists, engineers, data analysts, and marketing professionals ASEto learn more about the applications of Big Data. Session topics included “Enabling Science from Big Image Data,” “Engineering Cyber Security and Resilience,” “Cloud Forensics,” and “Exploiting Big Data in Commerce and Finance.”

Held at the Tresidder Memorial Union at Stanford University, the ASE International Conference on Big Data Science took place from Tuesday, May 27 – Friday, May 31, 2014. The conference kicked-off with 4 workshops which were held in parallel. Here are highlights from selected talks from the following 2 workshops which were held together:
  • International Workshop on Big Data Analytics for Predictive Organization and Big Transformations
  • International Workshop on Distributed Storage Systems and Coding for Big Data

Kalyan VeeramachaneniDr. Kalyan Veeramachaneni, Research Scientist and Leader at AnyScale Learning For All (ALFA) Group, MIT gave a keynote titled “From JSON Logs to Latent Variable Models: Knowledge Mining Massive Open Online Courses (MOOC) Data”. MOOCs provide data pertaining to hundreds of thousands of students as they navigate through the web based platform, submit assignments and participate in forums. The data comes with a promise of enabling us to find patterns that could help MOOC instructors teach more effectively and improve student engagement. That data also presents us with a unique opportunity to learn about how students learn and helps us improve on-campus instruction as well.

He briefly discussed the predictive models that his group is building which would enable to predict student stop out. His group is consistently working to build innovative platforms that enable data science at scale. He shared details on how they compiled good predictors for MOOC outcome variables. While explaining his data strategy, he mentioned: “We care more about getting the variables right than we care about the models themselves.”

Srinivas AluruDr. Srinivas Aluru, Professor, Georgia Institute of Technology delivered a keynote on “Big Data Challenges in the Biosciences”. He started with mentioning that rapid decline in the cost to sequence a complex organism’s DNA (from $100 million down to $1000) has enabled many new applications in human health, agricultural biotechnology, etc.

This has happened by the advent of a number of high-throughput sequencing technologies, collectively known as next generation sequencing. This is leading to an explosive growth in the number of organisms sequenced, and in the number of individuals sequenced in search of important genetic variations. Next-gen sequencers enable diverse applications, each requiring its own class of supporting algorithms. He shared big data challenges arising from these developments in the context of microbial communities, agricultural biotechnology, and human health. Here are few open problems he shared:
  • How do we archive and query petascale short read archives?
    • Sequencing centers throwing away raw data
    • What if you don’t like one’s bioinformatics analysis?
    • Scientific reproducibility?
  • Compression and decompression algorithms
    • Inherent redundancy
    • Understanding error profiles can be helpful
    • Don’t forget quality scores

Quick Interesting fact:

How Big are Genomes?

Viruses - 50kb+
Microbes - 1 Mb+
Human/Mouse/Chimp - ~3 Gb
Rice - 450 Mb
Maize - 2.5 Gb
Pine - 19 Gb

Parikshit GopalanDr. Parikshit Gopalan, Researcher, Microsoft Research delivered a talk on “Code with Locality for Data Storage”. Talking about erasure codes for Big Data storage, he mentioned that it is very important to ensure that error patterns which occur frequently (such as single disk failure) can be corrected efficiently. This has motivated the study of codes with good locality, where any data symbol can be reconstructed using a few other code word symbols. He summarized what we know about rate distance trade-offs for such codes. The main challenge in constructing such codes is to maximize their reliability while keeping the field size small.

Kenneth ShumDr. Kenneth Shum, Associate Professor, The Chinese University of Hong Kong (CUHK) talked about “Polynomial Construction of Sector-disk Code”. Shum described two types of disk errors: entire disk failures and disk sector failures. He mentioned that in conventional storage codes, a disk sector failure is considered as an entire disk failure, even though most of the remaining sectors in the disk remain intact.

Sector-Disk (SD) code, proposed by Plank et al. recovers a mixture of these two types of failure patterns. He presented a detailed construction of SD codes which specifies the generator matrix using bi-variate polynomials. This new construction can repair any number of disk failures with up to three sector failures.

Highlights from conference proceedings will be published soon.