A Community Event for Innovative Spark Apps: A Datapalooza Dispatch
Datapalooza, which is holding its inaugural event this week in San Francisco, is proving to be a seedbed for innovation apps in the Spark community. James Kobielus describes the highlights.
Data science is all about creativity, identifying the hidden patterns in the data around us, and in building new data-driven apps that establish new patterns for living in the 21st century.
Datapalooza, which is holding its inaugural event this week in San Francisco, is proving to be a seedbed for innovation apps in the Spark community. In one of my recent blogs, I focused on the fact that many of the most productive data-science teams blend various modes of pattern thinking—in other words, people with different aptitudes for data visualization, contextual analysis, cognitive exploration, and so on. In the presentations and demos at Datapalooza, we’ve seen all of these modes of data-driven exploration at work.
Sponsored by the Spark Technology Center that IBM established in June, Datapalooza is helping data scientists of all types to boost their skills to the next level. STC-based mentors are actively engaged with Datapalooza attendees who have converged on the downtown San Francisco campus of Galvanize, which is training the next generation of working data scientists. Some, but by no means all, of the presentations at Datapalooza are by STC-based data scientists, who are presenting their most innovative projects. Other presentations are of Spark-based app development by Silicon Valley Data Science, Typesafe, Nitro, Cake Solutions, Facebook, SportsPhoto.com, zData, and others.
Even though Datapalooza is not technically a “coming-out party” for the STC, this is opportune occasion for showcasing the many Spark-based STC projects that are already underway in San Francisco and elsewhere throughout IBM’s partner ecosystem. Here, from the STC website, are current summaries of the principal app-development projects in progress:
- AMBER Alert Aid: This Spark app enables broadcasting of the most serious missing children cases through AMBER Alert. This project uses the analytic capabilities of Spark to find vehicles described in AMBER Alert reports in car traffic video feeds. Live video feed from traffic cameras are ported through Spark Streaming into a Spark Cluster. To extract the images of the individual cars, the live feed is processed through SIFT from OpenCV, which includes edge detection and key feature detection algorithms. These extracted images are then used to train the machine-learning model to understand car model and color. To do this, H20, an ML software built on top of the MLlib for Spark, allows the app to match vehicle descriptions to images of vehicles processed from the live feed. With this Spark-powered AMBER Alert Aid, AMBER Alert adds the power of data to the eyes of the community to find missing children.
- Ask Spark: This is a real-time Spark search app that runs powerful queries and algorithms on massive amounts of data in parallel processing environments. It involves entering a Twitter hashtag of your choice into a browser or mobile app, with the app showing the live Twitter feed and the general public opinion of that hashtag through a sentiment meter. It enable real-time identification of trends evident in the data. It leverages Spark (Spark Streaming, Spark SQL, and MLLib) to run K-means, DecisionTree, and linear regression machine-learning algorithms against live data to create dynamic visualizations. Development is in SparkBench. Further details on the project are available here. The contacts for this app are Jesse Chen, Stewart Tate, and Samuel Wong.
- Bluemix Genomics: This Spark app enables scientists to understand how genetics contribute to complex disease. It enables processing and analysis of massive amounts of genome data. It runs on IBM Bluemix and Spark in the Softlayer cloud, running on YARN and HDFS, with programming in Data Scientist Workbench, R, and RStudio. The contacts for this app are Eric Li, Connie Lam, and Xiaoyang Gao.
- RedRock: This is a Spark app that lets the user act on real-time data driven insights discovered from Twitter. It transforms a huge volume of Twitter data into an easy-to-digest set of visualizations accessible to a general audience. The app uses several Spark engines (Spark Streaming, Spark SQL, MLLib, DataFrame). It leverages two algorithms—Word2Vec and K-means--to filter raw user tweets and look for patterns indicative of influential individuals, social sentiment, key topics, and location of conversations anywhere in the world. Word2Vec uses deep neural networks to assign numerical vector to each of the words in Twitter data, identify similarity among them, and form a feature matrix, while K-means algorithm does the same for clustered words. These algorithms generate screens in the app. Data is acquired through the DecaHose and PowerTrack tools, loaded into HDFS, and then put into an Elasticsearch database for fast powerful indexing. Development is in Scala on Bluemix, and outputs of the Spark engines are exposed through a REST API. The app’s runtime capabilities are hosted on Softlayer, with 6 nodes for Elasticsearch and 20 nodes for Spark. At Datapalooza, the RedRocks demo involved real-time searching through 6 million tweets. The contacts for this app are Jon Alter, Raphael Bouchard, Kellyn Carpenter, Joel Colon Figueroa, Barbara Gomes, Rosstin Murphy, Zoe Symon, and Hao Wang.
- RockPaperScissors: This Spark app enables users to competes against Spark in 3 rounds of the classic childhood game, Rock Paper Scissors. The player decides whether to be a rock, paper, or scissors, and, after an excitement-generating countdown, sees if they beat Spark—or if Spark beat them. When played in front of a large audience, the crowd can see Spark’s selection first on one screen while the player is deciding their selection on an iPad to help show that Spark does not know what the player has chosen. It uses Spark Streaming and machine learning algorithms to learn about human patterns and predict which option the player will choose. The contacts for this app are Dillon Eversman, Virginia Honig, Joe Meersman, Brad Noble, David Taieb, and Tim White.
- Search by Selfie: This is a Spark app for real-time facial detection, recognition, and intelligence in customer engagement scenarios. It enables instant and continual facial recognition gathering is within reach for business users outside of large-scale enterprise—retailers, event-planners, or security, with potential applications for missing persons as well. It enables capture of a photo, extraction of key features, transformation of those features to normalize the data of the faces, and training of facial-recognition models in Spark. Matching the input against MLib models, the app can quickly identify individuals by their face, body, and even articles of clothing. Now deployed in the SportsPhotos.com platform, Search by Selfie can process photographs from large-scale events like marathons, and help individuals find photographs of themselves even in huge, largely anonymous crowds. The app uses Spark’s MLLib on an RDD Faces Dataset. Development is on Data Scientist Workbench and in Scala. The contacts for this app are Brandon Schatz and Ray Sikka.
- SETI + Spark Explore Space: This is a Spark app for analyzing 100 million radio events that have been collected over several years n order to identify faint signals indicative intelligent extraterrestrial life. It uses sophisticated mathematical models and machine-learning algorithms to separate terrestrial interference from signals truly of interest. The app uses iPython Notebook service on Apache Spark and is deployed on IBM Cloud Data Services (CDS). It loads data into a CDS object store for digital signal processing and experimentation. Data scientists from NASA, Penn State, and IBM build and refine analytic methodologies, using iPython notebooks to create a self-documenting repository of signal processing research that is collaboratively searched, referenced, and improved. More information on the app is available here and here. The contact for this app is Graham Mackintosh.
- SFPD Loves Spark: This Spark app supports predictive crime prevention. It takes San Francisco crime data from 2003-2006 and overlays it on a map of the city to highlight locations where crimes had occurred. From the data, crime incidents are divided into low, medium, and high severity, and assigned a color to each category. The map is divided into a series of small squares, each square lit up with one of those three colors representing crime severity—and a heatmap of San Francisco’s criminal activity was created. Using a decision tree algorithm, it achieves an average precision of around ~ 67% and an average recall of ~57%. Further optimization of the app is planned using random forest or boosting methods. The contact for this app is Nimish Kulkarni.
- Tone Analyzer with Watson + Spark + Twitter: This is a Spark app for sifting in real time through Twitter data to gauge customer emotions on a multiple tone dimensions, ranging from anger to cheerfulness to openness. It uses uses Spark Streaming, Watson Cognitive Services on IBM Bluemix, and visualizations in iPython notebooks. It enables users to can see the distribution of emotions represented in the Twitter data. It also supports narrowing of views to the top hashtags and associated sentiment scores. Furthermore, it enables organizations to customize searches to find hashtags that relate specifically and uniquely to them, while also enabling comprehensive rollups of how customers actually feel. The contact for this app is David Taieb.
- Warren Buffett: This Spark app monitors Twitter feeds to capture the words that are most associated with a specific stock, tracks current stock prices, and sends buy, sell, or hold recommendations to the user. In its next iteration, the app will calculate returns over time, free of risk, and build the user’s confidence in its recommendations. It is developed in Scala and Node.js, uses Spark Streaming and MLlib, and provides a Web-based mobile-first UI built on Bootstrap. The contacts for this app are Erin Gieseke, Sriram Moorthy, and Anindita Mahapatra.
Want to see more? Click here to become part of the STC community and contribute projects, design, and code to Apache Spark.
Also, Datapalooza may soon be coming to a city near you. Stay tuned here for updates. We hope to engage the world’s brightest data scientists wherever and whenever makes sense for you.