SDSC: Supercomputer Data Mining in San Diego
I talk with Natasha Balac, Director of Predictive Analytics at San Diego Supercomputer Center about supercomputer data mining, Gordon, Hadoop, Data Mining Boot Camps, distinction between Data Science and Data Mining, Big Data hype, and more.
By Gregory Piatetsky, Sep 24, 2013.
Supercomputerss used to be limited to advanced physics and military calculations, but now, facing competition from clusters of cheap computers, many supercomputer centers are also becoming active in other areas, and a major application is Big Data and Data Mining. SDSC - San Diego Supercomputer Center - is a leader in that area and a home for PACE, Predictive Analytics Center of Excellence, which has started offering Data Mining Boot Camps (next one is Oct 17-18). In cooperation with PACE, University of California San Diego (UCSD) Extension is also offering a certificate in data mining, with online, web-based instruction.
UCSD Extension and SDSC PACE offer workshops to the public which will allow students to use SDSC supercomputer facilities while being guided by industry and academic experts. The successful completion of most of the workshops will earn the students continuing education units. The first two workshops focus on PMML, Oct 24-25, and Hadoop, Nov 7-8, 2013, but suggestions for new topics are welcome!
To discuss supercomputers and data mining I talked with Natasha Balac, Director, Predictive Analytics Center of Excellence at San Diego Supercomputer Center. This group also supports DataCentral, the first national program of its kind to host and make available significant research and community data collections and databases. Natasha received her Bachelor's degree in Computer Science from Middle Tennessee State University as well as her Master's and Ph.D. in Computer Science from Vanderbilt University. She has been with SDSC since 2003.
Gregory Piatetsky: 1. Tell us about Data Mining Boot Camps - what are students able to do during and after?
Natasha Balac: The Data Mining Boot Camps (DMBCs) are unique in that we teach a multi-level curriculum that includes both conceptual and hands-on training. We cover basic data mining, data analysis pattern recognition concepts and predictive modeling algorithms so that students can explore and implement analyses on real-life cases studies.
During the hands-on lab they will use Weka and R machine learning algorithms that include tools for data pre-processing, classification, regression, clustering, association rules and visualization. After successfully completing series 1 and 2 of the Boot Camp, our "graduates" are better prepared to apply their newly learned data analysis techniques to their own projects and interpret the results.
GP: 2. One of the great things about these Boot Camps is the ability to use Gordon, the newest high-performance computing asset at the San Diego Supercomputing Center (SDSC). What special advantages does Gordon and supercomputing bring to Data Mining? What types of problems are they especially good for?
NB: Notably, Gordon is the first large deployment of flash storage in a high performance computer. Today's Big Data challenges with big datasets need big memory. With 300 Terabytes of flash storage - which is an enormous amount of shared memory - we expect Gordon to be an incubator of new areas of research. Users can request dedicated, long-term use of Gordon's I/O nodes for data-intensive applications like databases and data mining environments.
Virtually every enterprise wants to be able to reach better decisions faster. With Gordon we have the "complete package," an agile high performance computing system with the high-intensity flash memory for real-time, accurate solutions. Gordon's capabilities can advance insight in many fields from severe weather prediction to high frequency trading on Wall Street.
GP: 3. What is the relationship between the University of California, San Diego (UCSD), SDSC and the Predictive Analytics Center of Excellence (PACE) ?
NB: SDSC is an organized research unit (ORU) of UCSD. SDSC provides high-performance computing and data infrastructure to support leading-edge science and engineering research for all of UCSD's ORUs as well as researchers throughout the University of California System. Shortly after deploying Gordon, SDSC launched PACE to bridge new, data-intensive partnerships with science, health and business enterprises.
GP: 4. UCSD is offering online Data Mining Certificate - who is it for, how long does it take, and what skills does it give to students?
NB: The Data Mining Certificate is one of the most sought after certifications offered through the UCSD's Extension. It is designed to provide individuals in business and scientific communities with the skills necessary to design, build, verify and test predictive data models - the skills that give job candidates an edge when competing for very technical, high-paying jobs. Typically, students complete the certification within one to two years. SDSC's Boot Camps are an outgrowth of UCSD's Extension Data Mining course and, importantly, offer students the essential hands-on experience to quickly apply their newly learned techniques to their specific projects
GP: 5. I noticed that SDSC and UCSD also offer courses on Hadoop. Hadoop was designed to run on many commodity computers and was developed as a low-cost alternative to supercomputers like Gordon. What is the role of supercomputers & Gordon in the era of Hadoop?
SDSC has a number of different computing resources. The Hadoop framework is extensively used for scalable distributed processing of large datasets. While SDSC has a 128 node "traditional" Hadoop cluster, the SDSC Gordon Compute cluster is ideally suited to running Hadoop as well, with fast SSD drives enabling HDFS performance and the high speed Infiniband interconnect to provide scalability. Hadoop can be set up on Gordon in two ways
1) using the myHadoop framework through the regular batch queue, and
2) utilizing dedicated I/O nodes with associated compute nodes.
Users can run Hadoop on Gordon using the myHadoop infrastucture, which integrates configuration and Hadoop cluster setup within Gordon's normal job-scheduling environment. The Hadoop File System (HDFS) is built using the high-performance flash drives (SSDs) mounted on each compute node.
GP: 6. Do you see "Data Science" on Big Data as qualitatively different from "Data Mining" or is it mostly a rebranding?
NB: No, I don't think it's rebranding, but more of an evolution and expansion. I think there is a distinction in that data science is a research field advancing our understanding and development of data systems, new technologies and analytic tools, bringing in the business, communication, intuition, creativity and presentation skills into the otherwise a very technical discipline of Machine Learning. On the other hand, data mining is the application of those tools and techniques to rapidly achieve new insight and, ultimately, lead to groundbreaking discoveries in science, health and business.
GP: 7. What is your opinion about Big Data - has it reached the hype peak (see KDnuggets poll Data Scientists split on whether Big Data reached the Hype Peak ) or is it still growing?
NB: We've only scratched the tip of the Big Data iceberg. I'd venture to say that there are many sectors that don't yet realize the potential value hidden in their unstructured data. Even though Big Data has it's own hype curve, it's really not about the hype, it's about the best solution to meet needs most effectively. Regardless of whether we believe it is hype or not, data is growing at very fast rates - it is doubling every nine months. This is not going to change!
GP: 8. San Diego has a very strong and growing analytics/data mining/data science community. How do you work with local companies and what are the prospects for employment?
NB: Since SDSC is a national leader in data-intensive computing and all aspects of Big Data, we have developed partnerships with many organizations in the data and analytics community - data integration, performance modeling, data mining software development an workflow automation. And, SDSC has an Industry Partner Program that provides member companies with a framework for interacting with SDSC researchers and staff, exchanging information, receiving education & training, and developing collaborations. Joining IPP is an ideal way for companies to get started collaborating in Big Data initiatives with SDSC researchers and to stay abreast of all the new developments and opportunities on an ongoing basis. The expertise of SDSC researchers spans many domains including computer science, cybersecurity, data management, data mining & analytics, engineering, geosciences, health IT, high performance computing, life sciences & genomics, networking, physics, and many others.
Well, with regard to employment opportunities, the Harvard Business Journal and Fortune magazine recently noted that a career as a data scientist is "the" job to have in the 21st century. So, I'd say the further is bright for those seeking a job in this field.
GP: in another recent interview, Natasha was asked: "What do you do for fun?"
NB: First and foremost, I like to spend time with my family. In the past year I have enjoyed raising and nurturing two babies - DataCentral and our 10-month-old son Luke. My husband and I are having a blast watching him grow. We are fortunate to have some of our extended family living near by. Some of my happiest moments are playing with Luke and my niece, Sophie, who is just nine days younger than Luke.
In my few spare moments I love to travel and walk/run on the beach. I enjoy sewing as a creative outlet and I love to play tennis. I am originally from Yugoslavia and came to the United States on tennis scholarship. In my 'previous life' I used to be a tennis pro. I love playing for fun and also love competing in both singles and doubles matches. It's a great way to have fun, de-stress, get exercise and make great friends!