Interview: Daqing Zhao, Macys.com on Advanced Analytics for Marketing in the Big Data era
We discuss Analytics at Macys.com, comparison of advanced analytics with traditional BI, building data models for scalability, problem of data models becoming quickly obsolete and challenges in customer targeting.
Daqing is Director of Advanced Analytics at Macys.com, leading the predictive analytics, test and experimentation and data science teams. He previously held senior management and technical leadership positions at Ask.com, the University of Phoenix, Tribal Fusion, Yahoo, Digital Impact, and Bank of America. He also worked on client analytics projects for Intel, HP, Wells Fargo Bank, SBC, Dell, T-Mobile, MSN Search and Travel, Intrawest, PayPal, wine.com, MasterCard and others.
Daqing received his Ph.D. from Stanford University, specialized in scientific data processing, simulations and optimizations.
Here is my interview with him:
Anmol Rajpurohit: Q1. What role does Analytics play at Macys.com? What are the top priority tasks for the Advanced Analytics team at Macy's?
Daqing Zhao: First, let me thank KDNuggets for an interview regarding my talk at Big Data & Analytics for Retail Summit in Chicago this year. I have been a user of KDNuggets for data mining related information since the late 1990's and have great respect for the site and the team.
Analytics play an important role in Macy's customer centric strategies. Our senior management puts great emphasis on competing in analytics. Macy's omni channel customer strategies are aimed to provide superior customers shopping experience, store or online, with interaction through web sites, emails, using desktop, tablet and mobile devices. A customer may buy some products online and exchange or return in store. Or she may also go to a store and an associate may help order online an item out of stock, with free shipping.
From improving customer experience to optimizing our business processes, data and analytics are very important assets. We use data to build powerful tools to improve the experience of our customers. We strive to innovate on how to apply data science in these areas. There are two types of data scientists, one being those who make sure data are collected, and accessible in time and provide the tools for us to process the data and to take action. There is another kind who are domain experts, focusing on solving specific data driven business problems such as those customer or supply chain analytics and modeling, using the data and analysis and modeling tools. Using an example in the medical field, there is a need for MRI technologists (physicists, chemists and electrical engineers) but there is also a need for medical professionals who use MRI to treat patients. We are more like the latter. We need both types. Senior management needs to differentiate the two types of data scientists. Being an expert in one does not imply having the knowhow for the other.
We have many high priority projects, such as real time site personalization, email personalization and ad spend optimizations, as well as supporting other company wide initiatives, support supply chain optimization, customer values, providing insights to business decision maker.
AR: Q2. How do you differentiate Advanced Analytics & Big Data from the traditional BI processes?
DZ: At Advanced Analytics at Macy's, we work on all projects beyond reporting and database ad hoc analysis. Big data is a natural growth out of traditional analytics. Traditional BI has a set of processes from schema design, ETL (Extract, Transform and Load), standard reporting, multidimensional reporting, simple analysis, predictive modeling, and knowledge discovery. There is a maturity process for companies in adopting BI. Most companies never go beyond simple analysis. Due to technology limitations, only after loading data into a database can we start querying and analyzing the data. The value of data at per gigabyte level had to be very high. For example, only transaction data, limited customer information and product data, as well as financial data were stored and analyzed.
Now in the Big Data era, with Hadoop and NoSQL, we have a paradigm shift. In order to store and read the data efficiently we need to use distributed commodity computer clusters and distributed computing software. The key differences are that data are not only big but also raw including free format data. Value of data in dollars per gigabytes of data can be much lower than before. We now can include web analytics data, raw log, ad serving data and conversion data, and also free text and image data. It is often subjective on how to transform free format data into structured one, so data transformation is now part of the analysis and modeling.
With the increasing volume and the disk IO bottleneck, persistent data are stored raw as read and append only. We can only read and often analyze from the raw data using distributed tools like Map Reduce. Data storage is cheap and computation power is in abundance. In the new paradigm, we should analyze the raw data first and do some insight discovery, before we use domain knowledge to decide on the best ETL design. We also can have transformations and load processes in a much more iterative and agile fashion. This are the software development principles in the Big Data context.
AR: Q3. How does one design a data model for scalability? What all factors need to be considered?
DZ: We need to work with the customers of the data models and understand their objectives, how the models are used and how successes are measured. It is crucially important for our analysts to understand the data, their context of collection very well. And understand the strengths and weaknesses of the data. If we see a pattern, we need to know if it is real. We need to understand what data can and cannot be collected, and consider cost and benefits of collection of the data. Some data cannot be collected. It can be due to privacy policies or behavior, for example, customers may pay cash.
We often try to find the model optimal algorithm, such as random forests or logistic regressions, but we do not pay enough attention to how new data can improve model performance. As we scale in big data, impact of data on model performance becomes more and more important.
With scalable models, we have to be very careful about leakages in the model target definitions. We need to rely on test and experimentation to identify champions, and optimize and learn over time. We may have some assumptions, but we cannot be sure of their correctness unless we can prove in the market that the model works, in that they improve customer engagement or generate lifts in conversions. This is another art and science and a constant challenge.
AR: Q4. What would be the ideal approach to deal with the problem of data models getting outdated quickly due to rapid changes in the underlying data?
DZ: From an analyst's perspective, we want more access to more data and quick turnaround time for transformations. We also want the freedom and capability to conveniently interpret, transform free format data. We want a more interactive environment with quick turnaround time for modeling building and comparisons.
On model building, the old way of model building is very human and time intensive. Data sets were small and expensive to collect. It may take months or longer to build a model, and patterns captured in the models would have to be long term. Now, with low cost of storage and computing, and fast collection and processing of data, we can build many more models and build models for short time behaviors. Building so many models, with each model having so many predictors and some of the categorical variables having a large number of values, an analyst just cannot inspect all variables individually. The danger of leakages and other problems is much higher. We have to rely on techniques of automated modeling, such as out of sample testing, cross validations, robust modeling tools that can handle missing values and outliers.
AR: Q5. What are the biggest challenges in customer targeting (i.e. delivering the right message to the right customer at the right time)?
DZ: Different companies may have different challenges. The senior management has to be sold on the idea. More than ten years ago, Yahoo had an Idea Factory where employees proposed ways to improve customer engagement and monetization. From my earlier experience in 1:1 emails at Digital Impact (now part of Acxiom), I proposed a 1:1 Yahoo front page to reduce clutter. But Yahoo front page was the most valuable page on the Internet. Right information to the right Yahoo user at the right time wasn't getting any attention. Personalized front page didn't become reality until recent years and after several changes in management. Right now, most companies are convinced that the only way to help customers on relevance of messages and information in scale is to use automated data driven personalization solutions.
Data are always a challenge. The most important data, the smoking gun, may not even be collected. Data availability, data quality is hugely important. In addition, the right transformation may not be invented. There may always be a better way to transform the data.
Modeling platform needs to be scalable. We need to have fast enough turnaround time to capture and influence customer’s real time behavior. Model deployment has to be fast and easy enough in order for the business to utilize the models. Data visualization tools can help the analyst explore and explain the models.
It takes more than either programming or the standard data mining algorithms. Google still offers a long list of search results. I hope one day when we search for something, we should get one item we want, in the context of what we are doing or searching, and not by luck. We are getting there but not there yet. Applying data mining and artificial intelligence methodologies in targeting is still far from being mature. We need more innovations.
The second and last part of the interview is here.