Interview: Thanigai Vellore, Art.com on Delivering Contextually Relevant Search Experience

We discuss the role of Analytics at Art.com, the polyglot data architecture at Art.com, the use cases for Hadoop, vendor selection, supporting semantic search and experience with Avro.



Twitter Handle: @hey_anmol

thanigai-velloreThanigai Vellore is an enterprise architect, technologist and innovator with over 15 years of progressive experience specializing in building large, highly scalable software systems. At Art.com, Thanigai is the lead architect responsible for defining and driving the technology roadmap initiatives for building the next generation technology vision and platform for the company. Thanigai’s interests and specialties include Hadoop/Big data, NoSQL, Distributed Systems, Enterprise Architecture, Scalability, etc. Prior to joining Art.com, Thanigai has worked in engineering roles at Sanmina and Flextronics.

Here is my interview with him:

Anmol Rajpurohit: Q1. What does Art.com do? How does it leverage Analytics to achieve its mission?

Thanigai Vellore: Art com LogoArt.com is a leading online retailer for wall art. We have the world’s largest selection of hand-curated art images for wall-art providing over 3 million art images from different publishers. We have a portfolio of 5 brands that are global and have a strong international presence.

Our mission is to forever lead the way art is experienced and consumed online – and we rely on a data-driven culture for understanding our users and all aspects of our business. So, data science and analytics play a vital role in key decisions and how we shape our business in fulfilling that mission.

AR: Q2. What are the major components of the Polyglot Data Architecture at Art.com?

TV: At Art.com, we have a heterogeneous technology stack seamlessly integrated thanks to the adoption of a Service Oriented Architecture. In addition, we are also constantly working on evolving and upgrading the stack to leverage latest technologies. As a result, soawe have to deal with multiple technologies across the stack. In the data-tier, we use different database technologies to support different workloads and use cases. We use SQL Server for transactional applications and we use MongoDB for document-oriented storage where we need to store hierarchical objects. We also use HBase to store search related metadata and other application. Also, our ERP runs on Oracle. These are some of the major components of our polyglot data architecture.

AR: Q3. What business and/or technical requirements propelled the need to implement Hadoop? How did you select a Hadoop distribution vendor?

hadoopTV: We had couple of use cases that drove us towards Hadoop. The first use case was to implement a next generation search/discovery engine that had to be updated in near real time and powered by ML algorithms. The second use case was to be able to build a large-scale “clickstream analytics” platform for our websites. These two use cases demanded the need for a distributed computing platform (like Hadoop) to handle and process large volumes of data.

We went with the Cloudera’s distribution as they had a proven track record and also offered support for components like the Lily HBase Indexer and Search (SOLR). In addition, Cloudera’s EDH provides data governance capabilities such as allowing complete metadata management, audit logging and reporting.

AR: Q4. Semantic search has become increasingly relevant in recent times in order to ensure that user discovery is not limited to a particular category, but rather includes alternative discovery of items from contextually relevant categories. How did you go about the process of designing data architecture to support semantic search?

TV: Conventional discovery experiences on ecommerce sites are typically taxonomy based and at art.com we have a state of the art catalog search engine that supports poly-hierarchy and targeted ranking algorithms. However, we wanted to build a visually engaging experience semantic-searchthat is contextually and semantically relevant and not tied to a typical parametric search experience. So, the goal was to build semantically related clusters for all searches and categories.

Another important aspect of the engine requirements was that it should be “demand-based” and it needs to weigh the different facets of search metadata based on search demand. In addition, we also wanted to make sure that the search indexes are updated in near “real-time” as we collect new metadata on searches. So, we analyzed these key aspects of the engine and logically divided the engine into the following sub-components:
  1. Clustering Engine – to automatically organize collections of documents into thematic categories. We used Carrot2 clustering engine
  2. Search Metadata store – to store all search related metadata based on distributed columnar data store, which was HBase in our case.
  3. Search Indexing Pipeline – to automatically replicate the mutations that happen on the metadata datastore into live search indexes in SOLR. Here, we used the Lily HBase Indexer.

 
AR: Q5. How has your experience been with using Avro for clickstream analytics? What do you consider the top pros and cons of Avro?

TV: Avro provides a data format designed to support data-intensive applications and is widely supported throughout the Hadoop avroecosystem. Also, with Avro we can read and write data in a variety of languages like Java, C, Python, etc. which is shipped by default. However, we wanted a format that works well with .NET and Node.js and we found good library support for Avro. A key reason for choosing Avro is the ability to do seamless schema evolution. The Avro format is self-describing as the schema is stored with the data, which allows for schema evolution in a scalable manner and this was very important for us. Avro compresses really well and can lead to significant storage savings for data such as clickstream or traffic data.

Regarding cons, Avro is a row based storage format – this can be inefficient when you need to retrieve only specific columns or fields – in those cases, a column based storage format (like Parquet) is a better choice.

Second part of the interview

anmol-rajpurohitAnmol Rajpurohit is a software development intern at Salesforce. He is a former MDP Fellow and a graduate mentor for IoT-SURF at UCI-Calit2. He has presented his research work at various conferences including IEEE Big Data 2013. He is currently a graduate student (MS, Computer Science) at UC, Irvine.

Related: