How GOAT Taught a Machine to Love Sneakers

Embeddings are a fantastic tool to create reusable value with inherent properties similar to how humans interpret objects. GOAT uses deep learning to generate these for their entire sneaker catalogue.



By Emmanuel Fuentes, GOAT.

Mission

At GOAT, we’ve created the largest marketplace for buyers and sellers to safely exchange sneakers. Helping people express their individual style and navigate the sneaker universe is a major motivator for GOAT’s data team. The data team builds tools and services, leveraging data science and machine learning, to reduce friction in this community whenever and wherever possible.

When I joined GOAT, I was not a sneakerhead. Every day while learning about new sneakers, I gravitated towards the visual characteristics that made each one unique. I started to wonder about the naturally different ways people new to this culture would enter the space. I came away feeling that, regardless of your sneaker IQ, we can all communicate about their visual appeal. Inspired by my experience, I decided to build a tool with the hope that others would find it helpful.


The first place to start is developing a common language to describe all sneakers. However, this is not a simple task. With over 30,000 sneakers (and growing) in our product catalogue of unique styles, silhouettes, materials, colors, etc. attributing the entire catalogue manually becomes intractable. In addition, every shoe release creates the possibility of changing how we talk about sneakers, meaning we have to update the common language. Instead of trying to fight this reality, we need to embrace the variation and innovation by including them in our language from the beginning.

One way to address this is to use machine learning. To keep up with the changing sneaker landscape, we use models that find relationships among objects without explicitly stating what to look for. In practice, these models tend to learn features similar to humans. I detail in this post how we use this technology to build visual attributes as the base of our common sneaker language.

Latent Variable Models

At GOAT, we use artificial neural networks to approximate the most-telling visual features from our product catalogue i.e. latent factors of variation. In machine learning, this falls under the umbrella of manifold learning. The assumption behind manifold learning is that often the data distribution, e.g. images of sneakers, can be expressed in a lower dimensional representation locally resembling a euclidean space all the while preserving a majority of the useful information. The result is transforming millions of image pixels into interpretable nuanced characteristics encapsulated as a list of a few numbers.

Manifold WHAT?

Think about how you would tell your friend the directions to your home. You would never describe how to get from their house to yours in a series of raw GPS coordinates. GPS, in this metaphor, represents a high dimensional, wide-domain random variable. Instead, you would more than likely use an approximation of those coordinates in the form of a series of street names and turn directions, i.e. our manifold, to encode their drive.

Modeling

We leverage unsupervised models such as Variational Autoencoders (VAE) [1], Generative Adversarial Networks (GAN) [7], and various hybrids [4] to learn this manifold without expensive ground truth labels. These models provide us a way to transform our primary sneaker photos into aesthetical latent factors, also referred to as embeddings.

In many cases these models leverage the autoencoder framework in some shape or form for their inference over the latent space. The model’s encoder decomposes an image into its latent vector then rebuilds the image through the model’s decoder. Following this process, we then measure the model’s ability to reconstruct the input and calculate the incorrectness, i.e. loss. The model iteratively compresses and decompresses many more images using the loss value as a signal of where to improve. The reconstruction task pushes this “bowtie looking” model to learn embeddings which are the most helpful to the task. Similar to other dimensionality reduction techniques such as PCA, this technique often results in encoding the variability in the dataset.

Prototypic Autoencoder

Gotchas and Design Choices

Simply being able to reconstruct an image is often not enough. Traditional autoencoders end up being fancy look up tables of a dataset [1] with minimal generalization capabilities. This is a result of a poorly learned manifold with “chasms”/”cliffs” in the space between samples. Modern models are solving this problem in a variety of ways. Some, such as the famous VAE [1], add a divergence regularization term to the loss function in order to constrain the latent space to some theoretical backing. More specifically, most of these kinds of models penalize latent spaces that do not match some Gaussian or uniform prior and attempt to approximate the differences through a choice of divergence metrics. In a lot of cases, choosing the appropriate model comes down to the design choices of divergence measurement, reconstruction error function, and imposed priors. Such examples for design choices are the β-VAE[2, 3] and Wasserstein Autoencoder [4] which leverage the Kullback-Leibler divergence and adversarial loss respectively. Depending on your use case for learned embeddings you may favor one over the other as there is commonly a tradeoff between output quality and diversity.

β-VAE Loss Function, reconstruction and weighted divergence terms
In the case of aesthetic sneaker embeddings for our visual sneaker language, we prefer latent factors that encourage a robust and diverse latent space to cover a majority of our product catalogue. In other words we want to be able to represent the widest range of sneakers at the cost of not being so great with the really unique styles like the JS Wings.

“Looks Like” Case Study

Generated Photos through Decoder, each image is a fixed latent vector at progressively increasing epochs of training
This model tends to create more independent human-interpretable factors per dimension [1, 2, 3] referred to as disentanglement. First, the model focuses on recreating the most appropriate silhouette with attention to the contrast between the sole and upper. From there, it constructs the notion of grayscale gradients across the silhouette until it starts to learn basic colors. After understanding silhouette types, e.g. boot vs. athletic, high vs. low the network begins to tackle the more complex design patterns and colors, which will be the final differentiators.

To showcase the learned manifold and inspect the “smoothness” of the learned surface, we can visualize it further through interpolations [6]. We choose two seemingly different sneakers as anchors, then judge the transitions between them in latent space. Each latent vector along the interpolation is decoded back into image space for visual inspection and matched with its closest real product in our entire catalogue. The animation illustrates both these concepts to map the learned representation.

Interpolations and Matches between Anchor Sneakers
Exploring the latent space further, we use a single sneaker and modify one factor at a time in every direction to observe how it changes. Factors representing “mid” to “boot”-ness and sole color are just a few visually perceivable characteristics learned by the network. Depending on the model the number of latent factors and their independence from each other varies. This disentanglement property is an active area of research for us, we hope will improve our embeddings.

Latent Factors Exploration, varying one factors at a time per row with the same anchor sneaker, each column is the reconstructed latent vector at the amount of modification, prior is a standard normal distribution
Furthermore, we can look at our entire product catalogue in terms of latent vectors in a dimensionally reduced 2D/3D plot to look for macro trends. We use tools such as t-SNE[5] to map our latent space into visualizations for spot checking and mass annotations.

 

t-SNE Latent Space Exploration
Logically, if each sneaker is nothing but an aggregation of latent factors, then adding or subtracting these factors relative to each other becomes possible. Here is an example of adding two sneakers together. Notice how the results maintains the wide ankle loop and branding from the first sneaker while the sole, overall silhouette, and material can be attributed to the second.

Sneaker Image Latent Space Arithmetic

Takeaways

Embeddings are a fantastic tool to create reusable value with inherent properties similar to how humans interpret objects. They can remove the need for constant catalogue upkeep and attribution over changing variables, and lend themselves to a wide variety of applications. By leveraging embeddings, one can find clusters to execute bulk-annotations, calculate nearest neighbors for recommendations and search, perform missing data imputation, and reuse networks for warm starting other machine learning problems.

References

  1. Auto-Encoding Variational Bayes
  2. β -VAE: Learning Basic Visual Concepts with a Constrained Variational Framework
  3. Understanding disentangling in β-VAE
  4. Wasserstein Auto-Encoders
  5. Visualizing Data using t-SNE
  6. Sampling Generative Networks
  7. Generative Adversarial Networks

Original. Reposted with permission.

Related: