Clustering in Crowdsourcing: Methodology and Applications

As a result of the efforts outlined in this article, we confirmed that clustering through crowdsourcing is indeed possible and works impressively well.

By Daniil Likhobaba, Analyst at Toloka.

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. In most cases, clustering calls for knowing the distances between objects, which normally takes the form of a distance matrix. However, the function of distance is often unknown, or clustering rules cannot be defined. Consequently, it makes more sense to describe the process in the labeling instructions and utilize crowdsourcing to cluster the data. Alas, no ready pipeline currently exists for this purpose, while most papers on the subject are limited to theory. We wanted to see if we could come up with a more practical solution. This is what we ended up with.


Paired object labeling


In the beginning, we recreated a simple algorithm from Larsen et al. (WWW'20), where all of the labeled data is broken down into two clusters. To do this, we needed to compare the two subsets of objects against each other twice and then aggregate the data using a specialized algorithm. Having said that, two-cluster tasks are quite rare. To top it off, doing a paired subset comparison that comprises all of the objects is very expensive. Lastly, the algorithm gives out a single set with a complement to it, which isn’t proper clustering, technically speaking. But we had to start somewhere nonetheless.

The data was taken from Leipzig Corpora Collection – 360 English and German words each.

The crowd performers had to indicate whether the two words on display belonged to the same language. The results were surprisingly good: 98% of the words in each language group were correctly identified. This was the first promising step. But since we wanted a more efficient solution, we continued to look for viable alternatives.


Multiple comparisons


In the Gomes et al. paper (NIPS'11), the authors suggest an approach that allows for any number of image clusters. It works in the following way: we start with a page that contains images and a color palette. The performer selects a color and labels similar-looking images with it. Then the person selects a different color and groups together a different subset of images according to another shared characteristic. This goes on until all of the images have been color-labeled.

The tasks are created at random. The main parameter is the average number of pages on which each image is displayed. This number has to be chosen for each case, or it has to be defined in terms of its relation to the dataset size. The labeled data gets aggregated with a probabilistic model provided in the article, and the algorithm subsequently produces the clusters. What’s interesting is that this approach doesn’t call for known distances between the objects.


Clustering of shoes by style


In order to test the aforementioned method, we took a dataset from the SIGMOD/PODS'20 Toloka tutorial consisting of 100 images with people wearing clothes. In order to skip what Machine Learning has been trained to do already, we asked the crowd performers to group the objects by style regardless of the color. The logic behind it was that most of us can tell one fashion style of clothes from another, even if we can’t always verbalize the defining feature of any particular style. The instructions contained certain features and details that could be taken into account to help the performers, for example, the high heel.

In the end, we received seven easily identifiable shoe clusters. Among them were men’s shoes, women’s boots, sports shoes, and four others. We were glad to see that women’s shoes and women’s sandals formed two separate clusters. Sample images from each cluster can be seen below.

One can browse through all of the subsets and see firsthand how neatly the clustering was done even with the naked eye. This methodology proved valid to us, but we decided to explore it further.


Clustering of dresses by style


Now, we wanted to see this method in action on a larger dataset. We took 2,000 images of dresses from the Feidegger dataset and again decided to cluster them by style. However, unlike shoes, dresses tend to contain more features, which is why we needed to come up with a precise but concise set of instructions for the crowd performers completing this assignment.

The main difficulty was that we couldn't describe all of the possibilities and special cases to the performers since we didn’t actually know the structure of this dataset. We approached our colleagues from the Crowd Solutions Architect (CSA) team at Toloka with this dilemma. They helped us configure quality control, create a practice pool with a training program, and write up appropriate instructions. As a result, the task was made clear to the performers, which, in turn, boosted the labeling quality.

The instructions did not contain any worst-case scenarios or all of the grouping criteria. The performers were advised to focus on pinpointing the style of a dress, level of detail (for example, the size of the neckline but not its shape), and identify images with the overall similar vs. different looking dresses. We wanted to explain what logic could be used to compare the dresses, outline a basic approach to clustering, but not list any specific characteristics for the performers to memorize and cling to during the actual labeling.

The training was conducted with the difficulty level rising incrementally, starting with a two-dress comparison and eventually going all the way up to eight dresses. The key rules were explained en-route to the final exam that consisted of multiple images with the clearly visible differences.

After that, Russian was added as an interface language, and a maximum of eight images per page was set for the subsequent assignment. This was done to simplify the task and reduce the average processing time per page down to 40 seconds.

During the assignment, one aspect was of particular interest – the parameter that was responsible for the expected value of the number of pages containing the images. This parameter’s relation to the dataset size was defined, which told us how many object comparisons were needed for successful aggregation. We ended up with 12 easily interpretable clusters: among them were fluffy sleeveless dresses, long dresses with sleeves, knitted dresses, and so on. The picture below shows some of the objects from each cluster we acquired.




As a result of these efforts, we were able to confirm that clustering through crowdsourcing is indeed possible and works impressively well. The quality we obtained met our expectations, and we came up with a replicable pipeline. Moreover, we found a way to make the instructions clear to the performers and created a suitable interface for the task. Crucially, all this was possible without the paired comparison we started with. The labeling did not take a long time and was cheaper than previously anticipated.

We are currently working on evaluating the quality of the resulting markup using the crowdsourced-based evaluation approach by Chang et al. (NIPS '11). They suggest using intruders, i.e., adding a deliberately incorrect object to the cluster and asking performers to find it. The more often performers choose this obviously incorrect object and not some other object in the cluster, the higher we consider the quality.

We believe that using crowdsourcing for clustering may be useful in e-commerce, search result ranking, and when investigating the structure of an unknown dataset in any field.