Measuring Topic Interpretability with Crowdsourcing

Topic modelling is an important statistical modelling technique to discover abstract topics in collection of documents. This article talks about a new measure for assessing the semantic properties of statistical topics and how to use it.

By Fred Morstatter and Huan Liu.

Machine learning algorithms can help produce models that are capable of revealing summaries of the dataset. Topic modelling algorithms, which are often trained on text corpora containing billions of documents, can reveal “themes”, or “topics” about the dataset. As the amount of text available continues to grow rapidly, humans increasingly rely on algorithms to help them understand their data. It is, therefore, essential that topics learned by these algorithms can be understood by humans. It is a challenge to gauge the interpretability of the learned topics. Interpretability of a topic is naturally best measured by experts who possess required linguistic expertise. However, machine learned topics are massive in number, and human experts are certainly not a scalable solution.  One common approach is to employ workers on crowdsourcing platforms and ask them questions about the topics. However, crowdsourcing workers do not have required expertise.


Figure 1 In Model Precision an intruded word (red) is mixed in with the top five topic words (blue). The human workers’ ability to identify the intruded word is used to measure the topic’s interpretability.

The challenge then is to find a scalable solution that does not require professionally acquired expertise so that crowdsourcing can be employed. The “Model Precision”[1] measure is one solution based on crowdsourcing to quantify the interpretability of a topic. Instead of directly measuring the interpretability of a topic, its underlying idea is twofold: (1) if the top words of a topic are truly interpretable, then they should all share some semantic similarity, and (2) if these words do share this similarity, then a word randomly inserted into the list should be easily detected by a non-expert such as a crowdsourcing worker. An example of Model Precision for Topic i is shown in Figure 1: the top five words from the topic are shown to a worker along with a randomly-selected “intruder” word, “truck”, where workers are from on crowdsourcing sites like Amazon’s Mechanical Turk. They are asked to choose the intruder among these six words. The quality of the topic is then computed as the number of people who correctly identified the intruder divided by the number of times the question was asked to any human. If people can consistently identify the intruded word, then Model Precision suggests that the topic is interpretable, otherwise, it is not.


Figure 2 Example of a coherent topic and a less-coherent topic through Model Precision and Model Precision Choose Two. In both cases the workers easily identify the intruded word, passing the Model Precision task. In the Model Precision Choose Two task, workers consistently identify “century”, indicating that this topic has less semantic coherence than the other topic.

Model Precision is an ingenious measure and has made an important step toward measuring topic interpretability.  A closer look suggests that it only measures one aspect of topic interpretability: the distance from each topic to its intruded word, or intruder. It does not measure the within-topic distance of the top five words. Since the intruded word is usually semantically very far away from the top words in the topic, high Model Precision does not indicate in any way how close the within-topic distance of the top words is. In order to measure the within-topic distance, we propose a new measure, “Model Precision Choose Two”[2]. Its procedure is the same as Model Precision’s with five topic words and one intruded word. The key difference is that a worker is asked to choose two intruded words, instead of one. This is illustrated in Figure 2, where workers have difficulty to find the same second word (Fig 2 (g)), thus the five topic words have shorter within-topic distance, or are more interpretable, while workers consistently identify “century” as the second intruder (Fig 2(h)), thus the five topic words contains some word that obviously does not belong to the topic with longer within-topic distance, or the topic is less interpretable. The intuition behind this is that the first word that the worker will choose will be the intruded word. It is the second choice that determines the within-topic coherence. In an interpretable topic, the second choice should be difficult to identify, and cause the workers to guess randomly. In a less-interpretable topic, the workers will begin to coalesce around a second-choice contender in their answers, indicating a less-interpretable topic.

These two measures are just initial steps for assessing the interpretability of machine learned statistical topics. Code and data for Model Precision Choose Two are available at

[1] Proposed in: Chang, Jonathan, et al. “Reading tea leaves: How humans interpret topic models.” Advances in Neural Information Processing Systems. 2009.

[2] Morstatter, Fred, and Huan Liu. “A Novel Measure for Coherence in Statistical Topic Models.” Association for Computational Linguistics. 2016. Available at

Bio: Fred Morstatter is a Graduate Research Assistant at Arizona State University and Huan Liu is professor at School of Computing, Informatics, and Decision Systems Engineering Ira A. Fulton Schools of Engineering, Arizona State University.