What Data Scientists Can Learn From Qualitative Research

Learn what data scientists can learn from qualitative researchers when it comes to analysing text, and how this relates to writing quality code.

By Alyona Medelyan, Thematic.

Open-ended survey questions often provide the most useful insights, but if you are dealing with hundreds or thousands of people’s answers, summarising them will give you the biggest headache. If you are a data scientist, you may try using an NLP library or an API, but tuning them is hard and the results are often difficult to interpret. If you don’t have a qualitative research background, this article will help you learn the best practices from people who have been working with text, also referred to as qualitative data, for decades.


From text to codes to analysis

What is coding and why does it matter?

When terms like ‘big data’ are thrown around they almost always refer to quantitative data: data that can be easily expressed as numbers or categories. Statistical and machine learning techniques “love” numbers. Text on the other hand is hard, but it is important! Qualitative researchers believe that numbers won’t get you far. They believe that by interviewing people and asking them to answer open-ended questions you can learn more than by simply looking at hard data.

Let’s take for example NPS surveys. The NPS score, calculated from numeric answers to ‘How likely on a scale from 0 to 9 are you to recommend us to friend or family?’ will give you a single measure of company’s performance. But it’s the open-ended answers to the question ‘Why did you give us that score?’ that will teach you how to improve that measure in the future.

A lot of text is produced during qualitative research, and in order to draw conclusions, a technique called coding is used. Survey questions where respondents can write whatever they like are also called open-ended questions. Each of the responses is known as a verbatim.‘Coding’ or ‘tagging’ each response with one or more codes helps capture what the response is about, and in turn summarise the results of the entire survey effectively.

If we compare coding to NLP methods for analysing text, in some cases coding can be similar to text categorisation and in others to keyword extraction. Next, we explore what is involved in coding and the different methodologies that can be used. We often refer to how to perform the task manually, but if you are looking at using an automatic solution, this knowledge will help you understand what matters and how to choose an approach that’s effective.

What is a Coding Frame?

When creating codes they are put into a coding frame. This frame is important because it represents the organizational structure and influences how the coded results can be used. There are two types of frame: ‘flat’ and ‘hierarchical’:

  • Flat frame means that all codes are treated as being of the same level of specificity and importance. While this frame is easy to understand, organizing and navigating it is difficult if it gets large.
  • Hierarchical frames capture a taxonomy of how the codes relate to one another. They allow you to apply a different level of granularity during coding and analysis of the results.

One interesting application of a hierarchical frame is to support sentiment differences. If the top level code describes what the response is about, a mid level code can describe if it is positive or negative and a third level the attribute or specific theme. An example of this type of coding frame is shown below.


Using Sentiment in a Hierarchical coding frame

Advantages and disadvantages of code frames

Advantages and disadvantages of code frames

Coverage and Flexibility of a Coding Frame

One very important thing to consider is the size and the coverage of the coding frame. When coding it is important that responses containing the same themes, even if they are expressed differently, are grouped under the same code. For example, a code ‘cleanliness’ could cover responses mentioning words like ‘clean’, ‘tidy’, ‘dirty’, ‘dusty’ and phrases like ‘looked like a dump’, ‘could eat of the floor’. This requires the coder to have a good understanding of each code and its coverage. Having few codes and a fixed frame makes the decision easier. Having many codes, particularly in a flat frame, makes it harder as there can be ambiguity and sometimes it isn’t clear what exactly a response mean. Manual coding also requires the coder to remember or be able to find all the relevant codes which is harder with a large coding frame.

Finally, coding frames should be flexible. Coding a survey is a costly task, especially if done manually, and so the results should be useable in different contexts. Imagine this: You are trying to answer the question ‘what do people think about customer service’ and create codes capturing key answers. Then you find that the same survey responses also have many comments about your company’s products.   In order to answer ‘what do people say about our products’, you may find yourself having to code from scratch! Creating a coding frame that is flexible and has good coverage (see the Inductive Style below) is a good way to get value in the future.

Deductive and Inductive Coding Styles

What are the two approaches to manual coding of open-ended questions, and which one is best?

Deductive Coding Using pre-existing frame

With deductive coding you start with a predefined set of codes. These might come from an existing taxonomy, codes that cover departments in a business or industry specific terms. This often means that the codes are driven by a project objective and are intended to report back on specific questions. For example, if the survey is about customer experience and you already know that you are interested in problems that arise from call wait times then this would be one of the codes. This has the benefit that you can guarantee the items you are interested in will be covered but you need to be careful of bias.When you use a pre-existing coding frame, you are starting with a bias as to what the answers could be and might miss themes that would emerge naturally from people’s responses.

When you use a pre-existing coding frame, you are starting with a bias as to what the answers could be and might miss themes that would emerge naturally from people’s responses.

Inductive Coding Using Sampling and Re-coding

The alternative coding style is inductive, which is often called ‘grounded’. Here you start with no codes at all, and all codes arise directly from the survey responses. The process for this is iterative:

  1. You read a sample of the data
  2. Create codes that will cover the sample
  3. Reread the sample and apply the codes
  4. Read a new sample of data applying the codes and noting where codes didn’t match
  5. Create new codes
  6. Go back and recode ALL responses that have been coded
  7. Repeat from step 4.

If you happen to add a new code, split an existing code into two, or change its description, make sure to  review codes of all responses that could be affected. Otherwise, the same response near the beginning and near the end of the survey could be given different codes!

How to choose good Codes

When deciding what codes to create several things should be considered.

  1. Ensure Coverage. Codes should cover as many survey responses as relevant. This means the code should be more generic than the comment itself to allow it to cover other responses. Of course this needs to be balanced with the usefulness for analysis. For example ‘Product’ is a very broad code that will have high coverage but limited usefulness. On the other hand ‘Product stops working after using it for 3 hours’ is very specific and is unlikely to cover many responses.
  2. Avoid Commonality. While it is ok to have codes that are similar there should be an obvious difference between them. In maths this is referred to as orthogonality and captures how independent two things are. ‘Customer Service’ and ‘Product’ would be orthogonal while ‘Customer service’ and ‘Customer support’ may have subtle differences but are not orthogonal and may work better as the same code.
  3. Create contrast. Try to create codes that contrast with each other. This allows for both the positive and negative elements of the same thing to be extracted separately. For example ‘Useful product features’ and ‘Unnecessary product features’ would have contrast.
  4. Reduce data. Let’s look at the two extremes: There are as many codes as comments, or each code applies to all responses. In both cases, the coding exercise is pointless. So, try to think about how to reduce the number of data points so that analysis can be done effectively. The example above of ‘Product stops working after using it for 3 hours’ is an example that fails this test. More effective would be a code like ‘Product stops after use’.

Accuracy of Coding

It is very difficult to ensure consistency regardless of whether a deductive or inductive process is used. This is because a coder’s mental frame and past experiences color how they interpret things. This means that different people given the same task are very likely to disagree on what the proper codes should be. In fact, one study has shown that the same person coding the same survey on a different day will produce different results.

Mitigate this by logging all decisions and thoughts that went into coding. Review them when applying existing codes or deciding if a new one is necessary. This process will also mean that the choice of codes can be backed up with evidence.

A different, more expensive, approach to ensure accuracy is through deliberate testing of the coding reliability. The ‘test-retest’ method involves the same person coding the data twice without looking at the results. The ‘independent-coder’ method uses a second coder on the same survey. In both cases the results are then compared for consistency and amended as needed.

Summary / TLDR

  • Data scientists can learn from qualitative researchers when it comes to analysing text
  • Coding is the process of assigning codes to open-ended answers, or other type of text data, after which text can be analysed just like numerical data
  • Code frames are sets of codes and can be flat (easier and faster to use) and hierarchical (more powerful)
  • Code frames need to have good coverage and flexible to allow for a complete and a varied analysis of open-ended answers.
  • Inductive coding (without a pre-defined code frame) is more difficult, but less prone to bias
  • When creating codes make sure they contrast each other and reduce the data
  • Accuracy means consistent coding which can be achieved by logging and reviewing decisions during coding.

Bio: Alyona Medelyan, PhD, specialises in extracting meaning from text. She is a consultant in Natural Language Processing and Machine Learning, and speaks internationally on these subjects. At Thematic, she helps businesses understand customer feedback and customer sentiment. Follow her on Twitter @zelandiya or drop a line via medelyan.com.

Original. Reposted with permission.