We Sized Washington’s Edible Marijuana Market Using AI
As Canada legalizes marijuana, it might look to Washington to understand how pot sells. With the RAND Corp., we used machine learning to estimate how much THC- in pot is sold in Washington.
By Anhvinh Doanvo, Deloitte
Image sourced from Wikimedia Commons
As Canada becomes the first G7 country to legalize recreational marijuana, it might look to the US’s Washington State (WA) for hints on it might play out. In 2012, WA along with Colorado became the first US states to do it, albeit in the face of an ongoing federal prohibition. Yet even as America has had the chance to observe, our nation’s understanding of the industry remains imperfect. Just how much THC is sold in Washington — the chemical that gets users high — remains unknown.
This figure is crucial for answering American and Canadian policymakers’ questions: Can we correlate THC consumption across society with public-health outcomes? Marijuana’s psychoactive and narcotic effects are primarily due to THC, and since plant potency varies widely, it makes sense to understand the industry’s scale in the context of THC, rather than just plant mass.
I was part of a team of students at Carnegie Mellon University’s Heinz College supporting the RAND Corporation’s analysis of this industry, and in our work, we used artificial intelligence to estimate the amount of THC legally sold in edible marijuana products. Our team generated the first such estimate of its kind for WA.
Our Data is Messy with Missing Information
We were given a 60-gigabyte dataset on seed-to-sale transactions, including 10 million sales of edible marijuana products. We needed to estimate the quantity of THC sold in all these products, but all the values for THC content among edibles were missing. The information we received looked a bit like this :
Data provided includes some numeric variables (sale time, price, potency, etc.), categorical variables, and unstructured text (product names).
We needed to get these THC values somehow.
RAND Researcher Steven Davenport noticed that some products, such as the “PButter_TRIPLES_10mg” above, mentioned weights in their names. He judged that these weights referred to the quantities of THC in the product . These weights might appear as “10mg”, “10 mg”, “0.01 grams” or in numerous other forms, but to the human eye, the implication is clear: products with such labels contain 10 milligrams of THC.
Using this information, he wrote a script to recognize these patterns  and pull these weights from 44% of transactions.
First Methods to Calculate THC Sales Perform Badly: Averages and Price Regression
For the other 5.6 million sales, the product names didn’t give us a crutch. “Master Cookie NINES”, for example, makes no mention of THC weights. The simplest way to deal with this would be to assume that these other transactions have the same amount of THC per sale as the former 44% had. If we could scrape THC estimates from 2 million solid edibles, each of which contained 1 mg of THC, and there were 1 million solid edibles left, then we might say that solid edibles were linked to 3 million mg of THC (2 million x 1 mg + 1 million x 1 mg).
This was our first “model”.
But of course, this is a terrible assumption. What if these other transactions are a bit more expensive because people are buying more THC in them? Or if the product types in the latter 56% are different from the 44% that gave us helpful product names? We can think of a million ways for this to go wrong, so we’ve got to do better than calculations based on this assumption.
You can test how good your models are by using them on data that they haven’t seen. We judged how well our methods worked via cross-validation, which is a bit more of a robust application of this basic idea. And with an average error of 40 mg THC per transaction, our first model wasn’t great.
So, we thought we could do better by just exploiting price and other numbers in the data. Products that are more expensive probably have more of the main active chemicals, including THC. And by taking sale times and household incomes surrounding each store into account, we controlled for two key trends: how prices for THC decline over time and how high-income consumers inflate them. This regression-based model did better —it had an error of 20 mg THC per transaction — but there was still plenty of uncertainty.
Machine Learning/AI Dramatically Improves Predictions
This drove us to go one step further: exploiting unstructured text data to predict quantities of THC.
In the data table above, it is not unreasonable to guess that “PButter_Triples_10mg” and “PButter_Triples” are likely similar products, potentially with similar THC content. They share much of the same name. Likewise, “PButter_Triples” is probably pretty different from “Tincture Wintermint” — they don’t have similar names. To exploit this information, we calculated the frequencies of every unique word for every word across all transactions.  With this massive table of frequencies, we had converted text into quantitative information to train an AI making predictions on THC content.
Converting text into a “bag of words” model gives us usable data for machine learning.
Our machine learning algorithm of choice was the “Random Forest”, which uses predictions from hundreds of decision trees. Each of these trees continually partitions the data into many categories to make predictions. One abbreviated tree for our data might look something like this:
Decision trees partition data into many categories according to rules. Here, in this toy, if a transaction’s price is less than $50 and its product name contains “PButter”, then we believe it has 10 mg of THC.
Using all the available information — price, sale times, local incomes, product names, potency, and all other variables — our AI created a “random forest” that tried to minimize error in predicting THC amounts per sale. And with a cross-validated error of just 5 mg THC per transaction, its performance far-exceeded that of all previous models. It proved to be the best way to estimate the THC content in each of the remaining 5.6 million transactions.
Cross-Validation Performance Across Models
New Capabilities, New Answers
Though our algorithms are familiar, their application here unlocks wholly new opportunities in the drug policy field.
Our estimates of THC content sold suggest a continually expanding licit market for marijuana in Washington. We’re not sure whether this is because more people are consuming pot or if the licit market is just eating into the black market.
But with our estimates’ granularity, they give policymakers the capability to answer deeper questions on how THC legally gets to consumers. Exactly which products are consumers using to get high? We see that solid edibles dominate liquid edibles. However, with just 0.3 metric tons of THC legally sold through edibles from July 2016 through June 2017, they represent a small portion of the 24 metric tons of THC sold in WA’s licit market in the same time frame.
Are edibles a cheap way to get THC? Apparently not — their price per unit of THC is about nine times as high as that of usable marijuana products. Is the demand for pot entirely fulfilled by sales of licit products? Nope — there’s a big gap.  If not, is there a black market satisfying it? Maybe.
And how does this impact consumers’ health? Well, our work is just getting started.
We’ve illuminated just another piece of the puzzle on the industry’s scale and its character. Going abroad, the picture that we’ve painted here just might foretell how Canada’s market will change over the years. And I am sure data scientists will be watching it too.
 This is not an exact replication of the data, as multiple database joins were involved in linking matching records in separate tables.
 Separate procedures were applied to CBD products. These were omitted here for brevity.
 RAND researcher Steven Davenport used regular expressions to pull these values.
 In actuality, we used a TF-IDF matrix to train our models. These matrices are particularly useful as they reduce the importance of words that aren’t informative like “of” and “the”. The size of our TF-IDF matrix was also reduced by several hundred dimensions using principal components analysis, which accelerated computation times dramatically.
 We developed our estimates for demand — defined as consumer consumption of pot regardless of whether it was obtained in the regulated market — in a separate project through consumer surveys re-weighted according to their sampling bias.
Bio: Anhvinh Doanvo (@AnhvinhDoanvo) is an incoming analytics consultant with Deloitte's Strategy & Operations practice. As a recent graduate from Carnegie Mellon University, he has written for the Huffington Post and supported executive-level decision-making within multiple public sector organizations.
Original. Reposted with permission.
- Semantic Segmentation: Wiki, Applications and Resources
- Data Science Governance - Why does it matter? Why now?
- Best Masters in Data Science and Analytics in US/Canada