Spotting Controversy with NLP
In this article, I’ll introduce you to a hot-topic in financial services and describe how a leading data provider is using data science and NLP to streamline how they find insights in unstructured data.
By Jo Stichbury, Freelance Technical Writer
Environmental, social, and governance (ESG) metrics measure the sustainability and societal impact of an investment in a company or business. Before committing to a company, investors want to know if there are any potential controversies brewing, or if the company shows particular leadership in an area of ESG, such as diversity in the workforce.
Refinitiv is a global provider of financial market data and infrastructure, and this article describes how their Labs team is exploring the use of NLP to give their clients a competitive edge in global financial markets. Currently, Refinitiv analysts search for news stories about a specific company using a set of ESG-related keywords, and if there’s a positive match, the story is subject to further scrutiny. For example, a keyword scan would identify a potential governance controversy in the following snippet, prompting an analyst to read the story and determine whether it indicates an ESG controversy:
CHICAGO (Reuters) — The agricultural unit of German chemicals company Bayer AG will halt future U.S. sales of an insecticide that can be used on more than 200 crops after losing a fight with the U.S. Environment Protection Agency, the company said on Friday.
This can take the analyst some considerable time. As Tim Nugent, Senior Research Scientist at Refinitiv Labs, explains “the problem we need to solve is that it’s time-consuming to search and read news articles”. One option to increase throughput and coverage is to hire more analysts to cover more stories, but why not optimize the process with AI to build a more efficient workflow?
Using machine learning and natural language processing (NLP), Tim Nugent’s team has trained a model to review a news stream and triage news stories for potential ESG controversies.
When the Refinitiv analysts review an article manually they look for controversies in 20 ESG topics defined in-house, many of which align with the UN sustainable development goals. For a specific company, by examining each of the ESG topics, the analysts decide whether the article suggests controversy or not for that topic. In essence, they perform document classification — something which can be re-framed as a supervised machine learning task. An algorithm can be trained to make the same decision and output a probability score for each of the ESG controversial topics. Where the probability sits above a confidence threshold it proceeds directly through the ESG pipeline, while low confidence predictions are sent to human analysts for further review.
The Refinitiv Labs team uses Google’s open-source NLP model, BERT, which has demonstrated state-of-the-art performance in a range of classification tasks. BERT is pre-trained on 3.3 billion words from a general domain corpus, such as Wikipedia and the open BookCorpus dataset, so has a good, native understanding of the English language. The team further trained BERT using a business and finance-specific corpus. They used the Reuters News Archive, a further 715 million words from about 2 million articles. The extra training gives the model a better understanding of the domain-specific terminology of business and financial news and improves its prediction confidence downstream. Once this step was complete, they “fine-tuned” the domain-specific model to deal with the ESG controversy classification task.
“The field is highly adversarial and giving customers an edge can be profoundly impactful,” says Tim Nugent. BERT is a state-of-the-art model for language processing, but pre-training the model with additional data from Reuters News, has made it smarter still. BERT-RNA, as Nugent styles the adapted model, shows improvements in confidence from generic BERT (82% vs 78%) because of its adaptation for the nuances of financially focussed language. While 4% may not appear on the surface to be significant, it has the potential to translate to a huge competitive advantage.
High-quality data is crucial for supervised machine learning tasks. The ESG controversy model, trained using approximately 30,000 “positive” articles that Refinitiv analysts had already annotated, was crucial and used alongside a corresponding set of negative examples. Further work will focus on training the model with additional sources of ESG data that are typically less structured than the traditional market and index data, such as a company’s self-reported data.
The Refinitiv Labs team has used machine learning and NLP to positive effect, allowing the company’s ESG analysts to be more productive and efficient. The BERT-RNA model allows human expertise and domain-specific knowledge to work alongside each other. Analysts can now focus on what they know best — they can offer their company’s client-base insightful information about ESG controversies surrounding their companies of interest.
A version of this article appeared on Refinitiv Perspectives in March 2020.
Bio: Jo Stichbury is a Freelance Technical Writer.
Original. Reposted with permission.
- Google Unveils TAPAS, a BERT-Based Neural Network for Querying Tables Using Natural Language
- Why BERT Fails in Commercial Environments
- Text Mining in Python: Steps and Examples