Alternative Data, Text Analytics, and Sentiment Analysis in Trading and Investing

Different types of data beyond your typical dollars and cents have been used in the finance industry for many years. By leveraging machine learning, sentiment data is expected to play an increasingly dominant role in the investment industry, and this article highlights some special challenges of its use in trading models.

By Lars Hamberg, Analyst, Investor, co-founder of Gavagai.

In the Finance Industry, Alternative Data is used to give investors an information advantage.

Quantitative Hedge Funds have used trading models based on Alternative Data for many years.

The most common Alternative Data signal used in quantitative trading and quantitative investing is based on text data from the Internet, and the trading models can broadly be defined as algorithmic trading models and as statistical arbitrage models.

It has been suggested that text analysis is the key to success for the most successful money manager of all time.

The trading model can use text data and sentiment data as the only or as one of several inputs, and it can be the main strategy, or one of several strategies, in a hedge fund.

Some traditional funds use text-based signals to build the models they use as an overlay to other strategies and as a risk indicator for tactical asset allocation.

Machine Learning is applied on the data sets with both the historical price data, for one or more risk assets, and the historical alternative data, with time series based on text analysis, such as ratios of sentiment data for target concepts – as opposed to the raw frequency of co-occurrence between keywords. Target concepts can cover anything, including risk assets.

Risk assets covered by concepts are typically broader asset classes, as opposed to single stocks or instruments, which typically are traded as a derivative on broader asset classes, depending on factor attributes, or arbitrary indices of sentiment data, which aggregate mentions of many single stocks.

There is a trade-off between the volume and frequency of the sentiment data and the robustness in the signal. As a result, there is generally not enough sentiment data on any single stock to create a robust signal with the frequency that is required to create a viable trading model on that sentiment data in isolation.

Some strategies trade across all asset classes and have several trading models generating signals autonomously, for directional trades or pair trades, across a range of instruments across asset classes.

The Machine Learning module defines its rules for trading the risk asset, based on pre-defined parameters. Each trading model continuously generates a LONG or SHORT signal and continuously learns over time. Models are monitored and assessed on performance, and tweaks are simulated and tested on an ongoing basis.

Despite the number of experts in the field of machine learning and computational linguistics and in the field of quantitative trading, there are very few people with both the domain knowledge and the practical experience in both fields. There is a very small overlap between these groups.

As a result, there have been many false starts and false claims regarding the predictive powers in sentiment data.

There are many good examples of predictions based on sentiment data, but – for understandable reasons – there are few public displays of alpha-generating signals in sentiment data.

One multi-year public prospective study had to be halted as some participants started trading on the signals.

Market predictions based on sentiment is a field of study with some special challenges. There are, for instance, significant differences between using sentiment data to predict the outcome of a parliamentary election and the direction of a financial asset.

One such difference is that there is, generally, not enough sentiment data on a single stock to build a predictive model for a viable trading model for that stock. Another difference is that there are significant confounders in the price formation of a single stock.

Another major challenge is that price formation in financial assets is capital weighted, which means that the sentiment data is treated differently.

Another major challenge is that models require a lot of high-quality training data, and there is a scarcity of data sets with large volumes of texts collected and analyzed with high-quality text analytics over many years.

Another major challenge is reading, understanding, and analyzing unstructured language data, at scale. This is a non-trivial task, and the reason why sentiment signals from many traditional NLP tools do not work for trading purposes.

It is starting to become known that, in order to solve many – if not most – real-world natural language processing tasks, a machine must handle the extreme variability in natural language usage.

Another challenge is that the prediction accuracy of a signal is not the same thing as a viable trading model. There are models that – statistically – predict direction but consistently lose money or are impossible to trade profitably.

Prediction accuracy is only one part of a trading model, and it is often misunderstood. The most quoted academic paper on this subject has more than 4000 academic citations but shows a lack of understanding of what a prediction is.

This field is attracting more interest and resources. It is expected that the use of Alternative Data will continue to grow.

Language AI has developed rapidly. Advanced text analysis and machine learning are becoming increasingly affordable and available.

Sentiment data is expected to play an increasingly dominant role in the investment industry. Just like with all other sources of alpha this trend will be pervasive across all capital markets, all forms of investing, and across all asset classes.


Bio: Lars Hamberg is a Swedish investor and Financial Industry Expert. ‪He founded several businesses and held senior roles in the financial industry. He is a frequent speaker on AI and advanced analytics.