Beyond news contents: the role of social context for fake news detection

Today we’re looking at a more general fake news problem: detecting fake news that is being spread on a social network. This is a summary of a recent paper which demonstrates why we should also look at the social context: the publishers and the users spreading the information!

By Adrian Colyer, Venture Partner, Accel on March 7, 2019 in Fake News, NLP, Social Media

comments

Beyond news contents: the role of social context for fake news detection Shu et al., WSDM’19

Today we’re looking at a more general fake news problem: detecting fake news that is being spread on a social network. Forgetting the computer science angle for a minute, it seems intuitive to me that some important factors here might be:

what is being said (the content of the news), and perhaps how it is being said (although fake news can be deliberately written to mislead users by mimicking true news)
where it was published (the credibility / authority of the source publication). For example, something in the Financial Times is more likely to be true than something in The Onion!
who is spreading the news (the credibility of the user accounts retweeting it for example – are they bots??)

Therefore I’m a little surprised to read in the introduction that:

The majority of existing detection algorithms focus on finding clues from the news content, which are generally not effective because fake news is often intentionally written to mislead users by mimicking true news.

(The related work section does however discuss several works that include social context.).

So instead of just looking at the content, we should also look at the social context: the publishers and the users spreading the information! The fake news detection system developed in this paper, TriFN considers tri-relationships between news pieces, publishers, and social network users.

… we are to our best knowledge the first to classify fake news by learning the effective news features through the tri-relationship embedding among publishers, news contents, and social engagements.

And guess what, considering publishers and users does indeed turn out to improve fake news detection!

Inputs

We have publishers, social network users, and news articles. Using a vocabulary of t words, we can compute an $\mathbf{X} \in \mathbb{R}^{n \times t}$ bag-of-word feature matrix.

For the m users, we can have an m x m adjacency matrix $\mathbf{A} \in \{0,1\}^{m \times m}$ , where $\mathbf{A}_{ij}$ is 1 if i and j are friends, and 0 otherwise.

We also know which users have shared which news pieces, this is encoded in a matrix $\mathbf{W} \in \{0,1\}^{m \times n}$ .

The matrix $\mathbf{B} \in \{0,1\}^{l \times n}$ similarly encodes which publishers have published which news pieces.

For some publishers, we can know their partisan bias. In this work, bias ratings from mediabiasfactcheck.com are used, taking just the ‘Left-Bias’, ‘Least-Bias’ (neutral) and ‘Right-Bias’ values (ignoring the intermediate left-center and right-center values) and encoding these as -1, 0, and 1 respectively in a publisher partisan label vector, $\mathbf{o}$ . Not every publisher will have a bias rating available. We’d like to put ‘-’ in the entry for that publisher in $\mathbf{o}$ but since we can’t do that, the separate vector $\mathbf{e} \in \{0,1\}^l$ encodes whether or not we have a bias rating available for publisher p.

There’s one last thing at our disposal: a labelled dataset for news articles telling us whether they are fake or not. (Here we have just the news article content, not the social context).

The Tri-relationship embedding framework

TriFN takes all of those inputs and combines them with a fake news binary classifier. Given lots of users and lots of news articles, we can expect some of the raw inputs to be pretty big, so the authors make heavy use of dimensionality reduction using non-negative matrix factorisation to learn latent space embeddings (more on that in a minute!) TriFN combines:

A news content embedding
A user embedding
A user-news interaction embedding
A publisher-news interaction embedding, and
The prediction made by a linear classifier trained on the labelled fake news dataset

Pictorially it looks like this (with apologies for the poor resolution, which is an artefact of the original):

News content embedding

Let’s take a closer look at non-negative matrix factorisation (NMF) to see how this works to reduce dimensionality. Remember the bag-of-words sketch for news articles? That’s an n x t matrix where n is the number of news articles and t is the number of words in the vocabulary. NMF tries to learn a latent embedding that captures the information in the matrix in a much smaller space. In the general form NMF seeks to factor a (non-negative) matrix M into the product of two (non-negative) matrices W and H (or D and V as used in this paper). How does that help us? We can pick some dimension d (controlling the size of the latent space) and break down the $\mathbf{X} \in \mathbb{R}^{n \times t}$ matrix into a d-dimension representation of news articles $\mathbf{D} \in \mathbb{R}^{n \times d}$ , and a d-dimension representation of words in the vocabulary, $\mathbf{V} \in \mathbb{R}^{t \times d}$ . That means that $\mathbf{V}^T$ has shape $d \times t$ and so ends up with the desired shape $n \times t$ . Once we’ve learned a good representation of news articles, $\mathbf{D}$ we can use those as the news content embeddings within TriFN.

We’d like to get $\mathbf{DV}^T$ as close to $\mathbf{X}$ as we can, and at the same time keep $\mathbf{D}$ and $\mathbf{T}$ ‘sensible’ to avoid over-fitting. We can do that with a regularisation term. So the overall optimisation problem looks like this:

User embedding

For the user embedding there’s a similar application of NMF, but in this case we’re splitting the adjacency matrix $\mathbf{A}$ into a user latent matrix $\mathbf{U} \in \mathbb{R}^{m \times d}$ , and a user correlation matrix $\mathbf{T} \in \mathbb{R}^{d \times d}$ . So in this case we’re using NMF to learn $\mathbf{UTU^T}$ which has shape mxd . dxd . dxm, resulting in the desired mxm shape. There’s also a user-user relation matrix $\mathbf{Y}$ which controls the contribution of $\mathbf{A}$ . The basic idea is that any given user will only share a small fraction of news articles, so a positive case (having shared an article) should have more weight than a negative case (not having shared).

User-news interaction embedding

For the user-news interaction embedding we want to capture the relationship between user features and the labels of news items. The intuition is that users with low credibility are more likely to spread fake news. So how do we get user credibility? Following ‘Measuring user credibility in social media’ the authors base this on similarity to other users. First users are clustered into groups such that members of the same cluster all tend to share the same news stories. Then each cluster is given a credibility score based on its relative size. Users take on the credibility score of the cluster they belong to. It all seems rather vulnerable to the creation of large numbers of fake bot accounts that collaborate to spread fake news if you ask me. Nevertheless, assuming we have reliable credibility scores then we want to set things up such that the latent features of high-credibility users are close to true news, and the latent features of low-credibility users are close to fake news.

Publisher-news embeddings

Recall we have the matrix $\mathbf{B}$ encoding which publishers have published which news pieces. Let $\bar{\mathbf{B}}$ be the normalised version of the same. We want to find $\mathbf{q}$ , a weighting matrix mapping news publisher’s latent features to the corresponding partisan label vector $\mathbf{o}$ . It looks like this:

Semi-supervised linear classifier

Using the labelled data available, we also learn a weighting matrix $\mathbf{p}$ mapping news latent features to fake news labels.

Putting it all together

The overall objective becomes to find matrices $\mathbf{D,U,V,T,p,q}$ using a weighted combination of each of the above embedding formulae, and a regularisation term combining all of the learned matrices.