Age homophily for predicting age of mobile phone customers

Homophily (a tendency of people to associate with others like them) is ubiquitous in real world and in social networks. We show the existence of age homophily in a mobile phone communication network and exploit it to predict the age group for all users in the network.

Guest blog by Jorge Brea (Grandata),  September, 2014.

Social networks generated by mobile phone users provide a rich platform for the study of human interactions. Understanding communication affinities can help describe  the semantic content of the social network, and together with the network topology, predict users attributes and preferences.

Here we address the following questions:

  1. How strong is age homophily among user in a mobile phone network,
  2. Can we efficiently harness this homophily together with the network topology to infer age groups for all users in the network, and finally
  3. how does the accuracy  of our methodology depend on a users topological properties in the mobile network.

Age homophily in the mobile phone communication network.

To look at age homophily, we had  access to  the age of  approximately  500,000 clients of the mobile phone company  (seed nodes). We first look at the age population pyramid for both female and male seeds, see Figure 1. The distribution has a double peak at ages around 30 and 42 years for both genders.

Population and Age of Mobile Phone Customers
Figure 1

In Figure 2 we plot the histogram of age pairs for clients in the seed set that communicated with each other at least once within a three month period. We  observe a strong increase in caller-callee pairs for users of similar age signaling a plausible strong age homophily for the whole network. In this figure we also see two weaker off-diagonal peaks indicating communications between users a generational gap apart, plausible due to parent-children communications.

Links between different age groups
Figure 2

Algorithm for age prediction

Given the strong age homophily observed among the seed nodes, we propose a graph based algorithm  to infer the age group of the remaining users in the mobile network composed of over 70 million user with over 125 million connections. Our algorithm is a probability diffusion  algorithm with memory of the nodes initial state. Each node diffuses a probability vector for the node belonging to one of four age categories (<25, 25-34, 35-50, 50+)  remembering its vector initial state in each iteration. The algorithm's performance over all nodes in the network was over 46%, where random guessing would give a performance of 25%.

Performance based on nodes topological properties

Next, we looked at how performance depended on a node's topological relation to the seed set. We measured performance as a function of 1) seeds in a  nodes neighborhood, 2)  node's distance to the seed set, see Figure 3. We observed a significant increase in performance (with respect to the overall performance) for nodes closer to a seed, and up to 62% performance for nodes with three or more  seeds in their neighborhood.

Topology performance
Figure 3

Our methodology has allowed us to significantly increase the quality of our age predictions for user in the mobile phone network. We expect our method to work well for other user's attributes and network topologies showing significant homophily for the given attribute.

For more details, see  Harnessing The Mobile Phone Social Network Topology to Infer Users Demographic Attributes by Jorge Brea, Javier Burroni, Martin Minnoni and Carlos Sarraute, SNA-KDD Workshop 2014.

Jorge Brea, PhD is a Data scientist at Grandata, a company that integrates first-party and telco partner data to understand key market trends, predict customer behavior, and deliver impressive business results.