Age homophily for predicting age of mobile phone customers
Homophily (a tendency of people to associate with others like them) is ubiquitous in real world and in social networks. We show the existence of age homophily in a mobile phone communication network and exploit it to predict the age group for all users in the network.
Guest blog by Jorge Brea (Grandata), September, 2014.
Social networks generated by mobile phone users provide a rich platform for the study of human interactions. Understanding communication affinities can help describe the semantic content of the social network, and together with the network topology, predict users attributes and preferences.
Here we address the following questions:
Age homophily in the mobile phone communication network.
To look at age homophily, we had access to the age of approximately 500,000 clients of the mobile phone company (seed nodes). We first look at the age population pyramid for both female and male seeds, see Figure 1. The distribution has a double peak at ages around 30 and 42 years for both genders.
In Figure 2 we plot the histogram of age pairs for clients in the seed set that communicated with each other at least once within a three month period. We observe a strong increase in caller-callee pairs for users of similar age signaling a plausible strong age homophily for the whole network. In this figure we also see two weaker off-diagonal peaks indicating communications between users a generational gap apart, plausible due to parent-children communications.
Algorithm for age prediction
Given the strong age homophily observed among the seed nodes, we propose a graph based algorithm to infer the age group of the remaining users in the mobile network composed of over 70 million user with over 125 million connections. Our algorithm is a probability diffusion algorithm with memory of the nodes initial state. Each node diffuses a probability vector for the node belonging to one of four age categories (<25, 25-34, 35-50, 50+) remembering its vector initial state in each iteration. The algorithm's performance over all nodes in the network was over 46%, where random guessing would give a performance of 25%.
Performance based on nodes topological properties
Next, we looked at how performance depended on a node's topological relation to the seed set. We measured performance as a function of 1) seeds in a nodes neighborhood, 2) node's distance to the seed set, see Figure 3. We observed a significant increase in performance (with respect to the overall performance) for nodes closer to a seed, and up to 62% performance for nodes with three or more seeds in their neighborhood.
Our methodology has allowed us to significantly increase the quality of our age predictions for user in the mobile phone network. We expect our method to work well for other user's attributes and network topologies showing significant homophily for the given attribute.
For more details, see Harnessing The Mobile Phone Social Network Topology to Infer Users Demographic Attributes by Jorge Brea, Javier Burroni, Martin Minnoni and Carlos Sarraute, SNA-KDD Workshop 2014.
Jorge Brea, PhD is a Data scientist at Grandata, a company that integrates first-party and telco partner data to understand key market trends, predict customer behavior, and deliver impressive business results.
Related:
Social networks generated by mobile phone users provide a rich platform for the study of human interactions. Understanding communication affinities can help describe the semantic content of the social network, and together with the network topology, predict users attributes and preferences.
Here we address the following questions:
- How strong is age homophily among user in a mobile phone network,
- Can we efficiently harness this homophily together with the network topology to infer age groups for all users in the network, and finally
- how does the accuracy of our methodology depend on a users topological properties in the mobile network.
Age homophily in the mobile phone communication network.
To look at age homophily, we had access to the age of approximately 500,000 clients of the mobile phone company (seed nodes). We first look at the age population pyramid for both female and male seeds, see Figure 1. The distribution has a double peak at ages around 30 and 42 years for both genders.
Figure 1
In Figure 2 we plot the histogram of age pairs for clients in the seed set that communicated with each other at least once within a three month period. We observe a strong increase in caller-callee pairs for users of similar age signaling a plausible strong age homophily for the whole network. In this figure we also see two weaker off-diagonal peaks indicating communications between users a generational gap apart, plausible due to parent-children communications.
Figure 2
Algorithm for age prediction
Given the strong age homophily observed among the seed nodes, we propose a graph based algorithm to infer the age group of the remaining users in the mobile network composed of over 70 million user with over 125 million connections. Our algorithm is a probability diffusion algorithm with memory of the nodes initial state. Each node diffuses a probability vector for the node belonging to one of four age categories (<25, 25-34, 35-50, 50+) remembering its vector initial state in each iteration. The algorithm's performance over all nodes in the network was over 46%, where random guessing would give a performance of 25%.
Performance based on nodes topological properties
Next, we looked at how performance depended on a node's topological relation to the seed set. We measured performance as a function of 1) seeds in a nodes neighborhood, 2) node's distance to the seed set, see Figure 3. We observed a significant increase in performance (with respect to the overall performance) for nodes closer to a seed, and up to 62% performance for nodes with three or more seeds in their neighborhood.
Figure 3
Our methodology has allowed us to significantly increase the quality of our age predictions for user in the mobile phone network. We expect our method to work well for other user's attributes and network topologies showing significant homophily for the given attribute.
For more details, see Harnessing The Mobile Phone Social Network Topology to Infer Users Demographic Attributes by Jorge Brea, Javier Burroni, Martin Minnoni and Carlos Sarraute, SNA-KDD Workshop 2014.
Jorge Brea, PhD is a Data scientist at Grandata, a company that integrates first-party and telco partner data to understand key market trends, predict customer behavior, and deliver impressive business results.
Related: