Interview: Josh Hemann, Activision on Taming the Beast of Gaming Big Data
We discuss Analytics challenges at Activision, event data from games such as Call of Duty, balancing aesthetics and inference in visualization, problem with stacked charts and more.
Josh Hemann is the Director of Analytic Services at Activision where his team builds data tools to support video game studios and embed analytics within the games they create.
Prior to this his industry experience spanned diverse settings such as air pollution research, aerospace, retail loyalty programs and recommendation systems for grocers.
Josh has an MS in Applied Mathematics from the University of Colorado at Boulder.
Here is my interview with him:
Anmol Rajpurohit: Q1. What Analytics challenges take up majority of your work time? What are the most prominent use cases of Analytics at Activision?
There are a lot of areas analytics touch the game, everything from algorithms that determine which players around the world you play with to ones that determine the optimal light refraction to make a
AR: Q2. Video games are instrumented to provide a great amount of data about various events. What are the unique challenges of deriving insights from such data?
JH: I am not sure it is a unique challenge by today’s standards, but certainly the amount of data we deal with makes analytics life hard. For a single game like Call of Duty: Advanced Warfare, we have to process many TBs a day of binary event data that then gets persisted as many TBs a day of structured data. Even heavily down-sampled datasets can still be huge.
Beyond just data size, one challenge is its complexity. Many thousands of game events (e.g. where you are on a map, what weapon you are holding, who you are fighting) are generated for each player in a given match, each match having a dozen or more players. Each event can have hundreds of attributes. So weaving all of this social, temporal, and spatial data together to make inferences is especially challenging.
AR: Q3. What measures should be taken to ensure that we do not get lost in the aesthetics of visualization, jeopardizing the inference aspect? While designing visualization, how can we ensure to keep focus on decisions, not just stories?
JH: Well, aesthetics really do matter. But I think it is easier and easier to generate graphics with beautiful colors that are increasingly interactive and displayed in a browser. Unfortunately, it is not getting any easier to make data visualization informative.
The types of visualizations I choose to publish are shaped by a couple of questions I always have in my head when doing this work:
What would I have to believe in order for some-important-hypothesis to be likely?
Well, did you check some-thing-that-most-people-do-not-account-for?
The first question helps me think about the viewer and what supporting information they need in order to be comfortable with what a model is suggesting. The second question is one I imagine the viewer saying to me: what data issues, easy biases, confounders etc. are obvious to a domain expert but that I might be missing in my analysis and presentation of results? Recalling these questions helps me choose what data visualizations are the most useful in getting a model to affect a decision.
AR: Q4. Why is it important to pay attention to the default values of various configuration parameters in a visualization tool? What are the parameters one should be most careful about?
JH: Default settings can be just fine, but it is important to remind yourself that they are there and can sometimes vary widely between software tools. Thus, a viewer’s impression from a given data visualization could vary widely simply as an artifact of what software was chosen to create it.
What parameters are important could be quite different across settings. For example, almost all of the visualization work I do is with 2D graphics but I have friends working in scientific settings where nearly all visualizations are 3D. Some common defaults to be mindful of are:
- Axes range
- Axes scaling
- Axes aspect ratio
- Color maps
AR: Q5. Why is it a good idea to report sample sizes for all observation categories during visualization?
JH: It is simple but crucial context for the viewer and yourself. Sometimes it becomes
AR: Q6. Why do you dislike stacked charts? What alternatives do you recommend?
Small multiples are often the better way to inform, but it will probably take more mouse clicking or more code to generate them. As professionals though, I think we owe our audience that extra effort.
Second part of the interview
Related: