How to Balance the Load on a Data Team
This post will help you to better understand a data team’s workflow and allocate their resources to business users.
By Alex Yeskov, Statsbot
In the previous chapter, we discussed a complex process of defining business metrics. Now we’ll focus on the aspects of interaction between the data team and business users in their ongoing activities beyond metric definition. This chapter will help you to better understand a data team’s workflow and allocate their resources to business users.
Data team workflow
The resources of a data team are typically very constrained, and the demand in data services is growing exponentially with the business, as well as quantity and complexity of data. The high-level structure of a traditional BI team workload can be outlined as follows:
- Developing routines that bring data to a data warehouse.
- Developing routines that reshape data into an analytical data model. Typically, it’s not very suitable for the analysis in its original form.
- Supporting the analytical data model. This means accommodating to changes in the backend systems, such as adding new data attributes, removing the retired data attributes, and adjusting to changes in the structure.
- Defining and developing metrics.
- Developing data products (reports, visualizations, tools).
- Creating custom data extracts that can be used to answer specific questions by business users (ad-hoc exports).
- Working to answer specific questions (ad-hoc analysis).
- Creating documentation.
If the team does not proactively invest time into the analytical data model, it is going to be blown away by the amount of requests for metrics, reports, and extracts from business users. This is a hard-learned lesson. If you don’t have a proper analytical data model, you can expect:
- Each request to take much more time than it would with clean and well-structured data.
- Many requests that could be totally irrelevant if data is in the right shape and is accessible to end users.
- A very steep learning curve for newly hired analysts. They would have to learn the underlying raw data structure and use complex queries to do the most basic things. This results in low productivity and a high degree of frustration for new team members.
- Inconsistent results over time because underlying data changes while its extraction and transformation are on-demand.
At the early stages of data initiatives, solving all tasks coming from business users seems to be a low-hanging fruit.
A solid data foundation is what makes the difference between a successful and a failed BI initiative.
So, the best recommended proportion is 50/50, and preferably 60/40 (60% of time for modeling and cleaning, and 40% of time for business user requests). When a data model is mature enough, the proportion can be changed to 20/20/60(20% for developing new data models, 20% for supporting the existing ones, and 60% for work with business users).
Everyone who interacts with a data team should understand that. This is a great investment that will reduce the pressure on a data team in the future. Learn about the right way to do this in a modern data environment using ELT.
As a simple example, some business users need data on a specific subset of customers that has a certain number of properties. On a raw production database model it would require at least joining lots of tables and decoding a few encoded columns that are optimized for transactional database storage.
Typically this task will also require groupings and rankings, like what the first purchased or most expensive product is for each customer. That might be a challenge for a person who is not proficient enough with SQL and doesn’t know the production data model well.
As a result, data is locked and it takes time and skill to unlock it.
The data team needs to transform data in such a way that tables reflect business entities with their myriad of properties at a single place.
It’s much easier to complete the request above when you have a “customer” table where all necessary tables are joined, encoded columns are decoded, and all trivial summaries are calculated. The same with metrics: SQL definitions will be very complex on a raw data model and very simple on an analytical data model.
So, investment in data modeling is a key point at earlier stages of data initiative. Business users can’t expect immediate turnaround before the ground work is done.
Other ways to balance the load on a data team are setting up a self-service environment and selecting lightweight (ad-hoc) deliverables when possible.
The opportunities for building a self-service environment provided by modern BI tools are constantly increasing, as well as the level of data literacy among business users. The key points to building a successful self-service culture are:
- Invest in a proper data model. If data is heavily normalized, like in production systems, non-data people can’t make an effective analysis even with great UI tools at their disposal.
- Use modern BI tools like Statsbot that have rich UI functionality and allow for the picking up of dimensions and measures from a predefined analytical data model and getting results at the speed of light.
- Increase the level of data literacy among business users. Given the very gentle slope of the SQL learning curve, it’d be great fun with immediate benefits for people to start data hacking themselves.
They just need a real data-savvy person behind their back. An hour a week that you can easily dedicate to data training will be one of the most useful hours spent.
If your data is modeled well and business users adopt the right tools, they will be able to interact with data directly in the most basic scenarios, reducing the pressure on the data team.
Metrics vs ad-hocs
Defining a metric is a more complex process than using ad-hoc analysis. A data team has to support metrics, making sure they work well and are updated on time with correct values. This is particularly true in the case when someone makes changes in the underlying data or finds an issue in the data pipeline. Also, a metric is just not the right kind of answer in certain cases, such as:
- When you measure the impact of a significant one-time change in the business process. It’s enough to measure it in an ad-hoc fashion before and after changes in the business process are made.
For example, with a weekly metric it would be 4 weeks before the change is made and 4 weeks after the change is made, and these samples would be different (or not, if the change didn’t take a desired effect). If 8 numbers can be distilled to one number before the change and another number after the change, let it be so.
- When the size of the data is too small. In this case, you’ll see the statistically insignificant fluctuations of the metric, which will result in reduced transparency and actionability. At a small scale, it’s good to look at worst cases on an individual basis to understand what happened in each particular situation. When the process is scaled, you’re good to build the metric.
- When the definition is too complex and the numbers are revisited rarely. A good example is financial models. There is a great temptation to automate them, but there are better tools for financial modeling than SQL. You can use Python or a good old spreadsheet.
In such situations, ad-hoc analysis is a good way to get a more concise response and reduce the turnaround time. It’s nice to have some repository where you can review and rerun ad-hoc solutions. This can be a ticketing system with links to solutions in the BI tool, a special directory in the BI tool itself, or a git repo.
Documenting the analysis and the code in such a way that it can be reproduced by another member of the data analyst team is the best practice here.
Proper data capturing
It’s a common thing for a business software to start to generate questions that don’t have an easy answer given the current software design.
Imagine, you’re running an online fashion shop with the following process: a delivery man visits a certain number of customers, and every customer has 20 minutes to try the clothes. The order is then finalized and the payment happens using a mobile POS terminal. Delivery men end their day at 6 PM.
A business user turned out that the last customers on the route are not visited sometimes, therefore they cancel their orders. She built a hypothesis for what could be behind this and moved them to a data team:
- Routes are too optimistic. Given the traffic, it’s hard to visit the scheduled number of people within a fixed amount of time.
- Customers spend more than 20 minutes trying on clothes, so there is not enough time for the last person versus the plan.
- Everything is good with routes and timing, it’s just that some delivery men come late to the job and go home on time, leaving the last person unattended.
To investigate any of these hypotheses, the analyst needs the arrival and departure time for every customer. While the departure time can be tied to the time of terminal transaction, the arrival time is not clear.
A data team can get GPS data from terminals, mobile phones, or vehicles of the delivery men, geocode the addresses of the customers, and find out the approximate arrival times. This is possible to do, but it takes a while.
On the other hand, if arrival time is captured somehow on the software side, it is much easier. For example, customer coordinates are stored in the database, and when the gps tracker of the delivery man approaches the customer, there is an event fired in the database. There could also be a “checkin” button, which the delivery man has to press every time before ringing a customer’s doorbell. Then, a data team can take these timestamps and provide a quick turnaround on the analytics question.
The above use case demonstrates what the business application logic is and how it differs from the data analytics logic.
Let’s draw a general borderline between business applications and data analytics questions. The former are:
- “what/when” types of questions (like the arrival time)
- tasks that are aimed at reducing the amount of manual labor
- tasks that require regular uptime/fault tolerance
and the latter are related to:
- data aggregation (sums, counts, percentages, etc.)
- finding out the relationship between different things
As a result, with concentrating analyst resources on business application questions and tasks, it is much harder to get to actually doing the analysis. The best practice in this case is to try to formulate software requirements in as detailed a way as possible in context of the business processes.
You can ask the software engineering department to adjust the software if some important piece of information is missing, and not try to cure it with data wrangling.
This is what will make the difference between being data driven and data challenged.
Proper data capturing and modeling, combining metrics and ad-hocs, and building a self-service environment are the key points in balancing the load on a data team and using their resources most productively.
You can learn more about the proper data capturing in the following part of this series. It will be more technical but still useful for managers so they can understand the larger picture.
Original. Reposted with permission.
- Loading Terabytes of Data from Postgres into BigQuery
- Simple Tips for PostgreSQL Query Optimization
- ETL vs ELT: Considering the Advancement of Data Warehouses