Data Science Data Architecture
Data scientists are kind of a rare breed, who juggles between data science, business and IT. But, they do understand less IT than an IT person and understands less business than a business person. Which demands a specific workflow and data architecture.
The data flow
Figure 2 shows the data flow for analytical applications. The left side of the picture describes the database, the right side the analytics stack: in red dots the scheduling instances, in blue dots the actual analytical processes. The top part “Dev” indicates the model development environment, while the lower box ‘Prod’ indicates the model production environment.
In the model development environment, the database is divided up into three parts (or schemas):
- A staging area or (for the data scientists) read-only environment where IT can make data available. This has to be read-only in order for IT to guarantee there is no confusion on what has been delivered (from a quality and a quantity point of view).
- The data science playground (or sandbox). This is the free area where model experimentation takes place, where ad-hoc questions are answered and where reports and insights are developed.
- The lower part of the model development environment indicates the pre-production stage. This is an area where the data scientist works closely with the IT department. The discussions that the data scientist and the IT operator have, revolve around a hand-over process of the model. In addition, the IT operator needs to understand the data requirement of the model and needs to prepare the operational environment for the model. The hand-over of a model to an operational team needs to come with an audit structure.
The data flow described below supports the full workflow of data scientists: from ad-hoc reports to models supporting multiple departments. As mentioned, the journey starts with data being made available in the read-only (staging) area. The data available here is a mix of first time deliveries (a data scientist is curious by nature always on the lookout for new data sources) and regular scheduled data deliveries (e.g. monthly new customers, usages, transactions etc.). Initially, the data comes in raw, and is being explored as such. Further collaboration between IT and the data scientists may lead to requests to certain aggregations or selections of data. The regular data delivery is picked up by scheduled tasks that prepare the data for the data science data-mart. Ideally, this is a change-history based data-mart that contains the data to answer 90% of the ad-hoc questions and is capable of generating the modeling data from. Alternative to change-history is the storage of monthly snapshots, however, that makes time based selections and models much more difficult. Note that the playground is intended to contain transformed staging area data, not a copy of the original. Moreover, the playground should ideally only contain data from the staging area in order to prevent non-replicable models.
From the data mart, the data scientist creates two types of data for the modeling: analytical data and operational data. Analytical data refers to the data used to build the model. It is historic data, and is properly split up in train/test/validate. The operational data refers to the data that is needed for scoring. Note that, since the playground only contains historic data, the operational data refers to the format of the data only, not to its recency. This is an important point, as I’ve encountered multiple situations where the data scientists imagined that they needed to have the most recent data in order to score the model (in the development environment). This placed an unreasonable pressure on IT to deliver data with a high frequency in a development environment, with all the undesirable consequences that come from not separating model development from model production.
Once the model is built, that is: trained, tested, validated and confirmed to score on the operational data, the model can be placed in pre-production. Rather than this being a separate environment, it turns out more practical to reserve an area in the development environment specifically for this. In terms of the storage of the model, it can be a folder in the model repository, in terms of the database, it is best practice to not allow the data scientists create the required tables, but to provide the table create statements to IT in order to discuss naming conventions and such. After table creation by IT, the data scientists can insert the operational test data into the table in order to show that the model scores in pre-production. This is important in order to identify any overlooked dependencies.
In order to have the model run in production, IT needs to make the operational data available in the production environment. There are two routes for that. First of all, since IT knows exactly what they placed in the read-only staging area, they can make this available in the production environment. All the data preparation of the model then comes down to the data scientists, who build this as part of the model scoring job. When this scenario is explained to IT, they invariably want to take over and provide the exact data as needed for the model using their preferred ETL tooling. The data scientists are then tasked to document the data transformations in a way that IT can rebuild this. Typically this is not without challenge, as data scientists come up with very creative ways to transform the data, which might not be easy to archive in ETL tools. In practice, it comes down to choosing the middle road: IT provides semi-manufactured data, upon which the data science work stream completes the remainder and subsequently scores the model.
It is best practice to not have the data scientists migrate the model to production. It maybe one member of the data science team with a strong IT background and awareness of the IT policies who becomes the IT-data science liaison and it able to assist the migration.
Data science requires a close interplay between IT and the data scientists. It’s a bottom up process and it’s agile. That means, prior to doing the analyses, it cannot be written out as a list of specifications that need to be followed to the letter. Typically, data scientists start with investigating samples of data in combination with understanding the business, after which requirements for model building and data delivery will follow. This happens in an iterative way and with advancing insights come new or altered data requirements. An IT department that understand this process and can play this game, can greatly contribute to the success of data science and the enhancements it brings to the business.
In this article it was discussed how IT architecture can support the workflow of data scientists. I’ve found this architecture to hold for many companies that not have data science as their core business (most industries such as financial institutions, retail industry, telecoms, and manufacturing industry as opposed to companies that, say, specialize in deep learning). I’m also aware of rapidly changing technology, exchanging the traditional databases with Hadoop and alike. In those cases I’ve found that modeling against a database in a training environment (or at least having a database as part of the model development environment) often offers the biggest flexibility. The value of data science comes from the ability to play with data to determine the next steps in the analysis. Any architecture that enhances that ability will result in better outcomes for data science, and hence, better decision making.
Bio: Olav Laudy is Chief Data Scientist, IBM Analytics, Asia-Pacific.
- Data Hierarchy of Needs
- Automatic Statistician and the Profoundly Desired Automation for Data Science
- The Inconvenient Truth About Data Science