Data Science Data Architecture
Data scientists are kind of a rare breed, who juggles between data science, business and IT. But, they do understand less IT than an IT person and understands less business than a business person. Which demands a specific workflow and data architecture.
By Dr. Olav Laudy (Chief Data Scientist, IBM Analytics, Asia Pacific).
This article describes the data architecture that allows data scientists to do what they do best: “drive the widespread use of data in decision-making”. It is intended for various audiences: for IT admins to better understand the needs of data scientists, for data scientists to better articulate their needs and in general for companies who are looking to setup a data science work stream. Data scientists are kind of a rare breed. Apart from data science, they need to understand business and they need to have IT hacking skills (i.e. ability to get things working in an IT landscape; not to be confused with a penetration/exploit type of hacker). The data scientist does understand more business that an IT person and understands more IT than a business person. The flip side: the data scientist does understand less IT than an IT person and understands less business than a business person. With this set of skills comes the request for a specific workflow and data architecture.
IT versus Data Science terminology
IT landscapes can go as extensive as DTAP: Development, Testing, Acceptance, Production environment, but more often IT architectures follow a subset of those. From a data science perspective, there is a model development environment and a model production environment (i.e. a model scoring environment). In both worlds production environment means the same: a stable, audit-able environment that interfaces with the business under known conditions (workload, response time, escalation routes, etc.). Model development environment, however, has a different meaning for IT and the data scientists. Table 1 spells out the criteria for the different environments and shows that the data science model development environment is neither an IT development environment nor an IT production environment. Note that not all companies have such a strict set of requirements as outlined below, but it is a good starting point for an inventory.
A model development environment needs to have production-grade availability in multiple aspects:
- The daily business of the data scientists takes place on this platform, and it not being available stops any model development.
- The model development environment, over time, will contain a great deal of (analytical) assets, and in that sense, it cannot be restricted in lifetime, nor allows it for an easy re-installation and starting from scratch.
- A model development environment may have its own backup or testing environment to test the application of bug fixes and patches.
- Number crunching requires a lot computational power and storage and needs to be sized specific to the data and model requirements expected.
- The model development environment needs formal backup and escalation routes in case of disruptions.
- The model development environment comes with production level requirement regarding data availability. It is unfortunate that this needs to be pointed out: a data scientist does not build models on test data. It is amazing how often I’m asked to build a model on 2000 rows of artificially created data with the same columns names as the real data. Such a strategy works when one writes an API to returns a specific data request, however, in data science one learns from data and artificially created data does not contain any interesting structure.
A model development environment needs to have development status in the following aspects:
- A data scientists needs to work against a database with the ability to create, fill and drop tables. A data scientist is able to create queries that hang the system. That is part of experimentation and may happen once in a while. It will become a lesson learned.
- A data scientist is not a DBA. Creating tables happens on the fly, with the fullest disregard to proper database management such as naming conventions, indexing, partitioning and database normalization. Restricting a data scientist to work along those lines will kill productivity.
- The DBA companion may help out to do the proper thing to the database, such a writing clean-up scripts, indexing, etc. In additional the data scientist may request a DBA to set up database schemas, users, archiving etc.
- The data scientist needs to have fairly unrestricted access to a command prompt and OS level capabilities. It will not be the first time that data is being delivered in the shape of 100.000 zip files or a job needs to be setup to scrape some data from the (intra)web. Although source data or temporary files are preferred to go in the database, sometimes it’s just simpler to have the ability to store data in a csv on disk.
- Unrestricted installation of software doesn’t have to be among the requirements, however, not having to go through a three-month approval process helps productivity a lot.
- A data scientist should not need to have access to privacy sensitive data. The data repository containing the historic data can be created under referential integrity (i.e. you can still join tables) with hashed or encrypted sensitive fields.
The need for a separate model development and production environment
Not all analytical models are intended to make it to a production environment, although, the models that are most valuable are not one-time executions, but are embedded, repeatable scoring generators that the business can act upon. The model development takes place in a relatively unstructured environment that gives the possibility to play with data and experiment with modeling approaches. Embedding an analytical model in the business means it migrates from this loosely defined environment to a location of rigor and structure. Not separating the environments leads to a series of issues:
- An ad-hoc query for a new to develop model can disrupt the scoring of a production model.
- A data scientist can manually alter scores (e.g. credit scores).
- There’s privacy sensitive data available for the eyes of the data scientist (as production data is not censored).
- The model development cycle is likely required to align with the production scoring cycle.
- Archiving needs are different for model generated scores and models.
Figure 1 shows the difference between cycles for model development and model scoring. In the development environment, the data scientist comes up with an idea and slowly works towards a ready model. Once it has taken the right shape, it is placed in the pre-production environment (later more), where it is thoroughly inspected. Upon approval, and with the proper controls in place, the model is moved to production, where it is being scored on a set interval. Note that developing the model in the same environment as the scoring, frequently implies that a new version of the model needs to be ready for the upcoming scoring moment, i.e. the new model needs to be developed in between the scoring moments. This rushes the process and is error prone due to the lack of audit-ability and formal model migration process. In separate environments, as shown in Figure 1, after some time, the data scientist has a new idea to improve the model. The current approved model is taken from the pre-production environment, and being worked on. Once ready it is placed back into pre-approval, but as the figure shows, it cannot be approved due to lacking functionality. The data scientist repairs the defect, after which, upon approval, the new model can be placed in production.