In Loving Memory of Strictly-Typed Schemas
This article addresses one very peculiar manifestation of marketing propaganda in the big data industry that has crippled data engineers across the board — a resolute and methodical undermining of the sanctity of strictly-typed schemas.
We live in a strange time when critical technological decisions are driven by marketing propaganda. There is arguably no industry more afflicted with this syndrome than ‘big data’. In this article, I will address one very peculiar manifestation of this phenomenon in the big data industry that has crippled data engineers across the board — a resolute and methodical undermining of the sanctity of strictly-typed schemas.
While the ideological war over the OLTP empire raged on, the analytics and data science ecosystem, which had thus far been dominated by strictly-typed and schematic data warehouses, started to see an infiltration of the new and disruptive ‘schemaless’ ideas. As marketing propaganda across the world started to hail big data as the new oil, corporations of all sizes became obsessed with hoarding data. If AI is data-driven, then surely more data implies more AI magic, and who in their right mind doesn’t want more AI magic? In times like this, schemas became nothing but a hindrance, like a bouncer at a nightclub which isn’t quite cramped enough and really wants to be. To fix this problem, we turned to the idea of a ‘data lake’ — essentially a file system on steroids which can store anything that can be put in a file. If this wasn’t disruptive enough, our industry went a step further with the notion of ‘schema-on-read’. The idea of ‘schema-on-read’ is to hoard all the data you like, and ‘infer’ a schema when the data needs to be used. In theory, this is a swell idea, and companies with skin in the game wasted no time in marketing it to the moon. However, just like with the former ‘NoSQL’ movement, holes in the logic started to emerge quickly. How does one infer a schema? Do you write an algorithm to make an educated guess based on a subset of homogenous data? Does this algorithm depend purely on deterministic heuristics, or non-deterministic machine learning models, or a patchwork combination of the two? How do you know that your subset isn’t full of anomalies? How do you guard against a sampling bias? What happens when your schema inference algorithm inevitably gets it wrong? And possibly the biggest question of them all, what happens when the schema changes? There are many schema inferring tools out there, but none of them have found comprehensive answers to these problems. As a result, a lot of data engineers working on data lakes today spend unnecessarily large amounts of time trying to fix bugs caused by schema mismatches, type inference failures, and data corruption due to poorly assigned types. Problems that may never have arisen had we paid attention to these matters before abandoning the notion of strictly-typed schemas. Let us now consider a few practical implications of this issue.
Suboptimal Type Inference
Here’s a little fun fact that has caught many a senior dev off-guard.
1.1 + 2.2 quite obviously equals
3.3 in math, but try running
1.1 + 2.2 == 3.3 in your favorite programming language and you might be surprised to find that it evaluates to
false. Veteran engineers will probably know why, but just in case you didn’t, by default, programming languages will cast the values
2.2 to a
float type. A float uses a binary (base-2) system and typically offers as much precision as 32 bits of memory can afford. This creates limitations on how precisely a floating point number can be represented, leading to a situation wherein
1.1 + 2.2 actually evaluates to
3.3000000000000003. For most use-cases, this very slight imprecision is insignificant, but there are important exceptions, in particular monetary values. If let’s say you are working with a foreign exchange dataset, even a small imprecision in a currency’s conversion rate can lead to massive discrepancies when you use that imprecise rate in a computation applied to millions of data points, and then perform aggregations on those data points. Of course this is not a new problem, and most languages and databases offer a much more precise 128 bit
decimal type which is widely accepted as the best way to store monetary data.
Now let’s consider this in the context of data lakes with ‘schema-on-read’ inference architecture. Let’s say that your data lake is receiving a table with decimal fields. Some of these fields don’t need absolute precision (for example, weight of a purchased item), while others do (for example, exchange rate at the time of purchase). Today’s schema inference engines are not brave enough to try and infer the required precision based on the field name, and rightfully so, because naming conventions are another messy can of worms. So what options does your schema inference engine have? It could either cast every field with a floating point as a decimal type with very generous precision and scale. Or it could cast everything as a float or a double (64 bit precision with a binary system). In the former case, we have the problem of unnecessary performance penalties, because decimal type objects are bulkier and slower to work with. In the latter case, we run into the problem of imprecise financial data which introduces all the aforementioned liabilities. In my experience, all the schema inference engines I have worked with so far default to using floats. This leaves us with no choice but to hack our own schema-on-read system and enforce a schema for certain fields. It actually takes a fair bit of work to ensure that decimal type objects are maintained with their full precision through an entire ETL pipeline. It has to be enforced on the file format you are using on disk, and on your application layer’s code.
Loss of Type Resolution
On January 29, 2020, Pandas released v1.0.0, a major version release. If you have a look at its changelog, you’ll notice that an overwhelming majority of the updates are related to its handling of data types. This includes a dedicated
NA type for missing values, a
BooleanDtype that can hold missing values, and more. The implications of this release are huge. It is a sign that Pandas has acknowledged the core problem discussed in this article, and has therefore decided to steer its shipping in the direction of well-planned type systems. For some context, Pandas is considered the gold standard for single-server data processing applications written in Python, and for companies with small-to-medium sized data, it is a far more practical option than Spark.
In the context of this update, let’s dig a little deeper to see what happens to data as it moves between type systems of varying type resolutions. Traditional relational databases such as Postgres and MySQL offer fairly high-resolution type systems, with various options for strings, integers, etc. Let’s consider the case of integer data. Let’s say your data starts in a MySQL table, where it can be set to one of five different integer types depending on its size range (min-max values), and each of these types can be set to either
Signed which can hold both negative and non-negative values, or
Unsigned which can hold only non-negative values. From here let’s say you extract some data and dump it into a CSV file. You are now left entirely at the mercy of your schema inference engine to get this right. Let’s say the engine selects 1,000 random rows and assigns data types to the fields based on the data in those rows. The accuracy of the dataset once its loaded into memory now relies entirely on the notion that these 1,000 rows sampled all possible edge cases. For example, if a field contained mostly 3 digit integers but a few valid 19 digit integers that did not show up in the random sample, the schema inference engine might misallocate this data to an integer type that isn’t large enough to hold the outliers, resulting in an integer overflow, which could lead to anything from runtime errors — if you are lucky — to corrupt data in production datasets. Similar problems can and often do occur when an inference engine misallocates signed integers as unsigned integers or vice-versa, as was once acknowledged by the founder of Pandas.
Problems around type resolution are by no means limited to numbers. The exact same problem described above regarding the length of an integer applies to strings as well. If your schema-inference tool is casting types for fields in a data store that casts text data differently depending on its size, if the random sample of data selected for inference does not account for grossly oversized but valid outliers, they may simply get truncated in the final dataset. These are all problems we deal with on a regular basis. With regards to string fields, the recent Pandas update is a very welcome development because Pandas previously used the
object type for strings, which is the parent type for all its other types. The problem here is that until this recent release, a Pandas
DataFrame’s schema offered no way to distinguish between a string field and other fields containing complex objects, such as decimal fields which use Python’s
Decimal type. This becomes a big problem if you are trying to use a Pandas schema to automatically enforce a schema on a destination data store like a Parquet file.
Unfortunately, there is no conclusion as of now. We, the data engineering community at large, have dug ourselves into this hole. While I am complaining as loudly as I can, I too drank the schema-on-read Kool-Aid with gusto. Today, as I lead a team of highly skilled data engineers, it pains me to watch our sprints get crowded with firefighting tickets and tech debt produced by poorly inferred schemas. We are slowly but surely rewiring our system to enforce schemas wherever possible. We are also working on a
SchemaPorter class which has been designed specifically to manage the safe movement of schemas across various type systems. Our solution for now is in its nascency, but if and when it reaches a certain point of maturity, we will share more details about it. My hope is that over time, the many in-house solutions built by teams like ours will eventually converge. Meanwhile, we continue to put our faith in the champions of open source who build and maintain Pandas, Spark, and other such majestic feats of human engineering. There is no doubt in my mind that the hardest, most important work in the data engineering ecosystem is done in the pull requests made to these open source repositories, and developments like the recent Pandas release could nip most of our schema-related problems in the bud. Moreover, it would probably serve us well to treat new trends and lofty promises in marketing campaigns with a healthy dose of skepticism. Before we celebrate a new feat of disruptive innovation, we should carefully consider what is being disrupted. There are reasons why we spent decades laboring over database schema designs and migration scripts. A lot of those reasons haven’t disappeared quite yet.
P.S. If you haven’t already, do read my colleague Hussein’s article on schema evolution. Managing schema evolution is possibly the most difficult challenge posed by the schema-on-read paradigm. I decided not to dive into it because I would just be repeating a lot of what Hussein already wrote.
Want to work with us? Click here to see all open positions at SSENSE!
Original. Reposted with permission.
- Recommender System Metrics: Comparing Apples, Oranges and Bananas
- Schema Evolution in Data Lakes
- 7 Steps to Mastering SQL for Data Science — 2019 Edition