Dimensions of Data Quality
"Bad data" is like obscenity - we know it when we see it. But how do we define it?
Everyone wants “good” data. Almost as universal is the sense that the data you’re working with is…not good. Being able to objectively measure data quality is important for ensuring downstream modeling and decision making are built on reliable data, but it can be hard to measure and report on data quality without a framework for identifying what features of the data are good/bad.
The framework I like to use is to consider dimensions of data quality - data features with quality expectations that can be defined and measured against. Depending on who’s counting, there are somewhere between 5-10 generally accepted dimensions. I prefer to keep the list on the shorter side, both because it’s easier for me to remember, but also because it starts getting really confusing when there’s too much room for overlap between features.
Completeness, for me, is one of the easier dimensions to grok. For our data to be complete, there should be no nulls where nulls are not allowed - that may mean a record exists but a non-nullable field is blank, or an expected record could be missing entirely. Uniqueness is sometimes included in lists of data quality dimensions; I prefer to think of it as an element of completeness - “overcompleteness,” in a sense.
When multiple data sources provide values for a datapoint that are describing the same thing, or there is a datapoint that’s expected to move through our data pipeline unchanged (or mutated based on a fixed rule), we can measure consistency. We need to check whether things that should be the same are the same (or within some tolerance). Spike or outlier checks would also fall under the consistency umbrella, as would correlation checks between variables whose relationship you expect to be governed by a fixed formula. If we have someone’s date of birth, age, and date of record, you could check whether they’ve answered the age and DOB questions consistently - but if the consistency check fails, we’re left with another question - which datapoint is right, and which is wrong (or are they both wrong)?
Accuracy, or correctness, is exactly what it sounds like - is the data right? Seems simple - but it can be surprisingly difficult to measure this one! As hinted above, there’s a fair amount of overlap between consistency and correctness in the way we tend to think about data. We saw above that consistency checks are looking for two or more datapoints to be in agreement with each other, but can’t catch consistently wrong data, and can’t tell us which of datapoint in a set of datapoints that triggers a consistency check is inaccurate. Correctness requires the establishment of a source of truth. That could take the form of a “more trusted source” over other data sources, it could be calculated from more reliable, related datapoints, or it may require human intervention to independently verify and fill the correct data.
If there’s a time component to a dataset, we usually expect that data to be available by some fixed point in order to be available downstream. Even without an explicit time component, we may expect to see updates occur at least somewhat regularly. The “timeliness” dimension of data quality measures whether or not data population occurs by the time it’s needed, and can be used to check whether there is staleness in the dataset. This may not be an important consideration for all types of data (eg, historical/legacy data that is not meant to be updated on an ongoing basis), but when it matters, it really matters - so it earns its place on this list.
If we have clear expectations for whether a datapoint is the correct type, format, and/or within an accepted set or range, we can assess its validity. This could be as simple as making sure date fields contain dates, int’s contain integers, and so on. We often have low-level checks built in that will catch or block population of the most egregious instances of invalid data, so explicit checks that need to be added tend to focus on more nuanced definitions of what is a valid entry for a field - correlations will be limited to the range -1 to 1, age will not be negative, call direction is either “inbound” or “outbound” (but what about “Inbound” or “in bound”?), etc.
Spending the time during data product (and/or project) planning to map out data expectations along a set of quality dimensions (completeness, consistency, correctness, timeliness, validity) is a useful exercise that can help guide robust data model definitions. Doing so provides an understanding of data quality which can be measured and reported on, as well as used to guide the creation of checks to ensure maintenance of high quality data.
Hmmm. This is useful. And thanks for taking the plunge into Substack (and writing a guest post for Randy, which is how I found you, even though I haven't read it yet - I just had to click on a link called "Data and Tacos"!).
I'm having a problem with a dataset now and I'm not sure where It fits into your framework. I am counting climate tech companies by SIC codes and one company can be classified under multiple codes, so there's a lot of double counting (and triple, quadruple, etc.). Promiscuous categorization!
I don't have company names, just counts, so I really don't know what I'm looking at except in the general sense of "There's a lot of activity in Sector X."
Far be it for me to criticize our data overlords at the Bureau of Labor statistics, but what kind of bad data is this, do you think?