Dimensions of Data Quality

Jan 9, 2023

"Bad data" is like obscenity - we know it when we see it. But how do we define it?

3 Comments

Jul 12, 2023

One of my favorite validity checks is for numeric disguised as character, particularly currencies--$1,234--or even just a numeric matrix that has been mated with a character object and had to undergo type conversion.

Expand full comment

The 435

Jan 12, 2023

Hmmm. This is useful. And thanks for taking the plunge into Substack (and writing a guest post for Randy, which is how I found you, even though I haven't read it yet - I just had to click on a link called "Data and Tacos"!).

I'm having a problem with a dataset now and I'm not sure where It fits into your framework. I am counting climate tech companies by SIC codes and one company can be classified under multiple codes, so there's a lot of double counting (and triple, quadruple, etc.). Promiscuous categorization!

I don't have company names, just counts, so I really don't know what I'm looking at except in the general sense of "There's a lot of activity in Sector X."

Far be it for me to criticize our data overlords at the Bureau of Labor statistics, but what kind of bad data is this, do you think?

Expand full comment

Reply (1)

Amanda Alvarez

Jan 13, 2023

That's an interesting question that brings up an important point that I didn't really spell out - measuring against this set of dimensions kind of relies to some extent on an assumption that the data model is appropriate for your use case.

So with your example (as I understand it), if it's expected that one company should only have a single code, but has multiples, that'd be a completeness (overcompleteness) failure in my framework. But if it *is* expected/allowed for multiple codes to be associated with a single company (but you don't have any way to uniquely identify companies, by name/ID/etc), I'd characterize the problem as "the data model itself is incomplete for your use case".

Expand full comment