May 26, 2020

As the interest in machine learning and artificial intelligence grows, companies regularly find themselves confronted with the dissatisfying quality of their data. This discovery is either made early-on with a structured approach, or a lot later, when poor data quality is identified as the root-cause of poorly performing models. In either case, the next step should be a methodical exploration of the available data, followed by a series of steps to remedy the identified issues.

In this article, I will give you an overview of common data quality issues and strategies to deal with them.

“Dirty Data”: Definition and Causes

Dirty data can take just as many forms as it has possible causes. In most cases, data is already tainted when it is collected and entered into database systems. Let’s say, for example, that John and Bob are tasked with taking stock in a screw storage facility. John counts the screws and identifies them by measuring with his metric calipers. Bob, however, just weighs the screws instead of counting, and uses his trusty imperial ruler to identify them. Both enter their data into the same entry form in their apps, without specifying the method or unit of their data points. The resulting data will have uniformity and, resulting from that, accuracy issues. If they happen to both process the same box of screws independently, we get contradictory data and can add inconsistency issues to our list of problems. Finally, a box of screws might have fallen behind a shelf and was overlooked. Now our data is not only flawed but also incomplete.

In addition to these systematic data collection errors, there is a number of ways in which accurately collected data can end up tainted. These include simple spelling mistakes, inconsistent abbreviations or entry errors like negative values where only positive values make sense.

Another cause of data quality deterioration are fundamental alterations of the data storage systems. Take for example the joining of two previously independent systems in one overarching database. If we think back to John and Bob, any existing inconsistency issues are now compounded. Additional issues might arise from differing data formatting principles or even encodings.

Lastly, data quality may deteriorate on its own over time, simply by records becoming outdated. This regularly happens to CRM systems, by people moving, switching phone numbers etc.

Perils of dirty data

As previously mentioned, data quality issues often become evident, once the data is put to use in a data science context. The vast majority of machine learning methods learn by continuously evaluating themselves against reality, and for many companies this is the first-time data quality is actually measured, albeit implicitly.

Unfortunately, this means that companies who haven’t reached this level of data literacy yet are often unaware of the impact their dirty data has on their business. In a best case scenario, the faulty data isn’t used, in the worst case it influences data driven decisions already being made.

It is estimated that on average at least 25% of the data in CRM systems is in some way inaccurate or outdated. (Based on studies showing 40% of people changing phone numbers each year, and 37% changing mail addresses.)

Not only does faulty data offer no informational value, it also devalues the intact data it’s embedded in. Back in a machine learning context, the number of accurate records needs to severely outweigh the errors, in order to enable an AI model to weed out the errors as outliers.

Data Cleaning Methods

The methods of data cleaning range from (close to) fully automated to predominantly manual operation, closely correlating to the amount of knowledge required for each operation.

The methods described below can be employed to perform a one-time cleaning of a data base but can also be repurposed as continuous mechanisms for data integrity monitoring.

Column Level: Data Profiling

Data profiling is the elementary method of analyzing single data fields. As the name indicates, data profiling generates a model for “intact” data and highlights entries that do not conform to this model. While this may sound complicated, it can be as intuitive as counting the occurrence of each value in a column. Let’s say we count 4421 occurrences of “Mercedes”, 4823 occurrences of “Audi”, 3 occurrences of “Cermedes” and 1 “Aidu”. Obviously, in this case of sporadic typing mistakes, the problematic entries are readily apparent.

This approach also works on non-categorical data. It may, for example, spot negative prices in product databases, or past-dates that should be future-dates.

The big advantage data profiling offers is that it does not require any knowledge beyond the column it is applied to. It is, however, beneficial to specify additional, explicit rules such as valid value-ranges or regular expression patterns.

Caveat: Obviously, data profiling is heavily reliant on most of the records in the dataset actually being correct in order to deduce an accurate representation.

Table Level: Integrity

Going beyond data profiling, we can use our knowledge of processes and workflows to formulate more sophisticated constraints. Given a table of fulfilled orders for example, we can specify that the delivery date cannot be earlier than the order date.

Rules like this have the big benefit of being continuously applicable, meaning after the initial cleaning is performed, they can remain in place and prevent future violations.

Amassing a comprehensive set of constraints to guarantee data integrity requires equally comprehensive knowledge of the data. As such, implementing this method entails a significantly bigger amount of work compared to largely automated data profiling.

Database Level: Relations

As a third “level”, after ensuring our tables are in an adequate state of quality, we can take a look at the overarching database model. At its core, this is an extension of the integrity rules mentioned in the previous paragraph, in the sense that we can now apply these rules across multiple tables.

To stick with our product-order table example, we would expect any product numbers in the orders table to correspond to a product in the products table. If this rule is violated, we either have incomplete product data, or erroneous order data.

Tools of the trade

Data cleaning solutions can be divided into three major groups. Future blogposts will shed additional light on particular advantages and use cases, but I would like to give a little preview.

The first group comprises stand-alone software solutions that import data from nearly arbitrary sources. These tools usually offer file import functionality as well as connectors for several database systems or even cloud platforms. The cleaning process itself is performed on a copy of the data within the tool, requiring additional effort to be put into applying the findings to the original data. Unfortunately, these products are mostly unable to construct analyses across multiple tables, thereby limiting their usefulness in extensive data-cleaning operations.

An example for a tool like this is OpenRefine, formerly Google Refine.

The second group consists of tools that are embedded within a data eco-system such as Informatica or SAP. While the deep integration with the data is advantageous, in the sense that data import/export is trivial, it is also limiting in feature set and customization. One advantage over the mentioned standalone solutions is the ability to perform cross-table cleaning operations.

Custom project-based solutions make up the third group. Similarly, to readily available stand-alone software, utilizing custom solutions requires an export of the original data, a detailed analysis, and a strategy to reintegrate the findings into the source system. However, the ability to include domain and process knowledge into the analysis process distinguishes custom solutions from out-of-the-box software like OpenRefine. As an additional benefit, extensive data quality projects tend to uncover weaknesses in data collection and maintenance processes, as well as identifying possible improvements.

Closing Thoughts

Achieving and maintaining high data quality is in the best interest of any company looking to extract value from their data. Faulty data can have a disproportionate negative effect on business insights, predictive models, or immediate economic performance. The methods I described in this article are the steppingstones to a one-time consolidated data basis, however maintaining this state also requires improvement of data collection and entry methods. In future posts, we will share some insight on how these techniques are applied in practice, what the attached processes can look like and how to make data quality a design element in your data strategy.


AUTOR

Max Uppenkamp

Max Uppenkamp has been a Data Scientist at INFORM since 2019. After previously working in Natural Language Processing and Text Mining, he is now engaged in the machine-learning-supported optimization of processes. In addition to accompanying customer projects, he translates the knowledge gained into practice-oriented products and solutions.