My last post discussed some of the ways that data errors weaken the reliability of conclusions drawn from those noisy data. Even measures of the uncertainty in conclusions (such as statistical confidence intervals) can be inaccurate. This raises the question as to how one assesses data for such errors, and at what cost?
The Data Management Association International – DAMA proposes several dimensions of data quality. Some of these data quality dimensions may be feasible to review as a means of detecting data errors. Two of these quality dimensions – Missing data and Duplicate data are likely the easiest to detect. Sometimes Missing and Duplicate data can arise from the same underlying data errors. When two or more data sources are combined based on matches between common factors, bad data for some of those factors can result in partial duplicates (for the factors that do match) and missing values for the unique factors from each source. To take a simple example:
| Last | First | Position |
| Smith | Sally | Sales |
| Last | First | Wage |
| Smith | Sally | $35.50 |
Correct matching on Last and First leads to one record with both Position and Wage:
| Last | First | Position | Wage |
| Smith | Sally | Sales | $35.50 |
If Source 2 is Corrupted in First Name:
| Last | First | Wage |
| Smith | Saly | $35.50 |
Incorrect matching through errors in First Name duplicates Last Name and leads to missing values in Position and Wage:
| Last | First | Position | Wage |
| Smith | Sally | Sales | - |
| Smith | Saly | - | $35.50 |
Evidence of such Missing data may lie outside of the subset of data relied upon for the analysis. One reason why some data were excluded from the analysis may be the prevalence of Missing values. So Missing values in data sources should be considered as potential indicators of duplicates in the data you do rely upon.
Data Validity is a third data quality dimension. Like Missing and Duplicated values, Validity is often relatively straightforward to use to assess data errors. Are the data values in acceptable ranges? These ranges can be defined by information from other sources that reflect minimum and maximum values (for wages, prices, quantities, dates), or lists of recognizable values (like locations, zip codes, departments, product codes, seniority levels, etc.). Also, it can be helpful to consider validity in terms of sets of variables. For instance, if certain departments exist at specific locations but not others, one could check that department/location pairings are consistent.
A fourth data quality dimension is Timeliness. Timeliness asks how relevant is the time frame over which the data are defined to the time frame for which conclusions and predictions are being made. The value of data regarding sales leads likely depreciates quickly. Conversely, for retrospective analyses, some data can be too recent. Some data are updated to reflect current conditions with no readily accessible evidence of past values or of when values were last updated. As these examples suggest, while Missing data, Duplicate data, and Invalid data are likely identified through an explicit review of the data, Timeliness can often be assessed by a review of data definitions and sources relative to the question or decision those data are being used to address.
These data quality dimensions can be summarized as Data Consistency. Data are consistent if they are comparable across different sources. Missing and Duplicate records can arise when some data are inconsistent across combined sources. Invalid records arise when they are inconsistent relative to outside sources specifying acceptable values for those records. Data suffers from inconsistency in timing or relevance when there is a mismatch between the time periods that records refer to and those time periods that are intended to be predicted.
Data Consistency though also has another, perhaps more direct interpretation when data is processed or cleaned in multiple stages. In this context, Data Consistency can be further assessed by comparing original sources with final, cleaned data that form the inputs to an analysis. With respect to this data cleaning process, Data Consistency asks whether values were altered in unanticipated ways along that data cleaning process.
While all of these data quality dimensions are valuable, they are likely costly to explore. Exhaustively checking for errors throughout all of the data that your analysis relies upon may not be a feasible, let alone an efficient use of resources. Further, since checking for data errors takes time, checking data for — Missing values, Duplicates, Invalid entries, Timeliness, and data cleaning Consistency can itself degrade the Timeliness of the data.
It makes sense then to look for clues as to where data may be compromised. The results of your analysis can provide clues. Previously, I noted that when the measured impact of an input is implausibly weak, this could be a sign that measurement error affects that input. Conversely, if the measured impact of an input is implausibly strong, measurement error may affect another input in the analysis.
Before committing all your resources to an expensive search for errors, first review your results. As a general rule, it makes sense to invest in alternative, additional analyses. Conducting additional analyses allows you to assess the robustness of your conclusions in ways that are not captured by standard measures of statistical precision (e.g., confidence intervals). For example, if conducting a multiple regression analysis, consider additional regressions based on different assumptions (e.g., predicting quantitative outcomes – like customer spend vs. qualitative outcomes – like repeat purchaser). Examine relationships on subsets of the data to see if conclusions hold up (e.g., within the same region, establishment, for the same job type or product, or from one source data system vs. another). If conclusions continue to hold across different regressions and/or various subsets of your data, then it is less likely that data errors are distorting those conclusions. If conclusions do vary across analyses, having multiple results can help identify the subsets of observations that play a disproportionate role in changing those results. Answering such questions can help focus resources for more efficient data error checking.
Like checking for data errors, conducting additional analyses costs time and money. The takeaway then is that the costs of the more direct assessments for data errors (along the dimensions described above) should be weighed against the cost of such additional analyses. Some diversification of resources across these two avenues is likely prudent. Of course, both of these costs – of additional analysis and/or additional error checking – also need to be weighed relative to the stakes involved in the ultimate business decision.