Data quality dimensions are a critical tool in producing accurate, reliable data that organizations can trust. If you are wondering how much confidence you can place in the conclusions of your data analysis, applying these five dimensions of data quality may provide you with greater confidence or a migraine headache.
Identifying data quality starts with completeness. Has the necessary information been collected in the data to achieve 100% completeness? Let’s say, for example, the parents of new students at a school were asked to complete an emergency contact information sheet that asks for details like name, address, date of birth, and an emergency contact telephone number. Following the collection process, a data analysis was performed on “Emergency Contact Telephone Numbers.” The analysis shows that 300 new students provided information using the form, but only 294 emergency contact telephone numbers were collected. The analysis shows that the data is only 98% complete.
Next is validity. Does the data conform with the schematic requirements? If a date of birth is called for a particular data field, and there is a long string of numbers in the month field, but nothing in day or year, the data lacks validity. Another example would be when a social security number is requested and the data field does not contain 9 digits; the data would lack validity.
Third is consistency of data. If multiple representations of a particular data field tie back to a unique key, the data is not consistent. For example, a social security number should not be tied to two different individuals. If it does, then the data is not consistent.
Similarly, is consistency of information. Do the data fields make sense in the context of other data fields. For instance, a termination date of employment can’t be before an individual’s hire date. This particular data quality dimension is usually domain specific and will require domain knowledge experts to create an algorithmic flagging process.
Finally, there is anomaly detection. There are two types here, unlikely or infrequent occurrences and impossible values. The first makes perfect sense from a data and information consistency standpoint, but is rare and warrants a second look. In a payroll dataset, for example, one employee receives an extra paycheck every other month, but no other employee does. The second and more extreme anomaly is when a data field is impossible. For instance, an employee has logged 800 hours of service in a given month when there are only 720 hours.
These five data quality dimensions are critical in the accuracy of your data. When performing an analysis, an organization should consider these dimensions and adjust their conclusions accordingly. Organizations should also consider the financial impact this may have on your organization.