Data Quality is as Important as Data Quantity

Stephen DeAngelis

December 10, 2018

Despite the fact big data is old news, it continues to make headlines. Mike Wood (@Legalmorning), founder of Legalmorning.com, writes, “Big data is here to stay and we need to get used to it. In fact, so many companies are now using it that they gain an advantage over those who don’t. … Yes, we are all sick of hearing these words, but let’s stop kidding ourselves. It’s not just the latest buzz phrase, it’s here to stay.”[1] Pat Sullivan, Managing Director at the Accenture Oracle Business Group, adds, “More than ever before, organizations of all sizes are turning to data to inform strategy and drive growth. In fact, 82 percent of business leaders say their organizations are increasingly using data to drive critical and automated decision-making at scale.”[2] Like many buzz words, the topic of big data is too broad to discuss in toto. Subject matter experts often write about all of the “Vs” of big data; namely, Volume (there is lots of it), Velocity (it is generated at an amazing pace), Variety (it comes in structured and unstructured forms), Veracity (not all data is accurate), Value (data is the new gold), Vulnerability (database breaches occur on a regular basis), and Virtue (the ethical use of big data). In this article, I want to discuss an area that falls under the veracity heading: data quality. Sullivan notes, “[As the amount of data being generated increases], the extent of the damage that can be caused by inaccurate or manipulated information grows. Incorrect or falsified data threatens to compromise the insights that companies rely on to plan, operate, and grow. This data veracity challenge is one that most businesses have yet to come to grips with.”

The importance of data quality

Adam Robinson (@AdamRobinsonCDM), Director of Marketing at Cerasis, notes, “High-quality data will ensure more efficiency in driving a company’s success because of the dependence on fact-based decisions, instead of habitual or human intuition.”[3] The most important part of that assertion is Robinson’s emphasis on high-quality data. “One of the most important aspects of data management,” writes Bob Violino (@BobViolino), “is ensuring data quality. Without this, capabilities such as machine learning and advanced analytics might yield faulty results.”[4] The World Economic Forum has declared data a resource on the level of gold or oil. But data is only valuable if the insights it contains are accurate and actionable. The old saying, “garbage in garbage out” sums up the challenge. If that logic is not clear enough, Robinson offers five reasons you should be concerned about data quality. They are:

  • Completeness: “Ensuring there are no gaps in the data from what was supposed to be collected and what was actually collected.” Obviously, incomplete data can result in skewed or misleading results.
  • Consistency: “The types of data must align with the expected versions of the data being collected.” We’ve all heard people complain when comparisons aren’t consistent (e.g., people saying, “You’re trying to compare apples to oranges”). Robinson explains it is important to collect the right data in the expected format. It makes things easier for everybody.
  • Accuracy: “Data collected is correct, relevant and accurately represents what it should.” The importance of data accuracy can’t be overemphasized. Because today’s data comes in both structured and unstructured formats, data accuracy is a big challenge.
  • Validity: “Validity is derived from the process instead of the final result.” Invalid data, like inaccurate data, leads to faulty results and potentially disastrous consequences. Robinson explains, “When there is a need to fix invalid data, more often than not, there is an issue with the process rather than the results. This makes it a little trickier to resolve.”
  • Timeliness: “The data should be received at the expected time in order for the information to be utilized efficiently.” Not all data needs to be real-time; but, understanding how timely data must be in order to achieve desired results is important.

Wilfried Lemahieu, Seppe vanden Broucke, and Bart Baesens, academics from KU Leuven (Belgium), note, “Data quality is often defined as ‘fitness for use,’ which implies the relative nature of the concept.”[5] Nevertheless, fitness for use is a good way of focusing on data quality and doing so from the outset will enhance the chances of successful results in the end.

Improving data quality

Lemahieu, vanden Broucke, and Baesens insist data quality begins with people rather than data sources. Hiring the right person (or team of persons) is essential to making quality data a priority. They assert a data quality team should contain the expertise of an information architect, a database designer, a data owner, a data steward, a database administrator, and a data scientist. While that may sound daunting and expensive, they note, “Depending upon the size of the database and the company, multiple profiles may be merged into one job description.” Turning to the data, Michele Goetz (@Mgoetz_FORR), a principal analyst at Forrester Research, offers four best practices to enhance data quality.[6] They are:

Focus on data value. Goetz explains, “It’s all too easy to equate data quality with data cleansing, which has focused efforts on the wrong thing — data elements over data value. … If rules, services, and processes are too granularly focused on data elements and records, data quality lacks relevancy.” Obviously, data value is subjective and unique to each business. That’s why understanding what you want to obtain from the data is so critical to assessing what data needs to be collected and analyzed.

Link to processes. “Don’t assume data quality only happens through data management processes and services running at the database and integration layers,” Goetz states. “Every action, interaction, and consumption of data improves data relevancy and value. Operational processes are not only there to inform data quality relevancy and governance, they are also the points at which data quality services run.” Linking data and processes often relies on cognitive technologies that can gather, integrate, and analyze both structured and unstructured data. Goetz adds, “We don’t always think of analytics as having processes, but the way data is searched, gathered, aggregated, and prepared is a process in context of finding insights.”

Responsibility over ownership. The adage “knowledge is power” has resulted in siloed data and information turf wars. Goetz notes, “Ownership is a loaded term when the subject of data governance comes up, because it leads to finger pointing and centralization.” An organization will never achieve alignment if different parts of the organization are working from different versions of the truth. Goetz adds, “In a business and system landscape that requires sharing, combining and recombining data to get the most value from it, ownership gives way to responsibility. Data quality is a team sport. Data quality is a cultural imperative.”

Trust and confidence. Data quality and trust enjoy a symbiotic relationship. According to Goetz, those qualities go beyond data and affect people dealing with the data as well (i.e., the data has to be collected, stored, and used ethically and legally). In the end, Goetz concludes, “What matters to the business is that data just works. Metrics, including reports and dashboards, need to demonstrate that data is working for the business.”

Concluding thoughts

Lemahieu, vanden Broucke, and Baesens conclude, “The magnitude of data quality problems is continuously being exacerbated by the exponential increase in the size of databases. This certainly qualifies data quality management as one of the most important business challenges in today’s data based economy. … Bad data results in bad decisions, even with the best technology available. Decisions made based on useless data have cost companies billions of dollars.” Most analysts agree tomorrow’s successful companies will be data-driven; however, get the data wrong and companies will be driven out of business. That’s why data quality is as important as data quantity.

Footnotes
[1] Mike Wood, “Data Delirium: Why and How You Should Embrace Big Data,” Business.com, 22 February 2017.
[2] Pat Sullivan, “Data veracity challenge puts spotlight on trust,” Information Management, 10 August 201.
[3] Adam Robinson, “The 5 Key Reasons Why Data Quality Is So Important,” Cerasis, 29 June 2017.
[4] Bob Violino, “4 steps organizations can take to enhance the quality of data,” Information Management, 30 October 2018.
[5] Wilfried Lemahieu, Seppe vanden Broucke, and Bart Baesens, “Building the ideal data quality team starts with these roles,” Information Management, 25 September 2018.