Data Quality: The Good, The Bad, and The Ugly

Incorrect or unclean data leads to false conclusions. The time you take to understand and clean the data is vital to the outcome and quality of the results. Data Quality always takes the win against complex fancy algorithms.



Data Quality: The Good, The Bad, and The Ugly
Background vector created by rawpixel.com - www.freepik.com

 

Band-aid solutions do not deal with the cause of the problem. Creating data visualisations to make the data look pretty or applying a decision tree to unclean data, is just a waste of your time. You can create all the models in the world, but it’s no use if you present your findings and there are errors popping up one by one. What if your findings were taken as gospel, and the company makes important decisions based on these? None of us want to be in that uncomfortable position.

Incorrect or unclean data leads to false conclusions. The time you take to understand and clean the data is vital to the outcome and quality of the results. Data Quality always takes the win against complex fancy algorithms. 

 

So what is Data Quality?

 
Data quality is the measure of how fit a data set is to serve its specific purpose and how trusted it is to make trusted decisions. It is made up of characteristics such as accuracy, completeness, consistency, validity, and timeliness. Let’s briefly break these down further.

  1. Accuracy: This refers to how well the data reflect real-world scenarios; allowing it to be of use. 
  2. Completeness: A dataset with too many gaps or blanks is not going to be able to go through the correct analysis to answer specific questions.
  3. Consistency: Data that is stored in one location should be the same and not conflict with the same data stored in another location.
  4. Validity: This refers to how the data is collected, defining business rules and regulations. It should be in the right format and fall within the right range.
  5. Timeliness: Data that is readily available and accessible is more beneficial than data that becomes less useful and accurate to a company as time goes on. 

 

What ensures Data Quality?

 
Specific data quality tools can be used to improve and estimate the quality of the data. For example:

  1. Data Profiling: This is examining the source of the data, understanding the structure and its potential use. 
  2. Data Standardisation: This is the process of bringing data in a common format that allows analysts to utilise the data.
  3. Monitoring: Frequent checks on the quality of data are vital. There are specific tools that can be put in place that have the ability to detect and correct data. 
  4. Historical and Real-time: Data that has been previously cleaned allows analysts to apply that same data quality framework across other areas of data and applications.

An example of real-time Data Quality in the Healthcare Sector is ensuring that the patient data is accurate and valid. This is essential for documentation, payments, risk management, and protection of patient data purposes. 

 

Positive Impacts of Data Quality

 

  1. Decision Making: The higher the quality of data, the more companies and users will trust in making important decisions, based on the outputs produced. This, in turn, lowers the risk of the company making the wrong decision. 
  2. Productivity: Nobody wants to be sitting there for hours on ends fixing data errors. If the correct measures are taken in the initial step, it allows staff to focus on the next steps and other responsibilities. 
  3. Targets: Quality data can ensure accuracy in companies' current and future goals, for example, the Marketing team having a better understanding of what works and doesn’t work.
  4. Compliance: There are many industries where specific guidelines are used to keep data private and safe from any breaches or potential attacks. The lack of maintaining good quality in the finance sector can result in millions of dollars in fines or money laundering. 

 

Negative Impacts of Bad Data Quality

 

  1. Losing to your competitors: If your competitors have better data than you, giving them further insight can result in missed opportunities and potential damage to the company. Don’t let your competitors have one over on you!
  2. Revenue: Basing decisions on incorrect data can cause a loss in revenue. For example, making political decisions based on demographic data which is wrong could cause social and financial issues. 
  3. Reputation: Everybody wants to improve and maintain their reputation, especially when money is involved. Decisions based on poor data can be so detrimental to a company, they could lose investors or potentially their company. People tend to remember the bad over the good.

 
 

Conclusion

 
When looking at data, ask yourself these questions:
 
1. How was the data collected?

The source of the data matters. For example, was the data collated through a Government Census, or was it collated by somebody who manually created the data for their personal needs and uploaded it on Kaggle. Collating data from people on their commute to work and are not as interested is different from sending them a web link of a survey that they can fill out in their own time. 

2. What does the data represent?

Does the data have a good representation of what you or the company is looking for? Making concrete statements about statistical demographic data on France using data that is based in Paris is inaccurate. 

3. What does the Data Cleaning process look like?

There are different methods to clean data, choosing a specific one that is unique for that particular dataset or datatype is important. 

4. What are you doing to maintain Data Quality?

Investing in the correct people and infrastructure to maintain and continuously improve the quality of your data is critical in technology. 

It is always better to guard yourself against a problem that is avoidable, than walking right into it and spending time and effort to come up with a solution. I always say, do it properly once and you won't have to keep going back to it. 

 
 
Nisha Arya is a Data Scientist and freelance Technical writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.