Avoid the Adventurist Trap: Build an Automated Data Quality Testing System
This is the final post of our blog series exploring data quality management and the problems that arise from taking data quality for granted. In this post, I want to provide some practical examples on how to avoid data quality issues and some advice on how you can approach building your own automated data quality assurance (QA) system.
If you’re landing on this blog post, I encourage you to read my three previous posts: “Avoid the Adventurist Trap: Don’t Take Data Quality for Granted,” “Avoid the Adventurist Trap: The Cost of Poor Data Quality,” and “Avoid the Adventurist Trap: ‘Set It and Check It’ Culture.”
How to Avoid Data Quality Issues
Now that you understand the cost of a “set it and forget it” culture and the value of a “set it and check it” culture, it’s time to review some practical examples of avoiding data quality issues that’ll help you develop data quality management within your organization.
Benchmark Against Multiple Other Systems
It’s really important to have something to benchmark against such as another digital analytics system (e.g. Google Analytics standard) and the output from your e-commerce platform. Some of our clients have Google Analytics 360 and Adobe Analytics, and despite what everyone will tell you, it’s possible to get these systems to agree on some metrics such as transactions and revenue. Across multiple brands with their own report suites and Google Analytics properties, we maintain less than 1% variance between transactions and revenue, which is extremely valuable to us!
We regularly run reports comparing the metrics between the two analytics platforms, an activity that most in our industry (including Adobe and Google) will tell you is a fool’s errand. It’s important to be able to spot issues as soon as they arise, so performing regular comparisons is a great way to do this. There’s an excellent repository for R written by Tim Wilson that does a Google Analytics vs. Adobe Analytics comparison report “out of the box” using the analytics APIs offered by the two tools.
Even if you can’t get transactions and revenue to be within the same variance that we have achieved, don’t be deterred. Don’t fall into the trap of not comparing analytics tools because everyone says “two tools will never give the same results.” While it’s true that digital analytics systems process data differently, they should agree on the overall performance of your website and serve to spot data quality issues. Knowing whether it’s isolated to one of your systems or if it’s affecting all of them helps your investigation and gets it fixed quickly.
Don’t fall into the trap of not comparing analytics tools because everyone says ‘two tools will never give the same results.’ Click & Tweet!
You should also bring the data in from your website platform or e-commerce platform and compare it against Google Analytics and Adobe Analytics. It’s almost certain that your analytics systems won’t agree with your e-commerce platform. There are many reasons for this, such as ad blockers, but as I already mentioned, they should agree on the overall performance and trends of your website.
The only disadvantage of this approach is that you now have to maintain two analytics systems, but the other will also serve as a backup in case of data disruption to one.
Develop Processes for Implementation, QA and Release and Make Them a Priority
If you don’t take data quality assurance seriously, others in your organization won’t either. Lead by example and prioritize data quality testing. Develop a structured QA process that must be followed every time you implement changes. This should include:
- Code review
- Regression testing
- Comparing against a solution design reference
- Analyzing test data
Develop a Structured Release Process
It’s not good enough just to publish something “willy nilly.” You need to notify other teams that it’s happening, schedule it, and make sure there aren’t any other releases going on at that time.
You should always avoid publishing anything on a Friday — a great rule of thumb we should all adopt.
You should always have a rollback plan. This might seem very simple to you, knowing you just roll back the version in Google Tag Manager if something goes wrong. But it’s not always simple to others in the business. For example, it’s not as simple to do a rollback on the code of your website, so your development team may not understand how simple a tag management system rollback is.
The Future of Data Governance: Automated Data Quality Audits
In the previous posts, we’ve covered a lot of problems with maintaining high data quality, including:
- It’s manual and takes a long time.
- It requires high effort.
- There’s often a high cost.
- It’s not enjoyable.
- Implementation is overtaken by marketing tag requests.
- You need to develop processes and that requires time and effort.
Of course, I’m not going to pretend that our industry hasn’t already made great strides in automating data governance. You’ve probably heard of ObservePoint and may even use it already. But we still have a long way to go before data quality assurance automation is fully developed, easy to implement and low cost.
An automated data quality testing solution can really help eliminate a lot of the problems we’ve covered in these posts and capitalize on all the benefits of upholding high data quality. When considering what’s the best route to take, you’ll come across the “build versus buy” argument, and you need to balance the pros and cons of each. If you want to develop your own automated data quality testing or buy a tool off the shelf, there’s going to be an investment. But investment can be a good thing, and as long as you can show an ROI, then you should get buy-in.
But ROI is Difficult to Estimate?
This is where your digital analytics log comes in — a log of all the things that went wrong because you didn’t prioritize your data quality, your QA, and your audits. If you’re prioritizing it and doing it manually, bingo, you should calculate the number of hours you’re spending on regular audits and factor that into your ROI estimate.
Eliminate Human Error by Getting a Machine To Do It
Let’s face it, you could be the greatest developer or analyst, but we all make mistakes. Human error is part of life, but it can be reduced in digital analytics. Also, the machine doesn’t care about enjoyment, so you’re free to go off, slice data, impress your company with your killer insights, and ask for more budget to build that data lake or set up that Customer Data Platform (CDP).
You could be the greatest developer or analyst, but we all make mistakes. Human error is part of life, but it can be reduced in digital analytics. Click & Tweet!
If you’re interested in learning more about CDPs, you should view my colleague Paul’s blog post on Leveraging Your CDP to Make Data Driven Decisions. Also, see Blast’s Customer Data Platform Quick Start Solution if you’re interested in accelerating your CDP implementation with our industry best practices.
Can You Build Your Own Automated Data Quality Testing System?
In short, yes, you can. There’s loads of automation software out there, and while you do need technical expertise, it’s a problem that your development team can help with because testing automation has been part of web development for a long time now.
At Blast, we’ve experimented with data quality testing automation quite a lot and have some great tips. If you want to develop a system yourself, you have two main choices in terms of tech:
- Web driver-based (like Selenium)
- Programmatic browser-based (like Puppeteer)
Web Driver Based
Web driver-based solutions have one clear advantage over programmatic browser-based solutions: cross-browser compatibility, which is a big deal in digital analytics (that is, until the development of Playwright; see below). It’s imperative that we’re always testing our analytics across multiple browsers such as Chrome, Firefox, and Safari because they can behave so differently and cause inconsistencies in our data. Each browser also has its own release schedule, and when investigating data quality issues, the browser report in your analytics system should be one of the first places you look!
The disadvantages of these systems are that they don’t have native access to the network requests being sent to Google Analytics, Adobe Analytics, marketing providers, or any network requests for that matter. This is really important when developing your own automated data quality testing system because it’s the one thing you need to be able to do: capture and validate the analytics beacons. There are ways this can be accomplished with a web driver-based solution such as using a proxy, but as you’ll see in the next section, programmatic browsers are far superior.
They also tend to struggle with truly asynchronous operations such as waiting for elements to appear on the page, and you usually have to insert a lot of manual wait times that can make the performance of your tests inconsistent.
Programmatic Browser Based
This is the most exciting technology in the space right now and Blast’s preferred choice for data quality assurance automation. When we first started exploring digital analytics QA automation, the only choice was Puppeteer from Google, which only works with Google Chrome. Hence, a web driver-based solution had the advantage of cross-browser compatibility. However, since the invention of Playwright, it’s now possible to programmatically control Chrome, Firefox, and Safari in the same way. Playwright is also developed by the same team that originally developed Puppeteer, so it has a very similar syntax.
Back when we started prototyping data quality assurance (QA) automation technology at Blast, we had a competition between Nightwatch and Puppeteer. Puppeteer emerged the victor based on a single feature: the native access to the network panel within Chrome and, therefore, the ability to capture network requests easily and store them in a database for later analysis. Crucially, with a little bit of coding, it has the ability to associate network requests with events that happened on the page, giving us a level of granularity that just wasn’t possible with a web driver-based solution.
Furthermore, there are built-in methods for handling asynchronous operations such as waiting for elements to be visible on the page, waiting for network requests to occur, etc., so you don’t have to insert a lot of manual wait times drastically improving the performance and reliability of the solution.
Avoid Taking Data for Granted
For those of you who’ve read this whole data quality management series and are now reading this, thank you. I hope it’s been useful and you’ve learned something. I hope you’re excited to put the advice contained within into practice and further your culture of data governance in your organization — and maybe even build your own data quality assurance testing automation.
If you do nothing else, please start conducting regular data quality audits. Try to prove the value of doing them instead of the cost. Start a digital analytics bug log and embed it within your team and cure your blame culture! These small steps alone will help you avoid taking your data for granted.
I’d love to hear your thoughts and comments on this post and series.