Member-only story
Put a Stop to Data Swamps With Event-Driven Data Testing
Ensure data quality in your S3 data lake using Python, AWS Lambda, SNS, and Great Expectations

Data lakes used to have a bad reputation when it comes to data quality. In contrast to data warehouses, data doesn’t need to adhere to any predefined schema before we can load it in. Without proper testing and governance, your data lake can easily turn into a data swamp.
In this article, we’ll look at how to build automated data tests that will be executed any time new data is loaded to a data lake. We’ll also configure SNS-based alerting to get notified about data that deviates from our expectations.
Table of Contents
· Python libraries for data quality
· Using Great Expectations
· Using Great Expectations for event-driven data testing
· Demo: generating time series data for testing
· Implementing data tests using Great Expectations
∘ Which tests can we run for this data?
∘ How to implement data tests?
∘ How to run data tests locally?
∘ How to run data tests on AWS Lambda?
∘ Testing the AWS process by uploading new files to a data lake
· How to monitor this process?
· Conclusion
Python Libraries for Data Quality
There are so many tools for data profiling and data testing out there. Just to list some of them:
- Pandas Profiling allows us to generate an HTML report showing quantile statistics, histograms, correlations, NULL value distribution, text analysis, categorical variables with high cardinality, and more.
- dbt Tests let us validate the uniqueness, accepted values, NULL values, and build any custom data test to detect anomalies by using SQL queries.
- Bulwark provides decorators for functions that return pandas
DataFrames
(e.g.@dc.HasNoNans()
). - mobyDQ is a tool from Ubisoft to generate a GraphQL-based web application for data…