Member-only story

Put a Stop to Data Swamps With Event-Driven Data Testing

Ensure data quality in your S3 data lake using Python, AWS Lambda, SNS, and Great Expectations

Anna Geller

Published in

Better Programming

8 min readJun 29, 2021

Chart, magnifying glass and eye glasses on table — Photo by Anna Nekrashevich from Pexels | Branded content disclosure.

Data lakes used to have a bad reputation when it comes to data quality. In contrast to data warehouses, data doesn’t need to adhere to any predefined schema before we can load it in. Without proper testing and governance, your data lake can easily turn into a data swamp.

In this article, we’ll look at how to build automated data tests that will be executed any time new data is loaded to a data lake. We’ll also configure SNS-based alerting to get notified about data that deviates from our expectations.

· Python libraries for data quality
· Using Great Expectations
· Using Great Expectations for event-driven data testing
· Demo: generating time series data for testing
· Implementing data tests using Great Expectations
∘ Which tests can we run for this data?
∘ How to implement data tests?
∘ How to run data tests locally?
∘ How to run data tests on AWS Lambda?
∘ Testing the AWS process by uploading new files to a data lake
· How to monitor this process?
· Conclusion

Python Libraries for Data Quality

There are so many tools for data profiling and data testing out there. Just to list some of them:

Pandas Profiling allows us to generate an HTML report showing quantile statistics, histograms, correlations, NULL value distribution, text analysis, categorical variables with high cardinality, and more.
dbt Tests let us validate the uniqueness, accepted values, NULL values, and build any custom data test to detect anomalies by using SQL queries.
Bulwark provides decorators for functions that return pandas DataFrames (e.g. @dc.HasNoNans()).
mobyDQ is a tool from Ubisoft to generate a GraphQL-based web application for data…

Better Programming

Put a Stop to Data Swamps With Event-Driven Data Testing

Ensure data quality in your S3 data lake using Python, AWS Lambda, SNS, and Great Expectations

Table of Contents

Python Libraries for Data Quality

Create an account to read the full story.

Published in Better Programming

Written by Anna Geller

No responses yet

More from Anna Geller and Better Programming

My Notes and Thoughts on David Green, a Wall Street Trader.

This article will include my notes and things that I learned by following David Green’s content. The article below also contains my…

Hidden Impact of Gambling Income on Medicare Premiums for Retirees Over age 65 and Younger Ages…

Gotcha! You never saw this ripple effect coming because you received a 1099 for gambling wins, did you?

Acorns: The Personal Savings App That’s Robbing You

Another Overpriced Financial Product

From Infinite Sets to Market Cycles: The Fascinating Journey of Fractals in Mathematics and Finance

Fractals have revolutionized our understanding of complexity in both mathematics and finance, revealing patterns that repeat across…

Recommended from Medium

This is How An Algorithmic Trading Strategy Made +10,891% With Python

Backtesting a Basic Breakout Trading Strategy With Python

A Cockpit View Of Q3

I recently built this cockpit view to see what’s going on in markets. I’ll be iterating on it as well as creating a page to incorporate my…

How I Built a Stock Prediction Tool in Python — and What I Learned Along the Way

I’ve been tinkering with code for over a decade, and nothing gets my gears turning like trying to outsmart the stock market. Predicting…

FinGPT: The Future of Financial Analysis — Revolutionizing Markets with Open-Source AI

Discover how FinGPT is disrupting traditional financial tools like Bloomberg Terminal, making powerful analytics accessible for everyone —…

Machine Learning Algorithms You Never Knew Existed, But Are Quite Useful

Ever heard of Tsetlin Machines ??

Predicting Zomato Stock Price Using TimeGPT

What is TimeGPT?