EuroPython 2017

Data Unit Testing with Python

Speaker(s) Katharine Jarmul
Sub Community: PyData

Data validation is commonly used by data science teams to ensure their data “smells right.” Often, however, these teams might be evaluating the criteria by hand (or eyes) or evaluating only at end-of-term reports (i.e. monthly, quarterly or yearly numbers). In addition, these numbers may only be reviewed in aggregate, and the functions or pipelines used to create these reports might never be tested. Issues in schema changes, erroneous or duplicate data is left to fix perhaps after algorithms or other decision making approaches were already used on the datasets, leading to negative business impact.

In this workshop, we will evaluate why unit testing approaches are important even with a data science or data engineering team. We’ll introduce libraries that can help incorporate data validation, schema constraints and unit tests into your data workflows. Finally, we’ll look at a real-world use case and implement some data unit tests and alerting for failure.

Attendees should have a working understanding of common data science libraries (like Pandas and NumPy) and have the repository installed before they attend. The repository with the requirements: https://github.com/kjam/data-cleaning-101. We will be focusing on the validation notebooks, so you will only need some of the libraries, but if you would like to install them all, you may go through the other notebooks as well!

in on Friday 14 July at 10:15 See schedule

Do you have some questions on this talk?

New comment