Automating Data Cleaning Processes with Pandas


Automating Data Cleaning Processes with Pandas

Few data science projects are exempt from the necessity of cleaning data. Data cleaning encompasses the initial steps of preparing data. Its specific purpose is that only the relevant and useful information underlying the data is retained, be it for its posterior analysis, to use as inputs to an AI or machine learning model, and so on. Unifying or converting data types, dealing with missing values, eliminating noisy values stemming from erroneous measurements, and removing duplicates are some examples of typical processes within the data cleaning stage.

As you might think, the more complex the data, the more intricate, tedious, and time-consuming the data cleaning can become, especially when implementing it manually.

This article delves into the functionalities offered by the Pandas library to automate the process of cleaning data. Off we go!

Cleaning Data with Pandas: Common Functions

Automating data cleaning processes with pandas boils down to systematizing the combined, sequential application of several data cleaning functions to encapsulate the sequence of actions into a single data cleaning pipeline. Before doing this, let’s introduce some typically used pandas functions for diverse data cleaning steps. In the sequel, we assume an example python variable df that contains a dataset encapsulated in a pandas DataFrame object.

  • Filling missing values: pandas provides methods for automatically dealing with missing values in a dataset, be it by replacing missing values with a “default” value using the df.fillna() method, or by removing any rows or columns containing missing values through the df.dropna() method.
  • Removing duplicated instances: automatically removing duplicate entries (rows) in a dataset could not be easier thanks to the df.drop_duplicates() method, which allows the removal of extra instances when either a specific attribute value or the entire instance values are duplicated to another entry.
  • Manipulating strings: some pandas functions are useful to make the format of string attributes uniform. For instance, if there is a mix of lowercase, sentencecase, and uppercase values for an 'column' attribute and we want them all to be lowercase, the df['column'].str.lower()method does the job. For removing accidentally introduced leading and trailing whitespaces, try the df['column'].str.strip() method.
  • Manipulating date and time: the pd.to_datetime(df['column']) converts string columns containing date-time information, e.g. in the dd/mm/yyyy format, into Python datetime objects, thereby easing their further manipulation.
  • Column renaming: automating the process of renaming columns can be particularly useful when there are multiple datasets seggregated by city, region, project, etc., and we want to add prefixes or suffixes to all or some of their columns for easing their identification. The df.rename(columns={old_name: new_name}) method makes this possible.

Putting it all Together: Automated Data Cleaning Pipeline

Time to put the above example methods together into a reusable pipeline that helps further automate the data-cleaning process over time. Consider a small dataset of personal transactions with three columns: name of the person (name), date of purchase (date), and amount spent (value):

Code

This dataset has been stored in a pandas DataFrame, df.

To create a simple yet encapsulated data-cleaning pipeline, we create a custom class called DataCleaner, with a series of custom methods for each of the above-outlined data cleaning steps, as follows:

Note: the ffill and bfill argument values in the ‘fillna’ method are two examples of strategies for dealing with missing values. In particular, ffill applies a “forward fill” that imputes missing values from the previous row’s value. A “backward fill” is then applied with bfill to fill any remaining missing values utilizing the next instance’s value, thereby ensuring no missing values will be left.

Then there comes the “central” method of this class, which bridges together all the cleaning steps into a single pipeline. Remember that, just like in any data manipulation process, the order matters: it is up to you to determine the most logical order to apply the different steps to achieve what you are looking for in your data, depending on the specific problem addressed.

Finally, we use the newly created class to apply the entire cleaning process in one shot and display the result.

code

And that’s it! We have a much nicer and more uniform version of our original data after applying some touches to it.

This encapsulated pipeline is designed to facilitate and greatly simplify the overall data cleaning process on any new batches of data you get from now on.

Get Started on The Beginner’s Guide to Data Science!

The Beginner's Guide to Data Science

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises

See What’s Inside



Source link