1. Data Wrangling Introduction

Open In Colab

What should we care about data wrangling?

Data scientists devote the bulk of their time to the routine work of collecting and preparing messy digital data, in preparation for analysis. While new libraries have emerged to assist with data handling, the data itself has grown more complex and voluminous, thereby maintaining pace with the ongoing challenges in data science.

Here is a brief illustration of how data wrangling can be used to clean and understand data. Below, data wrangling is employed to transform the given table, which presents various data quality issues including non-descriptive or inadequate column labels and label inconsistencies. This process results in a corrected version of the table with appropriate column names and labels. Subsequently, the corrected table is used to construct a summary aggregate table to calculate the average population size per neighborhood.

/bad%20quality%20data
Table with various data quality issues including non-descriptive or inadequate column labels and label inconsistencies
/corrected%20data
The corrected table resulting from removing issues identified in the previous table
/Summarized%20data
Summary aggregate table derived from the corrected table.

If you’ve ever worked with data in Excel, Google Sheets, or a CSV file, you’ve likely conducted data wrangling in a manner similar to the scenario above. Maybe you’ve added or removed columns and rows, or carried out summations or other arithmetic functions. These tools work great for simple tasks. But what happens when you need to apply more complex functions or operations? The standard tools fall short as they lack more sophisticated functions, and reusing existing libraries becomes a challenge, as they’re primarily not designed for such platforms or scenarios. Similarly, repeating the same steps to multiple datasets is neither scalable nor reproducible.

Addressing these limitations, Python stands out as a powerful language that can handle most intricacies of working with data. Notably, the Pandas library is recognized as the quintessential standard in data processing, providing powerful tools for managing substantial datasets, executing complex functions, and creating reusable scripts. Its proficiency far surpasses the elementary functionalities of conventional tools like Excel or Google Sheets.

While many tools are available for coding and data wrangling, Jupyter Notebooks have become the go-to option for these tasks. Jupyter Notebooks are an open-source web-based application that allows you to create and share documents containing code, equations, visualizations, and narrative text. Their popularity stems from their interactive nature, allowing you to write brief code snippets, examine the results, and iterate as needed. Jupyter Notebooks also allow users to mix code, visuals, and text in one place. This setup makes writing workflow easy to follow and document, which is ideal for collaboration and for dissemination. Furthermore, Jupyter supports multiple languages and connects well with data science tools, making it a versatile choice for different projects.

In this first workshop, you will learn the basics of using Jupyter Notebooks (Link to the Jupyter Project) and Pandas (Link to Pandas Website). The Pandas library helps provide various bells and whistles for both cleaning and analyzing your data. However, since it is built on top of Python, a basic understanding of the Python (Link to Python Website) programming language is required. Through these tools you will learn how to analyze a raw dataset by cleaning it up and formatting it so that it can be used for further analysis or other workflows.

What this lesson will not teach you

Notebook Programming Languages

Jupyter notebooks can contain various programming languages with R or Julia being possibilities.

Lesson Structure

The structure for this lesson will require participants to run a Jupyter Notebook. To reduce the time required to set up and run Jupyter Notebooks, we will utilize Google Colab, an online Jupyter Notebook environment from Google. At the beginning of each module, you’ll find a link directing you to the Colab notebook website specific to that lesson. Look for the “Open in Colab” icon above this section.