7. Real Example Cleanup

Open In Colab

Cleaning Real Data

The previous episodes have focused on key concepts with small datasets and/or made-up data. However, these examples can only take you so far. For this reason, the remaining lessons and examples will be focused on an actual dataset from the Hawaiian Ocean Time-Series (HOT) data website (Link to HOT Data Website). This website allows you to query data generated by the time series through various modules.

The Hawaiian Ocean Time Series has been collecting samples from station ALOHA located just North of Oahu since 1988. The map below shows the exact location where the samples we will be using originate from.

HOT Location

(Original image from: https://www.soest.hawaii.edu/HOT_WOCE/bath_HOT_Hawaii.html)

Going forward we are going to be using data from HOT between the 1st of January 2010 to the 1st of January 2020. This particular data we are going to be utilizing comes from bottle extractions between depths from 0 to 500m. The environmental variables that we will be looking at include:

Column name Environmental Variable
botid # Bottle ID
date mmddyy Date
press dbar Pressure
temp ITS-90 Temperature
csal PSS-78 Salinity
coxy umol/kg Oxygen concentration
ph pH
phos umol/kg Phosphate concentration
nit umol/kg Nitrate + Nitrite concentration
no2 nmol/kg Nitrite concentration
doc umol/kg Dissolved Organic Carbon concentration
hbact # 1e5/ml Heterotrophic Bacteria concentration
pbact # 1e5/ml Prochlorococcus numbers
sbact # 1e5/ml Synechococcus numbers

The data contains over 20000 individual samples. To analyze the data we are going to clean it up. Then, in the next episode we will analyze and visualize it. To do this, we will be using some of the tricks we have already learned while also introducing some new concepts. Note: The dataset we are using has been modified from its original format to reduce the amount of cleaning up we need to do.

Cleaning Up

DataFrame Content Cleanup

During our initial clean up we will only load in the first few rows of our datasets entire DataFrame. This will make it easier to work with and less daunting.

This will show us the DataFrame seen below:

/Initial%20DataFrame

From this we can see a few things:

Both of these issues can easily be fixed using Pandas and things we’ve learned previously.

To start off let’s fix the first problem we saw which was the large number of -9 values in the dataset. These are especially strange for some of the columns e.g. how can there be a negative concentration of hbact i.e. heterotrophic bacteria? This is a stand-in for places where no measurement was obtained.

With this, we have fixed the problematic -9 values from our initial DataFrame.

The second problem we identified was that there was an extra column (with no header) made up of only NaN values. This is probably an issue with the original file and if we were to take a look at the raw .csv file we find that each row ends with a ‘,’. This causes read_csv to assume that there is another column with no data since it looks for a new line character (\n) to denote when to start a new row.

There are various methods to deal with this. However, we are going to use a relatively simple method that we’ve already learned. As we discussed in a previous episode pandas DataFrames has a method to drop columns (or indexes) called drop. When provided with the correct arguments, it can drop a column based on its name. Knowing this we can chain our read_csv method with the drop method so that we load in the blank column and then immediately remove it.

This then gives us the output DataFrame seen below:

/No%20Blank%20Column%20DataFrame

We now have fixed some of the initial issues related to our dataset. It should be noted that there might still exist other issues with our dataset since we have only relied on the first few rows.

One final thing that we are going to do that is not quite “clean up” but nonetheless important is to set our index column when we load the data. The column we are going to use for this is the ‘botid #’ column. We can also remove the nrows=5 parameter since we want to load the whole DataFrame starting in the next section.

This gives us a somewhat cleaned up DataFrame that looks like the image below:

/Cleaned%20up%20Dataframe

With our initial cleanup done we can now save the current version of our DataFrame to the df variable. This df variable will be used for the next two sections.

DataFrame Column Types

Now that we have fixed the initial issues we could glean from an initial look at the data we can take a look at the types that Pandas assumed for each of our columns. To do this we can access the .dtypes attribute.

Most of the columns have the correct type with the exception of the ‘date mmddyy’ column that has the int64 type. Pandas has a built in type to format date and time columns and conversion of the date column to this datetime type will help us later on.

To change the type of a column from an int64 to a datetime type is a bit more difficult than e.g. a int64 to float64 conversion. This is because we both need to tell Pandas the type that we want it to convert the column’s data to and the format that it is in. For our data, this is MMDDYY which we can give to Pandas using format='%m%d%y'. This format parameter can be very complicated but is based on native Python more information can be found on the to_datetime method docs (Link to datetime method docs).

The code bit below creates a new column called ‘date’ that contains the same data for each row as is found in the ‘date mmddyy’ column but instead with the datetime64 type. It will not delete the original ‘date mmddyy’ column.

We can see from the output that we have all of our previous columns with the addition of a ‘date’ column with the type datetime64. If we take a look at the new column we can see that it has a different formatting compared to the ‘date mmddyy’ column

Now that we’ve added this column (which contains the same data as found in ‘date mmddyy’ just in a different format) there is no need for the original date mmddyy column so we can drop it.

DataFrame Overview

A final thing we will want to do before moving on to the analysis episode is to get an overview of our DataFrame as a final way of checking to see if anything is wrong. To do this we can use the describe() method which we discussed earlier.

A look at the output shows us that the ‘no2 nmol/kg’ column does not contain any useable data based on its count value being 0. Since this column doesn’t contain any information of interest we can drop it to clean up our DataFrame.

With this done, our data is reasonably cleaned up and we have the DataFrame seen in the image below:

/Cleaned%20Dataframe

We can now move on to the analysis and visualization of the data in our DataFrame.

Summary

With this we’ve cleaned up our initial dataset. To summarize we have:

  1. Replaced the -9 placeholder for Null values with NaN values
  2. Fixed an issue with an extra column containing no data
  3. Added a custom row index
  4. Converted the data in ‘date mmddyy’ to a Pandas-supported datetime type
  5. Dropped two columns:
    • ‘date mmddyy’ column since the new ‘date’ column contains the same data but in a better type
    • ‘no2 nmol/kg’ column since it contained no data