9. Real Example Analysis

Open In Colab

Analyzing Real Data

This episode continues from the previous one and utilizes the final DataFrame described there.

Analysis

Before we get started on our analysis let us take stock of how much data we have for the various columns. To do this we can use two DataFrame methods that we’ve previously used.

First, we can check how many rows of data we have in total. We can check this easily through the shape attribute of the DataFrame

From this, we can see that we have 21222 rows in our data and 13 columns. At most, we can have 21222 rows of data for each column. However, as we saw during the cleaning phase there are NaN values in our dataset so many of our columns won’t contain data in every row.

To check how many rows of data we have for each column we can again use the describe() method. It will count how many rows of data are not NaN for each column. To reduce the size of the output we will use loc to only view the counts for each column.

As we can see we have highly variable amounts of data for each of our columns. We will ignore pressure since it is roughly a depth estimate. These fit fairly neatly into three groups:

  1. Data that is found in almost all rows
    • Temperature
    • Salinity
  2. Data that is found in around 2000-4000 samples
    • Oxygen
    • Phosphorus
    • Nitrate+Nitrite
  3. Data that is found in fewer than 1000 samples
    • pH
    • Dissolved organic carbon
    • Heterotrophic bacteria
    • Prochlorococcus
    • Synechococcus

Note here that pressure is roughly akin to “depth” so we won’t be using it.

GroupBy and Visualization

To start off we can focus on the measurements that we have plenty of data for.

The resulting plot appears cluttered. Notable yearly temperature fluctuate between 7°C and 25°C and the lines are densely packed.

Now we can see that we have removed some of the variation we saw in the previous figure. However, it is still somewhat difficult to make out any trends in the data. One way of dealing with this would be to e.g. get the average temperature for each year and then plot those results.

To this end, we will introduce a new method called groupby which allows us to run calculations like mean() on groups we specify. We want to get the mean temperature for each year. Thanks to our previous work in setting up the date column type this is very easy. We can also reuse surface_samples to only get samples from the upper 100m of the water column.

We see now that the new DataFrame generated by groupby() and mean() contains the mean for each year for each of our columns.

Now it looks a lot smoother, but now we have another issue. We’ve smoothed out any month-to-month variations that are present in the data. To fix this we can instead use the groupby method to group by year and month.

While we have been focusing on temperature there is no reason that we can’t redo the same plots that we have been making with measurements other than temperature. We can also plot multiple measurements at the same time if we want to as well.

We explored various methods of plotting our data. We’ve also learned how to group different measurements depending on when the measurement was taken. If you are interested you can keep testing different methods of grouping the data or plotting some of the measurements that we did not use e.g. pH or dissolved organic carbon (doc umol/kg).