4. Wrangling DataFrames

Binder

Selection, Subsetting and Sorting a DataFrame

When exploring our data we will often want to focus our attention to specific rows, columns, and entries that satisfy certain conditions. We may want to either look at a single column of the data or work with a subset of the original data. This situation may occur because the source of your data was probably not using the information for the exact same purpose.

Furthermore, it is often helpful to sort our data set using a particular relation to identify patterns and to understand the data’s structure. For example, suppose the original data set we acquire and want to analyze describes a sport team’s performance for each game during a season and it is original ordered in chronological order. It may be interesting to sort the game play statistics using a different relation such as number of points scored to easily identify high and low scoring games.

As with previous episodes you should follow along in the notebook starting with ‘04’ that can be found at the following link Link to Binder.

Selection

Selecting data from a DataFrame is very easy and builds on concepts we discussed in the previous episode. As mentioned previously, in a DataFrame we have column names called headers and row names called indexes. Depending on how we loaded the data these might be integers or some other ID.

If we have the example DataFrame seen in the image below we can select and/or subset various combinations of rows and columns by using Pandas.

/archive-2022-2023/Selection%20DataFrame

Selecting Columns

To start off lets focus on subsetting columns. If we have a DataFrame called df and a column called ‘ph’ we can subset the DataFrame using df['ph'].

Selecting Rows

If on the other hand we want to access a subset rows we have to use a slightly different approach. While previously we could just pass the name of the column inside brackets if we want to access one or more rows we need to use the .loc or .iloc methods.

So if we again have a DataFrame called df and a row called ‘Sample-1’ we can access it using df.loc['Sample-1', df.columns].

The key difference between .loc and .iloc is that .loc relies on the names of the indexes and headers while .iloc relies instead on the index and header number. Here df.columns is providing the all the header names in the DataFrame called df and letting pandas know that we want a single row, but all of the header in the DataFrame.

If we were to use .iloc instead of .loc for the previous example and we know that ‘Sample-1’ is at index position 0 we would use df.iloc[0, :] and get the same result. You might notice that we also had to change df.columns to : when we used .iloc this is because df.columns provides the names of all the headers which is fine to do with .loc but not .iloc.

The : operator

Selecting Columns and Rows Simultaneously

As we saw in the previous section we can select one or more rows and/or columns to view. For example if we wanted to view the ‘ph’ entry of ‘Sample-1’ from the previous example we could use .loc in the following manner.

If we wanted to select multiple columns e.g. both the ‘ph’ and ‘Longitude’ columns we can change the code bit to fit our needs.

Using .iloc

You can just as well use .iloc for the two examples above, but you will need to change the index and headers to their respective integer values i.e. the row number(s) and the header number(s).

Subsetting

Comparison operations (“<” , “>” , “==” , “>=” , “<=” , “!=”) can be applied to pandas Series and DataFrames in the same vectorized fashion as arithmetic operations except the returned object is a Series or DataFrame of booleans (either True or False).

Within a Single DataFrame

To start let focus on a single DataFrame to better understand how comparison operations work in Pandas. As an example lets say that we have a DataFrame like the one below stored in df:

If we wanted to identify which of the above samples come from a depth above 380 we start by finding column ‘press dbar’ using a less than condition of 380.

We can see that the first row and the last row are both True while the remaining rows are False and a quick look at the original data confirms that this is correct based on our condition. However, looking back at the original DataFrame is very tedious. If we instead want to view/save the rows that were found to have a depth < 380 we can either save the output Series to a variable and use that or directly place the previously used code within a bracket. Both methods are equivalent and shown below.

It might seem strange that we don’t need to use .loc or .iloc despite the fact that we are selecting rows. This is due to the fact that the output of df['press dbar'] < 380 is a Pandas Series that contains information on the row and Pandas inherently assumes that when it is passed a boolean list like this that we want to select those rows that are True. A graphic example of this is shown below.

/archive-2022-2023/Filter%20DataFrame

From previous Pandas Data Wrangling workshop (Link to Github).

Column Specification

If you do not specify a column to compare and instead do e.g. df.loc[:] < df2.loc[:] you will get a comparison of every column and a DataFrame containing those values as an output:

Sorting

There may also come a time when we want to see the data sorted by some criterion in order to explore potential patterns and order statistics of the entries based on a relation. For example, we might want to order a DataFrame by the depth that a sample was recovered from. Note: we can also sort the order of the columns based on their names

There are two primary methods to sort a DataFrame either by the index or by a value. The index is rather straight forward, you use the method sort_index() to sort the DataFrame and provide an axis (0=row, 1=columns). We can reuse df and since the index is already sorted we can sort the order of the columns instead.

Here we can see that we’ve ordered the columns alphabetically.

If we want to instead order the DataFrame based on the value in a particular column or row we instead use sort_values(). We can again use the pressure i.e. ‘press dbar’ column as an example and sort the rows by greatest to smallest pressure value.