Aggregations aggregate the data in each group, i.e., they reduce the data to a single value. This includes, for instance, computing group sums, means, maximums, minimums, etc. Some of the interesting/important summary aggregation methods of GroupBy
objects are:
Methods | Decription |
---|---|
mean , median |
Computes the mean and the median in each group |
min , max |
computes the min and max in each group |
size |
computes the number of values in each group |
When one of these methods is called by the GroupBy
object, it is applied to each group individually and then the group is combined into a new DataFrame
.
For example, suppose we wanted to group df
by Region
, apply the sum()
method to calculate the aggregate Total Revenue
by Region
and aggregate Total Profit
by Region
. Then we can combine the results into a new DataFrame
which holds the aggregate Total Revenue
by Region
and the aggregate Total Profit
by Region
. We could achieve this by first splitting the data using the groupby()
DataFrame
method to obtain a new GroupBy
object, we will call it grouped_by_region
. Then we could apply and combine using the GroupBy
object’s sum()
method.
Code:
grouped_by_region = df.loc[:, ['Region', 'Total Revenue', 'Total Profit']].groupby("Region")
grouped_by_region.sum()
Output:
Total Revenue | Total Profit | |
---|---|---|
Region | ||
Asia | 6921531.31 | 2580972.04 |
Australia and Oceania | 3292856.72 | 1236498.14 |
Central America and the Caribbean | 6573837.78 | 1735667.38 |
Europe | 1341328.03 | 347463.87 |
Sub-Saharan Africa | 9980589.83 | 2990051.28 |
We see from the above example that the GroupBy
sum()
method returns a DataFrame
with an index labeling the group that the row entry corresponds to and entries telling us the aggregate Total Revenue
and aggregate Total Profit
by Region
.
As discussed in the previous cell, pandas
has implemented the most common aggregate methods for us, like sum()
and mean()
, but sometimes our data requires unique processing. The GroupBy
method agg()
can be used where complex or custom aggregation logic is required. The method agg()
will take a function and use it to aggregate the group in the same way that we saw sum()
do in the previous cell. The function passed must take a DataFrame
as an argument, and that passed DataFrame
will be each group of the calling GroupBy
object.
For example, suppose we wanted to find the aggregate Total Revenue
by Region in Canadian dollars. We can define a function called sum_total_revenue_CAD()
to return the sum of the Total Revenue
of a group in Canadian Dollars. Then we can create a new GroupBy
object, call it grouped_by_region
, using a subset of the df
DataFrame only containing the Region
and Total Revenue
columns. Lastly, we can call agg()
with the grouped_by_region
GroupBy object and pass it the sum_total_revenue_CAD
function.
Code:
def sum_total_revenue_CAD(x):
return x.sum() * 1.33
grouped_by_region = df.loc[:, ["Region", "Total Revenue"]].groupby("Region")
grouped_by_region.agg(sum_total_revenue_CAD)
Output:
Total Revenue | |
---|---|
Region | |
Asia | 9.205637e+06 |
Australia and Oceania | 4.379499e+06 |
Central America and the Caribbean | 8.743204e+06 |
Europe | 1.783966e+06 |
Sub-Saharan Africa | 1.327418e+07 |
We see in the above example that the result is a new DataFrame
with the unique Region
values as the index and values corresponding to the sum of the Total Revenue
by Region
in Canadian dollars.
To customize group-specific processing even further agg()
can also take a dictionary of functions to aggregate on. The dictionary should be the name of a column of the group and the value of a callable function that will take a Series
.
For example, suppose we wanted to create a new DataFrame
that tells us the sum of Total Revenue
and the max Total Profit
by Region from df
. To do this we would first groupby()
Region
and then call agg()
with the new GroupBy
object, passing it the dictionary: {'Total Revenue' :sum,'Total Profit' : max}
, which specifies that we want to sum the Total Revenue
column and find the max of the Total Profit
column.
Code:
grouped_by_region = df.groupby("Region")
grouped_by_region.agg({'Total Revenue' :sum,'Total Profit' : max})
Output:
Total Revenue | Total Profit | |
---|---|---|
Region | ||
Asia | 6921531.31 | 1208744.24 |
Australia and Oceania | 3292856.72 | 951410.50 |
Central America and the Caribbean | 6573837.78 | 1487261.02 |
Europe | 1341328.03 | 224598.75 |
Sub-Saharan Africa | 9980589.83 | 693911.51 |
Transformations change the data in a group-specific way. As opposed to aggregations, which reduce the data into a single value, transformations modify the data but don’t change the shape of the groups.
Applying a transformation is done using the transform()
GroupBy
method. The transform()
method takes as input a function name, which it calls on each group of the GroupBy
object. The function passed to transform()
must take a DataFrame
, which will be a group of the calling GroupBy
object.
For example, suppose we wanted to transform the Total Revenue
column of the df
DataFrame to hold the percentage of the Total Revenue by Region that rows make up. First, we would define a function that will take a DataFrame
and calculate the percentage of the total each entry takes up. Then we will create a new GroupBy
object grouped by specialty
from a subset of df
that only has the columns Total Revenue
and Region
. Then we will call transform passing it the name of our defined function.
Code:
def my_function(x):
return (x / x.sum() ) * 100
grouped_by_region = df.loc[:, ["Region", "Total Revenue"]].groupby("Region")
grouped_by_region.transform(my_function)
Output:
Total Revenue | |
---|---|
0 | 76.943949 |
1 | 8.773913 |
2 | 86.369819 |
3 | 0.757387 |
4 | 33.028359 |
5 | 23.056051 |
6 | 28.034881 |
7 | 12.475344 |
8 | 4.970659 |
9 | 13.588176 |
10 | 0.276000 |
11 | 4.563649 |
12 | 13.045966 |
13 | 91.226087 |
14 | 5.787140 |
15 | 13.630181 |
16 | 43.912456 |
17 | 2.581546 |
18 | 36.978437 |
We see that the result is a new DataFrame
with an index matching that of the original DataFrame
used to initialize the GroupBy
object. This is different than the aggregation example because aggregation reduces the group to a single value, while transformation maintains the shape of the calling DataFrame
.
Let us save these results into a new column in df
called Total_Revenue_Percentage
.
Code:
df["Total_Revenue_Percentage"] = grouped_by_region.transform(my_function)
Suppose we wanted to see the percent Total Revenue by Order Priority
and Region
. One solution to achieve this would be to group on both the Region
and the Order Priority
columns and then sum the Total_Revenue_Percentage
that was computed previously.
Code:
order_priority_pct = df.loc[:, ["Region", "Order Priority", "Total_Revenue_Percentage"]].groupby(["Region", "Order Priority"]).sum()
order_priority_pct.head(n=4)
Output:
Total_Revenue_Percentage | ||
---|---|---|
Region | Order Priority | |
Asia | C | 5.787140 |
H | 0.276000 | |
L | 50.024403 | |
M | 43.912456 |
Notice that since we are grouping on two columns, the resulting index of order_priority_pct
also contains two columns. Now we can sort first on the values in the Region
column so that all the entries with a common Region
are clustered together, and then on the values in the Total_Revenue_Percentage
columns.
Code:
order_priority_pct = df.loc[:, ["Region", "Order Priority", "Total_Revenue_Percentage"]].groupby(["Region", "Order Priority"]).sum()
order_priority_pct.sort_values(["Region", "Total_Revenue_Percentage"], ascending=[True, False]).head(n=6)
Output:
Total_Revenue_Percentage | ||
---|---|---|
Region | Order Priority | |
Asia | L | 50.024403 |
M | 43.912456 | |
C | 5.787140 | |
H | 0.276000 | |
Australia and Oceania | H | 76.943949 |
C | 23.056051 |
Filtering a group consists of dropping or retaining groups in a way that depends on a group-specific computation that returns True
or False
. Groups that are retained will be left unmodified. For instance, we can filter specialties from spending_df
that don’t have enough entries or for which the mean spending
is below a certain threshold.
Filtering a group is done using the GroupBy
method filter()
. The method filter()
takes as input a function name, which it calls on each group of the GroupBy
object. The function must return either True
or False
and groups for which the function returns False
are dropped. The resulting DataFrame
will have entries in the same order as the original DataFrame
.
Suppose we want to filter out the Regions that having low Total Revenue, i.e. we want to filter out the Regions for which the aggregate Total Revenue
is less than some defined threshold, let say $4,000,000. To do this we can define a function named filter_on_total_revenue()
. The defined function will take a DataFrame
, determine whether the sum of the Total Revenue
column in that DataFrame
is less than 5000000, and then return True
if it is or False
if not.
Then, to apply the filter on df
, we first subset the DataFrame
so that only the columns Region
and Total Revenue
are remaining and then group by Region
. Then the GroupBy filter()
method can be called with the filter_on_total_revenue()
function passed as an argument. We can save the results into a new DataFrame
named low_revenue_df
. Then to see which Regions are less than the $4,000,000 Total Revenue threshold we can print the unique values in the Region
column of low_revenue_df
.
Code:
def filter_on_total_revenue(x):
return x['Total Revenue'].sum() < 4000000
low_revenue_df = df[["Region", "Total Revenue"]].groupby("Region").filter(filter_on_total_revenue)
low_revenue_df["Region"].unique()
Output:
array(['Australia and Oceania', 'Europe'], dtype=object)
We see that only two Regions are under the threshold of Total Revenue.
apply()
GroupBy Methodpandas
provides a few built-in GroupBy
methods for thinning the data including nlargest()
, nsmallest()
, and more. An example usage of nlargest()
, a thinning method, would be grouping a subset of df
which contains only the Total Revenue
and Region
columns by Region
and then obtaining the 2 largest of each Region. The result will be a new DataFrame
with only the top 2 spenders from each unique specialty.
Code:
grouped_by_region = df.loc[:, ["Region", "Total Revenue"]].groupby("Region")
grouped_by_region["Total Revenue"].nlargest(2).head(n=4)
Output:
Region
Asia 16 3039414.40
18 2559474.10
Australia and Oceania 0 2533654.00
5 759202.72
Name: Total Revenue, dtype: float64
Though pandas
has the more common and basic aggregation, transformation, and thinning methods implemented for us, they could not possibly cover all cases. Therefore cases that do not fit into any one of these categories may be carried out by using the more flexible apply() GroupBy
method. apply()
takes as input a function name, which it calls on each group of the calling GroupBy
object.
For example, suppose we wanted to thin our dataset so that there are only 50% of each specialty represented. To do this we can define a new function, we will call it, sample_50p
, and this function will utilize the sample()
DataFrame
method. The sample()
DataFrame
method will take a parameter frac
that specifies the fraction of the original DataFrame
that is to be returned. We can then use the apply()
method and pass it the name of our newly defined function to obtain a new DataFrame
that is filtered at the group-specific level.
Code:
def sample_50p(x):
return x.sample(frac=0.5)
grouped_by_region = df.loc[:, ["Region", "Total Revenue", "Order Priority"]].groupby("Region")
grouped_by_region.apply(sample_50p).head(n=5)
Output:
Region | Total Revenue | Order Priority | ||
---|---|---|---|---|
Region | ||||
Asia | 10 | Asia | 19103.44 | H |
18 | Asia | 2559474.10 | L | |
Australia and Oceania | 0 | Australia and Oceania | 2533654.00 | H |
Central America and the Caribbean | 1 | Central America and the Caribbean | 576782.80 | C |
Europe | 15 | Europe | 182825.44 | M |
Key Points
Function Application and Mapping
DataFrame
we can use the apply()
methodSeries
we can use the map()
methodDataFrame
we can use the applymap()
methodGroup Specific Processing
The groupby()
method is used to group the data using values on one or more columns
groupby()
is often applied in the context of the data processing paradigm called “split-apply-combine”
DataFrame
There are 3 common classes of split-apply-combine operations that can be applied to group data.
Aggregations generate a single value for each group
Transformations convert the data and generate a group of the same size as the original group.
Filters retain or discard a group based on group-specific boolean computations.
For more information
You can read more about groupby
from the official pandas documentation of groupby
Bio Break! We will start again at 3:05 PM.