Introduction
The GroupBy
object in pandas
is a powerful
tool for grouping and analyzing data. With methods like
aggregate()
, filter()
,
transform()
, and apply()
, you can
efficiently perform operations on subsets of your dataset.
aggregate()
The aggregate()
method applies aggregation functions like
sum
, mean
, or count
to groups
of data.
Real-World Example: Calculate the total and average sales by category.
import pandas as pd
# Sales data
data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing'],
'Sales': [1500, 1200, 1700, 800]}
df = pd.DataFrame(data)
# Grouping by category
grouped = df.groupby('Category')
result = grouped.aggregate(['sum', 'mean'])
print(result)
Output:
Sales sum mean Category Clothing 2000 1000.0 Electronics 3200 1600.0
filter()
Use filter()
to include or exclude groups based on a
condition.
Real-World Example: Display categories with total sales greater than 2500.
# Filter categories with total sales > 2500
filtered = grouped.filter(lambda x: x['Sales'].sum() > 2500)
print(filtered)
Output:
Category Sales Electronics 1500 Electronics 1700
transform()
The transform()
method applies a function to each group
and returns a DataFrame with the same shape as the original.
Real-World Example: Calculate each sale's percentage contribution to total sales in its category.
# Calculate sales percentage
df['Percent'] = grouped['Sales'].transform(lambda x: x / x.sum() * 100)
print(df)
Output:
Category Sales Percent Electronics 1500 46.88 Electronics 1700 53.12 Clothing 1200 60.00 Clothing 800 40.00
apply()
The apply()
method applies a custom function to each
group and returns the result.
Real-World Example: Find the largest sales difference within each category.
# Calculate the largest sales difference in each category
result = grouped.apply(lambda x: x['Sales'].max() - x['Sales'].min())
print(result)
Output:
Category Clothing 400 Electronics 200
Feature | aggregate() | filter() | transform() | apply() |
---|---|---|---|---|
Purpose | Aggregates numerical data using multiple functions. | Filters groups based on a condition. | Applies a function element-wise while keeping the same shape. | Applies a function along rows or columns. |
Typical Use Case | Summing, averaging, or finding min/max values for groups. | Keeping only groups that meet a condition. | Standardizing or normalizing data within groups. | Applying custom functions row-wise or column-wise. |
Returns | A scalar or DataFrame depending on the functions used. | A subset of the original data. | A Series or DataFrame of the same shape. | A Series, DataFrame, or scalar. |
Shape Preservation | Reduces dimensions. | Keeps only matching groups. | Maintains the original shape. | Can change the shape depending on the function. |
Supports GroupBy? | Yes | Yes | Yes | Yes |
Function Behavior | Returns a scalar or a reduced object. | Returns True to keep groups, False to drop them. | Returns a Series of the same length as the input. | Can return any type of object. |
Example Usage | df.groupby('Category').agg({'Sales': ['sum', 'mean']}) |
df.groupby('Category').filter(lambda x: x['Sales'].sum() > 1000) |
df.groupby('Category')['Sales'].transform(lambda x: x / x.mean()) |
df.apply(lambda row: row['Sales'] * 2, axis=1) |
Performance | Fast for built-in aggregation functions. | Efficient but can be slow for complex conditions. | Faster than apply() for element-wise operations. | Can be slow if used inefficiently. |
Use with Multiple Functions | Yes | No | No | Yes |
Maintains Grouping? | No | No | Yes | No |