Introduction
The GroupBy object in pandas is a powerful
tool for grouping and analyzing data. With methods like
aggregate(), filter(),
transform(), and apply(), you can
efficiently perform operations on subsets of your dataset.
aggregate()
The aggregate() method applies aggregation functions like
sum, mean, or count to groups
of data.
Real-World Example: Calculate the total and average sales by category.
import pandas as pd
# Sales data
data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing'],
'Sales': [1500, 1200, 1700, 800]}
df = pd.DataFrame(data)
# Grouping by category
grouped = df.groupby('Category')
result = grouped.aggregate(['sum', 'mean'])
print(result)
Output:
Sales
sum mean
Category
Clothing 2000 1000.0
Electronics 3200 1600.0
filter()
Use filter() to include or exclude groups based on a
condition.
Real-World Example: Display categories with total sales greater than 2500.
# Filter categories with total sales > 2500
filtered = grouped.filter(lambda x: x['Sales'].sum() > 2500)
print(filtered)
Output:
Category Sales
Electronics 1500
Electronics 1700
transform()
The transform() method applies a function to each group
and returns a DataFrame with the same shape as the original.
Real-World Example: Calculate each sale's percentage contribution to total sales in its category.
# Calculate sales percentage
df['Percent'] = grouped['Sales'].transform(lambda x: x / x.sum() * 100)
print(df)
Output:
Category Sales Percent
Electronics 1500 46.88
Electronics 1700 53.12
Clothing 1200 60.00
Clothing 800 40.00
apply()
The apply() method applies a custom function to each
group and returns the result.
Real-World Example: Find the largest sales difference within each category.
# Calculate the largest sales difference in each category
result = grouped.apply(lambda x: x['Sales'].max() - x['Sales'].min())
print(result)
Output:
Category
Clothing 400
Electronics 200
| Feature | aggregate() | filter() | transform() | apply() |
|---|---|---|---|---|
| Purpose | Aggregates numerical data using multiple functions. | Filters groups based on a condition. | Applies a function element-wise while keeping the same shape. | Applies a function along rows or columns. |
| Typical Use Case | Summing, averaging, or finding min/max values for groups. | Keeping only groups that meet a condition. | Standardizing or normalizing data within groups. | Applying custom functions row-wise or column-wise. |
| Returns | A scalar or DataFrame depending on the functions used. | A subset of the original data. | A Series or DataFrame of the same shape. | A Series, DataFrame, or scalar. |
| Shape Preservation | Reduces dimensions. | Keeps only matching groups. | Maintains the original shape. | Can change the shape depending on the function. |
| Supports GroupBy? | Yes | Yes | Yes | Yes |
| Function Behavior | Returns a scalar or a reduced object. | Returns True to keep groups, False to drop them. | Returns a Series of the same length as the input. | Can return any type of object. |
| Example Usage | df.groupby('Category').agg({'Sales': ['sum', 'mean']}) |
df.groupby('Category').filter(lambda x: x['Sales'].sum() > 1000) |
df.groupby('Category')['Sales'].transform(lambda x: x / x.mean()) |
df.apply(lambda row: row['Sales'] * 2, axis=1) |
| Performance | Fast for built-in aggregation functions. | Efficient but can be slow for complex conditions. | Faster than apply() for element-wise operations. | Can be slow if used inefficiently. |
| Use with Multiple Functions | Yes | No | No | Yes |
| Maintains Grouping? | No | No | Yes | No |