Understanding GroupBy in Python

Introduction

The GroupBy object in pandas is a powerful tool for grouping and analyzing data. With methods like aggregate(), filter(), transform(), and apply(), you can efficiently perform operations on subsets of your dataset.

aggregate()

The aggregate() method applies aggregation functions like sum, mean, or count to groups of data.

Real-World Example: Calculate the total and average sales by category.


      import pandas as pd
      
      # Sales data
      data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing'],
              'Sales': [1500, 1200, 1700, 800]}
      
      df = pd.DataFrame(data)
      
      # Grouping by category
      grouped = df.groupby('Category')
      result = grouped.aggregate(['sum', 'mean'])
      
      print(result)

Output:

            Sales
            sum   mean
            Category                    
            Clothing      2000  1000.0
            Electronics   3200  1600.0

filter()

Use filter() to include or exclude groups based on a condition.

Real-World Example: Display categories with total sales greater than 2500.


      # Filter categories with total sales > 2500
      filtered = grouped.filter(lambda x: x['Sales'].sum() > 2500)
      print(filtered)

Output:

      Category       Sales
      Electronics    1500
      Electronics    1700

transform()

The transform() method applies a function to each group and returns a DataFrame with the same shape as the original.

Real-World Example: Calculate each sale's percentage contribution to total sales in its category.


      # Calculate sales percentage
      df['Percent'] = grouped['Sales'].transform(lambda x: x / x.sum() * 100)
      print(df)

Output:

      Category       Sales      Percent
      Electronics    1500       46.88
      Electronics    1700       53.12
      Clothing       1200       60.00
      Clothing        800       40.00

apply()

The apply() method applies a custom function to each group and returns the result.

Real-World Example: Find the largest sales difference within each category.



      # Calculate the largest sales difference in each category
      result = grouped.apply(lambda x: x['Sales'].max() - x['Sales'].min())
      print(result)

Output:

      Category
      Clothing       400
      Electronics    200

Feature	aggregate()	filter()	transform()	apply()
Purpose	Aggregates numerical data using multiple functions.	Filters groups based on a condition.	Applies a function element-wise while keeping the same shape.	Applies a function along rows or columns.
Typical Use Case	Summing, averaging, or finding min/max values for groups.	Keeping only groups that meet a condition.	Standardizing or normalizing data within groups.	Applying custom functions row-wise or column-wise.
Returns	A scalar or DataFrame depending on the functions used.	A subset of the original data.	A Series or DataFrame of the same shape.	A Series, DataFrame, or scalar.
Shape Preservation	Reduces dimensions.	Keeps only matching groups.	Maintains the original shape.	Can change the shape depending on the function.
Supports GroupBy?	Yes	Yes	Yes	Yes
Function Behavior	Returns a scalar or a reduced object.	Returns True to keep groups, False to drop them.	Returns a Series of the same length as the input.	Can return any type of object.
Example Usage	`df.groupby('Category').agg({'Sales': ['sum', 'mean']})`	`df.groupby('Category').filter(lambda x: x['Sales'].sum() > 1000)`	`df.groupby('Category')['Sales'].transform(lambda x: x / x.mean())`	`df.apply(lambda row: row['Sales'] * 2, axis=1)`
Performance	Fast for built-in aggregation functions.	Efficient but can be slow for complex conditions.	Faster than apply() for element-wise operations.	Can be slow if used inefficiently.
Use with Multiple Functions	Yes	No	No	Yes
Maintains Grouping?	No	No	Yes	No

Test Your Understanding with Flash Cards