Skip to content

Pandas Primer

Published: at 01:15 AM in 5 min readSuggest Changes

Learn by example!

Table of Contents

Open Table of Contents

Indexing and Slicing

Pandas provides multiple ways to index and slice DataFrames, each with its own advantages. Understanding these methods is crucial for efficient data manipulation. Here are the main approaches you should know.

Let’s start with an example DataFrame:

# Create a sample DataFrame
df = pd.DataFrame({
    'A': range(1, 6),
    'B': range(10, 60, 10),
    'C': ['a', 'b', 'c', 'd', 'e']
}, index=['row1', 'row2', 'row3', 'row4', 'row5'])

print("Original DataFrame:")
print(df)
>>>
Original DataFrame:
      A   B  C
row1  1  10  a
row2  2  20  b
row3  3  30  c
row4  4  40  d
row5  5  50  e

Square Brackets []:

Single bracket: df['column_name'] or df[['col1', 'col2']]

# 1. Square Brackets
print("\nSelecting column 'A':")
print(df['A'])

>>>
Selecting column 'A':
row1    1
row2    2
row3    3
row4    4
row5    5
Name: A, dtype: int64

.loc[]:

Label-based indexing

# 2. .loc[]
print("\nSelecting rows 'row2' to 'row4' and columns 'B' and 'C':")
print(df.loc['row2':'row4', ['B', 'C']])
>>>
Selecting rows 'row2' to 'row4' and columns 'B' and 'C':
       B  C
row2  20  b
row3  30  c
row4  40  d

.iloc[]:

Integer-based indexing

# 3. .iloc[]
print("\nSelecting first 3 rows and first 2 columns:")
print(df.iloc[:3, :2])
>>>
Selecting first 3 rows and first 2 columns:
      A   B
row1  1  10
row2  2  20
row3  3  30

Boolean Indexing:

# 4. Boolean Indexing
print("\nRows where 'A' is greater than 3:")
print(df[df['A'] > 3])
>>>
Rows where 'A' is greater than 3:
      A   B  C
row4  4  40  d
row5  5  50  e

.query():

# 5. .query()
print("\nUsing query to select rows where 'B' is greater than 30:")
print(df.query('B > 30'))
>>>
Using query to select rows where 'B' is greater than 30:
      A   B  C
row4  4  40  d
row5  5  50  e

Data Preprocessing and Transformation

Data preprocessing is a crucial step in any data analysis pipeline. Pandas provides numerous tools to clean, transform, and prepare your data for analysis. Let’s explore some common preprocessing tasks:

  1. Handle Missing Data: Always start by understanding and addressing missing values
  2. Remove Duplicates: Ensure data integrity by handling duplicate entries
  3. Fix Data Types: Convert columns to appropriate data types
  4. Feature Engineering: Create new features from existing ones
  5. Data Normalization: Scale or normalize numerical features as needed

Remember that the order of preprocessing steps can affect your results, so maintain a consistent preprocessing pipeline across your projects.

Let’s work with some more sophisticated data:

# Create sample DataFrames
df1 = pd.DataFrame({
    'ID': range(1, 6),
    'Name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, np.nan, 35, 28],
    'Salary': [50000, 60000, 55000, np.nan, 62000],
    'Department': ['IT', 'HR', 'Sales', 'IT', 'Finance']
})

df2 = pd.DataFrame({
    'ID': range(3, 8),
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
    'JoinDate': pd.date_range(start='2023-01-01', periods=5)
})

print("Original df1:")
print(df1)
print("\nOriginal df2:")
print(df2)
>>>
Original df1:
   ID     Name   Age   Salary Department
0   1     John  25.0  50000.0         IT
1   2     Jane  30.0  60000.0         HR
2   3      Bob   NaN  55000.0      Sales
3   4    Alice  35.0      NaN         IT
4   5  Charlie  28.0  62000.0    Finance

Original df2:
   ID      City   JoinDate
0   3  New York 2023-01-01                                                                                                                                                                 1   4    London 2023-01-02
2   5     Paris 2023-01-03
3   6     Tokyo 2023-01-04
4   7    Sydney 2023-01-05

Handling Missing Values:

df.dropna() # Remove missing values
df.fillna(value) # Replace missing values

You can …

df1['Age'] = df1['Age'].fillna(df1['Age'].mean())
df1['Salary'] = df1['Salary'].fillna(df1['Salary'].median())

Dropping Duplicates:

# Remove duplicates (if any)
df1.drop_duplicates(inplace=True)

Filtering

The general syntax for filtering rows based on a condition is:

df[df['column'] > threshold] # Filter rows based on condition

In this case:

print("\nRows where 'Age' is greater than 25:")
print(df1[df1['Age'] > 25])
>>>
Rows where 'Age' is greater than 25:
   ID     Name   Age   Salary Department
1   2     Jane  30.0  60000.0         HR
2   3      Bob  29.5  55000.0      Sales
3   4    Alice  36.0  57500.0         IT
4   5  Charlie  28.0  62000.0    Finance

Transformation

# Apply a function to a column
df1['Salary'] = df1['Salary'].apply(lambda x: x * 1.1)
# OR
df1['Salary'] = df1['Salary'] * 1.1
# OR
df1['Salary'] = np.log(df1['Salary'])
# OR
df1["NormalizedSalary"] = df1["Salary"] / df1["Salary"].max()

Grouping and Aggregating

Grouping and aggregating operations are fundamental to data analysis, allowing you to summarize and analyze data at different levels of granularity. These operations are particularly useful when you need to understand patterns and trends within categories of your data.

Binning Data

# Categorize 'Age'
df1["AgeGroup"] = pd.cut(
    df1["Age"], bins=[0, 25, 35, 100], labels=["Young", "Middle", "Senior"]
)
>>>
After preprocessing:
   ID     Name    Age   Salary Department  NormalizedSalary AgeGroup
0   1     John  25.00  50000.0         IT          0.806452    Young
1   2     Jane  30.00  60000.0         HR          0.967742   Middle
2   3      Bob  29.75  55000.0      Sales          0.887097   Middle
3   4    Alice  36.00  57500.0         IT          0.927419   Senior
4   5  Charlie  28.00  62000.0    Finance          1.000000   Middle

GroupBy Operations

Getting a dictonary of groups:

groups = (
    {name: group for name, group in df1.groupby("AgeGroup")}
)

Grouping and aggregating:

Basic groupby

print(df.groupby("Department")["Salary"].mean())
>>>
Department
Finance    66500.000000
HR         57333.333333
IT         67333.333333
Name: Salary, dtype: float64

Multiple aggregations

print(
    df.groupby("Department").agg(
        {"Salary": ["mean", "min", "max"], "Age": "mean", "Projects": "sum"}
    )
)
>>>
                  Salary                      Age Projects
                    mean    min    max       mean      sum
Department
Finance     66500.000000  65000  68000  38.000000        6
HR          57333.333333  55000  59000  38.333333        6
IT          67333.333333  60000  72000  30.000000       12

Custom aggregation function

# Custom aggregation function
def salary_range(x):
    return x.max() - x.min()


print(
    df.groupby("Department").agg({"Salary": ["mean", salary_range], "Projects": "sum"})
)
>>>
                  Salary              Projects
                    mean salary_range      sum
Department
Finance     66500.000000         3000        6
HR          57333.333333         4000        6
IT          67333.333333        12000       12

Multiple groupby columns

# Multiple groupby columns
agg_result = df.groupby(["Department", "AgeGroup"]).agg({"Salary": ["mean", "count"], "Age": "mean"})
print(agg_result)
>>>
                      Salary         Age
                        mean count  mean
Department AgeGroup
Finance    Young         NaN     0   NaN
           Middle        NaN     0   NaN
           Senior    66500.0     2  38.0
HR         Young         NaN     0   NaN
           Middle    55000.0     1  35.0
           Senior    58500.0     2  40.0
IT         Young     60000.0     1  28.0
           Middle    71000.0     2  31.0
           Senior        NaN     0   NaN

Next Post
Python and Pandas and P-values, Oh My! A Statistical Journey