Learn by example!
Table of Contents
Open Table of Contents
Indexing and Slicing
Pandas provides multiple ways to index and slice DataFrames, each with its own advantages. Understanding these methods is crucial for efficient data manipulation. Here are the main approaches you should know.
Let’s start with an example DataFrame:
# Create a sample DataFrame
df = pd.DataFrame({
'A': range(1, 6),
'B': range(10, 60, 10),
'C': ['a', 'b', 'c', 'd', 'e']
}, index=['row1', 'row2', 'row3', 'row4', 'row5'])
print("Original DataFrame:")
print(df)
>>>
Original DataFrame:
A B C
row1 1 10 a
row2 2 20 b
row3 3 30 c
row4 4 40 d
row5 5 50 e
Square Brackets []
:
Single bracket: df['column_name']
or df[['col1', 'col2']]
- Used for selecting columns
- Similar to NumPy’s array indexing, but with column names
# 1. Square Brackets
print("\nSelecting column 'A':")
print(df['A'])
>>>
Selecting column 'A':
row1 1
row2 2
row3 3
row4 4
row5 5
Name: A, dtype: int64
.loc[]
:
Label-based indexing
df.loc['row_label', 'column_label']
df.loc['row_label1':'row_label2', 'col1':'col3']
- Can use boolean arrays for filtering
# 2. .loc[]
print("\nSelecting rows 'row2' to 'row4' and columns 'B' and 'C':")
print(df.loc['row2':'row4', ['B', 'C']])
>>>
Selecting rows 'row2' to 'row4' and columns 'B' and 'C':
B C
row2 20 b
row3 30 c
row4 40 d
.iloc[]
:
Integer-based indexing
df.iloc[0, 0]
# first row, first columndf.iloc[0:5, 0:3]
# first 5 rows, first 3 columns- Most similar to NumPy array indexing
# 3. .iloc[]
print("\nSelecting first 3 rows and first 2 columns:")
print(df.iloc[:3, :2])
>>>
Selecting first 3 rows and first 2 columns:
A B
row1 1 10
row2 2 20
row3 3 30
Boolean Indexing:
df[df['column'] > value]
- Similar to NumPy’s boolean indexing
# 4. Boolean Indexing
print("\nRows where 'A' is greater than 3:")
print(df[df['A'] > 3])
>>>
Rows where 'A' is greater than 3:
A B C
row4 4 40 d
row5 5 50 e
.query()
:
- String expression filtering
df.query('column > value')
- Can be more readable for complex conditions
# 5. .query()
print("\nUsing query to select rows where 'B' is greater than 30:")
print(df.query('B > 30'))
>>>
Using query to select rows where 'B' is greater than 30:
A B C
row4 4 40 d
row5 5 50 e
Data Preprocessing and Transformation
Data preprocessing is a crucial step in any data analysis pipeline. Pandas provides numerous tools to clean, transform, and prepare your data for analysis. Let’s explore some common preprocessing tasks:
- Handle Missing Data: Always start by understanding and addressing missing values
- Remove Duplicates: Ensure data integrity by handling duplicate entries
- Fix Data Types: Convert columns to appropriate data types
- Feature Engineering: Create new features from existing ones
- Data Normalization: Scale or normalize numerical features as needed
Remember that the order of preprocessing steps can affect your results, so maintain a consistent preprocessing pipeline across your projects.
Let’s work with some more sophisticated data:
# Create sample DataFrames
df1 = pd.DataFrame({
'ID': range(1, 6),
'Name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, np.nan, 35, 28],
'Salary': [50000, 60000, 55000, np.nan, 62000],
'Department': ['IT', 'HR', 'Sales', 'IT', 'Finance']
})
df2 = pd.DataFrame({
'ID': range(3, 8),
'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
'JoinDate': pd.date_range(start='2023-01-01', periods=5)
})
print("Original df1:")
print(df1)
print("\nOriginal df2:")
print(df2)
>>>
Original df1:
ID Name Age Salary Department
0 1 John 25.0 50000.0 IT
1 2 Jane 30.0 60000.0 HR
2 3 Bob NaN 55000.0 Sales
3 4 Alice 35.0 NaN IT
4 5 Charlie 28.0 62000.0 Finance
Original df2:
ID City JoinDate
0 3 New York 2023-01-01 1 4 London 2023-01-02
2 5 Paris 2023-01-03
3 6 Tokyo 2023-01-04
4 7 Sydney 2023-01-05
Handling Missing Values:
df.dropna() # Remove missing values
df.fillna(value) # Replace missing values
You can …
df1['Age'] = df1['Age'].fillna(df1['Age'].mean())
df1['Salary'] = df1['Salary'].fillna(df1['Salary'].median())
Dropping Duplicates:
# Remove duplicates (if any)
df1.drop_duplicates(inplace=True)
Filtering
The general syntax for filtering rows based on a condition is:
df[df['column'] > threshold] # Filter rows based on condition
In this case:
print("\nRows where 'Age' is greater than 25:")
print(df1[df1['Age'] > 25])
>>>
Rows where 'Age' is greater than 25:
ID Name Age Salary Department
1 2 Jane 30.0 60000.0 HR
2 3 Bob 29.5 55000.0 Sales
3 4 Alice 36.0 57500.0 IT
4 5 Charlie 28.0 62000.0 Finance
Transformation
# Apply a function to a column
df1['Salary'] = df1['Salary'].apply(lambda x: x * 1.1)
# OR
df1['Salary'] = df1['Salary'] * 1.1
# OR
df1['Salary'] = np.log(df1['Salary'])
# OR
df1["NormalizedSalary"] = df1["Salary"] / df1["Salary"].max()
Grouping and Aggregating
Grouping and aggregating operations are fundamental to data analysis, allowing you to summarize and analyze data at different levels of granularity. These operations are particularly useful when you need to understand patterns and trends within categories of your data.
Binning Data
# Categorize 'Age'
df1["AgeGroup"] = pd.cut(
df1["Age"], bins=[0, 25, 35, 100], labels=["Young", "Middle", "Senior"]
)
>>>
After preprocessing:
ID Name Age Salary Department NormalizedSalary AgeGroup
0 1 John 25.00 50000.0 IT 0.806452 Young
1 2 Jane 30.00 60000.0 HR 0.967742 Middle
2 3 Bob 29.75 55000.0 Sales 0.887097 Middle
3 4 Alice 36.00 57500.0 IT 0.927419 Senior
4 5 Charlie 28.00 62000.0 Finance 1.000000 Middle
GroupBy Operations
Getting a dictonary of groups:
groups = (
{name: group for name, group in df1.groupby("AgeGroup")}
)
Grouping and aggregating:
Basic groupby
print(df.groupby("Department")["Salary"].mean())
>>>
Department
Finance 66500.000000
HR 57333.333333
IT 67333.333333
Name: Salary, dtype: float64
Multiple aggregations
print(
df.groupby("Department").agg(
{"Salary": ["mean", "min", "max"], "Age": "mean", "Projects": "sum"}
)
)
>>>
Salary Age Projects
mean min max mean sum
Department
Finance 66500.000000 65000 68000 38.000000 6
HR 57333.333333 55000 59000 38.333333 6
IT 67333.333333 60000 72000 30.000000 12
Custom aggregation function
# Custom aggregation function
def salary_range(x):
return x.max() - x.min()
print(
df.groupby("Department").agg({"Salary": ["mean", salary_range], "Projects": "sum"})
)
>>>
Salary Projects
mean salary_range sum
Department
Finance 66500.000000 3000 6
HR 57333.333333 4000 6
IT 67333.333333 12000 12
Multiple groupby columns
# Multiple groupby columns
agg_result = df.groupby(["Department", "AgeGroup"]).agg({"Salary": ["mean", "count"], "Age": "mean"})
print(agg_result)
>>>
Salary Age
mean count mean
Department AgeGroup
Finance Young NaN 0 NaN
Middle NaN 0 NaN
Senior 66500.0 2 38.0
HR Young NaN 0 NaN
Middle 55000.0 1 35.0
Senior 58500.0 2 40.0
IT Young 60000.0 1 28.0
Middle 71000.0 2 31.0
Senior NaN 0 NaN