Pandas for Web Analytics: A Practical Guide
Published Jan 8, 2025
⋅
2 minutes read
Having worked with web analytics data extensively, I've found Pandas to be an invaluable tool for data manipulation and analysis. Let me share some practical patterns I use regularly when analyzing web traffic and user behavior data.

Basic Data Loading and Cleaning
First, let's look at how to load and clean typical web analytics data:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Load web analytics data
df = pd.read_csv('web_analytics.csv', parse_dates=['timestamp'])
# Basic cleaning and preparation
df = df.assign(
date=df['timestamp'].dt.date,
hour=df['timestamp'].dt.hour,
weekday=df['timestamp'].dt.day_name(),
session_duration=pd.to_numeric(df['session_duration'], errors='coerce')
).dropna(subset=['session_duration'])
Session Analysis
Here's how I typically analyze user sessions:
# Average session duration by day
daily_sessions = df.groupby('date').agg({
'session_duration': 'mean',
'visitor_id': 'nunique',
'page_views': 'sum'
}).round(2)
# Calculate bounce rate
bounce_rate = df.groupby('date').apply(
lambda x: (x['page_views'] == 1).mean() * 100
).round(2)
# Identify peak traffic hours
peak_hours = df.groupby('hour').agg({
'visitor_id': 'count'
}).sort_values('visitor_id', ascending=False)
User Journey Analysis
Understanding user paths through your website:
# Create user journey sequences
user_journeys = (df.sort_values(['visitor_id', 'timestamp'])
.groupby('visitor_id')
.agg({
'page_path': lambda x: ' > '.join(x),
'page_views': 'count'
}))
# Identify common paths
common_paths = (user_journeys[user_journeys['page_views'] > 1]
.groupby('page_path')
.size()
.sort_values(ascending=False)
.head(10))
Conversion Analysis
Analyzing conversion funnel:
def calculate_conversion_rate(df, steps):
"""Calculate conversion rates through defined funnel steps"""
funnel_data = []
total_users = len(df['visitor_id'].unique())
for i, step in enumerate(steps):
users_at_step = df[df['page_path'] == step]['visitor_id'].nunique()
conversion_rate = (users_at_step / total_users * 100)
funnel_data.append({
'step': i + 1,
'page': step,
'users': users_at_step,
'conversion_rate': round(conversion_rate, 2)
})
return pd.DataFrame(funnel_data)
# Define conversion funnel steps
funnel_steps = [
'/home',
'/products',
'/cart',
'/checkout',
'/thank-you'
]
conversion_funnel = calculate_conversion_rate(df, funnel_steps)
Time-Based Analysis
Analyzing trends and patterns over time:
# Weekly trends
weekly_trends = df.resample('W', on='timestamp').agg({
'visitor_id': 'nunique',
'page_views': 'sum',
'session_duration': 'mean'
}).round(2)
# Month-over-Month growth
mom_growth = (df.set_index('timestamp')
.resample('M')['page_views']
.sum()
.pct_change() * 100).round(2)
# Rolling averages for smoothing trends
rolling_avg = (df.set_index('timestamp')
.resample('D')['page_views']
.sum()
.rolling(window=7)
.mean())
Geographic Analysis
Analyzing traffic by location:
# Traffic by country
geo_analysis = df.groupby('country').agg({
'visitor_id': 'nunique',
'session_duration': 'mean',
'page_views': 'sum'
}).sort_values('visitor_id', ascending=False)
# Calculate engagement metrics by region
geo_engagement = df.groupby('country').apply(
lambda x: pd.Series({
'avg_pages_per_session': x['page_views'].mean(),
'avg_session_duration': x['session_duration'].mean(),
'bounce_rate': (x['page_views'] == 1).mean() * 100
})
).round(2)
Advanced Pattern: Cohort Analysis
Here's a more complex example for cohort analysis:
def create_cohort_analysis(df):
# Get the first visit date for each visitor
first_visit = df.groupby('visitor_id')['date'].min().reset_index()
first_visit.columns = ['visitor_id', 'cohort_date']
# Merge with original dataframe
df_cohort = df.merge(first_visit, on='visitor_id')
# Calculate months between visits
df_cohort['months_since_first'] = (
(df_cohort['date'] - df_cohort['cohort_date']).dt.days // 30
)
# Create cohort table
cohort_data = df_cohort.groupby(['cohort_date', 'months_since_first'])['visitor_id'].nunique()
cohort_table = cohort_data.unstack(fill_value=0)
# Calculate retention rates
cohort_sizes = cohort_table.iloc[:, 0]
retention_table = cohort_table.div(cohort_sizes, axis=0) * 100
return retention_table.round(2)
retention_analysis = create_cohort_analysis(df)
Performance Tips
From my experience working with large web analytics datasets:
- Use efficient data types
# Optimize memory usage
df['visitor_id'] = df['visitor_id'].astype('category')
df['page_path'] = df['page_path'].astype('category')
- Leverage vectorized operations
# Instead of apply/lambda for simple operations
df['session_duration_minutes'] = df['session_duration'] / 60
- Use chunking for large files
# Read large files in chunks
chunks = pd.read_csv('large_web_analytics.csv', chunksize=10000)
result = pd.concat([chunk for chunk in chunks], ignore_index=True)
Remember, when working with web analytics data, it's important to:
- Handle missing values appropriately
- Account for timezone differences
- Consider data privacy requirements
- Document your analysis steps
- Validate results against other tools (like Google Analytics)