homeexamplesweb-analysis-pandas

Pandas for Web Analytics: A Practical Guide

Published Jan 8, 2025
2 minutes read

Having worked with web analytics data extensively, I've found Pandas to be an invaluable tool for data manipulation and analysis. Let me share some practical patterns I use regularly when analyzing web traffic and user behavior data.

Pandas
Pandas doing the heavy lifting

Basic Data Loading and Cleaning

First, let's look at how to load and clean typical web analytics data:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
 
# Load web analytics data
df = pd.read_csv('web_analytics.csv', parse_dates=['timestamp'])
 
# Basic cleaning and preparation
df = df.assign(
    date=df['timestamp'].dt.date,
    hour=df['timestamp'].dt.hour,
    weekday=df['timestamp'].dt.day_name(),
    session_duration=pd.to_numeric(df['session_duration'], errors='coerce')
).dropna(subset=['session_duration'])

Session Analysis

Here's how I typically analyze user sessions:

# Average session duration by day
daily_sessions = df.groupby('date').agg({
    'session_duration': 'mean',
    'visitor_id': 'nunique',
    'page_views': 'sum'
}).round(2)
 
# Calculate bounce rate
bounce_rate = df.groupby('date').apply(
    lambda x: (x['page_views'] == 1).mean() * 100
).round(2)
 
# Identify peak traffic hours
peak_hours = df.groupby('hour').agg({
    'visitor_id': 'count'
}).sort_values('visitor_id', ascending=False)

User Journey Analysis

Understanding user paths through your website:

# Create user journey sequences
user_journeys = (df.sort_values(['visitor_id', 'timestamp'])
                 .groupby('visitor_id')
                 .agg({
                     'page_path': lambda x: ' > '.join(x),
                     'page_views': 'count'
                 }))
 
# Identify common paths
common_paths = (user_journeys[user_journeys['page_views'] > 1]
               .groupby('page_path')
               .size()
               .sort_values(ascending=False)
               .head(10))

Conversion Analysis

Analyzing conversion funnel:

def calculate_conversion_rate(df, steps):
    """Calculate conversion rates through defined funnel steps"""
    funnel_data = []
    total_users = len(df['visitor_id'].unique())
    
    for i, step in enumerate(steps):
        users_at_step = df[df['page_path'] == step]['visitor_id'].nunique()
        conversion_rate = (users_at_step / total_users * 100)
        
        funnel_data.append({
            'step': i + 1,
            'page': step,
            'users': users_at_step,
            'conversion_rate': round(conversion_rate, 2)
        })
    
    return pd.DataFrame(funnel_data)
 
# Define conversion funnel steps
funnel_steps = [
    '/home',
    '/products',
    '/cart',
    '/checkout',
    '/thank-you'
]
 
conversion_funnel = calculate_conversion_rate(df, funnel_steps)

Time-Based Analysis

Analyzing trends and patterns over time:

# Weekly trends
weekly_trends = df.resample('W', on='timestamp').agg({
    'visitor_id': 'nunique',
    'page_views': 'sum',
    'session_duration': 'mean'
}).round(2)
 
# Month-over-Month growth
mom_growth = (df.set_index('timestamp')
             .resample('M')['page_views']
             .sum()
             .pct_change() * 100).round(2)
 
# Rolling averages for smoothing trends
rolling_avg = (df.set_index('timestamp')
              .resample('D')['page_views']
              .sum()
              .rolling(window=7)
              .mean())

Geographic Analysis

Analyzing traffic by location:

# Traffic by country
geo_analysis = df.groupby('country').agg({
    'visitor_id': 'nunique',
    'session_duration': 'mean',
    'page_views': 'sum'
}).sort_values('visitor_id', ascending=False)
 
# Calculate engagement metrics by region
geo_engagement = df.groupby('country').apply(
    lambda x: pd.Series({
        'avg_pages_per_session': x['page_views'].mean(),
        'avg_session_duration': x['session_duration'].mean(),
        'bounce_rate': (x['page_views'] == 1).mean() * 100
    })
).round(2)

Advanced Pattern: Cohort Analysis

Here's a more complex example for cohort analysis:

def create_cohort_analysis(df):
    # Get the first visit date for each visitor
    first_visit = df.groupby('visitor_id')['date'].min().reset_index()
    first_visit.columns = ['visitor_id', 'cohort_date']
    
    # Merge with original dataframe
    df_cohort = df.merge(first_visit, on='visitor_id')
    
    # Calculate months between visits
    df_cohort['months_since_first'] = (
        (df_cohort['date'] - df_cohort['cohort_date']).dt.days // 30
    )
    
    # Create cohort table
    cohort_data = df_cohort.groupby(['cohort_date', 'months_since_first'])['visitor_id'].nunique()
    cohort_table = cohort_data.unstack(fill_value=0)
    
    # Calculate retention rates
    cohort_sizes = cohort_table.iloc[:, 0]
    retention_table = cohort_table.div(cohort_sizes, axis=0) * 100
    
    return retention_table.round(2)
 
retention_analysis = create_cohort_analysis(df)

Performance Tips

From my experience working with large web analytics datasets:

  1. Use efficient data types
# Optimize memory usage
df['visitor_id'] = df['visitor_id'].astype('category')
df['page_path'] = df['page_path'].astype('category')
  1. Leverage vectorized operations
# Instead of apply/lambda for simple operations
df['session_duration_minutes'] = df['session_duration'] / 60
  1. Use chunking for large files
# Read large files in chunks
chunks = pd.read_csv('large_web_analytics.csv', chunksize=10000)
result = pd.concat([chunk for chunk in chunks], ignore_index=True)

Remember, when working with web analytics data, it's important to: