GeoLift-SDID Dataset Setup Guide¶
Overview¶
This guide provides precise technical requirements for preparing datasets compatible with the GeoLift-SDID package. Improper dataset formatting will cause cryptic matrix calculation errors that are difficult to diagnose. Following these exact specifications will prevent 95% of all implementation failures.
Core Dataset Requirements¶
Required Columns¶
Every GeoLift-SDID dataset must contain these fundamental columns:
unit: Unique identifier for each geographic unit (e.g., DMA code, region ID, store ID)
time: Time period for each observation, typically as datetime or numeric timestamp
outcome: The metric being measured (e.g., sales, conversion rate, website visits)
treatment: Binary indicator (0/1) marking treatment periods for treatment units
Data Structure¶
The dataset must be in long format (panel data), where:
Each row represents a unique combination of unit and time period
All units must have observations for all time periods (balanced panel)
Pre-treatment and post-treatment periods must be clearly demarcated
Example structure:
| unit | time | outcome | treatment |
|------|------------|---------|-----------|
| 501 | 2024-01-01 | 123.45 | 0 |
| 501 | 2024-03-01 | 124.56 | 1 |
| 502 | 2024-01-01 | 234.56 | 0 |
| 502 | 2024-03-01 | 235.67 | 0 |
Data Types and Formatting¶
unit column:
Can be numeric (integer/float) or string
Must be consistent throughout the dataset
Will be automatically converted to appropriate type for analysis
If treatment units are specified via CLI or config, they must match this type
time column:
Can be datetime string (YYYY-MM-DD) or numeric timestamp
Will be automatically converted to numeric format for matrix operations
Must be chronologically sorted in ascending order
Must be consistent across all units
outcome column:
Must be numeric (float/integer)
Non-numeric values will cause matrix calculation errors
Should not contain NaN/None/NULL values
treatment column:
Must be binary (0/1)
0 = no treatment, 1 = treatment
Only treatment units in post-intervention period should have value 1
Technical Structural Requirements¶
Dimension Requirements:
Pre-intervention periods: At least 30 required for robust synthetic control (60+ preferred)
Post-intervention periods: At least 14 required for statistical validity (30+ preferred)
Control units: At least 10 required for donor pool diversity (30+ preferred for multi-cell)
Matrix Structure Considerations:
The implementation handles matrices of dimensions:
Y_pre: (n_control, n_pre) - Control units × pre-treatment periods
Y_post: (n_control, n_post) - Control units × post-treatment periods
Y1_pre: (1, n_pre) - Treatment unit × pre-treatment periods
Y1_post: (1, n_post) - Treatment unit × post-treatment periods
Handling Dimension Mismatches:
If n_pre ≠ n_post, the implementation includes fallback mechanisms
Bootstrap resampling is used for standard error calculation when matrix dimensions are incompatible
Single-Cell vs. Multi-Cell Dataset Structure¶
Single-Cell Dataset¶
For analyzing one treatment unit against multiple controls:
Data Structure:
One unit designated as treatment
All other units as controls
Clear delineation of pre/post intervention periods
Treatment Assignment:
‘treatment’ = 0 for all units during pre-intervention period
‘treatment’ = 1 ONLY for treatment unit during post-intervention period
Control units always have ‘treatment’ = 0
Example:
| unit | time | outcome | treatment |
|------|------------|---------|-----------|
| 501 | 2024-01-01 | 123.45 | 0 | <- Pre-intervention
| 501 | 2024-03-01 | 125.67 | 1 | <- Post-intervention starts
| 502 | 2024-01-01 | 234.56 | 0 | <- Control (always 0)
| 502 | 2024-03-01 | 236.78 | 0 | <- Control (always 0)
Multi-Cell Dataset¶
For analyzing multiple treatment units simultaneously:
Data Structure:
Multiple units can be designated as treatment
Treatment start date typically the same across all units
Treatment units can have different start dates if specified
Treatment Assignment:
Same principle: ‘treatment’ = 0 pre-intervention, ‘treatment’ = 1 post-intervention
Each treatment unit’s ‘treatment’ column changes from 0 to 1 at its intervention date
Critical Technical Pitfalls¶
Misaligned Treatment Indicator:
COMMON ERROR: Setting treatment=1 for all rows of treatment unit
CORRECT: Only post-intervention periods of treatment units should have treatment=1
Matrix Dimension Issues:
Pre/post period counts must be consistent across all units
Matrix operations require specific dimensional alignment
Error: “Input operand has a mismatch in its core dimension” indicates this problem
Type Inconsistencies:
Treatment unit IDs must match the type in your dataframe
Mixing string and numeric IDs without proper conversion causes errors
Command-line treatment unit specifications must match dataframe type
Date Format Inconsistencies:
Inconsistent formats between intervention date and dataframe dates
Strings vs. datetime objects causing comparison errors
Different regional formats (MM/DD/YYYY vs. DD/MM/YYYY)
Best Practices for Data Preparation¶
Preprocessing Steps:
Ensure balanced panel (all units have all time periods)
Remove outliers that could distort synthetic control calculation
Normalize/standardize outcome if using units with different scales
Consider seasonality adjustment if strong cyclic patterns exist
Validation Checks:
Verify treatment assignment pattern (0→1 only at intervention point)
Confirm pre/post period counts meet minimum recommendations
Check for parallel trends in pre-intervention period
Validate unit type consistency between data and command parameters
Performance Optimization:
Limit unnecessary columns to reduce memory usage
Pre-sort data by unit and time for faster processing
Consider data aggregation for very high-frequency data
Data Preparation Template Code¶
import pandas as pd
import numpy as np
# 1. Load and standardize column names
df = pd.read_csv('raw_data.csv')
df = df.rename(columns={
'dma': 'unit', # Geographic identifier
'date': 'time', # Time period
'sales': 'outcome', # Measured metric
})
# 2. Handle data types - CRITICAL FOR MATRIX COMPATIBILITY
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(['unit', 'time']) # Sort by unit and time
# Check for non-numeric outcome
if not pd.api.types.is_numeric_dtype(df['outcome']):
raise ValueError("'outcome' column must be numeric for matrix operations")
# 3. Define intervention parameters
treatment_units = [501] # Must match type in dataframe
intervention_date = pd.to_datetime('2024-03-01')
# 4. Create treatment indicator (critical for matrices)
df['treatment'] = 0 # Initialize all as untreated
mask = (df['unit'].isin(treatment_units)) & (df['time'] >= intervention_date)
df.loc[mask, 'treatment'] = 1
# 5. Verify balanced panel
expected_periods = df['time'].nunique()
unit_periods = df.groupby('unit').size()
imbalanced_units = unit_periods[unit_periods != expected_periods]
if not imbalanced_units.empty:
raise ValueError(f"Imbalanced panel will cause matrix errors. Units missing periods: {imbalanced_units.index.tolist()}")
# 6. Verify treatment pattern
pre_treatment = df[(df['unit'].isin(treatment_units)) & (df['time'] < intervention_date)]
if pre_treatment['treatment'].sum() > 0:
raise ValueError("Invalid treatment pattern: Pre-intervention periods incorrectly marked as treatment=1")
post_treatment = df[(df['unit'].isin(treatment_units)) & (df['time'] >= intervention_date)]
if post_treatment['treatment'].sum() != len(post_treatment):
raise ValueError("Invalid treatment pattern: Post-intervention treatment periods not all marked as 1")
# 7. Verify sufficient data for statistical validity
n_pre = df[df['time'] < intervention_date]['time'].nunique()
n_post = df[df['time'] >= intervention_date]['time'].nunique()
n_control = df[~df['unit'].isin(treatment_units)]['unit'].nunique()
if n_pre < 20:
print(f"WARNING: Only {n_pre} pre-periods. Minimum 20 recommended, 30+ preferred.")
if n_post < 7:
print(f"WARNING: Only {n_post} post-periods. Minimum 7 recommended, 14+ preferred.")
if n_control < 5:
print(f"WARNING: Only {n_control} control units. Minimum 5 recommended, 10+ preferred.")
# 8. Save processed data
df.to_csv('geolift_ready_data.csv', index=False)
print(f"Data preparation complete: {len(df)} rows across {df['unit'].nunique()} units and {df['time'].nunique()} time periods.")
Advanced Data Setup Considerations¶
Handling Structural Constraints¶
The GeoLift-SDID implementation includes failsafe mechanisms for handling structural matrix constraints. When pre-intervention and post-intervention periods have different counts (e.g., 60 pre vs. 30 post), direct matrix multiplication fails. The implementation automatically:
Detects dimension mismatches
Falls back to direct statistical estimation
Uses bootstrap resampling for standard error calculation (n=500 by default)
Logs warnings when fallback mechanisms are activated
These failsafes prevent execution errors but may reduce statistical precision. Balanced panel data remains strongly preferred for optimal results.
Absolute Minimum Requirements vs. Optimal Parameters¶
Below are the absolute minimum and optimal parameters for valid analysis:
Parameter |
Absolute Minimum |
Recommended Minimum |
Optimal |
|---|---|---|---|
Pre-treatment periods |
20 |
30 |
60+ |
Post-treatment periods |
7 |
14 |
30+ |
Control units (single-cell) |
5 |
10 |
20+ |
Control units (multi-cell) |
10 |
20 |
30+ |
Treatment units (multi-cell) |
2 |
3 |
5+ |
Additional critical requirements:
Balanced Panel: Every unit must have data for every time period
Type Consistency: Unit identifiers must maintain consistent type throughout
Treatment Assignment: Perfect binary pattern (0→1 at intervention only for treatment units)
Data Quality: No missing values or extreme outliers in outcome metric