GeoLift-SDID Technical Best Practices¶
This document provides first-principles implementation guidance for conducting rigorous Synthetic Difference-in-Differences analysis using the GeoLift-SDID library.
Core Implementation Framework¶
The GeoLift-SDID workflow follows a sequential process:
Data Preparation & Validation
Donor Pool Optimization
Power Analysis & Sensitivity Thresholds
SDID Model Implementation
Bootstrap Standard Error Estimation
AI-Powered Result Interpretation
Each component addresses specific technical challenges in causal inference for marketing applications.
Single-Cell vs Multi-Cell Experiment Designs¶
GeoLift-SDID supports two fundamental experimental designs with distinct statistical properties:
Single-Cell Design¶
Definition: Treatment applied to exactly one location (treatment unit).
Configuration: See configs/power_analysis_config_singlecell.yaml and configs/donor_eval_config_singlecell.yaml
Statistical Properties:
Simpler inference procedure
Higher power for detecting effects in the single treated location
Unable to estimate effect heterogeneity across locations
Multi-Cell Design¶
Definition: Treatment applied to multiple locations (treatment units).
Configuration: See configs/power_analysis_config_multicell.yaml and configs/donor_eval_config_multicell.yaml
Statistical Properties:
Allows estimation of average treatment effects across locations
Can detect heterogeneous treatment effects
Requires more control locations for reliable inference
Lower power per location but higher overall power
Data Requirements¶
Analysis Type |
Min Pre-Treatment Periods |
Min Treatment Locations |
Min Control Locations |
|---|---|---|---|
Single-Cell |
20 |
1 |
5 |
Multi-Cell |
20 |
3 |
10 |
Insufficient data will result in poor synthetic controls and unreliable inference.
Technical Implementation Guide¶
1. Data Preparation & Validation¶
Input: Raw time series data Output: Matrix-compatible panel dataset
Critical Technical Requirements¶
Matrix Dimensionality: Ensure balanced panel structure for matrix operations
Type Consistency: Maintain consistent types for unit identifiers between data and parameters
Treatment Pattern: Binary indicator correctly marking only post-intervention periods for treatment units
Variable Types: Numeric outcome variables without missing values
Validation Checks¶
# Essential validation checks
df['time'] = pd.to_datetime(df['time'])
df = df.sort_values(['unit', 'time'])
# Verify correct treatment pattern
treat_pattern = df[df['unit'].isin(treatment_units)]['treatment'].values
correct_pattern = np.concatenate([
np.zeros(pre_periods), # All zeros before intervention
np.ones(post_periods) # All ones after intervention
])
if not np.array_equal(treat_pattern, correct_pattern):
raise ValueError("Treatment pattern is incorrect: should be all 0s pre-intervention, all 1s post-intervention")
# Check dimension balance
unit_counts = df.groupby('unit').size()
if unit_counts.nunique() > 1:
imbalanced = unit_counts[unit_counts != unit_counts.max()]
raise ValueError(f"Imbalanced panel: units with missing periods: {imbalanced.index.tolist()}")
# Verify numeric outcome
if not np.issubdtype(df['outcome'].dtype, np.number):
raise ValueError("Outcome variable must be numeric")
2. Donor Pool Optimization¶
Input: Validated panel data Output: Optimal control unit ranking
Technical Implementation¶
The donor evaluator uses multiple distance metrics to identify optimal synthetic control candidates:
# Key similarity metrics for donor quality
metrics = {
'correlation': lambda x, y: np.corrcoef(x, y)[0, 1], # Trend similarity
'rmse': lambda x, y: np.sqrt(np.mean((x - y)**2)), # Absolute difference
'mape': lambda x, y: np.mean(np.abs((x - y) / x)), # Relative difference
'dtw': lambda x, y: fastdtw(x, y)[0] # Pattern matching with time warping
}
# Composite scoring function
def composite_score(metrics_dict, weights=None):
if weights is None:
weights = {'correlation': 0.4, 'rmse': 0.3, 'mape': 0.2, 'dtw': 0.1}
# Normalize metrics to [0,1] range
normalized = {}
for m in metrics_dict:
if m == 'correlation': # Higher is better
normalized[m] = (metrics_dict[m] + 1) / 2 # Convert [-1,1] to [0,1]
else: # Lower is better
values = np.array(list(metrics_dict[m].values()))
min_val, max_val = np.min(values), np.max(values)
# Add epsilon to prevent division by zero
epsilon = 1e-10
normalized[m] = {k: 1 - (v - min_val) / max(max_val - min_val, epsilon)
for k, v in metrics_dict[m].items()}
# Calculate weighted score
scores = {}
for location in normalized[list(normalized.keys())[0]]:
scores[location] = sum(weights[m] * normalized[m][location] for m in normalized)
return scores
Critical Considerations¶
Correlation captures trend similarity but ignores magnitude differences
RMSE captures absolute differences but is sensitive to scale
MAPE captures relative differences but fails for near-zero values
Dynamic Time Warping accounts for time shifts but has higher computational cost
Optimal donor pools balance these metrics based on business context.
3. Power Analysis & Sensitivity Thresholds¶
Input: Validated data with optimized donor pool Output: Minimum detectable effect (MDE) and optimal test duration
Technical Implementation¶
The power calculator uses t-test based approximations for efficient computation:
def power_calculation(data, treatment_units, control_units, pre_period, effect_sizes, durations, alpha=0.05):
# Extract pre-period data for estimating baseline variance
pre_data = data[data['time'].isin(pre_period)]
treat_pre = pre_data[pre_data['unit'].isin(treatment_units)]['outcome']
control_pre = pre_data[pre_data['unit'].isin(control_units)]['outcome']
# Calculate pooled standard deviation
n1, n2 = len(treat_pre), len(control_pre)
s1, s2 = treat_pre.std(), control_pre.std()
pooled_sd = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2))
# Calculate statistical power for each effect size and duration
results = {}
for effect in effect_sizes:
for duration in durations:
# Standard formula for t-test power
effect_size = effect / pooled_sd # Cohen's d
power = 1 - stats.t.cdf(
stats.t.ppf(1-alpha/2, df=duration-1) - effect_size*np.sqrt(duration),
df=duration-1
)
results[(effect, duration)] = power
return results
Minimum Detectable Effect Estimation¶
For a given test duration and desired power (typically 0.8):
def calculate_mde(duration, desired_power=0.8, alpha=0.05, sd=None):
"""Calculate minimum detectable effect size for given parameters"""
# Critical value for two-sided test
t_crit = stats.t.ppf(1-alpha/2, df=duration-1)
# Critical value for desired power
t_power = stats.t.ppf(desired_power, df=duration-1)
# MDE in standardized units (Cohen's d)
mde_standardized = (t_crit + t_power) / np.sqrt(duration)
# Convert to absolute units if standard deviation provided
if sd is not None:
return mde_standardized * sd
return mde_standardized
4. SDID Model Implementation¶
Input: Validated data with optimized donors and sufficient power Output: Treatment effect estimates and standard errors
Technical Implementation¶
The core SDID implementation contains matrix operations for weight optimization:
from synthdid.synthdid import SyntheticDifferenceInDifferences
# Preferred implementation with proper import path
sdid = SyntheticDifferenceInDifferences(
data,
unit="unit", # Geographic identifier
time="time", # Time period identifier
outcome="outcome", # Measured metric
treatment="treatment", # Binary treatment indicator
covariates=covariates_list # Optional additional predictors
)
# Fit model and extract primary results
sdid.fit()
att = sdid.att # Average treatment effect on treated
se = sdid.se # Standard error of estimate
p_value = sdid.p_value # Statistical significance
ci_lower, ci_upper = sdid.confidence_interval(alpha=0.05) # 95% confidence interval
Single-Cell Implementation Considerations¶
For single-cell analysis, special considerations apply:
from recipes.geolift_single_cell import GeoLiftSingleCell
# Initialize single-cell analysis
single_cell = GeoLiftSingleCell(
data_path="path/to/data.csv",
treatment_unit="501", # Single treatment unit
intervention_date="2024-03-01",
pre_periods=60,
post_periods=30,
# Additional parameters as needed
)
# Run analysis
results = single_cell.run()
# Extract results with fallback calculations
att = results.get('att') # Average treatment effect
se = results.get('se') # Standard error (may use bootstrap)
p_value = results.get('pvalue') # Statistical significance
5. Bootstrap Standard Error Estimation¶
Input: SDID model results with potential matrix dimension challenges Output: Robust standard error estimates
Technical Implementation¶
When standard matrix-based standard error calculations fail due to dimension constraints, bootstrap resampling provides an alternative:
def bootstrap_standard_error(Y1_pre, Y1_post, Y0_pre, Y0_post, n_bootstrap=500):
"""Calculate standard error via bootstrap when matrix dimensions prevent direct calculation"""
# Original dimensions may not be compatible for standard SE calculation
bootstrap_estimates = []
valid_iterations = 0
# Track failed iterations to analyze failure modes
failure_types = {'matrix_error': 0, 'convergence': 0, 'other': 0}
for _ in range(n_bootstrap):
try:
# Resample control units
n_control = Y0_pre.shape[1]
control_indices = np.random.choice(n_control, n_control, replace=True)
Y0_pre_boot = Y0_pre[:, control_indices]
Y0_post_boot = Y0_post[:, control_indices]
# Compute synthetic control weights with validation
omega = compute_omega(Y1_pre, Y0_pre_boot)
if np.any(np.isnan(omega)) or np.any(np.isinf(omega)):
failure_types['convergence'] += 1
continue
# Apply weights to get synthetic control
Y_synth_pre = Y0_pre_boot @ omega
Y_synth_post = Y0_post_boot @ omega
# Calculate ATT with heteroskedasticity-robust estimation
att_boot = np.mean(Y1_post) - np.mean(Y_synth_post)
bootstrap_estimates.append(att_boot)
valid_iterations += 1
except np.linalg.LinAlgError:
failure_types['matrix_error'] += 1
continue
except Exception:
failure_types['other'] += 1
continue
# Log bootstrap performance statistics
logging.info(f"Bootstrap completed: {valid_iterations}/{n_bootstrap} valid iterations")
if valid_iterations < n_bootstrap * 0.8:
logging.warning(f"Bootstrap failures: {failure_types}")
# Calculate standard error as standard deviation of bootstrap estimates
if bootstrap_estimates:
return np.std(bootstrap_estimates)
else:
return np.nan
Implementation in Production Environment¶
The implementation automatically detects matrix dimension issues and falls back to bootstrap:
def calculate_standard_error(model, Y1_pre, Y1_post, Y0_pre, Y0_post):
"""Calculate standard error with fallback to bootstrap if needed"""
try:
# Try standard matrix-based calculation
se = model.se
if np.isnan(se) or se <= 0:
raise ValueError("Invalid standard error from matrix calculation")
return se
except Exception as e:
print(f"Matrix-based SE calculation failed: {e}")
print("Falling back to bootstrap standard error calculation")
return bootstrap_standard_error(Y1_pre, Y1_post, Y0_pre, Y0_post)
6. AI-Powered Result Interpretation¶
Input: SDID analysis results Output: Business-focused interpretation and recommendations
Technical Implementation¶
The AI interpreter converts statistical results to business insights:
def generate_ai_interpretation(results, api_key, prompt_template=None):
"""Generate AI interpretation of statistical results"""
# Default prompt template focuses on business implications
if prompt_template is None:
prompt_template = """
You are an expert marketing analyst interpreting causal experiment results.
EXPERIMENT RESULTS:
- Test Periods: {test_duration} days
- Average Treatment Effect: {att:.4f} ({att_pct:.2f}% change)
- Standard Error: {se:.4f}
- p-value: {pvalue:.4f}
- 95% Confidence Interval: [{ci_lower:.4f}, {ci_upper:.4f}]
Provide a concise, business-focused interpretation of these results in 3-5 paragraphs,
addressing:
1. Statistical significance and what it means in plain language
2. Practical significance of the effect size for business decisions
3. Recommendations for next steps based on these findings
4. Any limitations or considerations in interpreting these results
Keep your language clear, precise, and actionable for marketing stakeholders.
"""
# Format the prompt with actual results
formatted_prompt = prompt_template.format(
test_duration=results['post_periods'],
att=results['att'],
att_pct=results['att_pct'] * 100 if 'att_pct' in results else results['att'] * 100,
se=results['se'],
pvalue=results['pvalue'],
ci_lower=results['ci_lower'],
ci_upper=results['ci_upper']
)
# Call the DeepSeek API
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
},
json={
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": "You are an expert marketing analyst."},
{"role": "user", "content": formatted_prompt}
],
"temperature": 0.4, # Lower temperature for more precise responses
"max_tokens": 1000
}
)
# Parse and return the interpretation
try:
interpretation = response.json()['choices'][0]['message']['content']
return interpretation
except Exception as e:
return f"Failed to generate interpretation: {str(e)}"
Critical Technical Pitfalls¶
Matrix Dimension Mismatches:
SYMPTOM: “Input operand has a mismatch in its core dimension”
CAUSE: Pre/post period counts don’t align for matrix multiplication
SOLUTION: Verify matching pre-intervention and post-intervention shapes
REFERENCE: See error handling in
recipes/power_calculator.pyor ensure balanced dimensions
Singular Matrix Errors:
SYMPTOM: “Singular matrix” or “LinAlgError”
CAUSE: Linear dependence between control units
SOLUTION: Add ridge penalty or remove highly correlated units
REFERENCE: See implementation in
synthdid/synthdid.py
Imbalanced Panel Issues:
SYMPTOM: “ValueError: Found input variables with inconsistent numbers of samples”
CAUSE: Each location has different number of time periods
SOLUTION: Verify complete panel structure before analysis
REFERENCE: See data validation in
recipes/donor_evaluator.py
Incorrect Package Import:
SYMPTOM: “ModuleNotFoundError: No module named ‘src’”
CAUSE: Package structure is
synthdid, notsrcSOLUTION: Use correct import from
synthdid.synthdidREFERENCE: See import statements in example scripts
Visualization Style Error:
SYMPTOM: “OSError: ‘seaborn-whitegrid’ is not a valid package style”
CAUSE: Matplotlib version compatibility issue
SOLUTION: Use version-appropriate style name (
seaborn-v0_8-whitegrid)REFERENCE: See style handling in
recipes/power_calculator.py
Incorrect Single-Cell Configuration:
SYMPTOM: “Multiple treatment locations detected in single-cell configuration”
CAUSE: Single-cell analysis requires exactly one treatment location
SOLUTION: Modify
power_analysis_config_singlecell.yamlto include only one treatment locationREFERENCE: See example in
configs/power_analysis_config_singlecell.yaml
Validation Heuristics¶
Pre-Treatment Fit Quality¶
The synthetic control should closely match the treatment location during pre-intervention:
def assess_pretreatment_fit(Y1_pre, Y_synth_pre):
"""Assess quality of pre-treatment fit"""
# Prevent division by zero in MAPE calculation
epsilon = 1e-10
denom = np.maximum(np.abs(Y1_pre), epsilon)
rmse = np.sqrt(np.mean((Y1_pre - Y_synth_pre) ** 2))
correlation = np.corrcoef(Y1_pre.flatten(), Y_synth_pre.flatten())[0, 1]
mape = np.mean(np.abs((Y1_pre - Y_synth_pre) / denom))
# Interpretation heuristics - calibrated from empirical results
if correlation > 0.9 and mape < 0.1:
quality = "Excellent"
recommendation = "Proceed with analysis"
elif correlation > 0.8 and mape < 0.2:
quality = "Good"
recommendation = "Proceed with analysis"
elif correlation > 0.7 and mape < 0.3:
quality = "Acceptable"
recommendation = "Proceed with caution, interpret results carefully"
else:
quality = "Poor"
recommendation = "Consider alternative donor pools or analysis methods"
return {
"rmse": rmse,
"correlation": correlation,
"mape": mape,
"quality": quality,
"recommendation": recommendation
}
Statistical Significance Assessment¶
Interpreting statistical significance requires both p-values and effect sizes:
def assess_significance(att, se, baseline, alpha=0.05):
"""Assess statistical and practical significance"""
# Statistical significance
z_score = att / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
# Practical significance (percent change)
pct_change = att / baseline * 100
# Confidence interval
ci_lower = att - stats.norm.ppf(1-alpha/2) * se
ci_upper = att + stats.norm.ppf(1-alpha/2) * se
# Interpretation
if p_value < alpha:
stat_interp = "Statistically significant"
else:
stat_interp = "Not statistically significant"
# Practical interpretation based on percent change
if abs(pct_change) < 1:
pract_interp = "Negligible effect"
elif abs(pct_change) < 5:
pract_interp = "Small effect"
elif abs(pct_change) < 10:
pract_interp = "Moderate effect"
else:
pract_interp = "Large effect"
return {
"p_value": p_value,
"percent_change": pct_change,
"confidence_interval": (ci_lower, ci_upper),
"statistical_interpretation": stat_interp,
"practical_interpretation": pract_interp
}
Advanced Technical Considerations¶
1. Optimization Algorithm Selection¶
The core SDID implementation uses:
Quadratic programming for unit weights (omega)
Ridge regression with cross-validation for time weights (lambda)
These algorithms balance fit quality with robustness to overfitting.
2. Bootstrapping vs. Jackknife Variance¶
Bootstrap resampling offers advantages over jackknife for standard errors:
Less sensitive to outliers in small donor pools
More robust when pre/post periods have different lengths
Better handling of non-linear relationships
3. Balanced vs. Unbalanced Panels¶
While balanced panels are ideal, strategies for unbalanced panels include:
Imputation of missing values (linear interpolation recommended)
Restriction to common time periods (reducing data but maintaining balance)
Weighted estimation that accounts for missing observations
Treatment Effect Interpretation¶
Interpreting Heterogeneous Effects¶
For multi-cell experiments, we need to interpret variation in treatment effects across locations:
def interpret_heterogeneous_effects(location_effects):
"""Convert raw location-specific effects to actionable insight"""
# Calculate coefficient of variation
cv = np.std(location_effects) / np.mean(location_effects) if np.mean(location_effects) != 0 else np.inf
if cv < 0.1:
return "Homogeneous effect across locations"
elif cv < 0.3:
return "Moderate heterogeneity, segment by location characteristics"
else:
return "High heterogeneity, analyze each location individually"
From Statistical to Business Metrics¶
To convert ATT to meaningful business metrics:
def convert_to_business_metrics(att, baseline, time_periods, cost_per_unit=None):
"""Convert statistical ATT to business metrics"""
metrics = {
'absolute_lift': att,
'percent_lift': att / baseline * 100 if baseline != 0 else np.inf,
'total_incremental': att * time_periods
}
if cost_per_unit is not None:
metrics['roi'] = (att * time_periods) / cost_per_unit
return metrics
Further Technical Resources¶
SDID Repository - Original R implementation
Matrix Operation Documentation - Core matrix calculation code
Bootstrap Implementation - Bootstrap standard error implementation
Unit Test Suite - Technical validation tests
Config Examples - Reference configuration files for all analyses