# GeoLift-SDID Technical Best Practices This document provides first-principles implementation guidance for conducting rigorous Synthetic Difference-in-Differences analysis using the GeoLift-SDID library. ## Core Implementation Framework The GeoLift-SDID workflow follows a sequential process: 1. **Data Preparation & Validation** 2. **Donor Pool Optimization** 3. **Power Analysis & Sensitivity Thresholds** 4. **SDID Model Implementation** 5. **Bootstrap Standard Error Estimation** 6. **AI-Powered Result Interpretation** Each component addresses specific technical challenges in causal inference for marketing applications. ## Single-Cell vs Multi-Cell Experiment Designs GeoLift-SDID supports two fundamental experimental designs with distinct statistical properties: ### Single-Cell Design **Definition**: Treatment applied to exactly one location (treatment unit). **Configuration**: See `configs/power_analysis_config_singlecell.yaml` and `configs/donor_eval_config_singlecell.yaml` **Statistical Properties**: - Simpler inference procedure - Higher power for detecting effects in the single treated location - Unable to estimate effect heterogeneity across locations ### Multi-Cell Design **Definition**: Treatment applied to multiple locations (treatment units). **Configuration**: See `configs/power_analysis_config_multicell.yaml` and `configs/donor_eval_config_multicell.yaml` **Statistical Properties**: - Allows estimation of average treatment effects across locations - Can detect heterogeneous treatment effects - Requires more control locations for reliable inference - Lower power per location but higher overall power ### Data Requirements | Analysis Type | Min Pre-Treatment Periods | Min Treatment Locations | Min Control Locations | |---------------|---------------------------|-------------------------|-----------------------| | Single-Cell | 20 | 1 | 5 | | Multi-Cell | 20 | 3 | 10 | Insufficient data will result in poor synthetic controls and unreliable inference. ## Technical Implementation Guide ### 1. Data Preparation & Validation **Input**: Raw time series data **Output**: Matrix-compatible panel dataset #### Critical Technical Requirements - **Matrix Dimensionality**: Ensure balanced panel structure for matrix operations - **Type Consistency**: Maintain consistent types for unit identifiers between data and parameters - **Treatment Pattern**: Binary indicator correctly marking only post-intervention periods for treatment units - **Variable Types**: Numeric outcome variables without missing values #### Validation Checks ```python # Essential validation checks df['time'] = pd.to_datetime(df['time']) df = df.sort_values(['unit', 'time']) # Verify correct treatment pattern treat_pattern = df[df['unit'].isin(treatment_units)]['treatment'].values correct_pattern = np.concatenate([ np.zeros(pre_periods), # All zeros before intervention np.ones(post_periods) # All ones after intervention ]) if not np.array_equal(treat_pattern, correct_pattern): raise ValueError("Treatment pattern is incorrect: should be all 0s pre-intervention, all 1s post-intervention") # Check dimension balance unit_counts = df.groupby('unit').size() if unit_counts.nunique() > 1: imbalanced = unit_counts[unit_counts != unit_counts.max()] raise ValueError(f"Imbalanced panel: units with missing periods: {imbalanced.index.tolist()}") # Verify numeric outcome if not np.issubdtype(df['outcome'].dtype, np.number): raise ValueError("Outcome variable must be numeric") ``` ### 2. Donor Pool Optimization **Input**: Validated panel data **Output**: Optimal control unit ranking #### Technical Implementation The donor evaluator uses multiple distance metrics to identify optimal synthetic control candidates: ```python # Key similarity metrics for donor quality metrics = { 'correlation': lambda x, y: np.corrcoef(x, y)[0, 1], # Trend similarity 'rmse': lambda x, y: np.sqrt(np.mean((x - y)**2)), # Absolute difference 'mape': lambda x, y: np.mean(np.abs((x - y) / x)), # Relative difference 'dtw': lambda x, y: fastdtw(x, y)[0] # Pattern matching with time warping } # Composite scoring function def composite_score(metrics_dict, weights=None): if weights is None: weights = {'correlation': 0.4, 'rmse': 0.3, 'mape': 0.2, 'dtw': 0.1} # Normalize metrics to [0,1] range normalized = {} for m in metrics_dict: if m == 'correlation': # Higher is better normalized[m] = (metrics_dict[m] + 1) / 2 # Convert [-1,1] to [0,1] else: # Lower is better values = np.array(list(metrics_dict[m].values())) min_val, max_val = np.min(values), np.max(values) # Add epsilon to prevent division by zero epsilon = 1e-10 normalized[m] = {k: 1 - (v - min_val) / max(max_val - min_val, epsilon) for k, v in metrics_dict[m].items()} # Calculate weighted score scores = {} for location in normalized[list(normalized.keys())[0]]: scores[location] = sum(weights[m] * normalized[m][location] for m in normalized) return scores ``` #### Critical Considerations - Correlation captures trend similarity but ignores magnitude differences - RMSE captures absolute differences but is sensitive to scale - MAPE captures relative differences but fails for near-zero values - Dynamic Time Warping accounts for time shifts but has higher computational cost Optimal donor pools balance these metrics based on business context. ### 3. Power Analysis & Sensitivity Thresholds **Input**: Validated data with optimized donor pool **Output**: Minimum detectable effect (MDE) and optimal test duration #### Technical Implementation The power calculator uses t-test based approximations for efficient computation: ```python def power_calculation(data, treatment_units, control_units, pre_period, effect_sizes, durations, alpha=0.05): # Extract pre-period data for estimating baseline variance pre_data = data[data['time'].isin(pre_period)] treat_pre = pre_data[pre_data['unit'].isin(treatment_units)]['outcome'] control_pre = pre_data[pre_data['unit'].isin(control_units)]['outcome'] # Calculate pooled standard deviation n1, n2 = len(treat_pre), len(control_pre) s1, s2 = treat_pre.std(), control_pre.std() pooled_sd = np.sqrt(((n1-1)*s1**2 + (n2-1)*s2**2) / (n1+n2-2)) # Calculate statistical power for each effect size and duration results = {} for effect in effect_sizes: for duration in durations: # Standard formula for t-test power effect_size = effect / pooled_sd # Cohen's d power = 1 - stats.t.cdf( stats.t.ppf(1-alpha/2, df=duration-1) - effect_size*np.sqrt(duration), df=duration-1 ) results[(effect, duration)] = power return results ``` #### Minimum Detectable Effect Estimation For a given test duration and desired power (typically 0.8): ```python def calculate_mde(duration, desired_power=0.8, alpha=0.05, sd=None): """Calculate minimum detectable effect size for given parameters""" # Critical value for two-sided test t_crit = stats.t.ppf(1-alpha/2, df=duration-1) # Critical value for desired power t_power = stats.t.ppf(desired_power, df=duration-1) # MDE in standardized units (Cohen's d) mde_standardized = (t_crit + t_power) / np.sqrt(duration) # Convert to absolute units if standard deviation provided if sd is not None: return mde_standardized * sd return mde_standardized ``` ### 4. SDID Model Implementation **Input**: Validated data with optimized donors and sufficient power **Output**: Treatment effect estimates and standard errors #### Technical Implementation The core SDID implementation contains matrix operations for weight optimization: ```python from synthdid.synthdid import SyntheticDifferenceInDifferences # Preferred implementation with proper import path sdid = SyntheticDifferenceInDifferences( data, unit="unit", # Geographic identifier time="time", # Time period identifier outcome="outcome", # Measured metric treatment="treatment", # Binary treatment indicator covariates=covariates_list # Optional additional predictors ) # Fit model and extract primary results sdid.fit() att = sdid.att # Average treatment effect on treated se = sdid.se # Standard error of estimate p_value = sdid.p_value # Statistical significance ci_lower, ci_upper = sdid.confidence_interval(alpha=0.05) # 95% confidence interval ``` #### Single-Cell Implementation Considerations For single-cell analysis, special considerations apply: ```python from recipes.geolift_single_cell import GeoLiftSingleCell # Initialize single-cell analysis single_cell = GeoLiftSingleCell( data_path="path/to/data.csv", treatment_unit="501", # Single treatment unit intervention_date="2024-03-01", pre_periods=60, post_periods=30, # Additional parameters as needed ) # Run analysis results = single_cell.run() # Extract results with fallback calculations att = results.get('att') # Average treatment effect se = results.get('se') # Standard error (may use bootstrap) p_value = results.get('pvalue') # Statistical significance ``` ### 5. Bootstrap Standard Error Estimation **Input**: SDID model results with potential matrix dimension challenges **Output**: Robust standard error estimates #### Technical Implementation When standard matrix-based standard error calculations fail due to dimension constraints, bootstrap resampling provides an alternative: ```python def bootstrap_standard_error(Y1_pre, Y1_post, Y0_pre, Y0_post, n_bootstrap=500): """Calculate standard error via bootstrap when matrix dimensions prevent direct calculation""" # Original dimensions may not be compatible for standard SE calculation bootstrap_estimates = [] valid_iterations = 0 # Track failed iterations to analyze failure modes failure_types = {'matrix_error': 0, 'convergence': 0, 'other': 0} for _ in range(n_bootstrap): try: # Resample control units n_control = Y0_pre.shape[1] control_indices = np.random.choice(n_control, n_control, replace=True) Y0_pre_boot = Y0_pre[:, control_indices] Y0_post_boot = Y0_post[:, control_indices] # Compute synthetic control weights with validation omega = compute_omega(Y1_pre, Y0_pre_boot) if np.any(np.isnan(omega)) or np.any(np.isinf(omega)): failure_types['convergence'] += 1 continue # Apply weights to get synthetic control Y_synth_pre = Y0_pre_boot @ omega Y_synth_post = Y0_post_boot @ omega # Calculate ATT with heteroskedasticity-robust estimation att_boot = np.mean(Y1_post) - np.mean(Y_synth_post) bootstrap_estimates.append(att_boot) valid_iterations += 1 except np.linalg.LinAlgError: failure_types['matrix_error'] += 1 continue except Exception: failure_types['other'] += 1 continue # Log bootstrap performance statistics logging.info(f"Bootstrap completed: {valid_iterations}/{n_bootstrap} valid iterations") if valid_iterations < n_bootstrap * 0.8: logging.warning(f"Bootstrap failures: {failure_types}") # Calculate standard error as standard deviation of bootstrap estimates if bootstrap_estimates: return np.std(bootstrap_estimates) else: return np.nan ``` #### Implementation in Production Environment The implementation automatically detects matrix dimension issues and falls back to bootstrap: ```python def calculate_standard_error(model, Y1_pre, Y1_post, Y0_pre, Y0_post): """Calculate standard error with fallback to bootstrap if needed""" try: # Try standard matrix-based calculation se = model.se if np.isnan(se) or se <= 0: raise ValueError("Invalid standard error from matrix calculation") return se except Exception as e: print(f"Matrix-based SE calculation failed: {e}") print("Falling back to bootstrap standard error calculation") return bootstrap_standard_error(Y1_pre, Y1_post, Y0_pre, Y0_post) ``` ### 6. AI-Powered Result Interpretation **Input**: SDID analysis results **Output**: Business-focused interpretation and recommendations #### Technical Implementation The AI interpreter converts statistical results to business insights: ```python def generate_ai_interpretation(results, api_key, prompt_template=None): """Generate AI interpretation of statistical results""" # Default prompt template focuses on business implications if prompt_template is None: prompt_template = """ You are an expert marketing analyst interpreting causal experiment results. EXPERIMENT RESULTS: - Test Periods: {test_duration} days - Average Treatment Effect: {att:.4f} ({att_pct:.2f}% change) - Standard Error: {se:.4f} - p-value: {pvalue:.4f} - 95% Confidence Interval: [{ci_lower:.4f}, {ci_upper:.4f}] Provide a concise, business-focused interpretation of these results in 3-5 paragraphs, addressing: 1. Statistical significance and what it means in plain language 2. Practical significance of the effect size for business decisions 3. Recommendations for next steps based on these findings 4. Any limitations or considerations in interpreting these results Keep your language clear, precise, and actionable for marketing stakeholders. """ # Format the prompt with actual results formatted_prompt = prompt_template.format( test_duration=results['post_periods'], att=results['att'], att_pct=results['att_pct'] * 100 if 'att_pct' in results else results['att'] * 100, se=results['se'], pvalue=results['pvalue'], ci_lower=results['ci_lower'], ci_upper=results['ci_upper'] ) # Call the DeepSeek API response = requests.post( "https://api.deepseek.com/v1/chat/completions", headers={ "Content-Type": "application/json", "Authorization": f"Bearer {api_key}" }, json={ "model": "deepseek-chat", "messages": [ {"role": "system", "content": "You are an expert marketing analyst."}, {"role": "user", "content": formatted_prompt} ], "temperature": 0.4, # Lower temperature for more precise responses "max_tokens": 1000 } ) # Parse and return the interpretation try: interpretation = response.json()['choices'][0]['message']['content'] return interpretation except Exception as e: return f"Failed to generate interpretation: {str(e)}" ``` ## Critical Technical Pitfalls 1. **Matrix Dimension Mismatches**: - SYMPTOM: "Input operand has a mismatch in its core dimension" - CAUSE: Pre/post period counts don't align for matrix multiplication - SOLUTION: Verify matching pre-intervention and post-intervention shapes - REFERENCE: See error handling in `recipes/power_calculator.py` or ensure balanced dimensions 2. **Singular Matrix Errors**: - SYMPTOM: "Singular matrix" or "LinAlgError" - CAUSE: Linear dependence between control units - SOLUTION: Add ridge penalty or remove highly correlated units - REFERENCE: See implementation in `synthdid/synthdid.py` 3. **Imbalanced Panel Issues**: - SYMPTOM: "ValueError: Found input variables with inconsistent numbers of samples" - CAUSE: Each location has different number of time periods - SOLUTION: Verify complete panel structure before analysis - REFERENCE: See data validation in `recipes/donor_evaluator.py` 4. **Incorrect Package Import**: - SYMPTOM: "ModuleNotFoundError: No module named 'src'" - CAUSE: Package structure is `synthdid`, not `src` - SOLUTION: Use correct import from `synthdid.synthdid` - REFERENCE: See import statements in example scripts 5. **Visualization Style Error**: - SYMPTOM: "OSError: 'seaborn-whitegrid' is not a valid package style" - CAUSE: Matplotlib version compatibility issue - SOLUTION: Use version-appropriate style name (`seaborn-v0_8-whitegrid`) - REFERENCE: See style handling in `recipes/power_calculator.py` 6. **Incorrect Single-Cell Configuration**: - SYMPTOM: "Multiple treatment locations detected in single-cell configuration" - CAUSE: Single-cell analysis requires exactly one treatment location - SOLUTION: Modify `power_analysis_config_singlecell.yaml` to include only one treatment location - REFERENCE: See example in `configs/power_analysis_config_singlecell.yaml` ## Validation Heuristics ### Pre-Treatment Fit Quality The synthetic control should closely match the treatment location during pre-intervention: ```python def assess_pretreatment_fit(Y1_pre, Y_synth_pre): """Assess quality of pre-treatment fit""" # Prevent division by zero in MAPE calculation epsilon = 1e-10 denom = np.maximum(np.abs(Y1_pre), epsilon) rmse = np.sqrt(np.mean((Y1_pre - Y_synth_pre) ** 2)) correlation = np.corrcoef(Y1_pre.flatten(), Y_synth_pre.flatten())[0, 1] mape = np.mean(np.abs((Y1_pre - Y_synth_pre) / denom)) # Interpretation heuristics - calibrated from empirical results if correlation > 0.9 and mape < 0.1: quality = "Excellent" recommendation = "Proceed with analysis" elif correlation > 0.8 and mape < 0.2: quality = "Good" recommendation = "Proceed with analysis" elif correlation > 0.7 and mape < 0.3: quality = "Acceptable" recommendation = "Proceed with caution, interpret results carefully" else: quality = "Poor" recommendation = "Consider alternative donor pools or analysis methods" return { "rmse": rmse, "correlation": correlation, "mape": mape, "quality": quality, "recommendation": recommendation } ``` ### Statistical Significance Assessment Interpreting statistical significance requires both p-values and effect sizes: ```python def assess_significance(att, se, baseline, alpha=0.05): """Assess statistical and practical significance""" # Statistical significance z_score = att / se p_value = 2 * (1 - stats.norm.cdf(abs(z_score))) # Practical significance (percent change) pct_change = att / baseline * 100 # Confidence interval ci_lower = att - stats.norm.ppf(1-alpha/2) * se ci_upper = att + stats.norm.ppf(1-alpha/2) * se # Interpretation if p_value < alpha: stat_interp = "Statistically significant" else: stat_interp = "Not statistically significant" # Practical interpretation based on percent change if abs(pct_change) < 1: pract_interp = "Negligible effect" elif abs(pct_change) < 5: pract_interp = "Small effect" elif abs(pct_change) < 10: pract_interp = "Moderate effect" else: pract_interp = "Large effect" return { "p_value": p_value, "percent_change": pct_change, "confidence_interval": (ci_lower, ci_upper), "statistical_interpretation": stat_interp, "practical_interpretation": pract_interp } ``` ## Advanced Technical Considerations ### 1. Optimization Algorithm Selection The core SDID implementation uses: - Quadratic programming for unit weights (omega) - Ridge regression with cross-validation for time weights (lambda) These algorithms balance fit quality with robustness to overfitting. ### 2. Bootstrapping vs. Jackknife Variance Bootstrap resampling offers advantages over jackknife for standard errors: - Less sensitive to outliers in small donor pools - More robust when pre/post periods have different lengths - Better handling of non-linear relationships ### 3. Balanced vs. Unbalanced Panels While balanced panels are ideal, strategies for unbalanced panels include: - Imputation of missing values (linear interpolation recommended) - Restriction to common time periods (reducing data but maintaining balance) - Weighted estimation that accounts for missing observations ## Treatment Effect Interpretation ### Interpreting Heterogeneous Effects For multi-cell experiments, we need to interpret variation in treatment effects across locations: ```python def interpret_heterogeneous_effects(location_effects): """Convert raw location-specific effects to actionable insight""" # Calculate coefficient of variation cv = np.std(location_effects) / np.mean(location_effects) if np.mean(location_effects) != 0 else np.inf if cv < 0.1: return "Homogeneous effect across locations" elif cv < 0.3: return "Moderate heterogeneity, segment by location characteristics" else: return "High heterogeneity, analyze each location individually" ``` ### From Statistical to Business Metrics To convert ATT to meaningful business metrics: ```python def convert_to_business_metrics(att, baseline, time_periods, cost_per_unit=None): """Convert statistical ATT to business metrics""" metrics = { 'absolute_lift': att, 'percent_lift': att / baseline * 100 if baseline != 0 else np.inf, 'total_incremental': att * time_periods } if cost_per_unit is not None: metrics['roi'] = (att * time_periods) / cost_per_unit return metrics ``` ## Further Technical Resources - [SDID Repository](https://github.com/synth-inference/synthdid) - Original R implementation - [Matrix Operation Documentation](../synthdid/sdid.py) - Core matrix calculation code - [Bootstrap Implementation](../recipes/geolift_single_cell.py) - Bootstrap standard error implementation - [Unit Test Suite](../tests/) - Technical validation tests - [Config Examples](../configs/) - Reference configuration files for all analyses