Analyze and Interpret A/B Test Results with Statistical Rigor

Get a complete A/B test analysis with statistical significance, power analysis, validity checks, and a clear ship decision.

📝 The Prompt

You are a senior data scientist specializing in experimentation and causal inference. I need you to perform a rigorous analysis of my A/B test results and provide clear, actionable recommendations. **Experiment Details:** - Experiment name: [EXPERIMENT_NAME, e.g., New checkout flow vs. existing checkout flow] - Hypothesis: [YOUR_HYPOTHESIS, e.g., The simplified checkout flow will increase conversion rate by at least 5%] - Primary metric: [PRIMARY_METRIC, e.g., purchase conversion rate] - Secondary metrics: [SECONDARY_METRICS, e.g., average order value, cart abandonment rate, time to purchase] - Test duration: [DURATION, e.g., 14 days] - Traffic split: [SPLIT_RATIO, e.g., 50/50] **Observed Results:** - Control group: [CONTROL_SAMPLE_SIZE] users, [CONTROL_CONVERSIONS] conversions (or [CONTROL_METRIC_VALUE]) - Treatment group: [TREATMENT_SAMPLE_SIZE] users, [TREATMENT_CONVERSIONS] conversions (or [TREATMENT_METRIC_VALUE]) - Any secondary metric observations: [SECONDARY_METRIC_RESULTS] **Please provide the following analysis:** 1. **Statistical Significance Testing**: Calculate the p-value, confidence interval (95%), and determine if the result is statistically significant. Specify whether a z-test, t-test, or chi-squared test is most appropriate and why. Show your calculations step by step. 2. **Effect Size & Practical Significance**: Compute the relative and absolute lift, Cohen's d or equivalent effect size measure, and assess whether the observed effect is practically meaningful for the business. 3. **Power Analysis**: Evaluate whether the sample size was sufficient to detect the hypothesized minimum detectable effect (MDE). If underpowered, calculate the required sample size and duration. 4. **Validity Checks**: Assess potential threats including sample ratio mismatch (SRM), novelty/primacy effects, selection bias, and Simpson's paradox. Suggest diagnostic checks for each. 5. **Segmentation Analysis**: Recommend 3-5 meaningful segments to analyze (e.g., by device type, user tenure, geography) and explain what heterogeneous treatment effects to look for. 6. **Decision Recommendation**: Based on all evidence, provide a clear ship/no-ship/extend recommendation with reasoning. Include risk considerations and suggest any follow-up experiments. 7. **Executive Summary**: Write a 3-4 sentence non-technical summary suitable for sharing with product leadership. Present all numerical results in a clean table format where applicable.

💡 Tips for Better Results

Provide exact numbers for sample sizes and conversions rather than percentages alone — this enables accurate statistical calculations and power analysis.
Include any known issues during the test period (e.g., site outages, marketing campaigns, holidays) so the analysis can account for confounding factors.
If you have results for multiple metrics, flag which one is the primary decision metric to avoid multiple comparison pitfalls in the analysis.