Beyond A/B Testing: Part 1 – Reducing Long Test Durations with Sequential Testing

Traditional A/B tests require large sample sizes, leading to long test durations and delayed decisions. In this post, we explore sequential testing as a faster, statistically reliable alternative.

and

Apr 15, 2025

A/B testing has become a staple in product development and experimentation. It’s clean, simple, and effective—until it isn’t. One of the biggest challenges with traditional A/B testing is its requirement for large sample sizes and long run times to detect statistically significant results, especially when measuring small effects.

In this post, we explore Sequential Testing—a powerful alternative that allows for smarter, more adaptive decision-making. This is the first in a multi-part series where we’ll unpack advanced experimentation techniques designed to overcome the real-world challenges of A/B testing at scale.

⏱️ Why Do A/B Tests Take So Long?

Let’s simulate a scenario where the baseline e-commerce checkout conversion rate is 0.5%, and we aim to detect a relative lift of 5%. Using Evan Miller’s online sample size calculator and applying industry-standard benchmarks—80% statistical power and a 5% significance level—the estimated sample size required for both the test and control groups is approximately 1.2M each.

Assuming the number of unique users per week is approximately 700k, the test would need to run for around 4 weeks (2.4M total sample size ÷ 700k/week) to achieve statistical significance.

While this 4-week timeline for data collection and decision-making may be acceptable in 90% of e-commerce cases—especially when there’s stakeholder alignment and planned experimentation cycles—the remaining 10% often involve tighter deadlines, higher visibility with leadership, or higher-stakes decisions. In these situations, the traditional A/B test setup, though statistically rigorous, can fall short.

Here are some of the common challenges that arise:

1. Delayed Decision-Making

In a frequentist A/B testing framework, you commit to a fixed sample size and test duration upfront. This means the experiment must run to completion before drawing any conclusions—even if it becomes clear early on that one variant is better.

✅ Result: Slower iteration cycles and reduced agility in product development.

2. Exposure to Underperforming Variants

If you’re testing a brand-new feature or experience—such as introducing a marketing banner to acquire new customers—and it distracts users from the checkout funnel by diverting them to the sign-up flow, it could cause a regression in revenue metrics. Yet, you might still continue showing this underperforming variant to users just to fulfill the sample size requirement.

❌ Result: Suboptimal user experience and potentially negative business impact during the test window.

3. Risk of Inconclusive Results

Imagine running a test for 4 weeks, only to find out that your results aren’t statistically significant (p-value > 0.05).

🔁 Result: The test is labeled inconclusive, and you’ve lost valuable time, momentum, and engineering effort—with no clear decision.

One of the potential solutions to the above challenges is mentioned below:

Sequential Testing

✨ What is Sequential Testing (In Layman Terms)?

Sequential testing is like checking your experiment results at regular intervals instead of waiting until the very end. Imagine running an A/B test, but instead of fixing a sample size upfront and only analyzing results once everything is collected, you peek periodically to see if there’s already enough evidence to declare a winner—or to stop early if it’s clearly not working.

⚠️ But Isn’t “Peeking” Bad in A/B Testing?

Yes—and it comes down to the risk of Type I error, which is one of the most important concepts in experimentation.

In simple terms, a Type I error happens when you incorrectly reject the null hypothesis, concluding that there is a significant effect (i.e., that your test variant is better) when in reality, there’s no real difference. This is also called a false positive.

In traditional A/B testing, you’re supposed to analyze the data only once, after all the data has been collected. If you peek at the results too early—or repeatedly check p-values before the test ends—you increase your chances of stumbling upon a result that looks statistically significant by pure chance, not because your variant actually worked.

Each “look” at the data is essentially another chance to make a false discovery. And if you’re checking multiple times without adjusting for that, your originally intended 5% significance level (α = 0.05) could inflate dramatically, leading to misleading conclusions.

This chart highlights how the false positive rate (Type I error) increases as we peek at the A/B test data more frequently during the experiment:

With no peeking (1 look), the false positive rate is 5%, aligned with the typical α = 0.05.
With 5 looks, it rises to ~23% (i.e. 1-(1-0.05)^5))
With 10 looks, it increases further to ~40% (i.e. 1-(1-0.05)^10))
By 20 looks, the rate jumps to ~64%+ (i.e. 1-(1-0.05)^20))

That’s why sequential testing methods like Pocock or O’Brien-Fleming, which will be explained in the following section, are used to keep the overall false positive rate under control.

🔍 Why Peeking is Okay in Sequential Testing?

Sequential testing addresses the issue of inflated false positive rates by using adjusted statistical thresholds—such as Pocock or O’Brien-Fleming boundaries. These methods mathematically correct for multiple looks at the data, ensuring that the overall confidence level (e.g., 95%) is preserved across the test.

Among these, the Pocock approach is one of the simplest to implement. It applies a constant, slightly stricter threshold at every checkpoint—typically around p ≈ 0.022 instead of 0.05—making it easy to operationalize in experimentation pipelines. However, this threshold isn’t one-size-fits-all; it can vary depending on how many interim looks you plan to take:

For 5–6 looks, the Pocock threshold is typically ~0.022
For 10–20 looks, it tightens to ~0.016
For 20+ looks, it further drops to around ~0.013

By contrast, O’Brien-Fleming starts off extremely conservative (e.g., p ≈ 0.005 early on) and relaxes toward the traditional 0.05 threshold as more data is collected. Pocock is especially useful when:

✅ You plan to peek at results frequently during the test

✅ You want a balanced trade-off between early stopping and statistical rigor

✅ You prefer operational simplicity in your experimentation pipelines

🔬 A Bit of History

Sequential testing has its roots in wartime innovation. It was first developed by Abraham Wald during World War II while working with the U.S. military to optimize quality control in munitions manufacturing. The goal was to reduce costs and save time by inspecting batches of ammunition more efficiently—stopping early when defects were clearly present or clearly absent.

Wald’s groundbreaking work laid the foundation for Sequential Probability Ratio Tests (SPRT), which became one of the first formal methods for making early, data-driven decisions without compromising accuracy.

Over the decades, sequential testing evolved and became a gold standard in clinical trials, where early stopping can save lives and resources. In recent years, it has made its way into tech and digital product experimentation, powering smarter A/B tests for companies like Walmart, Amazon, Google, Meta, and Netflix. These companies now rely on advanced group sequential designs (like Pocock and O’Brien-Fleming) to make rapid, statistically sound product decisions.

🤖 Types of Sequential Testing

Naive Sequential Testing (with Bonferroni Correction)

How it works: Analyze data at predefined checkpoints using a fixed significance threshold (e.g., 0.05), or apply simple corrections like the Bonferroni method, where the threshold is adjusted based on the number of tests conducted.
Intuition: To guard against false positives by tightening the threshold each time data is examined.
Best Applicable for: Scenarios with only a few interim looks and minimal infrastructure.
Drawback: Lies in being overly conservative, often missing true effects due to diminished statistical power.
📉 p-value handling: Threshold is simply divided by the number of interim checks, which can be too conservative for frequent monitoring.

Group Sequential Design (GSD)

How it works: Analyze data at predefined checkpoints using methods like Pocock or O’Brien-Fleming.
Intuition: To balance early stopping with strong Type I error control.
Best Applicable for: Structured organizational testing, such as product experiments or clinical trials with planned review points.
Drawback: The trade-off is the need to predefine stopping rules and analysis schedules in advance.
📉 P-value handling:
- Pocock: Same threshold across all looks (e.g., p ≈ 0.022).
- O’Brien-Fleming: Very low p-values early (e.g., p ≈ 0.005), more lenient later.

Alpha Spending Functions

How it works: Gradually “spends” the total allowable Type I error (e.g., α = 0.05) across multiple planned analyses, based on a spending function.
Intuition: Allocate small amounts of alpha to early looks, reserving more for later checks as evidence accumulates.
Best Applicable for: Adaptive clinical trials and sophisticated experimentation platforms that require flexible stopping rules.
Drawback: Requires careful planning and rigorous tracking of how much alpha has been used at each stage.
📉 p-value handling: Dynamically adjusted based on the proportion of alpha already used—each look has a different threshold guided by the spending function.

SPRT (Sequential Probability Ratio Test)

How it works: Developed by Abraham Wald, this test evaluates the likelihood ratio of observed data under H₀ vs. H₁ and stops early when enough evidence is gathered.
Intuition: Accumulate evidence step-by-step and stop as soon as enough evidence favors one hypothesis.
Best Applicable for: Controlled, low-noise environments like manufacturing or binary decisions with stable metrics.
Drawback: Highly sensitive to noisy early data, which can cause premature or misleading conclusions.
📉 p-value handling: Not explicitly used. Instead, decisions are made by comparing the log-likelihood ratio to thresholds A and B, derived from α (Type I error) and β (Type II error).

mSPRT (Mixture Sequential Probability Ratio Test)

How it works: Extends SPRT by incorporating a prior distribution over possible effect sizes, averaging likelihood ratios across multiple plausible outcomes.
Intuition: Avoid committing to a fixed effect size too early; instead, adapt based on a distribution of likely effects.
Best Applicable for: Online A/B testing with uncertainty in behavior or effect size, where early decisions are useful but risky.
Drawback: More complex modeling and computation; requires assumptions about effect size distribution.
📉 p-value handling: Interpreted more like posterior probabilities rather than fixed thresholds, making them less conventional but more flexible in dynamic environments.

All the above sequential testing methods aim to make early decisions while controlling Type I error through repeated checks. They differ in how frequently data is evaluated (continuous vs. batched), the assumptions they make (frequentist vs. Bayesian), and how they allocate statistical error (fixed threshold vs. adaptive spending).

⚠️ When to Avoid Sequential Testing

Sequential testing isn't always the best choice. Avoid it when:

You expect minimal difference between variants and cannot risk early stopping errors.
Your metrics are high-variance or noisy, making early decisions unreliable.
You're working in a highly regulated environment (e.g., pharma, finance) where pre-specifying the test design is required.
Your experiment is tied to a short-term campaign or seasonal traffic spike, stopping early might lead to non-representative conclusions.

🚀 Exploring Sequential Testing: How We Brought Stakeholders Along

When we started looking into sequential testing to improve Walmart’s experimentation framework, the real question wasn’t just “Can we do it?”—it was:
“How do we build alignment around its value and operational impact?”

Sequential testing isn’t just a statistical shift—it changes how fast we can learn, how often we look at data, and how we interpret results.

🔍 Framing the Opportunity

We didn’t lead with math—we led with impact:
💬 “What if we could make confident decisions sooner—without waiting 4 full weeks?”
💬 “What if we could reduce exposure to poor-performing variants and accelerate learning?”
💬 “This approach powers on-line experimentation at Spotify, Booking.com, Netflix—how could it work here?”

This framing helped spark interest and open the door to practical conversations.

🧰 Key Building Blocks We Explored

1. Real-Time Data Collection

📌 “Sequential testing relies on near real-time metrics. How close are we today?”
🧠 We first assessed our existing data pipelines to understand refresh rates and latency, then explored lightweight extensions (like Kafka/Spark) to support more frequent metric ingestion—without a full rebuild.

2. Embedded Statistical Logic

📌 “Instead of calculating one p-value at the end, we recalculate at multiple checkpoints using rules like Pocock or likelihood ratios.”
🧠 We explored adding a modular statistical layer to our existing experimentation engine. This would handle p-value recalculations, stopping rules, and logging—without disrupting the core system.

3. Dashboards with Guardrails

📌 “Teams need visibility—how is the test progressing, are we near a stopping point?”
🧠 We proposed to design mock dashboards using past tests to show what early stopping would’ve looked like.

4. Automated Stopping Rules

📌 “If a test hits a win/loss boundary, it should trigger an alert or action—no need to wait for manual checks.”
🧠 We explored developing rules for auto-stopping experiments when statistical boundaries are crossed capturing when, why, and how a test was stopped.

5. Cross-Functional Alignment

📌 “This isn’t just a stats change—it affects how results are acted on. We need cross-team clarity.”
🧠 We brought product, analytics, and engineering into early conversations to understand what might need updating.

⏳ Pilot Timeline

We scoped a phased rollout:

Start with low-risk experiments (e.g., banner tests, funnel UI tweaks)
If infrastructure aligns, a pilot could be ready in 6–8 weeks

This wasn’t about pushing a new tool—it was about showing how a smarter testing approach could drive faster decisions, safer experimentation, and more efficient learning—with buy-in from the people who make it happen.

🏢 Bonus Real-World Examples:

Booking.com Link

Implementation: Booking.com has integrated sequential testing methods into its experimentation framework to assess changes on their website or mobile application. This approach allows for continuous monitoring and the potential to reach conclusions earlier than traditional fixed-sample tests.

Spotify Link

Implementation: Spotify’s Experimentation Platform utilizes group sequential tests (GSTs). They chose GSTs because their data infrastructure processes information in batches, aligning well with the requirements of this testing method.

📊 Python Simulation (Traditional A/B vs. Sequential Testing)

For an initial MVP and to align with key stakeholders, it’s helpful to demonstrate algorithm performance by simulating a comparison between traditional A/B testing and naive sequential testing. Based on the initial discussion, we assume a checkout baseline conversion rate of 0.5% and simulate a 6% lift for the test experience, with incremental checks conducted every 100k samples.

The graph below illustrates that the test reached the significance threshold at just 33% of the total sample size, with the p-value stabilizing well before the full duration of the experiment. In such cases, it opens up the opportunity to discuss ending the test early—especially if the observed lift is statistically negative, making early decisions even more critical.

While naive sequential testing offers early insights, it comes at the cost of increased false positives due to unadjusted repeated peeking. To address this, we explored and compared more statistically rigorous methods using the same simulated data.

The graph below begins with Bonferroni correction, which addresses the issue of repeated testing by dividing the significance level across all interim looks—yielding a very strict threshold (≈0.0042). While this effectively reduces false positives, it’s often too conservative, potentially causing teams to miss valid early wins.

Pocock’s method offers a more balanced alternative, using a constant threshold (≈0.022) at every checkpoint. This makes it easier to implement and interpret while still maintaining Type I error control. However, its uniform threshold doesn’t account for changes in test power over time. O’Brien-Fleming improves upon this by starting with stricter thresholds (e.g., p ≈ 0.005) that become more lenient later, making it a strong choice when early stopping is less critical but decision flexibility is desired toward the end.

Alpha Spending Functions add another layer of nuance by dynamically allocating the total Type I error across looks—effectively loosening the threshold as more data accumulates. In this implementation, we use a linear alpha spending function, where the allowable error is distributed evenly across all checkpoints. This means early looks receive a smaller portion of the alpha budget, and the threshold gradually becomes more permissive over time.

Meanwhile, SPRT and mSPRT move away from p-values entirely, relying on likelihood ratios to make decisions. In the SPRT panel, we observe the log-likelihood ratio crossing the upper threshold relatively early, indicating strong evidence in favor of the test variant. However, its sensitivity to early noise can occasionally be misleading. mSPRT improves on this by incorporating a distribution of possible effect sizes, resulting in a more stable and robust decision boundary that crosses the critical threshold (≈3.84) in a smoother fashion—ideal for noisy, real-world environments.

In this simulation, the observed lift in conversion was so strong and consistent that all methods—regardless of their thresholds or statistical frameworks—eventually reached their respective stopping criteria. This level of convergence across methods provides high confidence that the test variant genuinely outperformed the control, making it safe to conclude that the new feature had a meaningful and positive impact on conversions.

Github Code link

🧠 Conclusion: Faster, Smarter Experimentation Starts Here

Sequential testing offers a powerful alternative to traditional A/B testing—especially in fast-paced product environments where time, resources, and user exposure matter. By allowing for interim analyses and statistically sound early stopping, it addresses some of the biggest challenges of experimentation: long run times, inconclusive results, and exposure to underperforming variants.

While naive sequential testing gives a taste of speed, it’s prone to inflated false positives. That’s why rigorously defined methods like Pocock, O’Brien-Fleming, SPRT, and Alpha Spending Functions are critical for maintaining validity while enabling agility. With the right guardrails and implementation, sequential testing transforms how data-driven decisions are made—bringing both efficiency and confidence to the table.

🚀 What’s Next?

In the next post of this series, we’ll explore another powerful technique that steps in when traditional A/B testing falls short—Interleaving.

Standard A/B tests assign each user to one experience or model (A or B) and measure outcomes over time. While this works well for static features, it struggles when comparing ranking algorithms or recommendation models. Why? Because ranking problems are sensitive to user context, and waiting for statistical significance across separate user groups can be slow and inefficient.

Interleaving addresses this by blending the outputs of both models into a single ranked list in real time—within the same user session. This lets you directly observe which model drives better engagement, without waiting weeks for results or exposing users to entirely subpar experiences.

Stay tuned as we continue our journey beyond traditional A/B testing—exploring smarter, faster experimentation methods designed for real-world complexity. Subscribe on Substack to get the next post delivered straight to your inbox!

About the Writers:

Banani Mohapatra: Banani is a seasoned data science product leader with over 12 years of experience across e-commerce, payments, and real estate domains. She currently leads a data science team for Walmart’s subscription product, driving growth while supporting fraud prevention and payment optimization. Over the years, Banani has designed and analyzed 100+ experiments, enabling data-driven decisions at scale and fostering a culture of rapid, evidence-based iteration.

Bhavnish Walia: Bhavnish is a product risk manager with over 12 years of experience in finance and e-commerce. He currently leads risk management at Amazon, where he builds and scales data-driven frameworks to ensure safe, compliant product deployment. Previously at Citibank, Bhavnish led multiple experimentation initiatives, designing data science solutions that enhanced customer experience and mitigated risk.

A guest post by

Bhavnish Walia

With 12+ years of experience in finance and e-commerce, I lead AI Risk Management at Amazon, where I build LLM frameworks for Anti-Money Laundering to enable safe.

All Things Product Data Science

Discussion about this post