Challenge towards accurately testing QEC at scale

13 minute read

Published:

We proposed ScaLER, a new method to test quantum error correction (QEC) at scale. Craig Gidney, the creator of Stim, challenged our estimated logical error rate by running 634 billion brute-force Monte Carlo shots. In this blog, we present extended experiments showing that ScaLER’s estimate converges toward the Stim result while using orders-of-magnitude fewer samples.

Context

ScaLER is based on weighted stratified sampling and S-curve model extrapolation. Instead of brute-force Monte Carlo sampling – which requires an astronomical number of shots to observe even a single logical error at low error rates – ScaLER decomposes the error space into weight subspaces, samples each subspace with adaptive allocation, and fits an S-curve model to extrapolate the logical error rate.

After our paper appeared on arXiv, a public debate emerged on Scirate with Craig regarding the accuracy of our estimated logical error rate for the $d = 17$ surface code. We present the full experimental data from our extended runs and provide a detailed analysis below.

The Challenge from Stim

In our paper, the original ScaLER estimate for the $d = 17$ rotated surface code (with single-qubit depolarizing noise at $p = 0.0005$ and $3d = 51$ rounds of syndrome measurement) was:

\[p_L^{\text{ScaLER}} \approx 1.51 \times 10^{-11}\]

Craig ran Stim with PyMatching and reported:

634,202,872,968 shots sampled, 1 error seen.

This gives a brute-force Monte Carlo estimate of:

\[p_L^{\text{Stim}} = \frac{1}{634{,}202{,}872{,}968} \approx 1.57 \times 10^{-12}\]

This is roughly an order of magnitude lower than our original 2-hour ScaLER estimate, raising a legitimate question about the accuracy of our extrapolation.

Putting the Challenge in Perspective

Before presenting our new experiments, we believe several points are worth noting.

The sample cost asymmetry is extreme. Craig’s Stim verification required $6.34 \times 10^{11}$ total shots to observe a single logical error. ScaLER’s original 2-hour estimate used only $3.91 \times 10^{7}$ samples – a ratio of over $16{,}000\times$ fewer samples. Even our most expensive 96-hour run used $3.41 \times 10^{9}$ samples, which is still $\sim 186\times$ fewer than what Stim needed. Even if ScaLER’s estimate carries a positive bias of $5$-$10\times$, obtaining an order-of-magnitude estimate with orders-of-magnitude fewer samples has clear practical value.

Stim’s result itself carries substantial uncertainty. With only 1 logical error observed in $6.34 \times 10^{11}$ shots, the Stim estimate is also subject to large statistical fluctuation. The 95% confidence interval for a Poisson process with 1 observed event spans roughly $[0.025, 5.57]$ events, which translates to:

\[p_L^{\text{Stim}} \in [3.9 \times 10^{-14},\; 8.8 \times 10^{-12}]\]

In other words, the true logical error rate could plausibly be anywhere from $\sim 4 \times 10^{-14}$ to $\sim 9 \times 10^{-12}$ based on the Stim data alone. A single observed error does not constitute a precise ground truth – it is itself a noisy estimate.

Monte Carlo is unbiased but impractical at scale. We fully agree with Craig’s point that Monte Carlo sampling is “bullet proof” in the sense that it is an unbiased estimator with no hyperparameters to misconfigure. However, being unbiased does not mean being useful under a limited budget. When Monte Carlo observes 0 or 1 errors, the resulting estimate has enormous variance. As we pointed out in our paper, at $d = 11$ with a 2-hour budget, Monte Carlo observed only 3 logical errors and produced an estimate of $(2.22 \pm 1.26) \times 10^{-8}$ – where the $3\sigma$ interval extends below zero. Such an estimate, while unbiased, is practically uninformative. ScaLER, by contrast, produces consistent estimates with low variance across repeated runs, even though it carries a systematic positive bias.

New Experiments: Convergence with Increasing Budget

To directly address the challenge, we ran ScaLER with increasing time budgets: 2 hours, 12 hours, 24 hours, 48 hours, and 96 hours on the same $d = 17$ surface code circuit. The goal is to examine whether ScaLER’s extrapolation-based estimate converges toward the Stim result as we allocate more computational resources.

Below, we show the Log-S curve fitting results and the detailed subspace sampling data for each time budget.

2-Hour Time Budget

This is the original time budget used in the paper. The estimated logical error rate is:

\[p_L^{(2\text{h})} = 1.487 \times 10^{-11}, \quad R^2 = 0.9902\]

Total samples: $39{,}084{,}155$. Total logical errors detected: $2{,}756$.

Weight $w$SamplesLogical ErrorsSubspace LER
24513,154,61511$8.36 \times 10^{-7}$
27113,154,61526$1.98 \times 10^{-6}$
3007,979,61530$3.76 \times 10^{-6}$
3292,546,49930$1.18 \times 10^{-5}$
3581,033,99930$2.90 \times 10^{-5}$
387603,99930$4.97 \times 10^{-5}$
413254,01330$1.18 \times 10^{-4}$
442154,02530$1.95 \times 10^{-4}$
47172,77532$4.40 \times 10^{-4}$
50045,00030$6.67 \times 10^{-4}$
52930,00036$1.20 \times 10^{-3}$
55815,00032$2.13 \times 10^{-3}$
65010,00074$7.40 \times 10^{-3}$
74210,000249$2.49 \times 10^{-2}$
83410,000661$6.61 \times 10^{-2}$
93410,0001,452$1.45 \times 10^{-1}$

12-Hour Time Budget

With a $6\times$ increase in budget, the estimate drops noticeably:

\[p_L^{(12\text{h})} = 1.106 \times 10^{-11}, \quad R^2 = 0.9857\]

Total samples: $355{,}170{,}053$. Total logical errors detected: $2{,}862$.

Log-S Curve Fit (d=17, p=0.0005, 12h)

Weight $w$SamplesLogical ErrorsSubspace LER
189158,571,5726$3.78 \times 10^{-8}$
220138,421,57230$2.17 \times 10^{-7}$
25544,571,57230$6.73 \times 10^{-7}$
2907,956,37030$3.77 \times 10^{-6}$
3253,531,37030$8.50 \times 10^{-6}$
360961,37030$3.12 \times 10^{-5}$
392703,04731$4.41 \times 10^{-5}$
427195,96531$1.58 \times 10^{-4}$
462127,21531$2.44 \times 10^{-4}$
49755,00031$5.64 \times 10^{-4}$
53220,00034$1.70 \times 10^{-3}$
56815,00039$2.60 \times 10^{-3}$
65810,00083$8.30 \times 10^{-3}$
74910,000258$2.58 \times 10^{-2}$
83910,000705$7.05 \times 10^{-2}$
93810,0001,463$1.46 \times 10^{-1}$

24-Hour Time Budget

\[p_L^{(24\text{h})} = 9.555 \times 10^{-12}, \quad R^2 = 0.9880\]

Total samples: $711{,}079{,}317$. Total logical errors detected: $2{,}782$.

Log-S Curve Fit (d=17, p=0.0005, 24h)

Weight $w$SamplesLogical ErrorsSubspace LER
168288,817,7866$2.08 \times 10^{-8}$
200288,817,78621$7.27 \times 10^{-8}$
236104,867,78630$2.86 \times 10^{-7}$
27221,121,13630$1.42 \times 10^{-6}$
3084,333,63630$6.92 \times 10^{-6}$
3441,723,63630$1.74 \times 10^{-5}$
376765,34130$3.92 \times 10^{-5}$
412351,10531$8.83 \times 10^{-5}$
448126,10533$2.62 \times 10^{-4}$
48465,00030$4.62 \times 10^{-4}$
52035,00035$1.00 \times 10^{-3}$
55615,00031$2.07 \times 10^{-3}$
64710,00093$9.30 \times 10^{-3}$
73910,000263$2.63 \times 10^{-2}$
83010,000620$6.20 \times 10^{-2}$
93010,0001,469$1.47 \times 10^{-1}$

48-Hour Time Budget

\[p_L^{(48\text{h})} = 8.938 \times 10^{-12}, \quad R^2 = 0.9879\]

Total samples: $1{,}444{,}921{,}952$. Total logical errors detected: $2{,}822$.

Log-S Curve Fit (d=17, p=0.0005, 48h)

Weight $w$SamplesLogical ErrorsSubspace LER
168873,007,17714$1.60 \times 10^{-8}$
203512,057,17730$5.86 \times 10^{-8}$
24137,782,17730$7.94 \times 10^{-7}$
28015,916,38530$1.88 \times 10^{-6}$
3194,153,88530$7.22 \times 10^{-6}$
3571,113,88530$2.69 \times 10^{-5}$
392505,57630$5.93 \times 10^{-5}$
431198,47030$1.51 \times 10^{-4}$
47067,22031$4.61 \times 10^{-4}$
50845,00031$6.89 \times 10^{-4}$
54720,00034$1.70 \times 10^{-3}$
58615,00041$2.73 \times 10^{-3}$
67010,00086$8.60 \times 10^{-3}$
75510,000290$2.90 \times 10^{-2}$
84010,000662$6.62 \times 10^{-2}$
93310,0001,423$1.42 \times 10^{-1}$

96-Hour Time Budget

With the largest budget, the estimate continues to decrease:

\[p_L^{(96\text{h})} = 7.843 \times 10^{-12}, \quad R^2 = 0.9902\]

Total samples: $3{,}408{,}739{,}244$. Total logical errors detected: $2{,}780$.

Log-S Curve Fit (d=17, p=0.0005, 96h)

Weight $w$SamplesLogical ErrorsSubspace LER
1552,537,896,77015$5.91 \times 10^{-9}$
189734,246,77030$4.09 \times 10^{-8}$
227104,471,77030$2.87 \times 10^{-7}$
26421,041,77430$1.43 \times 10^{-6}$
3027,766,77430$3.86 \times 10^{-6}$
3401,886,77430$1.59 \times 10^{-5}$
374853,48230$3.52 \times 10^{-5}$
412310,69030$9.66 \times 10^{-5}$
450104,44031$2.97 \times 10^{-4}$
48860,00031$5.17 \times 10^{-4}$
52640,00034$8.50 \times 10^{-4}$
56420,00036$1.80 \times 10^{-3}$
65410,00077$7.70 \times 10^{-3}$
74410,000249$2.49 \times 10^{-2}$
83410,000647$6.47 \times 10^{-2}$
93210,0001,450$1.45 \times 10^{-1}$

Summary

The following table summarizes how the ScaLER estimate evolves as the time budget increases, alongside Stim’s brute-force Monte Carlo result for comparison:

MethodTime BudgetTotal SamplesMin. Weight $w_{\min}$Estimated $p_L$
ScaLER2h$3.91 \times 10^{7}$245$1.487 \times 10^{-11}$
ScaLER12h$3.55 \times 10^{8}$189$1.106 \times 10^{-11}$
ScaLER24h$7.11 \times 10^{8}$168$9.555 \times 10^{-12}$
ScaLER48h$1.44 \times 10^{9}$168$8.938 \times 10^{-12}$
ScaLER96h$3.41 \times 10^{9}$155$7.843 \times 10^{-12}$
Stim (MC)~10 days$6.34 \times 10^{11}$$1.57 \times 10^{-12}$

Each column has the following meaning:

  • Method: The estimation method used. ScaLER uses weighted stratified sampling with S-curve extrapolation; Stim (MC) uses brute-force Monte Carlo sampling.
  • Time Budget: The wall-clock time allocated for the experiment.
  • Total Samples: The total number of syndrome samples drawn. Note that ScaLER’s 96-hour run used $3.41 \times 10^{9}$ samples, while Stim required $6.34 \times 10^{11}$ shots – approximately $186\times$ more samples – and even then only observed a single logical error.
  • Min. Weight $w_{\min}$: The minimum noise weight subspace that was tested. With more time, ScaLER can afford to sample lower-weight subspaces, which have exponentially smaller error rates and require far more samples to observe logical errors. Reaching lower weights means the S-curve fit is informed by data closer to the fault-tolerant regime, leading to more reliable extrapolation. This column is not applicable to Stim’s uniform Monte Carlo sampling.
  • Estimated $p_L$: The logical error rate estimated by each method.

The trend is clear: as the time budget increases from 2h to 96h, the estimated logical error rate decreases monotonically from $1.487 \times 10^{-11}$ down to $7.843 \times 10^{-12}$. Correspondingly, the minimum tested weight decreases from $w_{\min} = 245$ to $w_{\min} = 155$, meaning the S-curve model is being anchored by data from increasingly lower-weight (and rarer) error events.

Note also the sample efficiency: our 96-hour run used $3.41 \times 10^9$ total samples to arrive at an estimate of $7.84 \times 10^{-12}$, while Stim required $6.34 \times 10^{11}$ shots – approximately $186\times$ more samples – and even then only observed a single logical error. This dramatic difference in sample efficiency is the core advantage of weighted stratified sampling over uniform Monte Carlo.

Discussion

What we acknowledge

We do not claim that ScaLER produces an unbiased estimate of the logical error rate. The S-curve extrapolation introduces a systematic positive bias, as the model must extrapolate from the measured weight subspaces down to the full error space. This bias is inherent in any extrapolation-based approach. As we discussed in Section 8.2 of the paper, this tradeoff between bias and computational cost is fundamental.

We also agree with Craig’s criticism that the notation $p_L \pm \sigma$ in our paper can be misleading. In our paper, $\sigma$ refers to the standard deviation across repeated runs (reflecting the estimator’s variance), not a confidence interval that accounts for systematic error. We will improve this notation in the next version of the paper to avoid confusion.

What the data shows

Despite the systematic bias, the experimental data presented in this blog demonstrates several encouraging properties of ScaLER:

  1. Monotonic convergence. The estimated $p_L$ decreases consistently as the time budget increases. This is not guaranteed for an arbitrary heuristic method – it reflects the fact that with more budget, ScaLER reaches lower-weight subspaces and the S-curve fit becomes better constrained.

  2. The estimate is in the right ballpark. Even the 2-hour estimate of $1.49 \times 10^{-11}$ is within one order of magnitude of the Stim result. At 96 hours, the ScaLER estimate of $7.84 \times 10^{-12}$ is within $5\times$ of Stim’s $1.57 \times 10^{-12}$, and it falls inside the upper range of Stim’s own confidence interval (which extends up to $\sim 8.8 \times 10^{-12}$ for the 95% Poisson CI).

  3. Massive sample efficiency. ScaLER achieved a $5\times$-accurate estimate using $3.41 \times 10^9$ samples, compared to Stim’s $6.34 \times 10^{11}$ shots – a $186\times$ reduction in sample count. This sample efficiency is the fundamental advantage of importance sampling over uniform Monte Carlo: by directing samples to informative weight subspaces, ScaLER extracts far more information per sample.

  4. Information-rich output. Unlike Monte Carlo, which produced exactly 1 logical error in $6.34 \times 10^{11}$ shots, ScaLER produces thousands of logical errors distributed across weight subspaces (see the tables above). This structured output provides insight into the error landscape of the code – which weight subspaces contribute most to logical failure – information that pure Monte Carlo sampling at this scale simply cannot provide.

The broader point

The goal of ScaLER is not to replace Monte Carlo sampling in regimes where it works well. When the logical error rate is above $\sim 10^{-8}$ and sufficient computational budget is available, brute-force Monte Carlo remains the gold standard. But at scale – for large-distance codes in the low-noise regime where logical error rates plummet to $10^{-10}$ or below – Monte Carlo hits a wall. As Craig’s experiment vividly demonstrates, it took $6.34 \times 10^{11}$ shots to observe a single error event.

ScaLER offers a practical alternative: a biased but convergent estimator that can provide useful order-of-magnitude estimates with orders-of-magnitude fewer samples. With $3.41 \times 10^9$ samples ($186\times$ fewer than Stim), ScaLER produced an estimate within $5\times$ of the brute-force result. The bias is a known limitation, and as the data in this blog shows, it diminishes with increased budget. Further research on understanding and reducing this systematic error – particularly in regimes where no ground truth is available – remains an important open problem that we are actively pursuing.

Leave a Comment