Most AI survey tools sell you a coin flip and call it research. Ask 100 AI personas the same question twice and you'll see why. The answers move. Sometimes 47 percent yes flips to 53 percent no. The persona definitions are fixed. The language model voicing them is not. We hold the prompt fixed and let temperature drive run-to-run variance, but that variance is real and large.

This is the dirty secret of AI survey research. If you read “62 percent of indie founders would pay $20 a month for this” off a single run, you are reading noise as signal. The 62 might be 58 in another run. It might be 67. You do not know which version of reality you are looking at.

The fix is not better prompts. The fix is more runs and the right math to combine them.

Here is the approach we take at neverboringnow. Two pieces. Wilson confidence intervals on each run. Rubin's rules to pool runs together. Both are old statistics with strong theory behind them. Both are what survey methodologists already do for human survey panels. The translation to AI personas is mostly direct, with one important caveat we'll come back to.

The single run problem in numbers

Take a question where the true population probability of yes is 60 percent. Run 100 personas. You will not get exactly 60 yes responses. You will get something around 60 with a binomial distribution around it. The 95 percent confidence range using Wilson's method is roughly 50 to 69. That is a 19 point spread on a question where the true answer is fixed.

Now picture a founder making a pricing decision. They see 62 yes. They feel decisive. They ship. The next run shows 56. They lose confidence. They stall. Neither number is wrong individually. The problem is treating one sample as if it were the population.

A single run of 100 AI personas gives you a noisy estimate with a Wilson 95 percent interval that ranges 10 to 20 points wide for typical questions, widest when responses split near 50/50. If your decision threshold is “more than half” or “at least 60 percent”, that interval often straddles your decision line. You are flipping a coin and calling it research.

Wilson, not the textbook one

Most introductory stats courses teach the normal approximation interval, p plus or minus 1.96 times the square root of p times one minus p over n. It works fine near p of 0.5 with large n. It breaks down badly when p is near 0 or 1.

Wilson's score interval handles edge cases without breaking. The formula is uglier but the behavior is right. For 100 personas where 95 say yes, normal approximation produces a confidence band that includes values above 100 percent, which is nonsense. Wilson does not do that.

For yes-no-maybe surveys where many answers cluster near the edges of the distribution, Wilson is the right choice for marginal intervals on each option. Joint distributions over the three options need different machinery, which we cover in the post on Dirichlet posteriors. For now, Wilson per option per run is the foundation.

One important caveat before moving on. The 100 personas are not a sample from the target population of real humans. They are a sample from the language model's representation of that population. Wilson intervals quantify within-run sampling variance across the 100 personas. Rubin pooling handles between-run variance, which we cover next. Neither addresses the gap between the model's representation and reality. That gap exists. It needs separate validation, which lives in the eval layer. Treat these intervals as honesty about run noise, not honesty about the world.

Rubin's rules for pooling runs

If one run gives you noise, two runs give you less noise. Three runs give you less still. The naive thing is to average percentages across runs and report the average. The point estimate from naive averaging is fine. The uncertainty estimate is wrong, because naive averaging ignores how much the runs disagreed with each other.

Donald Rubin's pooling method was originally designed for multiple imputation in missing data problems. It gives you the right way to combine estimates that each have their own uncertainty. The pooled point estimate is the simple mean across runs. The pooled variance has two parts. Within-run variance, which is how uncertain each run was on its own, plus between-run variance, which is how much the runs disagreed with each other, with a small-sample correction that gets larger as you have fewer runs.

In plain words. Each run has its own confidence interval. Some runs agree with each other. Some runs disagree. If they all agree closely, the pooled interval is only slightly wider than a single run. If they disagree substantially, the pooled interval correctly widens to reflect that disagreement. The math punishes you when runs are inconsistent and rewards you when they agree.

A note on degrees of freedom. With three runs, the effective degrees of freedom for the pooled t distribution end up in the single digits, not approaching infinity like a normal approximation would assume. For three runs at the kind of variances we typically see, df comes out around 6 or 7 and the t critical value is about 2.4 instead of 1.96. We use Barnard and Rubin's small-sample df correction.

Why this matters for builders. A pooled estimate with a tight interval means the question has a stable answer in the model's persona space. A pooled estimate with a wide interval means the question is ambiguous or polarizing or both, and you should treat the result accordingly.

A worked example

We ran the question “Would you pay $20 a month for a tool that helps you find product market fit faster” three times against the same 100 persona founder panel (Builder tier flow, three runs by default).

Run one. 41 yes, Wilson 95 percent interval 32 to 51.
Run two. 48 yes, Wilson 95 percent interval 38 to 58.
Run three. 39 yes, Wilson 95 percent interval 30 to 49.

Naive averaging gives 42.7 yes with a hand-wavy interval of maybe a few points around it. That hand-wavy interval is what most tools report. Rubin's rules give a pooled point estimate of 42.7 yes, and a pooled 95 percent interval of roughly 25 to 60. Notably wider than any single Wilson interval, by design.

That extra width is the math telling you the runs disagreed more than sampling noise alone would explain. Run two was nine points higher than run three. The runs are not interchangeable. The correct response is not to suppress that signal by reporting the simple mean. The correct response is to widen the band and report what the data actually supports.

Now consider what a founder does with this. The single run reports might have driven a decision in either direction. The pooled report says the answer is somewhere between 25 and 60 percent yes, and it is unstable enough that you should not bet on the precise number. That is honest. That is what good research feels like.

The reliability dimension

Wilson and Rubin handle one part of the problem. They tell you how much your aggregate number can be trusted given the runs you have. They do not tell you how much the personas agree with each other within a single run.

This is where ICC enters. The intraclass correlation coefficient measures how much variance in responses is between persona groups versus within them. High ICC on a question means the persona panel partitions cleanly into groups that disagree with each other but agree internally. Low ICC means the responses are noisy regardless of who answered. Both are useful signals. Both feed into a per-topic reliability score that improves as more surveys come through the system.

The reliability score is the data flywheel we are building. Each new survey makes the next survey's pooling and weighting more accurate, because the system learns which question types in which domains produce stable signal versus drift. That is the kind of moat an LLM provider cannot replicate without serving real survey traffic at scale, and one we expect to compound as the data accumulates. We'll cover ICC and the reliability flywheel in the next post.

What this changes

Most AI survey tools today report a single number. That number is wrong in the sense that it is a sample treated as a population. The right response is not to apologize for AI persona surveys being uncertain. The right response is to make the uncertainty visible, quantified, and decision-grade.

To be precise about what this post covers. Wilson per run plus Rubin pooling gives you uncertainty quantification. It tells you how confident you should be in your aggregate number given the runs you observed. It does not give you calibration in the strict sense, where you would compare predicted probabilities against external ground truth using a Brier-style scoring loop. That is a separate layer, built on top of this one, and it is what the next two posts cover.

Concretely on this layer. On consensus questions where the runs all land within a few points of each other, pooled half-width should be under 10 points at 95 percent confidence. On polarizing or ambiguous questions, the half-width should visibly widen. That is the falsifiable success criterion we evaluate the framework against.

A note on who builds this. neverboringnow is small. One founder shipping with Claude as a thinking partner, with the calibration math written and tested over the last few weeks. Wilson and Rubin are old, well-tested statistics, which is why we trust them. The moat is not the code. The moat is accumulating real survey data on which question types calibrate well over time, and that only operators can earn.

If you are building anything that uses LLM personas to estimate audience preferences, please at least run twice. Compare the answers. If they disagree by more than a few points, your number is not a number. It is a mood.

Why one run of 100 AI personas isn't enough