00 / OVERVIEW

Methodology

Version 1.0 · Published March 2026 · Last updated March 7, 2026

This document explains how Rak's synthetic user testing methodology works, how personas are validated, how statistical confidence is calculated, and what the limitations are.

→ No black boxes. This methodology is transparent, auditable, and grounded in established research practices.

01 / PERSONA CONSTRUCTION

How personas are built

Every persona used in Rak testing is constructed from real human data, not invented. The process:

Step 1: Data Collection

We extract persona characteristics from:

User interviews — transcripts from customer development, JTBD research, or usability studies
Support tickets — language patterns, pain points, goals
G2/Capterra reviews — how real users describe products in their own words
Sales call transcripts — objections, decision criteria, evaluation process
Public social media — Reddit, Twitter, LinkedIn posts about product categories

Step 2: Persona Extraction

We use large language models (LLMs) to identify consistent patterns across data sources:

Role/title clusters (e.g., "VP Product" language patterns)
Pain point frequency (what problems appear repeatedly?)
Goals and motivations (what are they trying to achieve?)
Decision criteria (what matters when evaluating tools?)
Language style (formal/casual, technical/non-technical)

Output: Structured persona YAML files with background, goals, pain points, objections, language patterns, and decision criteria.

Step 3: Validation

Each persona is validated through:

Language pattern matching: Does the persona's language match real human transcripts?
Consistency checks: Are goals, pain points, and objections internally coherent?
Expert review: Does the persona reflect real buyer behavior? (Validated by founder with 15+ years product/research experience)
Test runs: Does the persona behave realistically when dropped into test scenarios?

Limitation: Personas are representations of real users, not real users themselves. They capture aggregate patterns but miss individual idiosyncrasies and emotional nuance.

Example: VP Product Persona Construction

For a "Skeptical VP Product" persona, we analyzed:

50+ VP Product interviews from B2B SaaS companies
200+ G2 reviews written by VP Product titles
30+ Reddit r/ProductManagement posts about research tools

Patterns identified: Pain = slow research cycles, skepticism = burned by overpromising tools, decision criteria = methodology rigor + pricing transparency, language style = direct, evidence-focused.

Validation: Persona language patterns matched 87% of real VP Product transcripts (n=20 holdout set).

02 / SIMULATION PROCESS

How simulations work

Once personas are constructed and validated, we run simulated user sessions:

Session Setup

Each session includes:

Persona context: Background, goals, pain points, objections, language patterns
Test scenario: What task is the user trying to complete? (e.g., "evaluate this landing page in 5 minutes")
Product snapshot: Text representation of the page/flow being tested
Success criteria: What defines completion? (e.g., "can explain what the product does")

Session Execution

The LLM simulates the persona interacting with the product:

Scan phase: Initial impression (5-10 seconds)
Exploration phase: Read sections, evaluate fit
Decision phase: Would they take next step? Why/why not?
Objection capture: What questions remain? What's unclear?

Output: Structured JSON with comprehension score, conversion intent, objections raised, friction points, and verbatim persona responses.

Scale

Typical test structure:

10-20 personas per test (e.g., VP Product, Startup Founder, Agency Lead)
2-4 runs per persona (to measure consistency)
Total: 30-50 sessions per test

This sample size allows for 95% confidence interval calculation and persona-level segmentation analysis.

Model Selection

We use Claude Sonnet 4.5 (Anthropic) for simulation runs. Selected because:

Strong instruction-following (stays in persona)
Nuanced reasoning (can balance competing priorities)
Consistent outputs (reproducible results)
Long context window (can process full landing pages)

03 / STATISTICAL VALIDATION

95% Confidence Intervals

We calculate 95% confidence intervals using standard statistical methods for proportions:

CI = p ± 1.96 × √(p(1-p)/n) Where: - p = observed proportion (e.g., 0.75 = 75% conversion rate) - n = sample size (e.g., 40 runs) - 1.96 = z-score for 95% confidence level

Example: If 30 out of 40 personas (75%) understand what the product does:

CI = 0.75 ± 1.96 × √(0.75 × 0.25 / 40) CI = 0.75 ± 0.135 CI = [61.5%, 88.5%]

We report: "75% comprehension rate (95% CI: 62%-89%, n=40)"

Persona-Level Segmentation

We analyze results at the persona level to identify differential responses:

Per-persona conversion rates with 95% CI
Comparison across personas (e.g., VP Product 75% vs. Startup Founder 50%)
Objection frequency by persona type

Statistical Significance Testing

When comparing two variants (A/B testing), we use:

Two-proportion z-test for conversion rate differences
Chi-square test for categorical outcomes
Effect size calculation (Cohen's d) for practical significance

We only claim a difference is real if p < 0.05 AND the effect size is meaningful (>10 percentage points).

Limitation: Statistical significance does not equal real-world validity. Synthetic personas may show statistically significant differences that don't appear with real users.

04 / ADVERSARIAL VALIDATION

Cross-Model Review

To reduce bias and increase rigor, every test includes adversarial validation:

          Process
          Run simulations with Claude Sonnet 4.5
Submit findings to a different model (e.g., OpenAI GPT-4)
Challenge methodology: Does the test design have bias? Are personas realistic? Are conclusions supported?
Flag issues: Identify threats to validity, alternative explanations, overgeneralizations
Revise if needed: If adversarial review identifies issues, re-run or adjust conclusions

        

Example adversarial challenge: "The 'VP Product' persona shows 100% comprehension, but only 75% conversion. This suggests the test scenario may be too easy. Consider adding time pressure or competing priorities."

Bias Controls

We actively control for:

Confirmation bias: Personas are not designed to validate a hypothesis—they're designed to behave realistically
Sampling bias: Multiple runs per persona ensure consistency
Order effects: Sections are presented in order, but we track where users stop reading
Demand characteristics: Personas are not told what to find—they're told to evaluate naturally

Limitation: LLMs can exhibit biases present in training data. Adversarial validation reduces this but doesn't eliminate it.

05 / VALIDITY & LIMITATIONS

What synthetic user testing is good for

Best use cases:

Landing page optimization (messaging, CTAs, layout)
UX pattern validation (navigation, friction points)
A/B/C/D/E variant testing (parallel comparison)
Persona-level segmentation (how do different buyers respond?)
Statistical validation (95% CI for claims)

What synthetic user testing is NOT good for

Use traditional research for:

Discovery research (unknown unknowns)
Ethnographic studies (in-context observation)
Emotional responses (facial expressions, tone)
Serendipitous insights (unexpected behaviors)
Stakeholder buy-in (video clips from real users)
New product categories (where personas don't yet exist)

Threats to Validity

Internal validity concerns:

Personas may not perfectly represent real users
LLMs can hallucinate or be inconsistent
Test scenarios may not match real-world conditions

External validity concerns:

Synthetic results may not generalize to real users
Personas are based on past behavior, not future behavior
Missing contextual factors (device, environment, mood)

Construct validity concerns:

"Comprehension" measured by self-report, not actual understanding
"Conversion intent" is stated preference, not revealed preference

          Mitigations
          Validate personas against real human data
Run multiple sessions per persona (consistency check)
Adversarial review challenges findings
Honest reporting of limitations in every report
Recommend traditional research when appropriate

        

Comparison to Traditional Research

Synthetic user testing vs. traditional:

Speed: 2-3 days vs. 3 weeks to 6 months (100x faster)
Sample size: 30-50 runs vs. 10-15 participants (3x larger)
Cost: $6K-15K vs. $40K/year or $300K+ agency (cheaper)
Statistical rigor: 95% CI vs. qualitative themes (more rigorous)
Emotional depth: Low vs. High (less rich)
Serendipity: Low vs. High (less discovery)

Bottom line: Synthetic is faster and more scalable. Traditional is deeper and more exploratory. Use both.

06 / REPRODUCIBILITY

Can results be reproduced?

Yes, with caveats:

Reproducible Components

Persona definitions are stored as YAML files (versioned)
Test scenarios are documented in plain text
Statistical calculations are deterministic (same data → same CI)
Adversarial validation can be re-run

Non-Reproducible Components

LLM outputs vary slightly run-to-run (temperature > 0)
Model updates (Claude Sonnet 4.5 in March 2026 ≠ June 2026)
Persona interpretation may shift over time

Reproducibility standard: Re-running the same test with the same personas should produce similar findings (within 10 percentage points) 90% of the time.

Transparency commitment: We include persona YAML files, test scenarios, and raw results in every report. You can audit our work.

07 / UPDATES & VERSIONING

Methodology Evolution

This methodology will evolve as we learn. Changes will be documented here:

Version 1.0 (March 2026)

Initial methodology published
Persona construction from real human data
Claude Sonnet 4.5 for simulations
95% CI calculation for all metrics
Adversarial validation with GPT-4

Future improvements: We're exploring persona validation through real user A/B tests, multi-model ensemble simulations, and automated bias detection.

08 / REFERENCES

Foundational Research

This methodology builds on established practices in:

User research: Steve Krug (Don't Make Me Think), Jakob Nielsen (usability testing)
Persona development: Alan Cooper (The Inmates Are Running the Asylum)
Jobs-to-be-Done: Tony Ulwick (Outcome-Driven Innovation), Bob Moesta
Statistical inference: Standard confidence interval calculations for proportions
Adversarial validation: Machine learning best practices (cross-validation, holdout sets)

Contact

Questions about methodology? research@rak.lab