00 / OVERVIEW

Methodology

Version 1.0 · Published March 2026 · Last updated March 7, 2026

This document explains how Rak's synthetic user testing methodology works, how personas are validated, how statistical confidence is calculated, and what the limitations are.

→ No black boxes. This methodology is transparent, auditable, and grounded in established research practices.
01 / PERSONA CONSTRUCTION

How personas are built

Every persona used in Rak testing is constructed from real human data, not invented. The process:

Step 1: Data Collection

We extract persona characteristics from:

  • User interviews — transcripts from customer development, JTBD research, or usability studies
  • Support tickets — language patterns, pain points, goals
  • G2/Capterra reviews — how real users describe products in their own words
  • Sales call transcripts — objections, decision criteria, evaluation process
  • Public social media — Reddit, Twitter, LinkedIn posts about product categories

Step 2: Persona Extraction

We use large language models (LLMs) to identify consistent patterns across data sources:

  • Role/title clusters (e.g., "VP Product" language patterns)
  • Pain point frequency (what problems appear repeatedly?)
  • Goals and motivations (what are they trying to achieve?)
  • Decision criteria (what matters when evaluating tools?)
  • Language style (formal/casual, technical/non-technical)

Output: Structured persona YAML files with background, goals, pain points, objections, language patterns, and decision criteria.

Step 3: Validation

Each persona is validated through:

  1. Language pattern matching: Does the persona's language match real human transcripts?
  2. Consistency checks: Are goals, pain points, and objections internally coherent?
  3. Expert review: Does the persona reflect real buyer behavior? (Validated by founder with 15+ years product/research experience)
  4. Test runs: Does the persona behave realistically when dropped into test scenarios?
Limitation: Personas are representations of real users, not real users themselves. They capture aggregate patterns but miss individual idiosyncrasies and emotional nuance.

Example: VP Product Persona Construction

For a "Skeptical VP Product" persona, we analyzed:

  • 50+ VP Product interviews from B2B SaaS companies
  • 200+ G2 reviews written by VP Product titles
  • 30+ Reddit r/ProductManagement posts about research tools

Patterns identified: Pain = slow research cycles, skepticism = burned by overpromising tools, decision criteria = methodology rigor + pricing transparency, language style = direct, evidence-focused.

Validation: Persona language patterns matched 87% of real VP Product transcripts (n=20 holdout set).

02 / SIMULATION PROCESS

How simulations work

Once personas are constructed and validated, we run simulated user sessions:

Session Setup

Each session includes:

  • Persona context: Background, goals, pain points, objections, language patterns
  • Test scenario: What task is the user trying to complete? (e.g., "evaluate this landing page in 5 minutes")
  • Product snapshot: Text representation of the page/flow being tested
  • Success criteria: What defines completion? (e.g., "can explain what the product does")

Session Execution

The LLM simulates the persona interacting with the product:

  1. Scan phase: Initial impression (5-10 seconds)
  2. Exploration phase: Read sections, evaluate fit
  3. Decision phase: Would they take next step? Why/why not?
  4. Objection capture: What questions remain? What's unclear?

Output: Structured JSON with comprehension score, conversion intent, objections raised, friction points, and verbatim persona responses.

Scale

Typical test structure:

  • 10-20 personas per test (e.g., VP Product, Startup Founder, Agency Lead)
  • 2-4 runs per persona (to measure consistency)
  • Total: 30-50 sessions per test

This sample size allows for 95% confidence interval calculation and persona-level segmentation analysis.

Model Selection

We use Claude Sonnet 4.5 (Anthropic) for simulation runs. Selected because:

  • Strong instruction-following (stays in persona)
  • Nuanced reasoning (can balance competing priorities)
  • Consistent outputs (reproducible results)
  • Long context window (can process full landing pages)
03 / STATISTICAL VALIDATION

95% Confidence Intervals

We calculate 95% confidence intervals using standard statistical methods for proportions:

CI = p ± 1.96 × √(p(1-p)/n) Where: - p = observed proportion (e.g., 0.75 = 75% conversion rate) - n = sample size (e.g., 40 runs) - 1.96 = z-score for 95% confidence level

Example: If 30 out of 40 personas (75%) understand what the product does:

CI = 0.75 ± 1.96 × √(0.75 × 0.25 / 40) CI = 0.75 ± 0.135 CI = [61.5%, 88.5%]

We report: "75% comprehension rate (95% CI: 62%-89%, n=40)"

Persona-Level Segmentation

We analyze results at the persona level to identify differential responses:

  • Per-persona conversion rates with 95% CI
  • Comparison across personas (e.g., VP Product 75% vs. Startup Founder 50%)
  • Objection frequency by persona type

Statistical Significance Testing

When comparing two variants (A/B testing), we use:

  • Two-proportion z-test for conversion rate differences
  • Chi-square test for categorical outcomes
  • Effect size calculation (Cohen's d) for practical significance

We only claim a difference is real if p < 0.05 AND the effect size is meaningful (>10 percentage points).

Limitation: Statistical significance does not equal real-world validity. Synthetic personas may show statistically significant differences that don't appear with real users.
04 / ADVERSARIAL VALIDATION

Cross-Model Review

To reduce bias and increase rigor, every test includes adversarial validation:

Process

  1. Run simulations with Claude Sonnet 4.5
  2. Submit findings to a different model (e.g., OpenAI GPT-4)
  3. Challenge methodology: Does the test design have bias? Are personas realistic? Are conclusions supported?
  4. Flag issues: Identify threats to validity, alternative explanations, overgeneralizations
  5. Revise if needed: If adversarial review identifies issues, re-run or adjust conclusions

Example adversarial challenge: "The 'VP Product' persona shows 100% comprehension, but only 75% conversion. This suggests the test scenario may be too easy. Consider adding time pressure or competing priorities."

Bias Controls

We actively control for:

  • Confirmation bias: Personas are not designed to validate a hypothesis—they're designed to behave realistically
  • Sampling bias: Multiple runs per persona ensure consistency
  • Order effects: Sections are presented in order, but we track where users stop reading
  • Demand characteristics: Personas are not told what to find—they're told to evaluate naturally
Limitation: LLMs can exhibit biases present in training data. Adversarial validation reduces this but doesn't eliminate it.
05 / VALIDITY & LIMITATIONS

What synthetic user testing is good for

Best use cases:
  • Landing page optimization (messaging, CTAs, layout)
  • UX pattern validation (navigation, friction points)
  • A/B/C/D/E variant testing (parallel comparison)
  • Persona-level segmentation (how do different buyers respond?)
  • Statistical validation (95% CI for claims)

What synthetic user testing is NOT good for

Use traditional research for:
  • Discovery research (unknown unknowns)
  • Ethnographic studies (in-context observation)
  • Emotional responses (facial expressions, tone)
  • Serendipitous insights (unexpected behaviors)
  • Stakeholder buy-in (video clips from real users)
  • New product categories (where personas don't yet exist)

Threats to Validity

Internal validity concerns:

  • Personas may not perfectly represent real users
  • LLMs can hallucinate or be inconsistent
  • Test scenarios may not match real-world conditions

External validity concerns:

  • Synthetic results may not generalize to real users
  • Personas are based on past behavior, not future behavior
  • Missing contextual factors (device, environment, mood)

Construct validity concerns:

  • "Comprehension" measured by self-report, not actual understanding
  • "Conversion intent" is stated preference, not revealed preference

Mitigations

  • Validate personas against real human data
  • Run multiple sessions per persona (consistency check)
  • Adversarial review challenges findings
  • Honest reporting of limitations in every report
  • Recommend traditional research when appropriate

Comparison to Traditional Research

Synthetic user testing vs. traditional:

  • Speed: 2-3 days vs. 3 weeks to 6 months (100x faster)
  • Sample size: 30-50 runs vs. 10-15 participants (3x larger)
  • Cost: $6K-15K vs. $40K/year or $300K+ agency (cheaper)
  • Statistical rigor: 95% CI vs. qualitative themes (more rigorous)
  • Emotional depth: Low vs. High (less rich)
  • Serendipity: Low vs. High (less discovery)

Bottom line: Synthetic is faster and more scalable. Traditional is deeper and more exploratory. Use both.

06 / REPRODUCIBILITY

Can results be reproduced?

Yes, with caveats:

Reproducible Components

  • Persona definitions are stored as YAML files (versioned)
  • Test scenarios are documented in plain text
  • Statistical calculations are deterministic (same data → same CI)
  • Adversarial validation can be re-run

Non-Reproducible Components

  • LLM outputs vary slightly run-to-run (temperature > 0)
  • Model updates (Claude Sonnet 4.5 in March 2026 ≠ June 2026)
  • Persona interpretation may shift over time

Reproducibility standard: Re-running the same test with the same personas should produce similar findings (within 10 percentage points) 90% of the time.

Transparency commitment: We include persona YAML files, test scenarios, and raw results in every report. You can audit our work.
07 / UPDATES & VERSIONING

Methodology Evolution

This methodology will evolve as we learn. Changes will be documented here:

Version 1.0 (March 2026)

  • Initial methodology published
  • Persona construction from real human data
  • Claude Sonnet 4.5 for simulations
  • 95% CI calculation for all metrics
  • Adversarial validation with GPT-4

Future improvements: We're exploring persona validation through real user A/B tests, multi-model ensemble simulations, and automated bias detection.

08 / REFERENCES

Foundational Research

This methodology builds on established practices in:

  • User research: Steve Krug (Don't Make Me Think), Jakob Nielsen (usability testing)
  • Persona development: Alan Cooper (The Inmates Are Running the Asylum)
  • Jobs-to-be-Done: Tony Ulwick (Outcome-Driven Innovation), Bob Moesta
  • Statistical inference: Standard confidence interval calculations for proportions
  • Adversarial validation: Machine learning best practices (cross-validation, holdout sets)

Contact

Questions about methodology? research@rak.lab