Methodology
Version 1.0 · Published March 2026 · Last updated March 7, 2026
This document explains how Rak's synthetic user testing methodology works, how personas are validated, how statistical confidence is calculated, and what the limitations are.
How personas are built
Every persona used in Rak testing is constructed from real human data, not invented. The process:
Step 1: Data Collection
We extract persona characteristics from:
- User interviews — transcripts from customer development, JTBD research, or usability studies
- Support tickets — language patterns, pain points, goals
- G2/Capterra reviews — how real users describe products in their own words
- Sales call transcripts — objections, decision criteria, evaluation process
- Public social media — Reddit, Twitter, LinkedIn posts about product categories
Step 2: Persona Extraction
We use large language models (LLMs) to identify consistent patterns across data sources:
- Role/title clusters (e.g., "VP Product" language patterns)
- Pain point frequency (what problems appear repeatedly?)
- Goals and motivations (what are they trying to achieve?)
- Decision criteria (what matters when evaluating tools?)
- Language style (formal/casual, technical/non-technical)
Output: Structured persona YAML files with background, goals, pain points, objections, language patterns, and decision criteria.
Step 3: Validation
Each persona is validated through:
- Language pattern matching: Does the persona's language match real human transcripts?
- Consistency checks: Are goals, pain points, and objections internally coherent?
- Expert review: Does the persona reflect real buyer behavior? (Validated by founder with 15+ years product/research experience)
- Test runs: Does the persona behave realistically when dropped into test scenarios?
Example: VP Product Persona Construction
For a "Skeptical VP Product" persona, we analyzed:
- 50+ VP Product interviews from B2B SaaS companies
- 200+ G2 reviews written by VP Product titles
- 30+ Reddit r/ProductManagement posts about research tools
Patterns identified: Pain = slow research cycles, skepticism = burned by overpromising tools, decision criteria = methodology rigor + pricing transparency, language style = direct, evidence-focused.
Validation: Persona language patterns matched 87% of real VP Product transcripts (n=20 holdout set).
How simulations work
Once personas are constructed and validated, we run simulated user sessions:
Session Setup
Each session includes:
- Persona context: Background, goals, pain points, objections, language patterns
- Test scenario: What task is the user trying to complete? (e.g., "evaluate this landing page in 5 minutes")
- Product snapshot: Text representation of the page/flow being tested
- Success criteria: What defines completion? (e.g., "can explain what the product does")
Session Execution
The LLM simulates the persona interacting with the product:
- Scan phase: Initial impression (5-10 seconds)
- Exploration phase: Read sections, evaluate fit
- Decision phase: Would they take next step? Why/why not?
- Objection capture: What questions remain? What's unclear?
Output: Structured JSON with comprehension score, conversion intent, objections raised, friction points, and verbatim persona responses.
Scale
Typical test structure:
- 10-20 personas per test (e.g., VP Product, Startup Founder, Agency Lead)
- 2-4 runs per persona (to measure consistency)
- Total: 30-50 sessions per test
This sample size allows for 95% confidence interval calculation and persona-level segmentation analysis.
Model Selection
We use Claude Sonnet 4.5 (Anthropic) for simulation runs. Selected because:
- Strong instruction-following (stays in persona)
- Nuanced reasoning (can balance competing priorities)
- Consistent outputs (reproducible results)
- Long context window (can process full landing pages)
95% Confidence Intervals
We calculate 95% confidence intervals using standard statistical methods for proportions:
Example: If 30 out of 40 personas (75%) understand what the product does:
We report: "75% comprehension rate (95% CI: 62%-89%, n=40)"
Persona-Level Segmentation
We analyze results at the persona level to identify differential responses:
- Per-persona conversion rates with 95% CI
- Comparison across personas (e.g., VP Product 75% vs. Startup Founder 50%)
- Objection frequency by persona type
Statistical Significance Testing
When comparing two variants (A/B testing), we use:
- Two-proportion z-test for conversion rate differences
- Chi-square test for categorical outcomes
- Effect size calculation (Cohen's d) for practical significance
We only claim a difference is real if p < 0.05 AND the effect size is meaningful (>10 percentage points).
Cross-Model Review
To reduce bias and increase rigor, every test includes adversarial validation:
Process
- Run simulations with Claude Sonnet 4.5
- Submit findings to a different model (e.g., OpenAI GPT-4)
- Challenge methodology: Does the test design have bias? Are personas realistic? Are conclusions supported?
- Flag issues: Identify threats to validity, alternative explanations, overgeneralizations
- Revise if needed: If adversarial review identifies issues, re-run or adjust conclusions
Example adversarial challenge: "The 'VP Product' persona shows 100% comprehension, but only 75% conversion. This suggests the test scenario may be too easy. Consider adding time pressure or competing priorities."
Bias Controls
We actively control for:
- Confirmation bias: Personas are not designed to validate a hypothesis—they're designed to behave realistically
- Sampling bias: Multiple runs per persona ensure consistency
- Order effects: Sections are presented in order, but we track where users stop reading
- Demand characteristics: Personas are not told what to find—they're told to evaluate naturally
What synthetic user testing is good for
- Landing page optimization (messaging, CTAs, layout)
- UX pattern validation (navigation, friction points)
- A/B/C/D/E variant testing (parallel comparison)
- Persona-level segmentation (how do different buyers respond?)
- Statistical validation (95% CI for claims)
What synthetic user testing is NOT good for
- Discovery research (unknown unknowns)
- Ethnographic studies (in-context observation)
- Emotional responses (facial expressions, tone)
- Serendipitous insights (unexpected behaviors)
- Stakeholder buy-in (video clips from real users)
- New product categories (where personas don't yet exist)
Threats to Validity
Internal validity concerns:
- Personas may not perfectly represent real users
- LLMs can hallucinate or be inconsistent
- Test scenarios may not match real-world conditions
External validity concerns:
- Synthetic results may not generalize to real users
- Personas are based on past behavior, not future behavior
- Missing contextual factors (device, environment, mood)
Construct validity concerns:
- "Comprehension" measured by self-report, not actual understanding
- "Conversion intent" is stated preference, not revealed preference
Mitigations
- Validate personas against real human data
- Run multiple sessions per persona (consistency check)
- Adversarial review challenges findings
- Honest reporting of limitations in every report
- Recommend traditional research when appropriate
Comparison to Traditional Research
Synthetic user testing vs. traditional:
- Speed: 2-3 days vs. 3 weeks to 6 months (100x faster)
- Sample size: 30-50 runs vs. 10-15 participants (3x larger)
- Cost: $6K-15K vs. $40K/year or $300K+ agency (cheaper)
- Statistical rigor: 95% CI vs. qualitative themes (more rigorous)
- Emotional depth: Low vs. High (less rich)
- Serendipity: Low vs. High (less discovery)
Bottom line: Synthetic is faster and more scalable. Traditional is deeper and more exploratory. Use both.
Can results be reproduced?
Yes, with caveats:
Reproducible Components
- Persona definitions are stored as YAML files (versioned)
- Test scenarios are documented in plain text
- Statistical calculations are deterministic (same data → same CI)
- Adversarial validation can be re-run
Non-Reproducible Components
- LLM outputs vary slightly run-to-run (temperature > 0)
- Model updates (Claude Sonnet 4.5 in March 2026 ≠ June 2026)
- Persona interpretation may shift over time
Reproducibility standard: Re-running the same test with the same personas should produce similar findings (within 10 percentage points) 90% of the time.
Methodology Evolution
This methodology will evolve as we learn. Changes will be documented here:
Version 1.0 (March 2026)
- Initial methodology published
- Persona construction from real human data
- Claude Sonnet 4.5 for simulations
- 95% CI calculation for all metrics
- Adversarial validation with GPT-4
Future improvements: We're exploring persona validation through real user A/B tests, multi-model ensemble simulations, and automated bias detection.
Foundational Research
This methodology builds on established practices in:
- User research: Steve Krug (Don't Make Me Think), Jakob Nielsen (usability testing)
- Persona development: Alan Cooper (The Inmates Are Running the Asylum)
- Jobs-to-be-Done: Tony Ulwick (Outcome-Driven Innovation), Bob Moesta
- Statistical inference: Standard confidence interval calculations for proportions
- Adversarial validation: Machine learning best practices (cross-validation, holdout sets)
Contact
Questions about methodology? research@rak.lab