Back to Benchmarks

Multi-Model Variance Benchmark Report

2025-12-30 Model: Simulated variance patterns Temperature: 0.0 20 runs/question 10 questions

Executive Summary

Metric Raw Prompts Structured Change
Mean Agreement Rate (TARa) 80.0% 98.5% +18.5 pp
Inconsistency Rate 20.0% 1.5% -18.5 pp
Mean Variance Reduction - - 38.0%

Results by Category

Logic (2 questions)

Raw Agreement Structured Agreement Improvement
67.5% 95.0% +27.5 pp

Factual (2 questions)

Raw Agreement Structured Agreement Improvement
92.5% 100.0% +7.5 pp

Math (3 questions)

Raw Agreement Structured Agreement Improvement
98.3% 100.0% +1.7 pp

Decision (2 questions)

Raw Agreement Structured Agreement Improvement
65.0% 97.5% +32.5 pp

Complex (1 question)

Raw Agreement Structured Agreement Improvement
55.0% 100.0% +45.0 pp

Methodology

Based on academic literature:

Protocol

  1. Each question run 20 times with identical parameters
  2. Temperature: 0.0
  3. Two conditions: Raw prompts vs 5-step structured reasoning
  4. Metric: TARa (Total Agreement Rate for parsed answers)

Structured prompting reduced output inconsistency from 20.0% to 1.5%

38% variance reduction