Case Study
One of the biggest token consumers globally improved quality by removing context bloat
Processing 193,000,000,000 tokens per month, Pax Historia set out to cut costs with bear-1.1 compression. Instead, they discovered it also improved user preference — and lifted purchase amounts by 5%.
The story
Pax Historia is a conversational AI platform where users interact with historical figures. As one of the biggest token consumers on OpenRouter — processing 193 billion tokens per month — they came to us with a straightforward goal: reduce their LLM costs. Long conversation histories were driving up token counts, and they wanted to compress the context before sending it to their models.
Pax Historia
193B tokens/month on OpenRouter
We integrated bear-1.1 compression into their pipeline. The cost savings were immediate and expected. What we didn't expect was what happened next.
Pax Historia runs a model arena — a blind comparison where users vote on which response they prefer across different models. After accumulating over 268,000 votes, the results showed something remarkable: models running with bear-1.1 compression were consistently preferred over their uncompressed counterparts.
Arena results
The arena tested 27 models head-to-head with 268,327 total votes. Two models were tested with bear-1.1 compression: Claude Sonnet 4.5 at 0.2 compression and Gemini 3 Flash at 0.05 compression. Both ranked higher than their uncompressed versions.
| Rank | Model | Score | Wins | Votes |
|---|---|---|---|---|
| #1 | Claude Sonnet 4.6 | +0.45 | 9,178 | 15,819 |
| #2 | Claude Sonnet 4.5 @ 0.2 bear-1.1 | +0.43 | 12,848 | 22,179 |
| #3 | Claude Sonnet 4.5 | +0.41 | 13,013 | 22,744 |
| #4 | Claude Opus 4.5 | +0.36 | 11,561 | 20,416 |
| #5 | GLM 5 | +0.36 | 8,415 | 14,872 |
| #6 | Claude Opus 4.6 | +0.35 | 11,463 | 20,263 |
| #7 | Gemini 3 Flash @ 0.05 bear-1.1 | +0.23 | 12,250 | 22,792 |
| #8 | Gemini 3 Flash | +0.20 | 12,122 | 22,819 |
| #9 | Qwen3.5 Plus | +0.18 | 8,050 | 15,424 |
| #10 | Gemini 3.1 Pro Preview | +0.15 | 7,634 | 14,611 |
| #11 | Gemini 3 Pro Preview | +0.15 | 11,847 | 22,699 |
| #12 | Gemini 2.5 Flash | +0.09 | 11,703 | 22,909 |
Compressed vs uncompressed
The most direct comparison: the same base model, with and without bear-1.1 compression. In both cases, the compressed version scored higher and ranked higher.
Claude Sonnet 4.5 went from #3 uncompressed (+0.41) to #2 with 0.2 compression (+0.43). That puts it above Claude Opus 4.5 and just 0.02 points behind the top model.
Gemini 3 Flash moved from #8 uncompressed (+0.20) to #7 with 0.05 compression (+0.23).
Compression improved user preference, not just efficiency
In blind comparisons across 268,327 votes, users consistently preferred responses from compressed models. Removing context bloat appears to help models generate more focused, higher-quality responses.
Win rates
Looking at raw win rates across all arena matchups reinforces the finding. Compressed versions won a slightly higher proportion of their matches than their uncompressed counterparts.
Business impact
The arena results motivated Pax Historia to run A/B tests on their production traffic, comparing compressed and uncompressed pipelines on actual user behavior — not just preference votes.
The result: purchase amounts went up 5% when using compressed context. Users who interacted with compressed-context responses spent more. The quality improvement measured in the arena translated directly to a business metric.
+5%
Purchase amount lift
A/B tested on production traffic
4.7–22%
Saved per request
Original goal: achieved alongside quality gains
Why does compression improve quality?
This isn't the first time we've seen compression improve output quality. Our FinanceBench evaluation showed a 2.7 percentage point accuracy improvement on financial QA when using bear-1.2 compression.
The mechanism is the same: LLM attention is a finite resource. When the context window is filled with boilerplate, formatting artifacts, and low-information tokens, the model spreads its attention across irrelevant content. Compression removes this noise, letting the model focus on what matters.
In Pax Historia's case, conversation histories accumulate system prompts, formatting, and repetitive phrasing. bear-1.1 compression strips these while preserving the semantic content that drives good responses. The result is a model that appears more coherent and focused to users — because its context actually is more coherent and focused.
Key findings
Compression improves user preference
In a 268K-vote blind arena, models with bear-1.1 compression ranked higher than their uncompressed counterparts on both Claude Sonnet 4.5 and Gemini 3 Flash.
Quality gains translate to business metrics
A/B testing on production traffic showed a 5% increase in purchase amounts when using compressed context. Better responses drive better outcomes.
Cost reduction and quality improvement are not a tradeoff
Pax Historia achieved both goals simultaneously: lower costs per request from fewer tokens, and higher quality from cleaner context. Compression is not a compromise — it's an optimization on both axes.
Ready to try it?
Create an account to get your API key and start compressing.