Backed byCombinator

Optimize LLM context
by removing input bloat

Bear-1.2 compression removes low signal tokens from your prompts before they hit your LLM.

Backed by people behind

Hugging Face
Silo
Wolt
Y Combinator
Supercell
SVA
PDF

Save tokens and improve accuracy on agent's background knowledge

Bear-1.2 compresses your agent's background knowledge before it enters the context window.

Featurednew

Intelligent semantic processing

The bear-1 and bear-1.2 models process tokens based on context and semantic intent. Compression runs deterministic and low latency.

In its most fundamental sense, compression is the process of encoding
information using fewer bits or resources than the original representation
by identifying and eliminating statistical redundancies or irrelevant data
within a dataset. Whether applied to digital media, text, or the high-
dimensional vector spaces of Large Language Models, compression relies on
the principle that most raw information contains noise or repeating patterns
that do not contribute new meaning. By applying an algorithm—or in your
case, an ML-based model—to map the input data into a more compact form,
you essentially distil the signal from the noise. In the context of ML
inputs, this means transforming long-form text into a dense, mathematically
efficient representation that preserves the original semantic intent and
logical relationships while significantly reducing the physical token count,
thereby allowing a system to process more information within the same fixed
computational window or budget.

One API call

Send text in, get compressed text back. Drop it in before your LLM call. That's the entire integration.

POSTapi.thetokencompany.com/v1/compress
{
"model": "bear-1.1",
"input": "Your long text to compress..."
}
response
{
"output": "Compressed text...",
"original_input_tokens": 1284,
"output_tokens": 436
}
Read the docs

Use cases

LLM Entertainment & Gaming

Longer memories, richer worlds, same budget.

Meeting Transcription

Distill hours of calls into signal-dense context.

Web Scraping

Strip boilerplate from crawled pages before ingest.

Document Analysis

Fit more PDFs and reports into one context window.

Frequently asked questions

How compression works, pricing, integration, accuracy, security, and deployment. The questions teams ask most.

How does it actually work? Is it generative, or does it just drop tokens?

We sit as a middleware layer between your prompt and your LLM. A small ML classifier scores every token in your input and removes the ones least likely to affect the model's output.

Nothing is summarized, paraphrased, or generated. We only ever delete.

That's why we're faster, cheaper, and 100% deterministic, where a small-LLM-based compression step is none of those.

Sign up and try it

How much will it actually compress my input?

Typically 10–40% while maintaining full accuracy, depending on how dense your input is.

Clean, information-rich text compresses less. Noisy web scrapes, long chat histories, and verbose documents compress more.

You control how aggressive we are with the aggressiveness parameter.

What compression aggressiveness should I use?

Use 0.05–0.2 for inputs the model reads directly: files, documents, anything it's answering questions about.

Use 0.5–0.8 for compacting long conversation histories or files used as background context, where exact wording matters less.

If in doubt, start low and dial up while you watch your eval.

How fast is the compression API?

Latency depends on input size, but we're built for real-time use, with p95 at 150ms.

For most workflows the shorter prompt after compression cuts time-to-first-token from the downstream LLM by more than the compression step adds, so end-to-end round-trip goes down with compression in the loop, not up.

See the latency numbers

Won't I lose information I need?

You stay in control of the trade-off. You can dial aggressiveness up or down, and you can wrap critical content (IDs, table cells, exact quotes, code identifiers) in safe labels so we never touch it.

We've benchmarked on needle-in-a-haystack and exact-quote retrieval. If you have a specific eval, we'll run it.

See the benchmarks

How is this different from summarization?

Summarization rewrites your input. It changes wording, introduces hallucinations, and loses structure.

We only delete. The remaining text is verbatim, in the original order, which keeps citations, code, numbers, and JSON intact.

Does it work on code?

For understanding, yes. Running it on a large codebase so an LLM can navigate architecture, find the right file, or answer questions about the repo works well. The model still understands what the code does.

Not recommended for code editing or syntax fixing. Compression strips tokens the LLM doesn't need for understanding, but the compressed output is no longer compilable. Don't feed it into a loop where the LLM edits the code and the result has to run.

How does pricing work?

You only pay for tokens saved: the difference between the input we received and the compressed output we returned.

If a 10M-token prompt comes out at 7M after compression, you pay for the 3M we removed, not the 10M you sent in.

That way we're always net cheaper than running the same input through your LLM. If we don't save you money, you don't pay.

See pricing

Is there a free tier?

Yes. Every new account gets 60M free input tokens of compression to test with.

If you need more credits to finish a proper eval, just ask.

Start free

Will it break my LLM provider's prompt caching?

No. Our output is deterministic for a given input and setting, so caches (yours or your LLM provider's) stay valid.

If you change the aggressiveness setting, that's a new cache key.

Can I integrate it without rewriting my pipeline?

Yes. Most customers drop us in as middleware: one API call before your existing LLM call.

We're also building a Stripe-AI-Gateway-style endpoint swap, so you can change a base URL and keep your provider SDKs (OpenAI, Anthropic, Gemini, Azure, OpenRouter).

Read the docs

Can you fine-tune to my data?

Yes. For higher-volume customers we train a model variant on your domain: legal, financial, code in a specific language, and so on.

We typically need a few million tokens of representative input. Fine-tuned models can be used alongside zero data retention if needed.

Talk to us

What about data retention and compliance?

By default, we retain inputs to improve the service.

Zero data retention is available on request and can be set at the account level.

We're SOC 2 (in progress) and HIPAA-ready with a BAA.

Can I run this on-prem or in my own VPC?

On-prem and AWS VPC / Marketplace are on the roadmap, and our most-requested enterprise feature.

If you have a hard requirement, reach out and we'll share timing.

Get in touch

What are the rate limits?

10 requests/min on the free plan, 60 RPM on Pro.

For custom enterprise deals we can size limits higher. Get in touch if you need production-scale throughput.

Talk to us

Who is this for?

Teams running LLMs over lots of long inputs.

Agent frameworks, web-research and enrichment pipelines, RAG systems, chat apps with long histories, document workflows in legal / financial / healthcare, and coding agents.

Try it on your stack

Ready to compress?

Access the compression API.