Streaming with Token Budget Pattern
Demonstrates cost control by enforcing token limits during streaming responses.
Stop paying for tokens you don’t need — enforce real-time streaming budgets and cancel LLM requests the moment your limit is reached.
Edition
Section titled “Edition”Community — runs on the OSS / Community SDK edition.
What this demonstrates
Section titled “What this demonstrates”Difficulty: Starter 🟢 · LLM
- Summary: Token budget management and cost estimation
- Scenario: Track and limit token usage across requests
tech_tagsin manifest:LLM— example idtoken-budgetinconformance/examples_manifest.json.
Prerequisites
Section titled “Prerequisites”- SDK: Use an installed SDK tree (
NXUSKIT_SDK_DIR,NXUSKIT_LIB_PATHas needed);test-examples.shresolves Go/Rust/Python deps from that tree only — see README.md,scripts/setup-sdk.sh, andscripts/test-examples.sh. - Languages in this example: go, python, rust, bash (paths under this directory; Python may live under a sibling
python/or shared reference per Language Implementations). - Models: CLI/Bash defaults to the
loopbackprovider for credential-free smoke tests. Set cloud provider API keys or run Ollama locally when using live providers in the Run steps.
Key nxusKit Features Demonstrated
Section titled “Key nxusKit Features Demonstrated”| Feature | Description |
|---|---|
| Unified Streaming | Same streaming interface across all providers (Stream in Rust, channels in Go) |
| Stream Cancellation | Graceful cancellation supported by all provider implementations |
| Token Tracking | Normalized token usage in final chunk regardless of provider |
Provider Compatibility: Any provider supporting streaming (Claude, OpenAI, Ollama)
Pattern Overview
Section titled “Pattern Overview”When streaming LLM responses, you may want to stop generation early to control costs or enforce response length limits. This pattern monitors token usage during streaming and cancels the request when a budget is reached.
Key Features
Section titled “Key Features”- Real-time token estimation during streaming
- Graceful stream cancellation when budget exceeded
- Returns partial content and budget status
- Works with any streaming-capable provider
Real-World Application
Section titled “Real-World Application”Usage metering, per-user quota enforcement.
Technologies
Section titled “Technologies”LLM
Language Implementations
Section titled “Language Implementations”| Language | Path | Status |
|---|---|---|
| Rust | rust/ | Available |
| Go | go/ | Available |
| Python | python/ | Available |
| CLI/Bash | bash/ | Available |
Attach an installed SDK (NXUSKIT_SDK_DIR). See the repository README.md and scripts/test-examples.sh.
# From `/examples/patterns/token-budget`:cd rust && cargo buildcd go && make buildcd python && python3 main.py --helpcd bash && make buildToken Estimation
Section titled “Token Estimation”Since exact token counts aren’t available during streaming, we use a simple heuristic:
- ~4 characters per token (works well for English text)
- Adjust this ratio for other languages or specialized content
Library usage
Section titled “Library usage”use token_budget::stream_with_budget;
// Stream with a 100 token budgetlet result = stream_with_budget(&provider, &request, 100).await?;
println!("Content: {}", result.content);println!("Tokens used: {}", result.estimated_tokens);if result.budget_reached { println!("Budget limit reached - response truncated");}// Stream with a 100 token budgetresult, err := StreamWithBudget(ctx, provider, req, 100)
fmt.Println("Content:", result.Content)fmt.Println("Tokens used:", result.EstimatedTokens)if result.BudgetReached { fmt.Println("Budget limit reached - response truncated")}cd rustcargo runcd gogo run .Python
Section titled “Python”cd pythonpython3 main.pyCLI/Bash
Section titled “CLI/Bash”cd bashmake runTOKEN_BUDGET_MAX=40 make run ARGS="ollama"Interactive Modes
Section titled “Interactive Modes”All examples support debugging flags:
# Verbose mode - show raw HTTP request/response datacargo run -- --verbose # Rustgo run . --verbose # Gomake run ARGS="--verbose" # CLI/Bash
# Step mode - pause at each step with explanationscargo run -- --step # Rustgo run . --step # Gomake run ARGS="--step" # CLI/Bash
# Combined modecargo run -- --verbose --stepOr use environment variables:
export NXUSKIT_VERBOSE=1export NXUSKIT_STEP=1Testing
Section titled “Testing”# Rustcd rust && cargo test
# Gocd go && go test -v
# Python smokecd python && python3 main.py --help
# CLI/Bashcd bash && make testProduction Considerations
Section titled “Production Considerations”- Calibrate token ratio: Adjust the 4 chars/token estimate for your specific use case
- Buffer margin: Set budget slightly below hard limits to account for estimation error
- User feedback: Indicate when responses are truncated due to budget
- Combine with max_tokens: Use both streaming budget and API max_tokens for defense in depth