OpenAI Responses API with Prompt Caching
OpenAIResponsesProvider uses OpenAI’s Responses API for stateless conversation management.
Use previous_response_id to continue a conversation without resending full history,
and prompt_cache_key to route requests to the same machine for consistent cache hits.
import asyncio
import uuid
from llm_async.models import Tool
from llm_async.models.message import Message
from llm_async.providers import OpenAIResponsesProvider
calculator_tool = Tool(
name="calculator",
description="Perform basic arithmetic operations",
parameters={
"type": "object",
"properties": {
"operation": {"type": "string", "enum": ["add", "subtract", "multiply", "divide"]},
"a": {"type": "number"},
"b": {"type": "number"}
},
"required": ["operation", "a", "b"]
}
)
def calculator(operation: str, a: float, b: float) -> float:
if operation == "add":
return a + b
elif operation == "subtract":
return a - b
elif operation == "multiply":
return a * b
elif operation == "divide":
return a / b
return 0
async def main():
provider = OpenAIResponsesProvider(api_key="your-openai-api-key")
session_id = uuid.uuid4().hex
response = await provider.acomplete(
model="gpt-4.1",
messages=[Message("user", "What is 15 + 27? Use the calculator tool.")],
tools=[calculator_tool],
tool_choice="required",
prompt_cache_key=session_id,
)
tool_call = response.main_response.tool_calls[0]
tool_result = await provider.execute_tool(tool_call, {"calculator": calculator})
final_response = await provider.acomplete(
model="gpt-4.1",
messages=[tool_result],
tools=[calculator_tool],
previous_response_id=response.original["id"],
prompt_cache_key=session_id,
)
print(final_response.main_response.content)
asyncio.run(main())
Key benefits:
No history overhead: reference previous turns via
previous_response_idinstead of resending messages.Prompt caching:
prompt_cache_keyroutes requests to the same machine for cache hits.Reduced costs: cached prefixes consume 90% fewer tokens.
Lower latency: cached prefixes are processed faster.
How it works:
First request establishes a response context and caches the prompt prefix (≥1024 tokens).
Subsequent requests reference the first response via
previous_response_id.Using the same
prompt_cache_keyroutes requests to the same machine.Only new content (tool outputs, user messages) needs to be sent.
Cached prefixes remain active for 5–10 minutes of inactivity (up to 1 hour off-peak).
See examples/openai_responses_tool_call_with_previous_id.py for a complete working example.
Interactive REPL Example (HTTP/2 + Prompt Cache + Tool Calls)
For a full interactive example, see examples/openai_responses_repl_http2_prompt_cache.py.
It demonstrates:
OpenAIResponsesProviderover HTTP/2Prompt caching with
prompt_cache_keySending latest
previous_response_idbetween turnsCalculator function tool call round-trips
Configurable state strategy via CLI (full history resend vs stateless chaining)
Default behavior:
Model:
gpt-5-miniReasoning effort:
mediumState mode:
previous_response_id
Run with defaults:
poetry run python examples/openai_responses_repl_http2_prompt_cache.py
Use stateless chaining explicitly:
poetry run python examples/openai_responses_repl_http2_prompt_cache.py --state-mode previous_response_id
Resend full conversation history each turn:
poetry run python examples/openai_responses_repl_http2_prompt_cache.py --state-mode full
Disable HTTP/2:
poetry run python examples/openai_responses_repl_http2_prompt_cache.py --no-http2
Set prompt cache retention (model support varies):
poetry run python examples/openai_responses_repl_http2_prompt_cache.py --prompt-cache-retention in-memory
Note: some models may reject prompt_cache_retention. If unsupported, omit the flag.