OpenAI Responses API with Prompt Caching
=========================================

``OpenAIResponsesProvider`` uses OpenAI's Responses API for stateless conversation management.
Use ``previous_response_id`` to continue a conversation without resending full history,
and ``prompt_cache_key`` to route requests to the same machine for consistent cache hits.

.. code-block:: python

   import asyncio
   import uuid
   from llm_async.models import Tool
   from llm_async.models.message import Message
   from llm_async.providers import OpenAIResponsesProvider

   calculator_tool = Tool(
       name="calculator",
       description="Perform basic arithmetic operations",
       parameters={
           "type": "object",
           "properties": {
               "operation": {"type": "string", "enum": ["add", "subtract", "multiply", "divide"]},
               "a": {"type": "number"},
               "b": {"type": "number"}
           },
           "required": ["operation", "a", "b"]
       }
   )

   def calculator(operation: str, a: float, b: float) -> float:
       if operation == "add":
           return a + b
       elif operation == "subtract":
           return a - b
       elif operation == "multiply":
           return a * b
       elif operation == "divide":
           return a / b
       return 0

   async def main():
       provider = OpenAIResponsesProvider(api_key="your-openai-api-key")
       session_id = uuid.uuid4().hex

       response = await provider.acomplete(
           model="gpt-4.1",
           messages=[Message("user", "What is 15 + 27? Use the calculator tool.")],
           tools=[calculator_tool],
           tool_choice="required",
           prompt_cache_key=session_id,
       )

       tool_call = response.main_response.tool_calls[0]
       tool_result = await provider.execute_tool(tool_call, {"calculator": calculator})

       final_response = await provider.acomplete(
           model="gpt-4.1",
           messages=[tool_result],
           tools=[calculator_tool],
           previous_response_id=response.original["id"],
           prompt_cache_key=session_id,
       )
       print(final_response.main_response.content)

   asyncio.run(main())

Key benefits:

- **No history overhead**: reference previous turns via ``previous_response_id`` instead of resending messages.
- **Prompt caching**: ``prompt_cache_key`` routes requests to the same machine for cache hits.
- **Reduced costs**: cached prefixes consume 90% fewer tokens.
- **Lower latency**: cached prefixes are processed faster.

How it works:

1. First request establishes a response context and caches the prompt prefix (≥1024 tokens).
2. Subsequent requests reference the first response via ``previous_response_id``.
3. Using the same ``prompt_cache_key`` routes requests to the same machine.
4. Only new content (tool outputs, user messages) needs to be sent.
5. Cached prefixes remain active for 5–10 minutes of inactivity (up to 1 hour off-peak).

See ``examples/openai_responses_tool_call_with_previous_id.py`` for a complete working example.

Interactive REPL Example (HTTP/2 + Prompt Cache + Tool Calls)
--------------------------------------------------------------

For a full interactive example, see ``examples/openai_responses_repl_http2_prompt_cache.py``.
It demonstrates:

- ``OpenAIResponsesProvider`` over HTTP/2
- Prompt caching with ``prompt_cache_key``
- Sending latest ``previous_response_id`` between turns
- Calculator function tool call round-trips
- Configurable state strategy via CLI (full history resend vs stateless chaining)

Default behavior:

- Model: ``gpt-5-mini``
- Reasoning effort: ``medium``
- State mode: ``previous_response_id``

Run with defaults:

.. code-block:: bash

   poetry run python examples/openai_responses_repl_http2_prompt_cache.py

Use stateless chaining explicitly:

.. code-block:: bash

   poetry run python examples/openai_responses_repl_http2_prompt_cache.py --state-mode previous_response_id

Resend full conversation history each turn:

.. code-block:: bash

   poetry run python examples/openai_responses_repl_http2_prompt_cache.py --state-mode full

Disable HTTP/2:

.. code-block:: bash

   poetry run python examples/openai_responses_repl_http2_prompt_cache.py --no-http2

Set prompt cache retention (model support varies):

.. code-block:: bash

   poetry run python examples/openai_responses_repl_http2_prompt_cache.py --prompt-cache-retention in-memory

Note: some models may reject ``prompt_cache_retention``. If unsupported, omit the flag.