CompletionService Call Pipeline

CompletionService is the core facade for LLM calls. It orchestrates frontend completion requests into a complete API call pipeline. This document details its six-step pipeline, streaming handling, thinking budget calculation, retry mechanism, and tool call loop.

File Locations

File	Path
CompletionService	`packages/desktop/app/main/services/capabilities/llm/completion/CompletionService.ts`
DirectApiHandler	`packages/desktop/app/main/services/capabilities/llm/completion/DirectApiHandler.ts`
StreamHandler	`packages/desktop/app/main/services/capabilities/llm/completion/StreamHandler.ts`
ToolHandler	`packages/desktop/app/main/services/capabilities/llm/completion/ToolHandler.ts`
TransformerHandler	`packages/desktop/app/main/services/capabilities/llm/completion/TransformerHandler.ts`
ThinkingResolver	`packages/desktop/app/main/services/capabilities/llm/completion/ThinkingResolver.ts`
URL Builder	`packages/desktop/app/main/services/capabilities/llm/completion/url-builder.ts`
Header Builder	`packages/desktop/app/main/services/capabilities/llm/completion/header-builder.ts`
Message Converter	`packages/desktop/app/main/services/capabilities/llm/completion/message-converter.ts`
Types	`packages/desktop/app/main/services/capabilities/llm/completion/types.ts`
NativeSearchInjector	`packages/desktop/app/main/services/capabilities/llm/completion/NativeSearchInjector.ts`
ProviderSearchInjector	`packages/desktop/app/main/services/capabilities/llm/completion/ProviderSearchInjector.ts`

Architectural Context

graph TB
    subgraph CompletionService ["CompletionService (Facade)"]
        direction TB
        Complete[complete]
        Stream[completeStream]
        WithTransformers[completeWithTransformers]
        StreamTransformers[completeStreamWithTransformers]
        WithTools[streamWithTools]
        TestModel[testModel]
    end

    subgraph PipelineSteps ["Pipeline Steps"]
        direction TB
        S1["(1) Route resolution<br/>resolveRoutedModel"]
        S2["(2) Provider lookup<br/>getProvider + enabled check"]
        S3["(3) API Key resolution<br/>codingPlan → pool → legacy"]
        S4["(4) API format resolution<br/>resolveApiFormat"]
        S5["(5) Handler dispatch<br/>callDirectHandler / callStreamHandler"]
        S6["(6) Retry + success report"]
    end

    subgraph Handlers
        DAH[DirectApiHandler<br/>non-streaming]
        SH[StreamHandler<br/>SSE streaming]
        TH[ToolHandler<br/>tool loop]
        THR[TransformerHandler<br/>transformer chain]
    end

    subgraph AuxiliaryServices ["Auxiliary Services"]
        TR2[ThinkingResolver]
        NSI[NativeSearchInjector]
        PSI[ProviderSearchInjector]
        MC[Message Converter]
        UB[URL Builder]
        HB[Header Builder]
    end

    Complete --> S1 --> S2 --> S3 --> S4 --> S5 --> S6
    S5 --> DAH
    S5 --> SH

    WithTools --> TH
    WithTransformers --> THR
    StreamTransformers --> THR

    DAH --> UB
    DAH --> HB
    DAH --> MC
    SH --> UB
    SH --> HB
    SH --> MC
    SH --> TR2
    TH --> UB
    TH --> HB

Data Structures

Request and Response Types

// Completion request options
interface CompletionOptions {
  providerId: string;              // Provider ID
  model: string;                   // Model ID (v89+: bare SDK id, no `<backend>:` prefix or `[1m]` suffix)
  messages: SimpleChatMessage[];   // Conversation messages
  maxTokens?: number;              // Max tokens to generate
  temperature?: number;            // Temperature
  stream?: boolean;                // Whether streaming
  thinkLevel?: ThinkLevel;         // Thinking level: 'none' | 'low' | 'medium' | 'high'
  nativeSearchAugmentation?: NativeSearchAugmentation; // SDK native search augmentation
  sessionId?: string;              // Session ID (API Key Pool affinity)
  /**
   * 1M-context flag (v89+). When true and the model is on the 1M-capable whitelist
   * (`claude-opus-4-7` / `claude-opus-4-6` / `claude-sonnet-4-6`),
   * `TransformerHandler` calls `injectExtendedContextBeta()` at the transformer chain
   * exit, merging `'context-1m-2025-08-07'` into the outbound request's
   * `anthropic-beta` HTTP header
   * (NOT a body field; `/v1/messages` rejects unknown body fields).
   */
  useExtendedContext?: boolean;
}

// Completion result
interface CompletionResult {
  success: boolean;
  message?: SimpleChatMessage;      // Generated message
  error?: string;                   // Error message
  usage?: {
    promptTokens: number;
    completionTokens: number;
    totalTokens: number;
  };
  finishReason?: string;            // 'stop' | 'tool_use' | 'max_tokens' etc.
}

// Streaming callbacks
interface StreamCallbacks {
  onStart?: (messageId: string) => void;
  onDelta?: (content: string) => void;
  onReasoning?: (reasoning: string) => void;
  onAudio?: (audio: SimpleChatAudio) => void;
  onVideo?: (video: SimpleChatVideo) => void;
  onBlock?: (block: MessageBlock) => void;  // Content blocks: thinking/text/tool_use/tool_result
  onDone?: (message, usage?, metrics?) => void;
  onError?: (error: string) => void;
}

// API format
type ApiFormat = 'openai' | 'anthropic' | 'google' | 'azure-openai' | 'openai-response';

Algorithms and Logic

Six-Step Request Pipeline

Step 1: Route Resolution

routedInfo = llmConfig.resolveRoutedModel(providerId, model)
actualProviderId = routedInfo?.actualProviderId || providerId
actualModel = routedInfo?.actualModelId || model

Route resolution handles Chat → Code and Code → Chat model routing. If model matches a routing rule, it is replaced with the actual provider and model.

Step 2: Provider Lookup

provider = getProvider(actualProviderId)
if (!provider) → return error "Provider not found"
if (!provider.enabled) → return error "Provider is disabled"

The lookup goes through LLMConfigService's Provider Index as an O(1) lookup.

Step 3: API Key Resolution

resolveApiKeyForRequest(provider, providerId, sessionId):
  // Priority 1: Coding Plan override
  if provider.codingPlan?.enabled && provider.codingPlan.apiKey:
    return resolveApiKey(codingPlan.apiKey)

  // Priority 2: API Key Pool (session-affinity weighted round-robin)
  if apiKeyPool available:
    poolKey = sessionId
      ? apiKeyPool.getKeyForSession(providerId, sessionId)
      : apiKeyPool.getKey(providerId)
    if poolKey: return poolKey

  // Priority 3: Legacy single key
  return resolveApiKey(provider.api_key)

resolveApiKey() handles environment variable expansion ($ENV_VAR → process.env.ENV_VAR).

Step 4: API Format Resolution

resolveApiFormat(provider):
  // Priority 1: apiFormat field (v3 preferred)
  if provider.apiFormat: return provider.apiFormat

  // Priority 2: chatApiFormat field (legacy v3)
  if provider.chatApiFormat: return provider.chatApiFormat

  // Priority 3: implicit apiType conversion
  if provider.apiType === 'claudecode' || 'anthropic': return 'anthropic'
  if provider.apiType === 'google': return 'google'

  // Default: OpenAI format
  return 'openai'

Step 5: Handler Dispatch

Select the appropriate handler based on apiFormat:

apiFormat	Non-streaming Handler	Streaming Handler
`openai`	`callOpenAICompletion`	`streamOpenAICompletion`
`openai-response`	`callOpenAIResponseCompletion`	`streamOpenAIResponseCompletion`
`anthropic`	`callAnthropicCompletion`	`streamAnthropicCompletion`
`google`	`callGeminiCompletion`	`streamGeminiCompletion`
`azure-openai`	`callOpenAICompletion`	`streamOpenAICompletion`

Step 6: Retry + Success Report

result = callHandler(...)
if result failed && apiKeyPool available && sessionId present:
  status = extractHttpStatus(result.error)
  if status in [429, 529, 401, 403]:
    newKey = apiKeyPool.reportError(providerId, sessionId, status)
    if newKey:
      result = callHandler(..., newKey)  // retry once with new key

if result succeeded && apiKeyPool available:
  apiKeyPool.reportSuccess(sessionId)  // reset cooldown counter

Streaming

SSE Stream Parsing

All streaming handlers use the streamSSEResponse() utility, based on the standard SSE (Server-Sent Events) protocol:

sequenceDiagram
    participant CS as CompletionService
    participant SH as StreamHandler
    participant API as Provider API

    CS->>SH: callStreamHandler(format, provider, key, options)
    SH->>API: POST request (stream: true)
    API-->>SH: SSE stream

    loop Each SSE event
        SH->>SH: Parse event data
        alt content delta
            SH->>CS: callbacks.onDelta(content)
        else reasoning delta
            SH->>CS: callbacks.onReasoning(reasoning)
        else block event
            SH->>CS: callbacks.onBlock(block)
        else [DONE]
            SH->>CS: callbacks.onDone(message, usage, metrics)
        else error
            SH->>CS: callbacks.onError(error)
        end
    end

Stream Retry (Pool Mode)

flowchart TD
    Start[completeStream] --> HasPool{apiKeyPool available?}
    HasPool -->|No| DirectCall[Call callStreamHandler directly]
    HasPool -->|Yes| InterceptCall[Call with intercepting callbacks]

    InterceptCall --> StreamDone{Stream completed successfully?}
    StreamDone -->|Yes| ReportSuccess[reportSuccess]
    StreamDone -->|No| CheckStatus{Is 429/529/401/403?}

    CheckStatus -->|Yes| GetNewKey[reportError → get new key]
    CheckStatus -->|No| PropagatError[Propagate error to frontend]

    GetNewKey --> HasNewKey{New key available?}
    HasNewKey -->|Yes| RetryStream[Retry stream with new key]
    HasNewKey -->|No| PropagatError

Key design points for stream retry:

429/529 errors are intercepted by wrapping callbacks.onError
A retryState object reference is used to track error state across closures
On retry, onStart is not triggered again (it was already fired once)

Anthropic Thinking Budget Calculation

flowchart TD
    Start[resolveThinkingBudget] --> CheckLevel{thinkLevel === 'none'?}
    CheckLevel -->|Yes| NoThinking[Return raw maxTokens<br/>no thinking config]
    CheckLevel -->|No| CheckModel{isReasoningModel?}
    CheckModel -->|No| NoThinking
    CheckModel -->|Yes| CalcBudget[calculateThinkingBudget<br/>model, thinkLevel, maxTokens]

    CalcBudget --> CheckFormat{API format is Anthropic?}
    CheckFormat -->|Yes| AdjustTokens[adjustedMaxTokens = getClaudeMaxTokens<br/>maxTokens - thinkingBudget]
    CheckFormat -->|No| KeepTokens[adjustedMaxTokens = maxTokens]

    AdjustTokens --> BuildConfig[buildAnthropicThinking<br/>generate thinking config object]
    BuildConfig --> Return[Return adjustedMaxTokens + thinkingConfig]
    KeepTokens --> Return

Anthropic special handling: For Claude models, max_tokens includes thinking tokens, so the following is required:

Calculate the thinking budget thinkingBudget
Subtract the thinking budget from max_tokens to get adjustedMaxTokens
Generate a thinking config object to include in the request body

OpenAI Reasoning Models

For OpenAI o-series models (o1, o3, etc.), use the reasoning_effort parameter instead of adjusting the token budget:

if thinkLevel !== 'none':
  effort = getOpenAIReasoningEffort(thinkLevel)
  // 'low' | 'medium' | 'high'
  request.reasoning_effort = effort

max_tokens Resolution Priority

resolveEffectiveMaxTokens(providerId, modelId, sessionMaxTokens):
  // 1. Session-level setting (highest priority, user-set manually, no cap applied)
  if sessionMaxTokens > 0: return sessionMaxTokens

  // 2. Global model parameters (admin-set, no cap applied)
  globalParams = llmConfig.getGlobalModelParameters()
  if globalParams.maxTokens.enabled && value > 0: return value

  // -------- Values below are auto-resolved and capped at MAX_TOKENS_CAP=65536 --------

  // 3. maxTokens from model config
  modelConfig = provider.modelConfigs.find(id === modelId)
  if modelConfig.maxTokens > 0: return min(value, 65536)

  // 4. maxTokens from model group
  modelGroup = provider.modelGroups.find(models.id === modelId)
  if model.maxTokens > 0: return min(value, 65536)

  // 5. Model discovery cache
  discovered = llmConfig.getDiscoveredModelMaxTokens(providerId, modelId)
  if discovered > 0: return min(value, 65536)

  // 6. undefined (let the API use its default)
  return undefined

// For providers requiring max_tokens (e.g. Anthropic):
getRequiredMaxTokens():
  resolved = resolveEffectiveMaxTokens(...)
  return resolved ?? DEFAULT_MAX_TOKENS  // typically 4096

Tool Call Loop (ToolHandler)

sequenceDiagram
    participant TH as ToolHandler
    participant LLM as LLM API
    participant MCP as MCP Service

    TH->>TH: MAX_ITERATIONS = globalParams.toolMaxTurns ?? 5
    TH->>TH: iteration = 0

    loop iteration < MAX_ITERATIONS
        TH->>TH: iteration++
        TH->>TH: buildToolRequest(format, messages, model, options)
        TH->>LLM: POST request with tools
        LLM-->>TH: SSE stream response

        TH->>TH: extractToolCalls(response)

        alt No tool calls
            TH->>TH: break (LLM finished answering)
        else Has tool calls
            loop Each tool call
                TH->>TH: callbacks.onToolCall(toolCall)
                TH->>MCP: executeToolCalls(toolCalls, mcpService)
                MCP-->>TH: tool results
                TH->>TH: callbacks.onToolResult(id, result)
            end
            TH->>TH: Append tool calls and results to messages
            TH->>TH: buildIterationBlocks(tool call blocks)
        end
    end

    TH->>TH: callbacks.onDone(finalContent, usage)

Key behaviors:

Parameter	Default	Description
`MAX_ITERATIONS`	`globalParams.toolMaxTurns ?? 5`	Maximum iteration count
Tool format	Auto-detected from apiFormat	Tool definitions in OpenAI/Anthropic/Gemini format
Termination condition	No tool calls, or limit reached	Ends naturally when LLM stops requesting tools

Tool calls support three formats, auto-detected by logToolFormat():

OpenAI format: { type: 'function', function: { name, parameters } }
Anthropic format: { name, input_schema }
Gemini format: { functionDeclarations: [...] }

Vision Fallback

When messages contain images but the model does not support vision, an auxiliary vision model is used automatically:

applyVisionFallback(options):
  if no images in messages: return
  if model supports vision: return

  visionModel = llmConfig.resolveEffectiveModels().vision
  if no vision model:
    // Strip images
    for msg in messages:
      msg.images = undefined
    return

  // Use VisionDescriptionService to describe images
  for msg in messages with images:
    description = visionService.describeImages(images, msg.content, visionModel)
    msg.content += "\n\n[Image Description]\n" + description
    msg.images = undefined

Error Interception Pattern

extractHttpStatus() extracts the HTTP status code from an error message string:

extractHttpStatus(error: string):
  match = error.match(/\((\d{3})\):/)
  return match ? parseInt(match[1]) : null

// Example: "API error (429): Rate limit exceeded" → 429

IPC Integration Table

IPC Channel	Direction	Router	Description
`completion:complete`	R → M	CompletionRouter	Non-streaming completion
`completion:getModels`	R → M	CompletionRouter	Get available models
`completion:testModel`	R → M	CompletionRouter	Test model connection
Streaming completion	R → M	ChatStreamHandler	SSE stream delivered via IPC messages
Regenerate	R → M	RegenerateHandler	Regenerate a reply

Extension Points

Adding a New API Format

Add the new value to the ApiFormat type in types.ts
Add a URL builder function in url-builder.ts
Add header-building logic in header-builder.ts
Add message format conversion in message-converter.ts
Add a callXxxCompletion function in DirectApiHandler.ts
Add a streamXxxCompletion function in StreamHandler.ts
Add the case to the switch in CompletionService.callDirectHandler() and callStreamHandler()

Custom Retry Strategy

Currently only one retry is attempted. For more complex retry behavior (e.g., multiple retries, varying wait times), modify the retry logic in complete() and completeStream().

Adding New Search Injection

SDK native search (e.g., Anthropic): inject via NativeSearchInjector.applyAugmentation()
Provider-specific search (e.g., model-param / builtin-tool): inject via ProviderSearchInjector

File	Relationship
`capabilities/llm/config-service/LLMConfigService.ts`	Provides provider lookup and route resolution
`capabilities/llm/completion/ApiKeyPoolService.ts`	Key selection and load balancing
`infra/utils/sse-parser.ts`	SSE stream parsing utilities
`shared/completion-types.ts`	Shared types such as SimpleChatMessage
`shared/thinking-config.ts`	Thinking budget calculation and reasoning model detection
`shared/llm-config.ts`	LLMProvider type definition
`capabilities/tools/mcp-users/McpService.ts`	Tool execution (called by ToolHandler)
`capabilities/llm/api-converter/openai-to-anthropic.ts`	OpenAI → Anthropic format conversion
`routers/CompletionRouter.ts`	IPC entry point

File Locations​

Architectural Context​

Data Structures​

Request and Response Types​

Algorithms and Logic​

Six-Step Request Pipeline​

Step 1: Route Resolution​

Step 2: Provider Lookup​

Step 3: API Key Resolution​

Step 4: API Format Resolution​

Step 5: Handler Dispatch​

Step 6: Retry + Success Report​

Streaming​

SSE Stream Parsing​

Stream Retry (Pool Mode)​

Anthropic Thinking Budget Calculation​

OpenAI Reasoning Models​

max_tokens Resolution Priority​

Tool Call Loop (ToolHandler)​

Vision Fallback​

Error Interception Pattern​

IPC Integration Table​

Extension Points​

Adding a New API Format​

Custom Retry Strategy​

Adding New Search Injection​

Related Files​