Getting Started

ACP Protocol

The Agent Control Protocol specification

Overview

The Agent Control Protocol (ACP) v1 is an open, bidirectional WebSocket protocol that connects application SDKs to the Vocall Engine. All messages are JSON objects identified by a type field. The main channel operates at the /connect endpoint. A separate /ws/stream endpoint carries voice audio.

Connection Lifecycle

SDK (Client)                              Engine (Server)
    |                                          |
    |-------- WebSocket /connect ------------->|
    |                                          |
    |-- manifest (app, screens, fields) ------>|
    |<---------------------- config (sessionId)|
    |                                          |
    |-- text ("fill the invoice") ------------>|
    |<-------------------- status ("thinking") |
    |<--------------------------- chat_token   |  (streaming)
    |<--------------------------- chat_token   |  (streaming)
    |<--------------------- chat (final: true) |
    |<----- command (seq: 1, actions: [...])   |
    |                                          |
    |   [SDK executes fill, select, click...]  |
    |                                          |
    |-- result (seq: 1, results: [...]) ------>|
    |<---------------------- status ("idle")   |
    |                                          |

Client-to-Server Messages

These messages are sent from the SDK to the Vocall Engine.

manifest

Sent once on connection. Describes the entire application UI.

{
  "type": "manifest",
  "app": "invoicing",
  "version": "1.0.0",
  "currentScreen": "create_invoice",
  "screens": {
    "create_invoice": {
      "id": "create_invoice",
      "label": "Create Invoice",
      "route": "/invoices/new",
      "fields": [
        { "id": "client_name", "type": "autocomplete", "label": "Client", "required": true },
        { "id": "amount", "type": "currency", "label": "Amount" }
      ],
      "actions": [
        { "id": "submit", "label": "Submit", "requiresConfirmation": true }
      ],
      "modals": [
        { "id": "client_picker", "label": "Select Client", "searchable": true }
      ]
    }
  },
  "user": { "name": "Jane", "email": "[email protected]", "org": "Acme" },
  "persona": { "name": "Assistant", "role": "Invoice helper", "instructions": "Be concise." },
  "context": {}
}

text

User chat message.

{ "type": "text", "message": "Create an invoice for Acme Corp, $5000" }

state

Reports the current field state for a screen. Sent proactively or in response to engine requests.

{
  "type": "state",
  "screen": "create_invoice",
  "fields": {
    "client_name": { "value": "Acme Corp", "valid": true, "dirty": true },
    "amount": { "value": "", "valid": false, "error": "Required", "dirty": false }
  },
  "canSubmit": false
}

result

Reports the outcome of executing a command's actions. Includes the sequence number from the original command.

{
  "type": "result",
  "seq": 1,
  "results": [
    { "index": 0, "success": true },
    { "index": 1, "success": true },
    { "index": 2, "success": false, "error": "Field not found: unknown_field" }
  ],
  "state": {
    "screen": "create_invoice",
    "fields": { "client_name": { "value": "Acme Corp", "valid": true } },
    "canSubmit": true
  }
}

confirm

User response to an ask_confirm action.

{ "type": "confirm", "seq": 1, "confirmed": true }

tts_state

Reports whether TTS audio is currently playing on the client.

{ "type": "tts_state", "active": true }

stt_config

Switches the speech-to-text profile.

{ "type": "stt_config", "profile": "pt-BR-medical" }

llm_config

Switches the LLM provider at runtime without reconnecting.

{ "type": "llm_config", "provider": "deepseek" }

tts_config

Switches the text-to-speech voice.

{ "type": "tts_config", "voice": "alloy" }

response_lang_config

Changes the language used for agent responses.

{ "type": "response_lang_config", "language": "pt-BR" }

Server-to-Client Messages

These messages are sent from the Vocall Engine to the SDK.

config

Sent immediately after connection. Contains the session ID, feature flags, and available LLM providers.

{
  "type": "config",
  "sessionId": "sess_abc123",
  "features": { "voice": true, "chat": true },
  "providers": [
    { "id": "openai", "name": "OpenAI", "model": "gpt-4o" },
    { "id": "deepseek", "name": "DeepSeek", "model": "deepseek-chat" }
  ],
  "current_provider": "openai"
}

command

Contains an array of UI actions to execute. The seq field is used to correlate with the result response.

{
  "type": "command",
  "seq": 1,
  "actions": [
    { "do": "fill", "field": "client_name", "value": "Acme Corp", "animate": "typewriter", "speed": 30 },
    { "do": "fill", "field": "amount", "value": "5000", "animate": "count_up", "duration": 800 },
    { "do": "select", "field": "tax_rate", "value": "10" },
    { "do": "click", "action": "submit" }
  ]
}

chat

Complete chat message from the agent, user echo, or system.

{ "type": "chat", "from": "agent", "message": "I have filled in the invoice. Ready to submit?", "final": true }

The from field is one of: agent, user, or system.

chat_token

Streaming token for incremental chat display.

{ "type": "chat_token", "token": "I have" }

status

Agent status update. Used to drive UI indicators (spinners, recording icons, etc.).

{ "type": "status", "status": "thinking" }

Valid statuses: idle, listening, recording, thinking, speaking, executing.

error

Error notification with an optional code.

{ "type": "error", "code": "MANIFEST_INVALID", "message": "Screen 'checkout' has no fields" }

tts_start

Indicates TTS audio playback has started for a sentence.

{ "type": "tts_start", "sentence": "I have filled in the invoice details." }

tts_end

Indicates TTS audio playback has ended for a sentence.

{ "type": "tts_end", "sentence": "I have filled in the invoice details." }

UI Actions Reference

Each action in a command message is an object with a do field that identifies the action type.

Sequential Actions

These execute one at a time, in order, because each may change application state.

| Action | Required Fields | Optional Fields | Description | |--------|----------------|-----------------|-------------| | navigate | screen | -- | Navigate to a different screen | | click | action | -- | Trigger a registered action callback | | show_toast | message | level, duration | Display a toast notification | | ask_confirm | message | -- | Prompt user for yes/no confirmation | | open_modal | modal | -- | Open a named modal dialog | | close_modal | modal | -- | Close a named modal dialog |

Parallel Actions

These execute concurrently for faster UI updates.

| Action | Required Fields | Optional Fields | Description | |--------|----------------|-----------------|-------------| | fill | field, value | animate, speed, duration | Set a field's value | | clear | field | -- | Clear a field's value | | select | field, value | -- | Set a select/dropdown value | | highlight | field | duration | Scroll to and visually highlight a field | | focus | field | -- | Focus a field for input | | scroll_to | field | -- | Scroll a field into the viewport | | enable | field | -- | Enable a previously disabled field | | disable | field | -- | Disable a field |

All actions include automatic retry logic: up to 3 attempts with a 300ms delay between retries.

Animation Options

The fill action supports an animate property to control how values appear:

| Animation | Behavior | Typical Use | |-----------|----------|-------------| | typewriter | Characters appear one at a time at speed ms per character | Text fields, names, descriptions | | count_up | Number counts from 0 to target over duration ms | Currency and number fields | | fade_in | Value fades in over duration ms | Any field type | | none | Value appears instantly (default) | Hidden fields, programmatic fills |

Toast Levels

The show_toast action accepts a level property:

| Level | Appearance | |-------|------------| | info | Informational (default) | | success | Green / positive | | warning | Yellow / caution | | error | Red / failure |

Field Types

The manifest supports 15 field types. Each field in a screen's fields array requires id, type, and label.

| Type | Description | Notable Properties | |------|-------------|-------------------| | text | Standard text input | maxLength, placeholder | | number | Numeric input | min, max | | currency | Monetary value input | min, max | | date | Date picker | -- | | datetime | Date and time picker | -- | | email | Email with validation | placeholder | | phone | Phone number input | mask | | masked | Input with mask pattern (e.g. ###.###.###-##) | mask (required) | | select | Dropdown select | options (array of {value, label}) | | autocomplete | Typeahead input | options, source | | checkbox | Boolean checkbox | -- | | radio | Radio button group | options | | textarea | Multi-line text area | maxLength, placeholder | | file | File upload | -- | | hidden | Hidden field (not rendered) | -- |

Fields also support required (boolean), readOnly (boolean), and placeholder (string).

Voice Channel

The voice channel operates on a separate WebSocket at /ws/stream. It carries binary audio frames (PCM) and JSON control messages on the same connection.

Client-to-server voice messages:

| Type | Description | |------|-------------| | config | Audio configuration (sample rate, encoding mode) | | eof | End of audio stream (user stopped speaking) | | tts_state | Whether TTS is currently playing | | interrupt | Interrupt current TTS playback |

Server-to-client voice messages:

| Type | Description | |------|-------------| | wake_word | Wake word was detected in the audio stream | | partial | Interim (non-final) transcription | | transcription | Final transcription of user speech | | llm_start | LLM processing has started | | llm_token | Streaming LLM token | | llm_end | LLM processing is complete | | tts_start | TTS audio generation started | | tts_end | TTS audio generation complete |

Binary audio frames from the server (TTS audio) are interleaved with JSON control messages on the same WebSocket. SDKs distinguish between them by checking the WebSocket frame type (binary vs text).

← PreviousQuick Start Next →Angular