Getting Started
ACP Protocol
The Agent Control Protocol specification
Overview
The Agent Control Protocol (ACP) v1 is an open, bidirectional WebSocket protocol that connects application SDKs to the Vocall Engine. All messages are JSON objects identified by a type field. The main channel operates at the /connect endpoint. A separate /ws/stream endpoint carries voice audio.
Connection Lifecycle
SDK (Client) Engine (Server)
| |
|-------- WebSocket /connect ------------->|
| |
|-- manifest (app, screens, fields) ------>|
|<---------------------- config (sessionId)|
| |
|-- text ("fill the invoice") ------------>|
|<-------------------- status ("thinking") |
|<--------------------------- chat_token | (streaming)
|<--------------------------- chat_token | (streaming)
|<--------------------- chat (final: true) |
|<----- command (seq: 1, actions: [...]) |
| |
| [SDK executes fill, select, click...] |
| |
|-- result (seq: 1, results: [...]) ------>|
|<---------------------- status ("idle") |
| |
Client-to-Server Messages
These messages are sent from the SDK to the Vocall Engine.
manifest
Sent once on connection. Describes the entire application UI.
{
"type": "manifest",
"app": "invoicing",
"version": "1.0.0",
"currentScreen": "create_invoice",
"screens": {
"create_invoice": {
"id": "create_invoice",
"label": "Create Invoice",
"route": "/invoices/new",
"fields": [
{ "id": "client_name", "type": "autocomplete", "label": "Client", "required": true },
{ "id": "amount", "type": "currency", "label": "Amount" }
],
"actions": [
{ "id": "submit", "label": "Submit", "requiresConfirmation": true }
],
"modals": [
{ "id": "client_picker", "label": "Select Client", "searchable": true }
]
}
},
"user": { "name": "Jane", "email": "[email protected]", "org": "Acme" },
"persona": { "name": "Assistant", "role": "Invoice helper", "instructions": "Be concise." },
"context": {}
}
text
User chat message.
{ "type": "text", "message": "Create an invoice for Acme Corp, $5000" }
state
Reports the current field state for a screen. Sent proactively or in response to engine requests.
{
"type": "state",
"screen": "create_invoice",
"fields": {
"client_name": { "value": "Acme Corp", "valid": true, "dirty": true },
"amount": { "value": "", "valid": false, "error": "Required", "dirty": false }
},
"canSubmit": false
}
result
Reports the outcome of executing a command's actions. Includes the sequence number from the original command.
{
"type": "result",
"seq": 1,
"results": [
{ "index": 0, "success": true },
{ "index": 1, "success": true },
{ "index": 2, "success": false, "error": "Field not found: unknown_field" }
],
"state": {
"screen": "create_invoice",
"fields": { "client_name": { "value": "Acme Corp", "valid": true } },
"canSubmit": true
}
}
confirm
User response to an ask_confirm action.
{ "type": "confirm", "seq": 1, "confirmed": true }
tts_state
Reports whether TTS audio is currently playing on the client.
{ "type": "tts_state", "active": true }
stt_config
Switches the speech-to-text profile.
{ "type": "stt_config", "profile": "pt-BR-medical" }
llm_config
Switches the LLM provider at runtime without reconnecting.
{ "type": "llm_config", "provider": "deepseek" }
tts_config
Switches the text-to-speech voice.
{ "type": "tts_config", "voice": "alloy" }
response_lang_config
Changes the language used for agent responses.
{ "type": "response_lang_config", "language": "pt-BR" }
Server-to-Client Messages
These messages are sent from the Vocall Engine to the SDK.
config
Sent immediately after connection. Contains the session ID, feature flags, and available LLM providers.
{
"type": "config",
"sessionId": "sess_abc123",
"features": { "voice": true, "chat": true },
"providers": [
{ "id": "openai", "name": "OpenAI", "model": "gpt-4o" },
{ "id": "deepseek", "name": "DeepSeek", "model": "deepseek-chat" }
],
"current_provider": "openai"
}
command
Contains an array of UI actions to execute. The seq field is used to correlate with the result response.
{
"type": "command",
"seq": 1,
"actions": [
{ "do": "fill", "field": "client_name", "value": "Acme Corp", "animate": "typewriter", "speed": 30 },
{ "do": "fill", "field": "amount", "value": "5000", "animate": "count_up", "duration": 800 },
{ "do": "select", "field": "tax_rate", "value": "10" },
{ "do": "click", "action": "submit" }
]
}
chat
Complete chat message from the agent, user echo, or system.
{ "type": "chat", "from": "agent", "message": "I have filled in the invoice. Ready to submit?", "final": true }
The from field is one of: agent, user, or system.
chat_token
Streaming token for incremental chat display.
{ "type": "chat_token", "token": "I have" }
status
Agent status update. Used to drive UI indicators (spinners, recording icons, etc.).
{ "type": "status", "status": "thinking" }
Valid statuses: idle, listening, recording, thinking, speaking, executing.
error
Error notification with an optional code.
{ "type": "error", "code": "MANIFEST_INVALID", "message": "Screen 'checkout' has no fields" }
tts_start
Indicates TTS audio playback has started for a sentence.
{ "type": "tts_start", "sentence": "I have filled in the invoice details." }
tts_end
Indicates TTS audio playback has ended for a sentence.
{ "type": "tts_end", "sentence": "I have filled in the invoice details." }
UI Actions Reference
Each action in a command message is an object with a do field that identifies the action type.
Sequential Actions
These execute one at a time, in order, because each may change application state.
| Action | Required Fields | Optional Fields | Description |
|--------|----------------|-----------------|-------------|
| navigate | screen | -- | Navigate to a different screen |
| click | action | -- | Trigger a registered action callback |
| show_toast | message | level, duration | Display a toast notification |
| ask_confirm | message | -- | Prompt user for yes/no confirmation |
| open_modal | modal | -- | Open a named modal dialog |
| close_modal | modal | -- | Close a named modal dialog |
Parallel Actions
These execute concurrently for faster UI updates.
| Action | Required Fields | Optional Fields | Description |
|--------|----------------|-----------------|-------------|
| fill | field, value | animate, speed, duration | Set a field's value |
| clear | field | -- | Clear a field's value |
| select | field, value | -- | Set a select/dropdown value |
| highlight | field | duration | Scroll to and visually highlight a field |
| focus | field | -- | Focus a field for input |
| scroll_to | field | -- | Scroll a field into the viewport |
| enable | field | -- | Enable a previously disabled field |
| disable | field | -- | Disable a field |
All actions include automatic retry logic: up to 3 attempts with a 300ms delay between retries.
Animation Options
The fill action supports an animate property to control how values appear:
| Animation | Behavior | Typical Use |
|-----------|----------|-------------|
| typewriter | Characters appear one at a time at speed ms per character | Text fields, names, descriptions |
| count_up | Number counts from 0 to target over duration ms | Currency and number fields |
| fade_in | Value fades in over duration ms | Any field type |
| none | Value appears instantly (default) | Hidden fields, programmatic fills |
Toast Levels
The show_toast action accepts a level property:
| Level | Appearance |
|-------|------------|
| info | Informational (default) |
| success | Green / positive |
| warning | Yellow / caution |
| error | Red / failure |
Field Types
The manifest supports 15 field types. Each field in a screen's fields array requires id, type, and label.
| Type | Description | Notable Properties |
|------|-------------|-------------------|
| text | Standard text input | maxLength, placeholder |
| number | Numeric input | min, max |
| currency | Monetary value input | min, max |
| date | Date picker | -- |
| datetime | Date and time picker | -- |
| email | Email with validation | placeholder |
| phone | Phone number input | mask |
| masked | Input with mask pattern (e.g. ###.###.###-##) | mask (required) |
| select | Dropdown select | options (array of {value, label}) |
| autocomplete | Typeahead input | options, source |
| checkbox | Boolean checkbox | -- |
| radio | Radio button group | options |
| textarea | Multi-line text area | maxLength, placeholder |
| file | File upload | -- |
| hidden | Hidden field (not rendered) | -- |
Fields also support required (boolean), readOnly (boolean), and placeholder (string).
Voice Channel
The voice channel operates on a separate WebSocket at /ws/stream. It carries binary audio frames (PCM) and JSON control messages on the same connection.
Client-to-server voice messages:
| Type | Description |
|------|-------------|
| config | Audio configuration (sample rate, encoding mode) |
| eof | End of audio stream (user stopped speaking) |
| tts_state | Whether TTS is currently playing |
| interrupt | Interrupt current TTS playback |
Server-to-client voice messages:
| Type | Description |
|------|-------------|
| wake_word | Wake word was detected in the audio stream |
| partial | Interim (non-final) transcription |
| transcription | Final transcription of user speech |
| llm_start | LLM processing has started |
| llm_token | Streaming LLM token |
| llm_end | LLM processing is complete |
| tts_start | TTS audio generation started |
| tts_end | TTS audio generation complete |
Binary audio frames from the server (TTS audio) are interleaved with JSON control messages on the same WebSocket. SDKs distinguish between them by checking the WebSocket frame type (binary vs text).