Getting Started

Introduction

What is Primoia and how it works

What is Primoia?

Primoia is an AI platform that lets intelligent agents see and control application user interfaces. Instead of building separate AI backends for every app, Primoia provides a universal bridge: your application describes its UI through a structured manifest, and an AI agent can read, understand, and operate that UI on behalf of the user.

Think of it as giving AI eyes and hands inside your application. The user speaks or types a request, the agent figures out which screens to navigate, which fields to fill, which buttons to click -- and the SDK executes those actions in real time.

The ACP Protocol

At the core of Primoia is the Agent Control Protocol (ACP) -- an open, bidirectional WebSocket protocol that connects your app to the Vocall Engine.

The protocol operates over a single WebSocket connection at the /connect endpoint. Every message is a JSON object with a type field that identifies its purpose. The SDK sends manifests, user messages, and execution results upstream. The Engine sends commands, chat responses, and status updates downstream.

A separate /ws/stream endpoint handles the voice channel, carrying raw audio for speech-to-text and text-to-speech in real time.

How It Works

The interaction model follows three steps:

  1. Describe -- Your app registers a manifest that lists every screen, field, action, and modal the agent is allowed to see and control.
  2. Converse -- The user sends a natural-language message (text or voice). The Engine passes it to an LLM along with the manifest context.
  3. Execute -- The LLM decides what UI actions to perform. The Engine sends a command containing those actions, and the SDK executes them against the live DOM or native UI.
  Your App           SDK            WebSocket          Vocall Engine          LLM
     |                |                 |                    |                  |
     |-- manifest --> |-- manifest --> |--- manifest ----->  |                  |
     |                |                 |                    |                  |
     |   user types   |                 |                    |                  |
     |-- "fill the    |-- text ------> |--- text --------->  |-- prompt -----> |
     |    invoice"    |                 |                    |                  |
     |                |                 |                    |  <-- actions --- |
     |                | <-- command --- | <--- command ----- |                  |
     |  fill fields   |                 |                    |                  |
     | <-- fill/click |-- result ----> |--- result -------> |                  |
     |                |                 |                    |                  |

The SDK handles retry logic, animation (typewriter effects, fade-ins), sequential vs parallel action execution, and reports results back so the agent can verify success or recover from errors.

Platform SDKs

Primoia ships with six official SDKs covering web, mobile, and desktop:

| Platform | Package | Language | |----------|---------|----------| | React / Next.js | @anthropic/vocall-react | TypeScript | | Angular | @anthropic/vocall-angular | TypeScript | | Vue | @anthropic/vocall-vue | TypeScript | | Flutter | vocall_flutter | Dart | | Kotlin | com.primoia:vocall-sdk | Kotlin | | Swift | VocallSDK | Swift |

Each SDK provides the same core capabilities: WebSocket connection management, manifest registration, field binding, action execution, voice support, and chat UI primitives. The API surface is adapted to each platform's idioms -- hooks for React, composables for Vue, standalone components for Angular, coroutines for Kotlin, and async/await for Swift.

The 14 UI Actions

The Engine can instruct your app to perform any of these actions:

Navigation and flow: navigate, click, open_modal, close_modal

Field manipulation: fill, clear, select, enable, disable

Visual feedback: highlight, focus, scroll_to

User interaction: show_toast, ask_confirm

Actions within a single command are split into two execution modes. Sequential actions (navigate, click, show_toast, ask_confirm, open_modal, close_modal) run one at a time in order, because each may change the page state. Parallel actions (fill, clear, select, highlight, focus, scroll_to, enable, disable) run concurrently for speed -- filling ten fields at once rather than one by one.

Architecture Overview

+------------------+       +------------------+       +------------------+
|                  |       |                  |       |                  |
|   Your App UI    | <---> |   Vocall SDK     | <---> |  Vocall Engine   |
|                  |       |                  |       |                  |
|  - Screens       |       |  - Manifest      |       |  - Session mgmt  |
|  - Fields        |       |  - Field binding |       |  - LLM routing   |
|  - Actions       |       |  - Action exec   |       |  - Voice (STT)   |
|  - Modals        |       |  - Voice capture |       |  - Voice (TTS)   |
|                  |       |  - Chat UI       |       |  - Command gen   |
+------------------+       +------------------+       +------------------+
                                    |                         |
                              WebSocket /connect         LLM Provider
                              WebSocket /ws/stream    (OpenAI, DeepSeek,
                                                       Anthropic, etc.)

The Engine is stateless per-session. It holds the manifest in memory for the duration of the WebSocket connection, forwards user messages to the configured LLM provider, parses the LLM response into structured commands, and relays them back to the SDK. Swapping LLM providers at runtime is supported via the llm_config message -- no reconnection needed.

Next Steps