你的设备、你的数据、你的 agent
一个 100% 在设备上运行的 personal AI assistant。没有云。没有 Python。你的数据永远不会离开 machine。
问题
Siri 会把你的语音发送给 Apple。Google Assistant 会把 queries 发送给 Google。Alexa 会把一切发送给 Amazon。每个“smart”assistant 都需要 云 hop。你的 calendar、contacts、messages 和 location data 都会离开设备。
如果你可以拥有一个 capable assistant,并且它 永远不会 向 machine 外发送任何 byte,会怎样?
Config
device_agent.toml defines a complete on-device assistant with local LLM inference and access to all your Apple data through iMCP - all in 82 lines of TOML.
Local LLM inference
[agent] provider = "llama_cpp" model = "unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q6_K_XL" assume_mutating = false tools = [ "create_task", "todowrite", "todoread", "question", "mdq", # All Apple data tools come from iMCP "iMCP.*", ]
"iMCP.*" is a wildcard that matches every tool exposed by the iMCP MCP server. As iMCP adds new services (Reminders, Notes, Shortcuts), the agent automatically gains access without config changes.
GPU model parameters
[agent.parameters] n_ctx = 160000 # 160K context window max_tokens = 8192 # Max response length top_p = 0.95 top_k = 20 temperature = 0.9 # Creative but grounded flash_attention = "enabled" # Faster inference
QueryMT auto-detects your GPU (Metal on Apple Silicon, CUDA on NVIDIA, Vulkan elsewhere) and pulls the correct OCI image variant. No driver config. No CUDA toolkit installation.
Three-layer compaction for long conversations
# Layer 1: Tool output truncation [agent.execution.tool_output] max_lines = 2000 max_bytes = 51200 # Layer 2: Pruning after every turn [agent.execution.pruning] protect_tokens = 40000 # Layer 3: AI summary on context overflow [agent.execution.compaction] auto = true
Calendar queries and contact lookups return structured data. But long conversations with many tool calls can still fill the 160K context window. The three-layer compaction system handles this automatically: truncate large tool outputs, prune old messages, and summarize when needed.
MCP server: iMCP
# iMCP - local Apple data access [[mcp]] name = "iMCP" transport = "stdio" command = "/Applications/iMCP.app/Contents/MacOS/imcp-server"
This tells QueryMT to launch the iMCP MCP server locally. The agent communicates with it over stdio - no network connection involved. iMCP bridges to native Apple APIs directly.
Architecture
一切都留在设备上。pipeline 的任何阶段,data 都不会离开你的 machine:
The LLM runs locally via llama.cpp. Tool calls go to iMCP over stdio. iMCP calls Apple's local APIs. At no point does your data traverse a network connection. The agent works offline.
Example Interaction
关键功能
- 100% on-device — local LLM inference, local data access, no cloud
- iMCP integration — Calendar, Contacts, Messages, Maps, Weather via native APIs
- Wildcard tool patterns —
"iMCP.*"auto-includes new services - GPU auto-detection — Metal/CUDA/Vulkan, no manual config
- Works offline —
stdiotransport, no network needed - Three-layer compaction — handles long tool-heavy conversations
自己试试
# 1. Install iMCP brew install --cask mattt/tap/iMCP # Or download from https://iMCP.app/download # 2. Open iMCP.app and enable services (Calendar, Contacts, etc.) # 3. Run the agent cargo run --example qmtcode --features dashboard -- confs/device_agent.toml --dashboard # 4. Ask questions in the dashboard: # "What does my schedule look like today?" # "Do I have any messages from Sarah this week?" # "What's the weather like for my commute home?"