आपका Device, आपका Data, आपका Agent
एक personal AI assistant जो 100% device पर चलता है। cloud नहीं। Python नहीं। आपका data machine से बाहर नहीं जाता।
समस्या
Siri आपकी voice Apple को भेजता है। Google Assistant आपकी queries Google को भेजता है। Alexa सब कुछ Amazon को भेजता है। हर "smart" assistant को cloud hop चाहिए। आपका calendar, contacts, messages और location data सब device छोड़ते हैं।
क्या हो अगर आपके पास ऐसा capable assistant हो जो आपकी machine से कभी भी एक byte भी बाहर न भेजे?
Config
device_agent.toml defines a complete on-device assistant with local LLM inference and access to all your Apple data through iMCP - all in 82 lines of TOML.
Local LLM inference
[agent] provider = "llama_cpp" model = "unsloth/Qwen3-VL-8B-Instruct-GGUF:UD-Q6_K_XL" assume_mutating = false tools = [ "create_task", "todowrite", "todoread", "question", "mdq", # All Apple data tools come from iMCP "iMCP.*", ]
"iMCP.*" is a wildcard that matches every tool exposed by the iMCP MCP server. As iMCP adds new services (Reminders, Notes, Shortcuts), the agent automatically gains access without config changes.
GPU model parameters
[agent.parameters] n_ctx = 160000 # 160K context window max_tokens = 8192 # Max response length top_p = 0.95 top_k = 20 temperature = 0.9 # Creative but grounded flash_attention = "enabled" # Faster inference
QueryMT auto-detects your GPU (Metal on Apple Silicon, CUDA on NVIDIA, Vulkan elsewhere) and pulls the correct OCI image variant. No driver config. No CUDA toolkit installation.
Three-layer compaction for long conversations
# Layer 1: Tool output truncation [agent.execution.tool_output] max_lines = 2000 max_bytes = 51200 # Layer 2: Pruning after every turn [agent.execution.pruning] protect_tokens = 40000 # Layer 3: AI summary on context overflow [agent.execution.compaction] auto = true
Calendar queries and contact lookups return structured data. But long conversations with many tool calls can still fill the 160K context window. The three-layer compaction system handles this automatically: truncate large tool outputs, prune old messages, and summarize when needed.
MCP server: iMCP
# iMCP - local Apple data access [[mcp]] name = "iMCP" transport = "stdio" command = "/Applications/iMCP.app/Contents/MacOS/imcp-server"
This tells QueryMT to launch the iMCP MCP server locally. The agent communicates with it over stdio - no network connection involved. iMCP bridges to native Apple APIs directly.
Architecture
सब कुछ device पर रहता है। pipeline के किसी भी step पर data आपकी machine से बाहर नहीं जाता:
The LLM runs locally via llama.cpp. Tool calls go to iMCP over stdio. iMCP calls Apple's local APIs. At no point does your data traverse a network connection. The agent works offline.
Example Interaction
Key Features
- 100% on-device — local LLM inference, local data access, no cloud
- iMCP integration — Calendar, Contacts, Messages, Maps, Weather via native APIs
- Wildcard tool patterns —
"iMCP.*"auto-includes new services - GPU auto-detection — Metal/CUDA/Vulkan, no manual config
- Works offline —
stdiotransport, no network needed - Three-layer compaction — handles long tool-heavy conversations
खुद try करें
# 1. Install iMCP brew install --cask mattt/tap/iMCP # Or download from https://iMCP.app/download # 2. Open iMCP.app and enable services (Calendar, Contacts, etc.) # 3. Run the agent cargo run --example qmtcode --features dashboard -- confs/device_agent.toml --dashboard # 4. Ask questions in the dashboard: # "What does my schedule look like today?" # "Do I have any messages from Sarah this week?" # "What's the weather like for my commute home?"