📎 Device Context
Shared context extraction layer that powers every way agents interact with device screens. Single source of truth — MCP tools, REST endpoints, and the Skill Creator all call the same functions.
Screen Understanding
Section titled “Screen Understanding”| Tool | What it does | Speed |
|---|---|---|
get_phone_state | Current app, activity, keyboard, focused element | Fast |
get_screen_tree | LLM-readable indented UI hierarchy with element indices | Fast |
get_interactive_elements | Clickable/text elements as JSON with bounds + centers | Fast |
get_screen_xml | Raw uiautomator XML dump | Fast |
classify_screen | Detect screen type: home, search, dialog, error, loading | Fast |
find_on_screen | Find specific text, return location (XML first, OCR fallback) | Fast-Medium |
get_screen_tree — LLM-optimized
Section titled “get_screen_tree — LLM-optimized”The most important tool for LLM agents. Converts the raw XML dump into an indented, readable format:
[1] FrameLayout [0,0][1080,2340] [2] ViewGroup "ehr" [0,80][1080,2205] [3] FrameLayout "Bottom sheet" [clickable] [0,861][1080,2205] [4] TextView "Following" [clickable] [40,200][200,240] [5] TextView "10" [100,200][160,240] [6] ImageView "profile_image" [clickable] [440,80][640,280]Agents can read this and decide “I need to tap element [4] to see Following list” — no vision model needed.
classify_screen — Quick state check
Section titled “classify_screen — Quick state check”Heuristic screen type detection without LLM:
{ "app": "TikTok", "package": "com.zhiliaoapp.musically", "screen_type": "profile", "has_keyboard": false, "activity": "X.0sWc"}Screen types: launcher, feed, search, profile, settings, dialog, error, loading, unknown.
Visual Capture
Section titled “Visual Capture”| Tool | What it does |
|---|---|
screenshot | Standard screencap (half-res JPEG, ~200ms) |
screenshot_annotated | Screenshot with Portal’s numbered element overlay — each interactive element gets a visible number badge |
screenshot_cropped | Crop a specific region (x1, y1, x2, y2 in device pixels) |
Annotated screenshots
Section titled “Annotated screenshots”When screenshot_annotated is called, the Portal app draws numbered labels on every interactive element. The numbers match get_interactive_elements() indices. This is ideal for vision-capable LLMs that can look at the image and say “tap element 5.”
For content rendered as images — analytics dashboards, games, WebViews, canvas-drawn text — where get_elements() returns nothing.
| Tool | What it does |
|---|---|
ocr_screen | Full screen OCR via RapidOCR (CPU, no GPU) |
ocr_region(x1, y1, x2, y2) | OCR a cropped region — more accurate for targeted extraction |
Returns [{text, conf, x, y, w, h}] sorted top-to-bottom.
Used in production for TikTok analytics scraping (post views, likes, shares from the Insights screen) and Instagram reel engagement stats.
Device Interaction Tools
Section titled “Device Interaction Tools”| Tool | What it does | Example |
|---|---|---|
clipboard_get | Read clipboard | Check copied text |
clipboard_set | Set clipboard | Prepare text for paste |
get_notifications | List active notifications | Check for new messages |
open_notifications | Pull down notification shade | Access notification panel |
clear_notifications | Dismiss all | Clean up |
launch_intent | Full Android intent API | Open URLs, share text, launch specific activities |
toggle_overlay | Portal numbered element overlay | Visual debugging |
find_on_screen | Find text → get location | ”Is the Login button visible?“ |
launch_intent — Full Android Intent API
Section titled “launch_intent — Full Android Intent API”More powerful than launch_app():
# Open a URL in browserlaunch_intent(device, action="android.intent.action.VIEW", data="https://google.com")
# Open Settingslaunch_intent(device, package="com.android.settings")
# Share text to any applaunch_intent(device, action="android.intent.action.SEND", extras={"android.intent.extra.TEXT": "Check this out!"})Agent Convenience
Section titled “Agent Convenience”build_llm_context(device) returns everything an agent needs in one call:
{ "phone_state": {...}, # app, activity, keyboard "screen_type": {...}, # classification "elements": [...], # interactive elements (max 40) "screen_tree": "...", # indented hierarchy "screenshot": {...}, # base64 JPEG (optional) "ocr": [...] # OCR results (optional)}Architecture
Section titled “Architecture”All 19 functions live in one file: gitd/services/device_context.py. Three consumers, zero duplication:
- MCP Server — exposes as tools for Claude Code, Cursor, Codex CLI, OpenClaw
- FastAPI Router — REST endpoints for the dashboard and external integrations
- Skill Creator — builds LLM system prompts with live device context