Skip to content

📎 Device Context

Shared context extraction layer that powers every way agents interact with device screens. Single source of truth — MCP tools, REST endpoints, and the Skill Creator all call the same functions.

ToolWhat it doesSpeed
get_phone_stateCurrent app, activity, keyboard, focused elementFast
get_screen_treeLLM-readable indented UI hierarchy with element indicesFast
get_interactive_elementsClickable/text elements as JSON with bounds + centersFast
get_screen_xmlRaw uiautomator XML dumpFast
classify_screenDetect screen type: home, search, dialog, error, loadingFast
find_on_screenFind specific text, return location (XML first, OCR fallback)Fast-Medium

The most important tool for LLM agents. Converts the raw XML dump into an indented, readable format:

[1] FrameLayout [0,0][1080,2340]
[2] ViewGroup "ehr" [0,80][1080,2205]
[3] FrameLayout "Bottom sheet" [clickable] [0,861][1080,2205]
[4] TextView "Following" [clickable] [40,200][200,240]
[5] TextView "10" [100,200][160,240]
[6] ImageView "profile_image" [clickable] [440,80][640,280]

Agents can read this and decide “I need to tap element [4] to see Following list” — no vision model needed.

Heuristic screen type detection without LLM:

{
"app": "TikTok",
"package": "com.zhiliaoapp.musically",
"screen_type": "profile",
"has_keyboard": false,
"activity": "X.0sWc"
}

Screen types: launcher, feed, search, profile, settings, dialog, error, loading, unknown.

ToolWhat it does
screenshotStandard screencap (half-res JPEG, ~200ms)
screenshot_annotatedScreenshot with Portal’s numbered element overlay — each interactive element gets a visible number badge
screenshot_croppedCrop a specific region (x1, y1, x2, y2 in device pixels)

When screenshot_annotated is called, the Portal app draws numbered labels on every interactive element. The numbers match get_interactive_elements() indices. This is ideal for vision-capable LLMs that can look at the image and say “tap element 5.”

For content rendered as images — analytics dashboards, games, WebViews, canvas-drawn text — where get_elements() returns nothing.

ToolWhat it does
ocr_screenFull screen OCR via RapidOCR (CPU, no GPU)
ocr_region(x1, y1, x2, y2)OCR a cropped region — more accurate for targeted extraction

Returns [{text, conf, x, y, w, h}] sorted top-to-bottom.

Used in production for TikTok analytics scraping (post views, likes, shares from the Insights screen) and Instagram reel engagement stats.

ToolWhat it doesExample
clipboard_getRead clipboardCheck copied text
clipboard_setSet clipboardPrepare text for paste
get_notificationsList active notificationsCheck for new messages
open_notificationsPull down notification shadeAccess notification panel
clear_notificationsDismiss allClean up
launch_intentFull Android intent APIOpen URLs, share text, launch specific activities
toggle_overlayPortal numbered element overlayVisual debugging
find_on_screenFind text → get location”Is the Login button visible?“

More powerful than launch_app():

# Open a URL in browser
launch_intent(device, action="android.intent.action.VIEW", data="https://google.com")
# Open Settings
launch_intent(device, package="com.android.settings")
# Share text to any app
launch_intent(device, action="android.intent.action.SEND",
extras={"android.intent.extra.TEXT": "Check this out!"})

build_llm_context(device) returns everything an agent needs in one call:

{
"phone_state": {...}, # app, activity, keyboard
"screen_type": {...}, # classification
"elements": [...], # interactive elements (max 40)
"screen_tree": "...", # indented hierarchy
"screenshot": {...}, # base64 JPEG (optional)
"ocr": [...] # OCR results (optional)
}

All 19 functions live in one file: gitd/services/device_context.py. Three consumers, zero duplication:

  • MCP Server — exposes as tools for Claude Code, Cursor, Codex CLI, OpenClaw
  • FastAPI Router — REST endpoints for the dashboard and external integrations
  • Skill Creator — builds LLM system prompts with live device context