XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Guanyu Jiang*1,2, Zhaochen Su*1, Xiaoye Qu3, Yi R. (May) Fung1
1Hong Kong University of Science and Technology, 2Zhejiang University, 3Huazhong University of Science and Technology
*Equal contribution
Comparison of reasoning trajectories with and without XSkill

Comparison of reasoning trajectories on a multimodal task with and without XSkill. The baseline agent (left) fails due to visual-semantic gaps, while XSkill (right) recalls relevant experiences and structured skill fragments, generating a grounded execution plan that leads to successful identification.

Abstract

Multimodal agents demonstrate impressive problem-solving capabilities but typically operate in isolated episodes without leveraging past experiences. Recent methods address this through dynamic retrieval of textual insights or predefined skill documents, yet face critical challenges: visual modalities are neglected during knowledge extraction, stored insights lack executable structure, and manually crafted skills fail to scale. We propose XSkill, a framework combining task-level Skills (structured workflows and tool templates) with action-level Experiences (context-specific tactical insights) through automated accumulation from agent trajectories. Our approach employs visually-grounded summarization to extract knowledge integrating visual observations and textual reasoning, hierarchical consolidation to maintain quality and diversity, and context-aware adaptation to tailor knowledge to current visual contexts. Evaluated on diverse benchmarks requiring visual tool use and multimodal search, XSkill achieves average gains of 2.58 to 6.71 points over strong baselines across different backbone models, with superior zero-shot transferability and strategic improvements in tool selection and execution accuracy. These results demonstrate that our framework enables transferable continual learning for multimodal agents in real-world scenarios without parametric training, offering broad application for practical deployment.

Framework Overview

XSkill operates in two phases. Phase I (Accumulation): the agent distills structured skill documents and experience items from multi-path trajectories via rollout summary, cross-rollout critique, and hierarchical consolidation to address the bottlenecks of inefficient tool use and inflexible orchestration. Phase II (Inference): the system decomposes the test task, retrieves relevant knowledge, adapts it to the current visual context, and injects it into the prompt of the agent.

XSkill framework overview

Overview of the XSkill framework. Phase I (left): Knowledge accumulation via rollout summary (A), cross-rollout critique (B), and hierarchical consolidation into the Skill Library and Experience Bank. Phase II (right): Task decomposition and retrieval (C), context-aware adaptation (D), and injection into the prompt of the agent.

 Skill Library 𝒦

A Skill k = (ℳ, 𝒲, 𝒫) is a task-level guidance document stored as structured Markdown, containing metadata, a workflow sequence, and reusable tool templates. Skills provide structured task-level guidance for a specific class of problems.

Workflow Sequences Tool Templates Markdown Format

 Experience Bank

An Experience e = (c, a, ve) captures action-level tactical insights — triggering condition, recommended action, and a semantic embedding for retrieval. Experiences provide context-specific tips on execution contexts and failure modes.

Triggering Conditions Action Recommendations JSON Format

Main Results

We evaluate XSkill on five benchmarks across three domains: Visual Agentic Tool Use (VisualToolBench, TIR-Bench), Multimodal Search (MMSearch-Plus, MMBrowseComp), and Comprehensive (AgentVista). We compare against three state-of-the-art learning-from-experience baselines: Agent Workflow Memory (AWM), Dynamic CheatSheet (DC), and Agent-KB.

Model Method VisualToolBench TIR-Bench MMSearch-Plus AgentVista Average
Avg@4Pass@4 Avg@4Pass@4 Avg@4Pass@4 Avg@4Pass@4 Avg@4Pass@4
Gemini-
2.5-Pro
No Tools 20.9128.97 21.3740.00 10.4319.43 17.8928.44 17.6529.21
w/ Tools 25.3540.65 28.3854.00 21.5635.55 20.1833.94 23.8741.04
AWM 25.9339.25 29.7553.50 20.8536.97 19.7232.11 24.0640.46
Dynamic CheatSheet 24.7737.38 27.6251.00 24.6440.76 21.7935.78 24.7141.23
Agent-KB 26.7541.12 29.1352.50 23.2237.91 20.8733.94 24.9941.37
XSkill (Ours) 30.4946.73 33.1258.00 27.9644.08 22.9436.70 28.6346.38
Gemini-
3-Flash
No Tools 25.1236.92 28.5054.00 16.4724.64 18.3529.36 22.1136.23
w/ Tools 41.9460.75 32.3758.50 39.5753.55 20.6439.45 33.6353.06
AWM 41.9459.35 34.2562.50 43.3654.98 19.5037.61 34.7653.61
Dynamic CheatSheet 41.7059.81 33.7559.00 40.2855.45 20.1836.70 33.9852.74
Agent-KB 41.7561.21 36.6262.00 39.8153.08 21.3338.53 34.8853.71
XSkill (Ours) 46.5064.02 47.7575.00 43.7256.40 23.3940.37 40.3458.95
GPT-5-
mini
No Tools 13.9022.90 20.0046.50 3.086.64 18.5828.44 13.8926.12
w/ Tools 24.3037.85 23.5050.50 14.2220.38 20.4135.78 20.6136.13
AWM 23.2534.58 24.2553.00 14.8120.85 19.0434.86 20.3435.82
Dynamic CheatSheet 23.8335.05 24.5053.50 13.7418.48 21.1036.70 20.7935.93
Agent-KB 24.7738.32 25.1353.00 13.6319.91 19.9534.86 20.8736.52
XSkill (Ours) 24.5337.85 28.2556.00 16.1123.22 23.8538.53 23.1938.90
o4-mini No Tools 14.7227.57 20.1345.00 5.9212.32 19.0425.69 14.9527.65
w/ Tools 19.6334.11 24.6249.50 15.8821.80 18.1229.36 19.5633.69
AWM 21.1436.92 25.2551.00 16.9422.75 19.9529.36 20.8235.01
Dynamic CheatSheet 20.4433.64 25.5052.00 14.5722.27 18.8128.44 19.8334.09
Agent-KB 22.7836.92 24.7552.50 15.1720.38 19.5031.19 20.5535.25
XSkill (Ours) 25.0041.12 30.2557.50 17.3023.70 22.3233.94 23.7239.07

Performance comparison (%) — Average@4 and Pass@4 over 4 independent rollouts.

Analysis

Ablation Study

Systematic ablation on VisualToolBench with Gemini-2.5-Pro. Removing either Experiences or Skills leads to drops of 3.04 and 3.85 points respectively. Phase I components (Experience Manager & Skill Manager) contribute more substantially than Phase II components (Task Decomposition & Adaptation), highlighting that accumulated knowledge quality is more critical than retrieval mechanism.

Setting Avg@4 Pass@4 Δ Avg@4
Full Pipeline (Ours) 30.4946.73
Experience & Skill Ablation
  w/o Experience27.4542.52−3.04
  w/o Skill26.6441.12−3.85
Phase I Ablation
  w/o Experience Manager26.4042.06−4.09
  w/o Skill Manager26.8742.99−3.62
Phase II Ablation
  w/o Task Decomposition29.2144.86−1.28
  w/o Task Adaptation28.9744.39−1.52
Baselines
  w/ Tools25.3540.65−5.14
  No Tools20.9128.97−9.58

Skills Improve Tool Use Quality

Transitioning from Experience Only to Skill Only reduces overall error rate from 29.9% (168 errors) to 15.3% (95 errors). Syntax errors drop from 114 (20.3%) to 71 (11.4%), tool name errors from 16 (2.85%) to just 2 (0.32%), and runtime errors from 38 (6.8%) to 22 (3.5%). These results show that skills provide a strong foundation for reliable tool use, directly addressing the problem of inefficient tool use.

Error analysis on VisualToolBench

Experiences Shape Strategic Tool Selection

Experiences substantially shift tool selection toward more targeted strategies. On VisualToolBench, code interpreter usage increases from 66.63% to 76.97% with the full pipeline. On MMSearch-Plus, image search grows from 15.43% to 23.89% while web search declines — reflecting learned preferences for specialized visual tools. This context-aware adaptation allows XSkill to achieve the flexible tool orchestration needed for complex multimodal tasks.

Setting Code Search-W Search-I Visit
VisualToolBench
w/ Tools66.6331.701.66
Skill Only65.9631.102.82
Exp Only74.49 ↑22.03 ↓2.56
Skill & Exp76.97 ↑21.12 ↓1.58
MMSearch-Plus
w/ Tools6.1871.0715.437.32
Skill Only7.9467.0717.877.12
Exp Only13.21 ↑56.12 ↓24.63 ↑6.04
Skill & Exp14.37 ↑55.08 ↓23.89 ↑5.66

Impact of Rollout Count

Performance consistently improves as rollout count N increases from 1 to 4. Pass@4 exhibits steeper gains — more rollouts provide richer trajectory diversity for the cross-rollout critique mechanism to extract higher-quality experiences.

Performance across rollout values

Zero-Shot Cross-Task Transferability

Knowledge accumulated on one benchmark generalizes to unseen benchmarks. Using VisualToolBench knowledge for TIR-Bench tasks, and MMSearch-Plus knowledge for MMBrowseComp, XSkill consistently outperforms all baselines with average improvements of 2–3 points over Agent-KB.

Out-of-distribution transfer comparison

Qualitative Example

We walk through a real reasoning trajectory to illustrate how XSkill works end-to-end. The agent is given a car image and asked: "In what color area is the second 'GOOD' text on the car body located?" The correct answer is Purple. XSkill injects an accumulated skill document before execution, guiding the agent to use forensic pixel sampling rather than a subjective visual guess.

Question: In what color area is the second "GOOD" text on the car body located?
Ground Truth: Purple
📘 Injected Skill Document
Visual Auditing and Color Forensics for Text on Surfaces
Workflow 1 — Systematic Text Instance Location
  1. Full Context Scan: Read the entire text block to understand the full sequence.
  2. Establish Positional Anchor: Locate the first instance of the target word as a clear starting point.
  3. Sequential Scan & Enumeration: Count each occurrence from the anchor until reaching the target instance.
  4. Neighbor Verification: Check adjacent words (e.g., "VALUES" follows the second "GOOD") to validate position.
Workflow 2 — Objective Color Identification
  1. Isolate ROI: Define a precise bounding box around the background behind the target text.
  2. Forensic Pixel Sampling: Use code_interpreter to extract pixel RGB data — an objective 'Hard Signature' avoiding lighting bias.
  3. Color Determination: Analyze sampled RGB / HSV values to determine the dominant color.
Tool Template
import PIL.Image
def forensic_crop(img_path, box):
    # box: [ymin, xmin, ymax, xmax] (0-1000)
    with PIL.Image.open(img_path) as img:
        w, h = img.size
        l,t,r,b = (box[1]*w/1000, box[0]*h/1000,
                   box[3]*w/1000, box[2]*h/1000)
        img.crop((l,t,r,b)).save("cropped_roi.png")
Common Pitfalls
  • Counting Error: No positional anchor → wrong instance selected.
  • Subjective Color Assessment: Guessing by eye is skewed by lighting & screen calibration.
  • Context Blindness: Searching in isolation without reading surrounding text.
Input Original Image
Original car image
Turn 1 Reasoning & Plan

To determine the color of the area behind the second "GOOD" text, I will follow a systematic approach guided by the skill.

  1. Locate the text "GOOD GIFTING GOOD VALUES" on the car.
  2. Identify the second "GOOD" (the one before "VALUES").
  3. Isolate the Region of Interest (ROI) with a bounding box.
  4. Use code_interpreter to extract average pixel color — objective and lighting-invariant.
Tool Call 1 code_interpreter — Crop & Average RGB
box = [865, 800, 895, 870]  # estimated ROI for 2nd "GOOD"
cropped_roi = original_image.crop(...)
average_rgb = np.mean(roi_array, axis=(0, 1))
# → (84, 59, 72)  → Undetermined by simple threshold
Output
Cropped ROI — Tool call 1

Average RGB: (84, 59, 72) — classified as Undetermined by the simple threshold; red is highest but not conclusively "red".

Turn 2 Reflection & Refinement

RGB (84, 59, 72) — red highest, blue second — is characteristic of a purple / magenta hue. Simple thresholding failed. Switching to HSV color space for a more reliable hue measurement. Also refining the bounding box to avoid reflections.

Tool Call 2 code_interpreter — HSV Color Analysis
box = [860, 800, 880, 870]  # refined bounding box
h_hsv, s_hsv, v_hsv = colorsys.rgb_to_hsv(r/255, g/255, b/255)
# → Hue = 258.1°, Sat = 0.58, Val = 0.36
# 240° ≤ 258.1° < 300°  →  Purple/Violet
Output
Cropped ROI — Tool call 2

RGB: (54, 38, 91)  |  HSV Hue: 258.1° → falls in the Purple/Violet range (240°–300°).

Answer Final Answer

The second "GOOD" text on the car body is located in a purple area.

Confirmed via forensic pixel sampling: RGB (54, 38, 91), HSV Hue 258.1° — Purple/Violet. ✓ Matches ground truth.

BibTeX

@misc{jiang2026xskillcontinuallearningexperience,
  title         = {XSkill: Continual Learning from Experience and Skills in Multimodal Agents},
  author        = {Guanyu Jiang and Zhaochen Su and Xiaoye Qu and Yi R. Fung},
  year          = {2026},
  eprint        = {2603.12056},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2603.12056},
}