DEV Community: Evan Lin

[Gemini] Building a LINE E-commerce Chatbot That Can "Tell Stories from Images"

Evan Lin — Sun, 29 Mar 2026 02:08:30 +0000

Reference articles:

Background

I believe many people have used the combination of LINE Bot + Function Calling. When a user asks "What clothes did I buy last month?", the Bot calls the database query function, retrieves the order data, and then Gemini answers based on that JSON:

Traditional process designed by developers:

User: "Help me see the jacket I bought before"
Bot: [Call get_order_history()]
Function returns: {"product_name": "Brown pilot jacket", "order_date": "2026-01-15", ...}
Gemini: "You bought a brown pilot jacket on January 15th for NT$1,890."

The answer is completely correct, but it always feels like something is missing—the user is talking about "that jacket," and Gemini is just restating the text in the JSON, with no way to "confirm" what the jacket looks like. If there happen to be three jackets in the database, the AI can't even determine which one is the one the user remembers.

AI can read text, but it can't see pictures—this limitation has always been a blind spot in the traditional Function Calling architecture.

This problem was truly solved only after Gemini introduced Multimodal Function Response.

What is Multimodal Function Response?

The traditional Function Calling process is as follows:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON] → Gemini → [Text answer]

Multimodal Function Response changes that middle step. The function can not only return JSON, but also include images (JPEG/PNG/WebP) or documents (PDF) in the same response:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON + image bytes] → Gemini → [Text answer that has seen the image]

When Gemini generates the next round of answers, it can "see" both the structured data and the image returned by the function, thereby generating richer and more accurate responses.

The official currently supported media formats:

Category	Supported formats
Image	`image/jpeg`, `image/png`, `image/webp`
Document	`application/pdf`, `text/plain`

The application scenarios for this feature are very broad: e-commerce customer service (identifying product images), medical consultation (analyzing PDF inspection reports), design review (giving suggestions based on screenshots)... almost all scenarios that require "functions to return visual data for AI analysis" are applicable.

Project Goal

This time, I used Multimodal Function Response to create a LINE e-commerce customer service robot, demonstrating the following scenario:

User: "Help me see the jacket I bought before" Bot (traditional): "You bought a brown pilot jacket." Bot (Multimodal): "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon, with metal zipper pockets on the sides. This is your January 15th order ORD-2026-0115, for a total of NT$1,890, and it has been delivered." + Product photo

The difference is obvious: Gemini really "saw" the jacket, rather than just restating the text in the database.

Architecture Design

Why not use Google ADK?

Originally, this repo used Google ADK (Agent Development Kit) to manage the Agent. The Runner and Agent of ADK encapsulated the entire process of Function Calling, which was very convenient.

But Multimodal Function Response requires manually including image bytes in the parts of the function response, and ADK completely encapsulates this layer, so it can't be intervened.

So this time, I directly used google.genai.Client to implement the iterative cycle of function calls myself:

# Old architecture (ADK)
runner = Runner(agent=root_agent, ...)
async for event in runner.run_async(...):
    ... # ADK handles all function calls for you, but you can't control the response content

# New architecture (directly use google.genai)
response = await client.aio.models.generate_content(
    model=model,
    contents=contents,
    config=types.GenerateContentConfig(tools=ECOMMERCE_TOOLS),
)
# Handle function calls yourself, include images yourself

Overall architecture

LINE User
    │
    ▼ POST /
FastAPI Webhook Handler
    │
    ▼
EcommerceAgent.process_message(text, line_user_id)
    │
    ├─ ① Call Gemini (with conversation history)
    │
    ├─ ② Gemini decides to call a tool → function_call
    │
    ├─ ③ _execute_tool()
    │ ├─ Execute query function (search_products / get_order_history / get_product_details)
    │ └─ Read real product photos in the img/ directory (Unsplash JPEG)
    │
    ├─ ④ Construct Multimodal Function Response
    │ └─ FunctionResponsePart(inline_data=FunctionResponseBlob(data=image_bytes))
    │
    ├─ ⑤ Call Gemini again (Gemini sees the image + data)
    │
    └─ ⑥ Return (ai_text, image_bytes)
    │
    ▼
LINE Reply:
  TextSendMessage(text=ai_text)
  ImageSendMessage(url=BOT_HOST_URL/images/{uuid}) ← FastAPI /images endpoint provides

How to get product images?

This demo uses real Unsplash clothing photography photos. Each of the five products corresponds to an actual photo of the item, stored in the img/ directory. The reading logic is very simple:

def generate_product_image(product: dict) -> bytes:
    """Read the product image and return JPEG bytes."""
    with open(product["image_path"], "rb") as f:
        return f.read()

Each product in PRODUCTS_DB has an image_path field pointing to the corresponding image file:

Product ID	Name	Image
P001	Brown pilot jacket	tobias-tullius-…-unsplash.jpg
P002	White cotton university T	mediamodifier-…-unsplash.jpg
P003	Dark blue denim jacket	caio-coelho-…-unsplash.jpg
P004	Beige knitted shawl	milada-vigerova-…-unsplash.jpg
P005	Light blue simple T-shirt	cristofer-maximilian-…-unsplash.jpg

The image bytes read have two uses:

As FunctionResponseBlob to include for Gemini analysis—real photos allow Gemini to describe the actual fabric texture and tailoring details
Temporarily stored in the image_cache dict, provided to the LINE Bot for display through the FastAPI /images/{uuid} endpoint

Detailed explanation of the core code

Step 1: Define tools (FunctionDeclaration)

from google.genai import types

ECOMMERCE_TOOLS = [
    types.Tool(function_declarations=[
        types.FunctionDeclaration(
            name="get_order_history",
            description="Query the current user's order history",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "time_range": types.Schema(
                        type=types.Type.STRING,
                        description="Time range: all / last_month / last_3_months",
                        enum=["all", "last_month", "last_3_months"],
                    ),
                },
                required=[],
            ),
        ),
        # ... search_products, get_product_details
    ])
]

Step 2: Function call cycle (up to 5 iterations)

async def process_message(self, text: str, line_user_id: str):
    contents = self._get_history(line_user_id) + [
        types.Content(role="user", parts=[types.Part(text=text)])
    ]

    for _iteration in range(5): # Up to 5 times, to prevent infinite loops
        response = await self._client.aio.models.generate_content(
            model=self._model,
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=_SYSTEM_INSTRUCTION,
                tools=ECOMMERCE_TOOLS,
            ),
        )

        model_content = response.candidates[0].content
        contents.append(model_content)

        # Find all function_call parts
        fc_parts = [p for p in model_content.parts if p.function_call and p.function_call.name]

        if not fc_parts:
            # No function call → final text response
            final_text = "".join(p.text for p in model_content.parts if p.text)
            break

        # Has function call → execute tool, include image
        tool_parts = []
        for fc_part in fc_parts:
            result_dict, image_bytes = _execute_tool(
                fc_part.function_call.name,
                dict(fc_part.function_call.args),
                line_user_id,
            )
            tool_parts.append(
                self._build_multimodal_response(fc_part.function_call.name, result_dict, image_bytes)
            )

        contents.append(types.Content(role="tool", parts=tool_parts))

Step 3: Construct Multimodal Function Response (the most critical step)

def _build_multimodal_response(self, func_name, result_dict, image_bytes):
    multimodal_parts = []

    if image_bytes:
        # ⚠️ Note: Use FunctionResponseBlob here, not types.Blob!
        multimodal_parts.append(
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/jpeg",
                    data=image_bytes, # raw bytes, SDK handles base64 internally
                )
            )
        )

    return types.Part.from_function_response(
        name=func_name,
        response=result_dict, # Structured JSON data
        parts=multimodal_parts or None, # ← Image is here! Gemini can "see" it after receiving it
    )

Gemini will receive both result_dict (order JSON) and image_bytes (product image) in the next generate_content call, and the generated answer can therefore describe the visual content of the image.

Step 4: LINE Bot simultaneously returns text + image

# main.py

ai_text, image_bytes = await ecommerce_agent.process_message(msg_text, line_user_id)

reply_messages = [TextSendMessage(text=ai_text)]

if image_bytes:
    image_id = str(uuid.uuid4())
    image_cache[image_id] = image_bytes # Temporary storage
    image_url = f"{BOT_HOST_URL}/images/{image_id}" # FastAPI provides service
    reply_messages.append(
        ImageSendMessage(
            original_content_url=image_url,
            preview_image_url=image_url,
        )
    )

await get_line_bot_api().reply_message(event.reply_token, reply_messages)

LINE Bot's reply_message supports returning multiple messages at once (up to 5), so text and images can be sent simultaneously.

Potholes

❌ Pitfall 1: `FunctionResponseBlob` is not `Blob`

The most common pitfall: When constructing multimodal image parts, you cannot use types.Blob, you must use types.FunctionResponseBlob:

# ❌ Error (will TypeError)
types.FunctionResponsePart(
    inline_data=types.Blob(mime_type="image/jpeg", data=image_bytes)
)

# ✅ Correct
types.FunctionResponsePart(
    inline_data=types.FunctionResponseBlob(mime_type="image/jpeg", data=image_bytes)
)

Although both have mime_type and data fields, the inline_data field type of FunctionResponsePart is FunctionResponseBlob, and Pydantic validation will directly reject Blob. You can confirm this with python -c "from google.genai import types; print(types.FunctionResponsePart.model_fields)".

❌ Pitfall 2: `aiohttp.ClientSession` cannot be created at the module level

The original code directly created aiohttp.ClientSession() at the module level:

# ❌ Old method: module level
session = aiohttp.ClientSession() # Will warn or error if there is no running event loop
async_http_client = AiohttpAsyncHttpClient(session)

When importing main.py in pytest tests, because there is no running event loop, RuntimeError: no running event loop will appear. The solution is to change to lazy initialization, and create it only when it is actually needed for the first time:

# ✅ New method: lazy init
_line_bot_api = None

def get_line_bot_api():
    global _line_bot_api
    if _line_bot_api is None:
        session = aiohttp.ClientSession() # Called within the async route handler, guaranteeing an event loop
        _line_bot_api = AsyncLineBotApi(channel_access_token, AiohttpAsyncHttpClient(session))
    return _line_bot_api

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Gemini receives raw bytes, but LINE Bot's ImageSendMessage requires a publicly accessible HTTPS URL.

The solution is to add a /images/{image_id} endpoint in FastAPI, temporarily store the read image bytes in the image_cache dict, and LINE gets the image through this endpoint:

@app.get("/images/{image_id}")
async def serve_image(image_id: str):
    image_bytes = image_cache.get(image_id)
    if image_bytes is None:
        raise HTTPException(status_code=404, detail="Image not found")
    return Response(content=image_bytes, media_type="image/jpeg")

Use ngrok to expose port 8000 for local development, and use the service URL directly after Cloud Run deployment.

Demo Display

Mock database (default data for demo)

The system has 5 built-in products (all with real Unsplash photos), and each LINE user automatically binds two demo orders when querying orders for the first time:

Order number	Date	Product
ORD-2026-0115	2026-01-15	P001 Brown pilot jacket
ORD-2026-0108	2026-01-08	P003 Dark blue denim jacket

Scenario 1: "Help me see the jacket I bought before"

User sends: "Help me see the jacket I bought before"

[Gemini → function_call]
  get_order_history(time_range="all")

[_execute_tool execution]
  - get_order_history() returns two orders (P001, P003)
  - Read img/tobias-tullius-...-unsplash.jpg → Brown pilot jacket real photo bytes

[Multimodal Function Response]
  Part.from_function_response(
    name="get_order_history",
    response={"orders": [...], "order_count": 2},
    parts=[FunctionResponsePart(inline_data=FunctionResponseBlob(data=<photo>))]
  )

[Gemini responds after seeing the real photo]
  "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon with
   a glossy feel, and a metal zipper pocket on the left sleeve. This is your January 15, 2026
   order ORD-2026-0115, for a total of NT$1,890, status: delivered."

LINE displays: [Text] + [Brown pilot jacket real photo]

Scenario 2: "Are there any dark blue jackets?"

[Gemini → function_call]
  search_products(description="dark blue jacket", color="dark blue")

[Gemini sees the real photo of the P003 dark blue denim jacket]
  "Yes! This dark blue denim jacket (P003) in the photo features a retro stitching design,
   a lapel with metal buttons, and a very complete garment feel, priced at NT$1,490, with 8 in stock."

Scenario 3: "What are the features of the P004 knitted shawl?"

[Gemini → function_call]
  get_product_details(product_id="P004")

[Gemini sees the real photo of the beige knitted shawl]
  "The photo shows a beige handmade crochet shawl, with a V-neck design and tassels at the bottom,
   you can see the light lace-like mesh weave, elegant texture, priced at NT$1,290."

Traditional Function Response vs Multimodal Function Response

	Traditional	Multimodal
Function return	Pure JSON	JSON + image/PDF bytes
Gemini perception	Text data	Text + visual content
Answer quality	"You bought a brown pilot jacket"	"You can see the nylon texture in the photo, with a zipper pocket on the left sleeve..."
API difference	`Part.from_function_response(name, response)`	`Part.from_function_response(name, response, parts=[...])`
Applicable scenarios	Pure text data queries	Scenarios that require visual recognition/confirmation

Analysis and Outlook

This implementation gave me a new understanding of Gemini's Function Calling capabilities.

The problem that Multimodal Function Response truly solves is to allow AI agents to bring in visual information in the action of "calling an external system" itself, instead of first checking text and then uploading images separately. This will be an important basic capability in areas highly related to visuals, such as e-commerce, medicine, and design.

However, there are still a few limitations worth noting:

Image URLs cannot be used directly: Gemini's FunctionResponseBlob requires raw bytes, and URLs cannot be filled in directly (this is different from bringing images directly in the prompt). If the image is originally a URL, you need to download it with requests.get() to bytes before passing it in.
No display_name can also be used: The official documentation examples have display_name and $ref JSON reference, but in actual testing in google-genai 1.49.0, it can also work normally without filling in display_name, and Gemini can still see and analyze the image.
Model limitations: The official mark supports the Gemini 3 series, but gemini-2.0-flash can also handle it normally in actual testing, and the API structure is the same.

There are many directions that can be extended in the future: let users send their own product photos for the Bot to compare, include PDF catalogs in the function response for Gemini to read directly, or let the Bot analyze the report images converted from DICOM in medical scenarios... As long as visual data can be obtained from external systems, Multimodal Function Response can make the AI's answers more in-depth.

Summary

The focus of this LINE Bot implementation is only one sentence: Let the function response carry the image, and Gemini's answer will be upgraded from "restating data" to "telling a story based on the picture".

The core API is just these few lines, but it takes a lot of details to get the whole process working:

# The complete way for Gemini to see the image returned by the function
types.Part.from_function_response(
    name="get_order_history",
    response={"orders": [...]},
    parts=[
        types.FunctionResponsePart(
            inline_data=types.FunctionResponseBlob( # ← Not types.Blob!
                mime_type="image/jpeg",
                data=image_bytes,
            )
        )
    ],
)

The complete code is on GitHub, feel free to clone and play with it.

See you next time!

Gemini Tool Combo: Building a LINE Meetup Helper with Maps Grounding and Places API in a Single API Call

Evan Lin — Sun, 29 Mar 2026 02:07:59 +0000

Reference articles:

Gemini API tooling updates: context circulation, tool combos and Maps grounding for Gemini 3
Google Places API (New) - searchNearby
GitHub: linebot-spot-finder
Complete code GitHub (Meeting Helper LINE Bot Spot Finder)

Background

The combination of LINE Bot + Gemini is already very common. Whether it's using Google Search Grounding to let the model look up real-time information or using Function Calling to let the model call custom logic, they are both mature when used alone.

But what if you want to achieve both "map location context" and "query real ratings" in the same question?

Taking restaurant search as an example, the traditional approach usually looks like this:

User: "Help me find a hot pot restaurant nearby with a rating of 4 stars or above"

Solution A (using only Maps Grounding):
Gemini has map context, but the rating information is described by AI itself, and accuracy is not guaranteed.

Solution B (using only Places API):
You can get real ratings, but there is no map context, and Gemini doesn't know where the user is.

To have both, you usually need to make two API calls, or manually connect them yourself.

AI can search maps and call external APIs, but doing both in a single call—has always been an awkward blank in the old Gemini API architecture.

Until March 17, 2026, Google released Gemini API Tooling Updates (by Mariano Cocirio), which provided an official solution to this problem.

What are Tool Combinations?

Google announced three core features in this update:

1. Tool Combinations Developers can now attach built-in tools (such as Google Search, Google Maps) and custom Function Declarations simultaneously in a single Gemini API call. The model decides which tool to call and when to call it, and finally integrates the results to generate an answer.

2. Maps Grounding Gemini can now directly perceive map data, not just text descriptions of "location", but truly has spatial context—knowing where the user is and what's nearby.

3. Context Circulation Allows the context between multi-turn tool calls to flow naturally, and the model can fully remember the results of the first tool call when making the second call.

The key to this change is:

# Old approach (two tools cannot coexist)
types.Tool(google_search=types.GoogleSearch())
types.Tool(function_declarations=[MY_FN])

# New approach (the same Tool object, both coexist)
types.Tool(
    google_maps=types.GoogleMaps(),
    function_declarations=[MY_FN],
)

One line of modification opens up a whole new combination method.

Project Goal

This time, I used Tool Combinations to transform the existing linebot-spot-finder, upgrading it from "only Maps Grounding for rough answers" to "Google Maps context + Places API real data":

After the user sends their GPS location, they enter: "Please find a hot pot restaurant with a rating of 4 stars or above, suitable for group dining, and list the name, address, and review summary."

Bot (old Maps Grounding): "There are several hot pot restaurants nearby, and the ratings are good." (AI describes it itself, which may not be accurate)

Bot (new Tool Combo): "Lao Wang Hot Pot | 100 Shimin Avenue, Xinyi District, Taipei City | Rating 4.6 (312) | Reviews: Large portions, great value for money, suitable for group dining; efficient service, fast serving."

The difference is: Gemini now receives both map context (where you are) and the real structured data (rating numbers, review text) from the Places API, so the answer changes from a "vague description" to "informed information".

Architecture Design

Overall Message Flow

LINE User sends GPS location
    │
    ▼
handle_location() → session.metadata stores lat/lng
    │
    └──► Returns Quick Reply (restaurant / gas station / parking lot)

LINE User sends text question (e.g. "Find a hot pot restaurant with a rating of 4 stars or above")
    │
    ▼
handle_text()
    │
    ├── session has lat/lng?
    │ Yes → tool_combo_search(query, lat, lng) ← Focus of this article
    │ No → fallback: Gemini Chat + Google Search
    │
    └──► Returns natural language answer

Tool Combo Agentic Loop

tool_combo_search(query, lat, lng)
         │
         ▼
  Step 1: generate_content()
  tools = [google_maps + search_nearby_restaurants]
         │
         ▼
  response.candidates[0].content.parts has function_call?
       ╱ ╲
      Yes   No
      │     │
      ▼     ▼
  _execute_function()  Directly returns response.text
  → _call_places_api()
    (Places API searchNearby)
    Returns rating, address, reviews
      │
      ▼
  Collect into a single Content(role="user")
  Add to history
      │
      ▼
  Step 3: generate_content(contents=history)
  Gemini integrates map context + Places data
      │
      ▼
  Returns final.text

Why not put lat/lng in Function Declaration?

This is an important design decision.

If you add lat/lng to the parameters of SEARCH_NEARBY_RESTAURANTS_FN, Gemini will fill in the coordinates itself—but it fills in the "approximate location" inferred from the conversation, not the user's actual GPS coordinates, and the error can be as high as several kilometers.

The correct approach is to let the Python dispatcher extract the precise coordinates from session.metadata and inject them:

def _execute_function(name: str, args: dict, lat: float, lng: float):
    if name == "search_nearby_restaurants":
        return _call_places_api(
            lat=lat, lng=lng, # ← Inject from session, don't let Gemini guess
            keyword=args.get("keyword", ""),
            min_rating=float(args.get("min_rating", 4.0)),
        )

Core Code Details

Step 1: Define Function Declaration

from google.genai import types

SEARCH_NEARBY_RESTAURANTS_FN = types.FunctionDeclaration(
    name="search_nearby_restaurants",
    description=(
        "Search for nearby restaurants using Google Places API, and return the rating, address, and user reviews."
        "lat/lng is automatically included by the system and does not need to be provided."
    ),
    parameters=types.Schema(
        type=types.Type.OBJECT,
        properties={
            "keyword": types.Schema(
                type=types.Type.STRING,
                description="Restaurant type or keyword, such as: hot pot, hot pot, Italian",
            ),
            "min_rating": types.Schema(
                type=types.Type.NUMBER,
                description="Minimum rating threshold (1–5), default 4.0",
            ),
            "radius_m": types.Schema(
                type=types.Type.INTEGER,
                description="Search radius (meters), default 1000",
            ),
        },
    ),
)

The description clearly tells the model "lat/lng is included by the system", avoiding the model filling in the coordinates itself in the args.

Step 2: Places API Call

import httpx

PLACES_API_URL = "https://places.googleapis.com/v1/places:searchNearby"
PLACES_FIELD_MASK = (
    "places.displayName,"
    "places.rating,"
    "places.userRatingCount,"
    "places.formattedAddress,"
    "places.reviews"
)

def _call_places_api(lat, lng, keyword="", min_rating=4.0, radius_m=1000):
    body = {
        "includedTypes": ["restaurant"],
        "maxResultCount": 5,
        "locationRestriction": {
            "circle": {
                "center": {"latitude": lat, "longitude": lng},
                "radiusMeters": radius_m,
            }
        },
    }

    response = httpx.post(
        PLACES_API_URL,
        headers={
            "X-Goog-Api-Key": os.getenv("GOOGLE_MAPS_API_KEY"),
            "X-Goog-FieldMask": PLACES_FIELD_MASK,
        },
        json=body,
        timeout=10.0,
    )
    response.raise_for_status()
    data = response.json()

    restaurants = []
    for place in data.get("places", []):
        rating = place.get("rating", 0)
        if rating < min_rating:
            continue
        reviews = [
            r["text"]["text"]
            for r in place.get("reviews", [])[:3]
            if r.get("text", {}).get("text")
        ]
        restaurants.append({
            "name": place["displayName"]["text"],
            "address": place.get("formattedAddress", ""),
            "rating": rating,
            "rating_count": place.get("userRatingCount", 0),
            "reviews": reviews,
        })

    return {"restaurants": restaurants}

Step 3: Tool Combo Main Function (Agentic Loop)

async def tool_combo_search(query: str, lat: float, lng: float) -> str:
    client = genai.Client(
        vertexai=True,
        project=os.getenv("GOOGLE_CLOUD_PROJECT"),
        location=os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1"),
        http_options=types.HttpOptions(api_version="v1"),
    )

    enriched_query = (
        f"User's current location: latitude {lat}, longitude {lng}.\n"
        f"Please answer in traditional Chinese using Taiwanese terminology, and do not use markdown format.\n\n"
        f"Question: {query}"
    )

    tool_config = types.GenerateContentConfig(
        tools=[
            types.Tool(
                google_maps=types.GoogleMaps(), # ← Maps grounding
                function_declarations=[SEARCH_NEARBY_RESTAURANTS_FN], # ← Places API
            )
        ],
    )

    # ── Step 1 ──────────────────────────────────────────────────────
    response = client.models.generate_content(
        model=TOOL_COMBO_MODEL,
        contents=enriched_query,
        config=tool_config,
    )

    if not response.candidates:
        return response.text or "（Unable to get a reply）"

    history = [
        types.Content(role="user", parts=[types.Part(text=enriched_query)]),
        response.candidates[0].content,
    ]

    # ── Step 2：Processing function_call ──────────────────────────────────
    function_response_parts = []
    for part in response.candidates[0].content.parts:
        if part.function_call:
            fn = part.function_call
            result = _execute_function(fn.name, dict(fn.args or {}), lat, lng)
            function_response_parts.append(
                types.Part(
                    function_response=types.FunctionResponse(
                        id=fn.id, name=fn.name, response=result,
                    )
                )
            )

    if function_response_parts:
        history.append(types.Content(role="user", parts=function_response_parts))

        # ── Step 3 ────────────────────────────────────────────────────
        final = client.models.generate_content(
            model=TOOL_COMBO_MODEL,
            contents=history,
            config=tool_config,
        )
        return final.text or "（Unable to get a reply）"

    return response.text or "（Unable to get a reply）"

Pitfalls Encountered

❌ Pitfall 1: `Part.from_function_response()` does not accept the `id` parameter

This is the easiest pitfall to step into this time, and the error only explodes when real model calls are made, and unit tests almost never detect it.

Originally, I wrote it like this, referring to the official example:

# ❌ Error——TypeError occurs at runtime
types.Part.from_function_response(
    id=fn.id, # ← This parameter does not exist!
    name=fn_name,
    response=result,
)

The actual signature of from_function_response is:

(*, name: str, response: dict, parts: Optional[list] = None) -> Part

There is no id parameter at all. Every time the model actually triggers a function_call, the program will throw a TypeError at this line, and then silently enter the except of Step 3, returning an error message, and the results of the Places API are never truly returned to Gemini.

The correct way is to directly construct types.FunctionResponse:

# ✅ Correct
types.Part(
    function_response=types.FunctionResponse(
        id=fn.id,
        name=fn_name,
        response=result,
    )
)

You can immediately confirm the parameter list with python -c "from google.genai import types; help(types.Part.from_function_response)".

❌ Pitfall 2: `include_server_side_tool_invocations=True` causes Pydantic to explode

I thought I should add this parameter after seeing the official documentation example:

# ❌ Error
types.GenerateContentConfig(
    tools=[...],
    include_server_side_tool_invocations=True, # ← The installed SDK version does not support it
)

In google-genai 1.49.0, this field is not in the model fields of GenerateContentConfig, and Pydantic will directly throw an extra_forbidden validation error. Just remove it, and the function is completely normal.

❌ Pitfall 3: `textQuery` is a parameter of `searchText`, not `searchNearby`

I thought "if there is a keyword, then bring it into the Places API", and intuitively added it to the request body:

# ❌ Error——Invalid field for searchNearby endpoint
if keyword:
    body["textQuery"] = keyword

searchNearby only accepts fields such as includedTypes, locationRestriction; textQuery is a parameter of the searchText endpoint. Adding this field will not report an error (in some versions), but the keyword will not take effect at all.

The correct approach is to leave the keyword in the description of the Function Declaration for Gemini to refer to, let the model translate the intent to enriched_query, let Maps Grounding handle the keyword semantics, and Places API is only responsible for returning real rating data.

❌ Pitfall 4: No guard for `response.candidates[0]`

When the model encounters security filtering, RECITATION, or other abnormal termination, candidates may be an empty list, and then directly response.candidates[0] is IndexError.

# ❌ No guard
history = [
    types.Content(role="user", parts=[types.Part(text=enriched_query)]),
    response.candidates[0].content, # ← If candidates is empty, it will explode
]

# ✅ Add guard
if not response.candidates:
    return response.text or "（Unable to get a reply）"

history = [...]

Demo Display

Scenario 1: "Find a hot pot restaurant with a rating of 4 stars or above for group dining"

User sends: GPS location (Xinyi District, Taipei City, 25.0441, 121.5598)

User enters: "Please find a hot pot restaurant with a rating of 4 stars or above, suitable for group dining, and list the name, address, and review summary."

[Step 1: Gemini receives query + map context]
  → Detects the need for restaurant data, emit function_call:
    search_nearby_restaurants(keyword="hot pot", min_rating=4.0)

[Step 2: Python calls Places API]
  → lat=25.0441, lng=121.5598 injected from session
  → Returns 3 restaurants with a rating ≥ 4.0, including review text

[Step 3: Gemini integrates Maps context + Places data]
  → "Lao Wang Hot Pot｜100 Shimin Avenue, Xinyi District｜⭐ 4.6 (312)
      Review summary: Large portions, great value for money, a top choice for friends to dine; fast service, fresh dishes.
     ... (3 restaurants in total)"

Scenario 2: "Are there any high-value Japanese restaurants?"

User enters: "Are there any high-value Japanese restaurants nearby?"

[Step 1: Gemini]
  → function_call: search_nearby_restaurants(keyword="Japanese cuisine", min_rating=4.0)

[Step 2: Places API]
  → Returns 2 Japanese restaurants that meet the rating criteria

[Step 3: Gemini]
  → "There are two recommendations:
      Washoku ○○｜...｜⭐ 4.4｜Reviews: Weekday lunch set is only 280 yuan, very fresh.
      ..."

Demo Script Quick Test

No need for LINE Bot, directly on the local machine:

# Only test Tool Combo (main function)
python demo.py combo

# Run all three functions
python demo.py all

Old Architecture vs. New Architecture

	Old Architecture (Maps Grounding only)	New Architecture (Tool Combo)
Tool	`google_maps` (built-in)	`google_maps` + `search_nearby_restaurants` (custom)
Rating Data	Gemini describes it itself (may not be accurate)	Places API real numbers
Reviews	AI generated	Real user reviews (up to 3)
API Call Count	1 time	1 time (Step1) + 1 time (Step3) = 2 times, but transparent to the user
Accuracy	Medium	High
Custom Filtering	Rely on prompt	`min_rating`, `radius_m` precise control

Analysis and Outlook

This implementation has given me a clearer understanding of the potential of Gemini Tool Combinations.

The problem that Tool Combinations truly solves is that Grounding and Function Calling are no longer mutually exclusive. Previously, to achieve "map context + real external data", you could only manually connect two APIs yourself at the application layer, or use Gemini's text generation to "simulate" external data (unreliable). Now the model itself knows when to use map context and when to call the Places API, and developers only need to attach the tools.

However, there are also a few things to note about this implementation:

lat/lng injection mode is very important: You can't let the model guess the coordinates itself, you must inject them from the session, otherwise the positioning accuracy will be very poor. This mode also applies to all function calling scenarios that "have session status".
The cost of two generate_content calls: The agentic loop of Tool Combo requires two model calls, and the token consumption is about 1.5–2 times that of a single call. This needs to be especially considered for scenarios with high latency requirements.
SDK version differences: Different versions of google-genai have different support for the fields of GenerateContentConfig, and new fields like include_server_side_tool_invocations should be used after confirming the version number, otherwise Pydantic validation errors are hard to track.

Future directions that can be extended:

Connect the Postback quick replies (click the "Find Restaurant" button) to Tool Combo, so that each entry can get real ratings
Add the searchText endpoint to support more complex keyword searches (e.g. Michelin recommendations)
Tool Combo combined with other built-in tools (such as google_search) to achieve more complex multi-tool chaining

Summary

The core concept of this modification is only one sentence: Put Google Maps grounding and the Places API function tool in the same types.Tool, and Gemini will coordinate the two in a single conversation.

The key code is only these few lines:

# This is all the magic of Tool Combo
types.Tool(
    google_maps=types.GoogleMaps(), # ← Maps context
    function_declarations=[SEARCH_NEARBY_RESTAURANTS_FN], # ← Places API
)

But to make it really work, you also need to pay attention to: the construction method of FunctionResponse, the guard of candidates, the correct fields of the Places API endpoint, and the injection of lat/lng from the session instead of letting the model guess.

The complete code is on GitHub, feel free to clone and play with it.

See you next time!

Gemini 3.1: Real-World Voice Recognition with Flash Live: Making Your LINE Bot Understand You

Evan Lin — Sun, 29 Mar 2026 02:07:26 +0000

Background

Google released Gemini 3.1 Flash Live at the end of March 2026 March, focusing on "making audio AI more natural and reliable." This model is specifically designed for real-time two-way voice conversations, with low latency, interruptibility, and multi-language support.

I happened to have a LINE Bot project (linebot-helper-python) on hand, which already handles text, images, URLs, PDFs, and YouTube, but completely ignores voice messages:

User sends a voice message
Bot: (Silence)

This time, I'll add voice support and share a few pitfalls I encountered.

Design Decision: Flash Live or Standard Gemini API?

The first question: Gemini 3.1 Flash Live is designed for real-time streaming, but LINE's voice messages are pre-recorded m4a files, not real-time audio streams.

Using Flash Live to process pre-recorded files is like using a live streaming camera to take photos – technically feasible, but the wrong tool.

Decided to use the standard Gemini API – directly passing the audio bytes as inline data, and getting the transcribed text in one call. It's simpler and more suitable for this scenario.

Architecture Design

Integration Approach

This repo already has a complete Orchestrator architecture, which automatically routes to different Agents (Chat, Content, Location, Vision, GitHub) based on the message content. The goal for voice messages is clear:

Convert voice to text, and then treat it as a regular text message and pass it into the Orchestrator – allowing all existing features to automatically support voice input.

User says "Help me search for nearby gas stations" → transcribed into text → Orchestrator determines it's a location query → LocationAgent processes it. No need to implement separate logic for voice.

Complete Flow

User sends AudioMessage (m4a)
    │
    ▼ handle_audio_message()
    │
    ├─ ① LINE SDK downloads audio bytes
    │ get_message_content(message_id) → iter_content()
    │
    ├─ ② Gemini transcription
    │ tools/audio_tool.py → transcribe_audio()
    │ model: gemini-3.1-flash-lite-preview
    │
    ├─ ③ Reply #1: "You said: {transcription}"
    │ reply_message() (consumes reply token)
    │
    └─ ④ Reply #2: Orchestrator routing
            handle_text_message_via_orchestrator(push_user_id=user_id)
            ↓
            push_message() (reply token already used, use push instead)

Why two replies?

The replies are divided into two parts to let the user see the transcription result immediately, without waiting for the Orchestrator to finish processing to know if the Bot understood what they said.

Core Code Explanation

Step 1: Audio Transcription Tool (tools/audio_tool.py)

from google import genai
from google.genai import types

TRANSCRIPTION_MODEL = "gemini-3.1-flash-lite-preview"

async def transcribe_audio(audio_bytes: bytes, mime_type: str = "audio/mp4") -> str:
    """
    Transcribe audio bytes to text using Gemini.
    LINE voice messages are always m4a, MIME type is always audio/mp4.
    """
    client = genai.Client(
        vertexai=True,
        project=os.getenv("GOOGLE_CLOUD_PROJECT"),
        location=os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1"),
    )

    audio_part = types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)

    response = await client.aio.models.generate_content(
        model=TRANSCRIPTION_MODEL,
        contents=[
            types.Content(
                role="user",
                parts=[
                    audio_part,
                    types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."),
                ],
            )
        ],
    )

    return response.text or ""

Design principle: The function itself does not catch exceptions, allowing the upper-level handler to handle error responses uniformly.

Step 2: Handler Main Flow (main.py)

async def handle_audio_message(event: MessageEvent):
    """Handle audio (voice) messages — transcribe and route through Orchestrator."""
    user_id = event.source.user_id
    replied = False # Track if the reply token has been used
    try:
        # Download audio
        message_content = await line_bot_api.get_message_content(event.message.id)
        audio_bytes = b""
        async for chunk in message_content.iter_content():
            audio_bytes += chunk

        # Transcription
        transcription = await transcribe_audio(audio_bytes)

        # Empty transcription (silent or too short)
        if not transcription.strip():
            await line_bot_api.reply_message(
                event.reply_token,
                [TextSendMessage(text="Unable to recognize voice content, please re-record.")]
            )
            return

        # Reply #1: Let the user confirm the transcription result (consumes reply token)
        await line_bot_api.reply_message(
            event.reply_token,
            [TextSendMessage(text=f"You said: {transcription.strip()}")]
        )
        replied = True

        # Reply #2: Send to Orchestrator, using push_message (token already used)
        await handle_text_message_via_orchestrator(
            event, user_id,
            text=transcription.strip(),
            push_user_id=user_id,
        )

    except Exception as e:
        logger.error(f"Error handling audio for {user_id}: {e}", exc_info=True)
        error_text = LineService.format_error_message(e, "processing voice message")
        error_msg = TextSendMessage(text=error_text)
        if replied:
            # reply token has been consumed, use push instead
            await line_bot_api.push_message(user_id, [error_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [error_msg])

Step 3: Enabling Orchestrator to Support External Text Input

The original handle_text_message_via_orchestrator directly reads event.message.text. AudioMessage doesn't have .text, so add two optional parameters:

async def handle_text_message_via_orchestrator(
    event: MessageEvent,
    user_id: str,
    text: str = None, # ← External text input (voice transcription)
    push_user_id: str = None, # ← Use push_message when set
):
    msg = text if text is not None else event.message.text.strip()
    try:
        result = await orchestrator.process_text(user_id=user_id, message=msg)
        response_text = format_orchestrator_response(result)
        reply_msg = TextSendMessage(text=response_text)

        if push_user_id:
            await line_bot_api.push_message(push_user_id, [reply_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [reply_msg])
    except Exception as e:
        error_msg = TextSendMessage(text=LineService.format_error_message(e, "processing your question"))
        if push_user_id:
            await line_bot_api.push_message(push_user_id, [error_msg])
        else:
            await line_bot_api.reply_message(event.reply_token, [error_msg])

text is not None (instead of text or ...) is intentional – in case the voice transcription results in an empty string, allow the empty string to pass through (and then be intercepted by the upper-level if not transcription.strip()), instead of falling back to event.message.text.

Pitfalls Encountered

❌ Pitfall 1: `Part.from_text()` does not accept positional arguments

The first TypeError encountered:

# ❌ Error (TypeError: Part.from_text() takes 1 positional argument but 2 were given)
types.Part.from_text(
    "Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes."
)

# ✅ Correct
types.Part(text="Please transcribe the above audio content into text completely, preserving the original language, and do not add any explanations or prefixes.")

In this version of the SDK, text in Part.from_text() is a keyword argument, or use the Part(text=...) constructor directly for more safety.

❌ Pitfall 2: LINE reply token can only be used once

LINE's reply token is one-time use. Once reply_message() is called, the token is invalidated.

This project's voice flow will call twice:

Reply #1 (display transcription text) → consumes token
Reply #2 (Orchestrator result) → token is invalid, will receive LINE 400 error

The solution is to have the Orchestrator handler support push_message mode (via the push_user_id parameter), and Reply #2 changes to push_message.

Error handling should also be noted: if Orchestrator throws an exception after Reply #1 succeeds, the reply_message cannot be used in the except block, and it also needs to be changed to push_message. This is the purpose of the replied flag in the code.

❌ Pitfall 3: Gemini Flash Live is not suitable for pre-recorded files

Not a real "pitfall", but worth clarifying:

Gemini 3.1 Flash Live is designed for real-time two-way streaming, which has the overhead of connection establishment and streaming protocols. LINE voice messages are complete pre-recorded m4a files, which can be processed once.

Using client.aio.models.generate_content() directly to pass inline audio bytes is simpler, and the delay is not bad. Leave Flash Live for scenarios that truly require real-time conversations.

Effect Demonstration

Scenario 1: Voice Command Query

User sends: [Voice] Help me search for cafes near Taipei Main Station

Bot Reply #1: You said: Help me search for cafes near Taipei Main Station
Bot Reply #2: [LocationAgent replies with a list of nearby cafes]

Scenario 2: Voice Question

User sends: [Voice] What's the difference between Gemini and GPT-4

Bot Reply #1: You said: What's the difference between Gemini and GPT-4
Bot Reply #2: [ChatAgent with Google Search Grounding replies with comparison results]

Scenario 3: Voice Send URL

User sends: [Voice] Help me summarize this article https://example.com/article

Bot Reply #1: You said: Help me summarize this article https://example.com/article
Bot Reply #2: [ContentAgent fetches and summarizes the article]

The text transcribed from voice goes directly into the Orchestrator, and all existing URL detection and intent determination work as usual, with zero extra logic.

Traditional Text Input vs. Voice Input

	Text Input	Voice Input
Input Format	TextMessage	AudioMessage (m4a)
Pre-processing	None	Gemini transcription
reply token	Direct use	Reply #1 consumes, Reply #2 changes to push
Orchestrator	Direct routing	Route after transcription
Supported Functions	All	All (no additional settings required)
Error Handling	reply_message	replied flag determines reply/push

Analysis and Outlook

What I am most satisfied with in this integration is that I hardly need to change the Orchestrator itself. As long as the voice is converted to text at the input end, all the routing logic, Agent calls, and error handling are automatically inherited.

Gemini's multimodal audio understanding performs very stably in this scenario – Traditional Chinese, Taiwanese accents, and sentences mixed with English can basically be transcribed accurately.

Future directions for extension:

Multi-language automatic detection: Tell Gemini to preserve the original language during transcription, Japanese voice → Japanese transcription, and then the Orchestrator decides whether to translate
Group voice support: Currently limited to 1:1, voice messages in groups are temporarily ignored
Long recording summary: Recordings exceeding a certain length go directly to ContentAgent for summarization, instead of being treated as commands

Extension: 🔊 Read Summary Aloud – Make the Bot Speak

Voice recognition allows the Bot to "understand" what the user is saying. After this is done, the next question naturally arises:

Can the Bot respond by speaking?

The Gemini Live API has a setting response_modalities: ["AUDIO"], which can directly output an audio PCM stream. I connected it to another scenario – reading summaries aloud.

Function Design

Each time the Bot summarizes a URL, YouTube, or PDF, a "🔊 Read Aloud" QuickReply button will appear below the message. When the user presses it, the Bot sends the summary text into Gemini Live TTS, converts the PCM audio to m4a, and then sends it back using AudioSendMessage.

URL summary complete
    │
    ▼ [🔊 Read Aloud] QuickReply button
    │
User presses the button → PostbackEvent
    │
    ▼ handle_read_aloud_postback()
    │
    ├─ ① Retrieve the summary text from summary_store (10 minutes TTL)
    │
    ├─ ② Gemini Live API → PCM audio
    │ model: gemini-live-2.5-flash-native-audio
    │ response_modalities: ["AUDIO"]
    │
    ├─ ③ ffmpeg transcoding: PCM → m4a
    │ s16le, 16kHz, mono → AAC
    │
    └─ ④ AudioSendMessage sent to the user
            original_content_url: /audio/{uuid}
            duration: {ms}

Core Code (tools/tts_tool.py)

LIVE_MODEL = "gemini-live-2.5-flash-native-audio"

async def text_to_speech(text: str) -> tuple[bytes, int]:
    client = genai.Client(vertexai=True, project=VERTEX_PROJECT, location="us-central1")
    config = {"response_modalities": ["AUDIO"]}

    async with client.aio.live.connect(model=LIVE_MODEL, config=config) as session:
        await session.send_client_content(
            turns=types.Content(role="user", parts=[types.Part(text=text)]),
            turn_complete=True,
        )
        pcm_chunks = []
        async for message in session.receive():
            if message.server_content and message.server_content.model_turn:
                for part in message.server_content.model_turn.parts:
                    if part.inline_data and part.inline_data.data:
                        pcm_chunks.append(part.inline_data.data)
            if message.server_content and message.server_content.turn_complete:
                break

    pcm_bytes = b"".join(pcm_chunks)
    duration_ms = int(len(pcm_bytes) / 32000 * 1000) # 16kHz × 16-bit mono

    # PCM → m4a (temp file mode, avoid moov atom problem)
    with tempfile.NamedTemporaryFile(suffix=".pcm", delete=False) as f:
        f.write(pcm_bytes)
        pcm_path = f.name
    m4a_path = pcm_path.replace(".pcm", ".m4a")
    subprocess.run(
        ["ffmpeg", "-y", "-f", "s16le", "-ar", "16000", "-ac", "1",
         "-i", pcm_path, "-c:a", "aac", m4a_path],
        check=True, capture_output=True,
    )
    with open(m4a_path, "rb") as f:
        return f.read(), duration_ms

Pitfalls of Read Aloud Function

❌ Pitfall 4: Completely Different Model Name

The first attempt at Gemini Live TTS was:

LIVE_MODEL = "gemini-3.1-flash-live-preview"

Following the inference of gemini-3.1-flash-lite-preview used for voice recognition, the result was a direct 1008 policy violation:

Publisher Model `projects/line-vertex/locations/global/publishers/google/
models/gemini-3.1-flash-live-preview` was not found

Listing the available models on Vertex AI revealed that the model naming rules for Live/native audio are completely different:

# ✅ Correct
LIVE_MODEL = "gemini-live-2.5-flash-native-audio"

There is no Live version of Gemini 3.1 on Vertex AI. The Live/native audio feature is currently the 2.5 generation, and the naming format is gemini-live-{version}-{variant}-native-audio, which is completely separate from the general model gemini-{version}-flash-{variant}.

❌ Pitfall 5: `GOOGLE_CLOUD_LOCATION=global` Causes Live API to Disconnect

After changing to the correct model name, the error message was still the same:

Publisher Model `projects/line-vertex/locations/global/...` was not found

This time the model name was correct, but locations/global was strange – we clearly set us-central1.

Investigating the source code of the Google GenAI SDK revealed:

# _api_client.py
self.location = location or env_location
if not self.location and not self.api_key:
    self.location = 'global' # ← here

location or env_location – if the passed-in location is an empty string, it will fall back to global.

The root cause of the problem is the environment variable of Cloud Run:

{ "name": "GOOGLE_CLOUD_LOCATION", "value": "global" }

GOOGLE_CLOUD_LOCATION was set to the "global" string. os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1") did not get "us-central1", but "global" – then the SDK obediently connected to the global endpoint, but gemini-live-2.5-flash-native-audio does not have BidiGenerateContent support in global.

Endpoint	Standard API	Live API
`global`	✅ Available	❌ Model not here
`us-central1`	✅ Available	✅ `gemini-live-2.5-flash-native-audio`

Solution: Hardcode the location of the Live API, and don't read from the env var:

# ❌ Affected by GOOGLE_CLOUD_LOCATION=global
VERTEX_LOCATION = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")

# ✅ Hardcoded, not affected by env var
VERTEX_LOCATION = "us-central1" # Live API needs a regional endpoint

Voice Recognition vs. Read Summary Aloud

The two functions use completely different Gemini APIs:

	Voice Recognition	Read Summary Aloud
Direction	Audio → Text	Text → Audio
API	Standard `generate_content`	Live API `BidiGenerateContent`
Model	`gemini-3.1-flash-lite-preview`	`gemini-live-2.5-flash-native-audio`
Location	Follows env var	Hardcoded `us-central1`
Output Format	text	PCM → ffmpeg → m4a
LINE Message Type	Input: `AudioMessage`	Output: `AudioSendMessage`

Conclusion

The release of Gemini 3.1 Flash Live makes audio AI more worthy of serious consideration. This time, both voice recognition and read summary aloud were integrated into the LINE Bot:

Voice Recognition: Standard Gemini API, pre-recorded m4a one-time transcription, connected to the existing Orchestrator
Read Summary Aloud: Gemini Live TTS, summary text to PCM, ffmpeg to m4a, AudioSendMessage returns

The most troublesome part is not the function itself, but finding the correct model name and locating the SDK's location logic – neither of these are clearly written in a prominent place in the documentation, and the answer can only be found by listing available models and reading the SDK source code.

The full code is on GitHub, feel free to refer to it.

See you next time!

Building an Agent Skill Hub: From Skill Development to Automated Multilingual Documentation Deployment on GitHub Pages

Evan Lin — Fri, 27 Mar 2026 01:45:20 +0000

Reference links:

This article documents how I built a skill description specification from scratch and created a GitHub Pages documentation site that supports both Chinese and English, drawing inspiration from minimalist aesthetics, while developing the Agent Skill Hub (2026 Skill Library).

Background

With the popularity of AI Agents (such as OpenClaw or Gemini CLI), we found that "how to quickly understand and execute specific tasks for the Agent" has become key. Instead of writing long prompts every time, it's better to package common operations into standardized Skills.

To facilitate community communication and Agent reading, I created agent-skill-hub. But code alone is not enough; we also need a decent "facade" – a document website that is both aesthetically pleasing and has technical details.

🛠️ Step 1: Standardize Skill Descriptions (SKILL.md)

In agent-skill-hub, each skill (such as gcp-helper or n8n-executor) has a SKILL.md. The structure of this file is crucial because it's not just for humans to read, but also for LLMs to read:

Name & Description: Let the Agent know what this is.
When to Use: Define trigger scenarios.
Core Pattern: Provide standard instruction examples.
Common Mistakes: Reduce errors caused by Agent hallucinations.

🎨 Step 2: Design Style — Tribute to Minimalist Aesthetics

When designing the web pages under the docs directory, I referenced the style of whisperASR. That design of a dark background with bright accent colors (Teal) is very in line with the aesthetics of modern developers:

Visual Element Highlights:

Gradient Title: Use linear-gradient to create a high-end texture.
Teal Accent Color: Use #14b8a6 as the highlight color for key buttons and titles.
Card-style Layout: Clearly present the icons and introductions of each skill, with good responsive design.

🌐 Step 3: Multilingual Support and Automatic Switching

To make it available to developers worldwide, I adopted a directory-structured language management method:

docs/
├── index.html (Language detection and redirection)
├── en/ (English version)
│ └── skills/
└── zh/ (Traditional Chinese version)
    └── skills/

I added a simple JavaScript snippet to the root directory's index.html, which automatically redirects to the correct language based on the user's browser settings:

const lang = navigator.language || navigator.userLanguage;
if (lang.startsWith('zh')) {
    window.location.href = './zh/index.html';
} else {
    window.location.href = './en/index.html';
}

🚀 Step 4: GitHub Pages Deployment Process

In 2026, the most recommended deployment method is to put the content in the docs/ directory of the main branch, which can keep the main branch clean while keeping development and documentation synchronized.

1. Prepare the Directory Structure

Create all the necessary directories at once using the command:

mkdir -p docs/en/skills docs/zh/skills docs/assets/css

2. Git Commit and Push

After completing HTML/CSS development, execute the standard Git process:

git add docs/
git commit -m "docs: add GitHub Pages documentation in English and Chinese"
git push origin main

3. Enable GitHub Pages Settings

Go to Settings > Pages in the GitHub repository.
Under Build and deployment, in Branch, select the main branch and the /docs folder.
Click Save, and the website will be online in a few minutes.

🛠️ Common Pitfalls and Troubleshooting

❓ Why can't the webpage style (CSS) be loaded?

Reason: In HTML files under subdirectories (such as en/skills/), the referenced paths must correctly use relative paths. Correction:

<!-- In the home page index.html -->
<link rel="stylesheet" href="../assets/css/style.css">
<!-- In the skill detail page -->
<link rel="stylesheet" href="../../assets/css/style.css">

❓ How to ensure that the Agent can correctly read the document?

We have retained a large number of semantic tags (article, h2, pre, code) in the HTML, so that the Agent can more accurately capture the core logic when performing RAG (Retrieval-Augmented Generation) or directly reading the webpage.

🏁 Conclusion

Through this development, I have realized the importance of "documentation as product". A good AI skill library, in addition to powerful program logic, also needs a clear, intuitive, and multilingual-friendly navigation system.

If you also want to create a professional facade for your AI project, you might as well refer to the docs/ structure layout. Happy Coding! 🦞

Security Declaration for AI Agents: Deep Dive into A2AS (Agent-to-Agent Security) Certification Mechanism

Evan Lin — Fri, 27 Mar 2026 01:45:10 +0000

Reference links:

This article documents an interesting Pull Request I received while maintaining linebot-adk (LINE Bot Agent Development Kit): adding the A2AS security certificate to the project. This is not just a YAML file, but a significant milestone for AI Agents to move towards "industrial-grade security" in 2026.

Background

When we develop Agents like linebot-adk that have Tool Use (Function Calling) capabilities, the biggest concern for users is often: "Will this Agent issue commands without my permission?" or "What data can it access?".

Traditionally, we could only write explanations in README.md, but that's for humans to read, not for system verification. This is why A2AS (Agent-to-Agent Security) emerged – it's hailed as the "HTTPS of the AI world".

🛠️ Step 1: Understanding the BASIC Model of A2AS

A2AS is not just a name; it has a complete BASIC security model behind it, designed to solve the trust issue between AI Agents:

(B)ehavior Certificates: Declarative certificates that clearly define the behavior boundaries of the Agent.
(A)uthenticated Prompts: Ensures that the source of prompts is trustworthy and traceable.
(S)ecurity Boundaries: Uses structured tags (such as <a2as:user>) to isolate untrusted input.
(I)n-Context Defenses: Embeds defense logic in prompts to reject malicious injections.
(C)odified Policies: Writes business rules into code and enforces them during inference.

🎨 Step 2: Deconstructing a2as.yaml – The Agent's ID Card

In PR #1 received by linebot-adk, the most core change was the addition of a2as.yaml. This file is like the Agent's "digital signature", making the code logic explicit:

manifest:
  subject:
    name: kkdai/linebot-adk
    scope: [main.py, multi_tool_agent/agent.py]
  issued:
    by: A2AS.org
    url: https://a2as.org/certified/agents/kkdai/linebot-adk

agents:
  root_agent:
    type: instance
    models: [gemini-2.5-flash]
    tools: [get_weather, get_current_time]

Why is this important?

This certificate is directly linked to the content of our main.py. When the certificate declares tools: [get_weather, get_current_time], it means this is a limited-authorization Agent. If it tries to execute delete_database, the security monitoring system can immediately detect that it is outside the certificate scope.

🌐 Step 3: Combining Code Logic

In linebot-adk, we used Google's ADK (Agent Development Kit) to build the Agent. The A2AS certificate can accurately map our program architecture:

1. Tool Declaration and Implementation

In multi_tool_agent/agent.py, we defined two tools:

def get_weather(city: str) -> dict:
    # Implement the logic to get the weather
    ...

def get_current_time(city: str) -> dict:
    # Implement the logic to get the time
    ...

The A2AS certificate will register these functions in the tools block, ensuring that the Agent's capability boundaries are transparent and auditable.

2. Runner and Execution Loop

In main.py, we start the Agent through Runner:

runner = Runner(
    agent=root_agent,
    app_name=APP_NAME,
    session_service=session_service,
)

The manifest.subject.scope in the certificate marks main.py, which means the entire startup process (including FastAPI's Webhook processing) is within the A2AS compliant scope.

🚀 Step 4: Why is this the "HTTPS of the AI world"?

Imagine if you want a "travel agent Agent" to talk to a "hotel reservation Agent".

Without A2AS: The travel Agent can only "blindly trust" the hotel Agent.
With A2AS: The travel Agent can first check the other party's a2as.yaml certificate. If the other party claims to have the right to "modify orders" but the certificate doesn't say so, the travel Agent can refuse the transaction.

This "verify first, then execute" model is the trust network that A2AS wants to build.

🛠️ Common Pitfalls and Troubleshooting

❓ What if the certificate expires or the Commit Hash doesn't match?

Reason: A2AS certificates are bound to a specific Git Commit. When you modify the logic of agent.py but don't update the certificate, the verification will fail. Correction: Every time you modify the core functions of the Agent (such as adding a Tool or changing the Model), you must regenerate and sign a2as.yaml.

❓ Does using A2AS increase latency?

No. A2AS is mainly a "declarative" and "structured" specification. During the inference phase, it uses structured tags (S in the BASIC model) to help LLMs distinguish between instructions and data, which can reduce the hallucinations caused by the model's confusion and improve execution efficiency.

🏁 Conclusion

Through the introduction of this A2AS certificate, linebot-adk is no longer just a simple LINE Bot example; it has become a transparent Agent that meets the 2026 security standards. In an era where AI agents are gradually penetrating our lives, "transparency" is the best defense.

If you are also developing AI Agents, you might as well go to A2AS.org and add that badge of trust to your project. Happy Coding! 🦞

Deploying OpenClaw on Google Cloud VM: Avoiding Sudo and NVM Pitfalls

Evan Lin — Sun, 01 Mar 2026 14:04:54 +0000

(Image generated by Nano Banana - Gemini Image Generation)

References:

This article documents the complete solution process for the permission, environment variable, and process persistence issues encountered when installing OpenClaw (2026 Latest Version) in a Debian/Ubuntu environment on Google Cloud Platform (GCP).

Preface

The AI Agent field has been very popular recently. OpenClaw, as an open-source AI agent that can operate 24 hours a day, has impressed people with its powerful system access and browsing capabilities. For security reasons, deploying it on a cloud VM (such as GCP GCE) is the most ideal approach, which can ensure 24/7 online availability and isolate sensitive local data.

However, in the default Debian/Ubuntu environment of GCP, due to the permission mechanism being slightly different from that of a general Desktop Linux, following the official script for installation often leads to many pitfalls.

🛠️ Basic Installation Process of OpenClaw on GCP

Before we get into troubleshooting, let's quickly go through the standard installation logic:

1. Create a VM Instance

Create a new VM in the GCP Console:

Machine type: Recommended e2-small or e2-medium (depending on your Agent load).
Operating system: Recommended to choose Ubuntu 24.04 LTS or Debian 12.
Hard disk: Recommended 20GB or more.

2. Connect and Basic Updates

After entering the VM via SSH, first perform a system update:

sudo apt update && sudo apt upgrade -y
sudo apt install -y git curl build-essential

3. Officially Install OpenClaw

The official website provides a one-click installation script:

curl -fsSL https://openclaw.ai/install.sh | bash

But! If you directly execute the above script, you will usually encounter the following two serious permission and path problems on GCP.

🛠️ Problem 1: "HAL 9000" Style Denial of sudo-rs

Symptom: When executing the official installation script, the following error is encountered with sudo-rs:

sudo-rs: I'm sorry evanslin. I'm afraid I can't do that

Reason:

Interaction Restriction: The script executed via curl ... | bash cannot obtain password input from the terminal when sudo is required.
No Password Account: GCP defaults to using SSH Key login, and the user account usually does not have a physical password set, leading to sudo authentication failure.

Solution: Use NVM (Node Version Manager) to install Node.js, and build the environment under the user directory, completely avoiding the sudo requirement.

# 1. Install NVM
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

# Reload shell configuration
source ~/.bashrc

# 2. Install Node.js
nvm install node # Recommended version v25.7.0+

🛠️ Problem 2: NVM Path and Environment Variables

After using NVM, although sudo is avoided, a new problem arises: when you log in again or execute commands using a non-interactive shell, the system may not be able to find the node or openclaw command.

This is because the NVM path is dynamically loaded. It is recommended to ensure that the following content exists in ~/.bashrc:

export NVM_DIR="$HOME/.nvm"
[-s "$NVM_DIR/nvm.sh"] && \. "$NVM_DIR/nvm.sh"
[-s "$NVM_DIR/bash_completion"] && \. "$NVM_DIR/bash_completion"

🛠️ Problem 3: How to Make OpenClaw Run 24/7 Stably?

After installation, in order to keep the Agent running after closing the SSH window, I switched from the original GCP Web SSH to using the local gcloud CLI, but I also found a new small pitfall.

1. Why gcloud ssh can't find openclaw?

This is usually because GCP's gcloud compute ssh may create a new username based on your local account name, instead of using the account you used when installing on the VM (e.g., evanslin).

Verification method: Please enter the following in the "Web SSH" and "Local gcloud SSH" windows respectively:

whoami

Root cause: If the web version shows evanslin, but the gcloud version shows a name like evan_lin_yourdomain_com, then the home directory paths of the two are completely different, and your NVM and OpenClaw settings will of course "disappear".

Solution: When executing the gcloud command, explicitly specify the account to log in to:

gcloud compute ssh evanslin@openclaw-evanlin

This will ensure that you return to the correct environment!

2. Use tmux and Startup Script to Achieve Perfect Execution

In order to ensure that environment variables can be loaded correctly in any SSH session (web version or gcloud version), and to keep OpenClaw running stably in the background, it is recommended to use the following "scripted" startup method.

Step 1: Create a Startup Script

In a window where you can normally execute openclaw (usually Web SSH), create a startup script:

cat << 'EOF' > ~/start_openclaw.sh
#!/bin/bash
# 1. Force loading NVM path
export NVM_DIR="$HOME/.nvm"
[-s "$NVM_DIR/nvm.sh"] && \. "$NVM_DIR/nvm.sh"

# 2. Automatically correct PATH (please adjust the path according to your Node version)
export PATH="$HOME/.nvm/versions/node/v25.7.0/bin:$PATH"

# 3. Execute command
openclaw "$@"
EOF

# Grant execution permission
chmod +x ~/start_openclaw.sh

Step 2: Verify the Script

From now on, no matter where you log in from, please use this script uniformly. Test in the gcloud ssh window:

~/start_openclaw.sh gateway

If it can run successfully, it means the path has been manually connected!

Step 3: Combine tmux to Solve the Disconnection Problem

Now we combine the script with tmux to achieve true 24/7 background operation:

Open a new session: tmux new -s openclaw
Execute the script inside: ~/start_openclaw.sh gateway
Perfectly detach: Press Ctrl + B and release, then press D.
Reconnect at any time: Next time you log in, execute tmux a -t openclaw.

Summary

The key to deploying OpenClaw on GCP is "user directory priority". By using NVM to avoid the system-level sudo-rs restriction, not only is the installation process smoother, but it also makes it easier to switch Node.js versions to meet the latest requirements of OpenClaw.

After successful deployment, don't forget to use openclaw onboard to start configuring your API Keys and communication channels (such as Telegram or Discord).

I hope this note can help developers who are also working hard on GCP. See you next time!

Sharing Good Books: Secrets to Successful WFH

Evan Lin — Sat, 28 Feb 2026 16:38:19 +0000

WFH在家工作的成功祕訣
美國中小企業最佳CEO教你高效、彈性、具團隊精神的企業競爭新優勢
How to Thrive in the Virtual Workplace : Simple and Effective Tips for Successful， Productive and Empowered Remote Work

作者： 羅伯特・格雷瑟 米克・史隆 原文作者： Robert Glazer Mick Sloan 譯者： 孟令函 出版社：遠流出版

Purchase Recommendation Website:

Readmoo Online Book Purchase

Preface:

This is the eleventh book I've read this year. Since the international outbreak of the pandemic in 2020, the company has actually started to accelerate its transformation into a Hybrid Office concept. The so-called Hybrid Office means having flexible office seats, plus flexible remote work. During the severe pandemic, we even directly launched a full-stage WFH.

Whether you are employees or supervisors, are you afraid or fond of WFH? Do you like it because you don't need to spend extra commuting time, but you also worry that your home doesn't have enough equipment, and you also worry that you don't have actual interaction with your colleagues? When I saw this book, I thought it was interesting, so I bought it and read it.

Content Introduction:

Build a virtual office and enhance future competitiveness!
"If you are still struggling with remote work, Robert Glazer can provide you with some immediately actionable advice." - Adam Grant (Professor at Wharton Business School, author of "Give and Take")

When millions of office workers around the world were suddenly forced to work from home (WFH, Work From Home) to prevent the pandemic, business owners found that employees were more willing to accept it than they had previously understood, and most of the work content could still operate normally. However, not every company and every office worker can smoothly transition overnight, and it's not enough to simply apply the work procedures and strategies commonly used in physical offices. In the future, as remote or hybrid work models become more and more common, companies that do well will have a clear competitive advantage and attract the best talent.

As the founder and CEO of "Accelerate Partners," a 100% remote work organization with 170 employees working from home, Robert Glazer has drawn on more than ten years of valuable experience to extract the correct principles, strategies, and tools for managing remote employees, allowing companies to excel in both the virtual and real worlds.

Office workers will from now on:
✔ Don't have to commute, stay away from the pressure of high housing prices and high rents in the city
✔ Not be disturbed, create their own work schedule and environment
✔ Enjoy the ideal life of balancing family, interests, and work

Companies can even:
✔ Save costs, or can invest more resources in employees and customers
✔ Improve efficiency, and can achieve excellent performance and work results globally
✔ Create an equal and cohesive work environment, retaining outstanding talent

Chapter Outline

Part 1: The Winning Mindset for Remote Workers

What exactly is remote work? It's not a product of the pandemic. Before the pandemic, many companies needed businesses or customer service marketing personnel around the world. But they couldn't afford to set up physical offices in every region. The result was that employees came from all over the world and could work from their own homes.

The Winning Mindset for Remote Workers

Recruit diligent, responsible, and self-disciplined employees
Give them enough trust
Perfect work procedures
Excellent company culture

The Basics for Remote Workers

Develop a work plan and execute it effectively
Create a suitable work environment
Establish a clear boundary between work and personal life

Properly Manage Your Email

After remote work, the probability of email exchanges will increase.
How to let others know your reply frequency is very important.

Methods to Improve Work Efficiency

Allocate energy well
Create a buffer before and after work
Prioritize and allocate time.
Establish expectations
Stay focused
- Try to focus on one thing for at least 15 to 20 minutes a day.
Take care of yourself
- Physical and mental health is very important, don't ruin your health because of WFH.
Establish communication between people
- Create some chat channels
- Allow more participants to speak in meetings.
- Make good use of asynchronous video (use videos instead of emails or announcements)

Changing Work Location

Make sure that the region (country) has an office of the company, otherwise there may be problems with salary remittances.
Due to the difference in labor laws and tax rates in various countries, employee benefits and labor regulations are different.
Changing countries may result in salary differences, which will be adjusted based on the cost of living in each location.

Part 2: The Success Rules for Remote Work Companies

Starting from Organizational Culture

Since remote work companies care a lot about employees' autonomous work motivation, every colleague needs to have an in-depth understanding of the organizational culture. (And also be able to deeply identify with it).

Company Culture:

Vision
Values
Goals
Consistency
Clear and explicit

How to Describe the Core Concept of the Company:

The core concept is to think about a specific future point in time, and use the tone of describing the current facts, as much as possible, to detail what the company and employees will be like at that time, and how they will feel.

When to Use the Core Concept:

Recruiting employees
Major policy decisions

How to Recruit Suitable Remote Employees

Ask the other party if they have remote work experience
Whether they agree with the core concept
Look at the other party's concept and handling methods for remote work
You can ask detailed questions
- Do you like remote work -> Why do you like it -> How do you arrange it -> Self-adjustment

How to Conduct Remote Interviews

Fact-based interview questions
What changes does remote work bring
Are you troubled because you can't work face-to-face?
How to communicate effectively without meeting
How to avoid feeling isolated while working from home

Will not waste training resources for someone who only meets the average standard

Notes for Remote Work Colleagues

Complete onboarding process
- 1 on 1 with each supervisor
- Colleagues breaking the ice
- Setting up equipment
- Related pre-education system
- More special:
- Introduction to company regulations (especially related to remote work)
- Introduction to company culture (to constantly keep everyone on the same core concept)
Reduce meetings, especially reduce regular meetings, and change to irregular, fast, and concise discussions with a small number of participants
- Meeting participants rate themselves whether they need to attend, if it is less than six points. Then cancel the relevant meeting.
- Everyone participating in the meeting must speak
- Meeting summaries are very important (to avoid someone not being able to participate)
Etiquette in different time zones
- Emails and messages should clearly indicate the time zone that needs attention. (If possible, convert it to their time zone)
Travel strategy
- The probability of business trips is expected to decrease after the pandemic
- Become more individual, more face-to-face meetings with fewer people
Team camaraderie
- Deepen camaraderie through regular meetings and casual chats after meetings.
- Play some online games
Performance management
- Compared to physical work, remote work requires more feedback and suggestions
- Don't always think that you only give opinions during performance evaluations.
- This can increase the sense of trust between colleagues.
- Praise, praise immediately!
Responsible culture
- Avoid the strategy of close monitoring
- Through weekly reports, or daily regular reports.
- Give more trust and care appropriately.
Trust crisis:
- If any violations occur, they need to be handled immediately.
- And announce it (without announcing the name, only providing the violation) as a reminder among colleagues
Physical employee conference
- Although everyone is working remotely, it doesn't mean that everyone doesn't need to meet. You can arrange for everyone to meet in the same place once a year.
- Connect feelings and synchronize company culture
- And it can make more people work more smoothly

Thoughts

This book was written by the founder of a startup accelerator company, and his company has also been doing full remote work for a long time. The whole book clearly explains remote work through the impact of the pre-pandemic and the pandemic. Two major aspects:

As a remote worker, how should you adjust?
As a manager, how should you manage your all-remote team? (or even a full-remote company)

This book gives remote workers the psychological preparation they should have. After all, remote work is not just about saving commuting time, but also about paying more attention to the overall transformation of the work model. Remote work requires a higher degree of self-discipline and proactive aspects. Only then can supervisors and company colleagues trust and feel at ease. It is even more necessary to balance your life and work to avoid blurring the lines between work and personal life because of working from home, which can lead to early burn-out.

And as a manager, you need to pay more attention to the company culture and core concepts. Because employees are scattered everywhere, they cannot feel the banners and slogans in many office decorations. You need to frequently communicate relevant information, and you also need to pay special attention when recruiting employees. Not all employees can understand and properly use the benefits that remote work brings to them. This book also spends a lot of time teaching how to build corporate culture and core concepts (remotely), which also gives me a lot of in-depth understanding.

Finally, whether you are a prospective remote worker or a management level who may become a remote worker, this book can help you.

LINE Bot with Long Memory: Firebase Database, Gemini Pro, and Cloud Functions

Evan Lin — Sat, 28 Feb 2026 16:38:04 +0000

Preface:

This is the second in a series of articles for the BUILD WITH AI (BWAI) WORKSHOP, in collaboration with the Google Developer Group on 04/18 (it's unknown how many more articles are needed).

This article will focus on the following aspects:

Firebase Database setup
How to access Firebase through the official Golang on Cloud Function
Using Firebase Database to make your Gemini remember everything that has been said, optimizing the LINE Bot built in the last time

Article List:

[BwAI workshop][Golang] LINE OA + CloudFunction + GeminiPro + Firebase = Travel Assistant LINE Chatbot (1): Scenery Recognition Assistant
[BwAI workshop][Golang] LINE OA + CloudFunction + GeminiPro + Firebase = Travel Assistant LINE Chatbot (2): Firebase Database gives LINEBot a super long memory

Preparation

LINE Developer Account: You only need a LINE account to apply for a developer account.
Google Cloud Functions: The deployment platform for Go code, generating the webhook address for LINEBot.
Firebase: Create a Realtime database, LINE Bot can remember your previous conversations and even answer many interesting questions.
Google AI Studio: You can get the Gemini Key here.

Applying for Firebase Database Service

Remember to go to Firebase Console and create a project.
Create a Firebase Realtime Database, which will be used later
Select the US region
Start in “lock mode”
For ease of development, set it to read and write in “Rules”. Pay close attention:

Remember the URL (Note! You need to change the permissions back when you go live), and add an item: “ BwAI ”

Applying for Services Account Credential to connect Cloud Function to Google services

You can actually refer to another article of mine for this part of the tutorial. [Learning Document] How to use Golang to access Google Cloud services on Heroku, but I'll quickly go through it here.

Enter Google Cloud Console, go to IAM & Admin and select Create Services Account

Decide on the Services Account Name yourself, pay attention (the project and Firebase project names must be consistent)

Grant this service account access to project. When setting the identity, it is recommended to use Editor first (it is larger and needs to be used with caution)

“Grant users access to this service account” does not need to be specifically set
Press “Manage Keys” to prepare to download Credential

Select Add Key -> Create new Key -> Download JSON

Things to note when using Golang Google Options package:

Although Firebase Realtime Database has been set to allow everyone to read and write, if you access it through Golang, you will get an Unauthorized request error message. This is because the Project of your JSON file is different from your Firebase Project. Just recreate a Services Account and update the JSON content.

How to import Services Account Credential in Google Cloud Function?

Next, I will share how to correctly use it within Cloud Function. If you want to directly use Cloud Function to open the Credential JSON file, you will always get an error message that you cannot get the credential correctly.

At this time, you need to add it through environment variables:

Copy all the content in the JSON file
Set the GOOGLE_APPLICATION_CREDENTIALS parameter, and then paste all the content into the environment parameter.

Next, I will tell you how to modify the relevant code?

    // Init firebase related variables
    ctx := context.Background()
    opt := option.WithCredentialsJSON([]byte(os.Getenv("GOOGLE_APPLICATION_CREDENTIALS")))
    config := &firebase.Config{DatabaseURL: os.Getenv("FIREBASE_URL")}
    app, err := firebase.NewApp(ctx, config, opt)
    if err != nil {
        log.Fatalf("error initializing app: %v", err)
    }
    client, err := app.Database(ctx)
    if err != nil {
        log.Fatalf("error initializing database: %v", err)
    }

First, option.WithCredentialsJSON([]byte(os.Getenv("GOOGLE_APPLICATION_CREDENTIALS"))) allows you to read the credential from the environment variable.
Next, &firebase.Config{DatabaseURL: os.Getenv("FIREBASE_URL")} sets the FIREBASE_URL content.
This can be executed correctly, and then we will look at the relevant processing of Gemini chat history.

How to correctly process Gemini Pro Chat History?

Full Source Code

You can find the relevant open source code here: https://github.com/kkdai/linebot-cf-firebase

Gemini: Building a LINE E-commerce Chatbot That Can "Tell Stories" from Images

Evan Lin — Thu, 26 Feb 2026 02:44:27 +0000

References:

Background

Traditional process designed by developers:

User: "Help me take a look at the jacket I bought before"
Bot: [Call get_order_history()]
Function returns: {"product_name": "Brown pilot jacket", "order_date": "2026-01-15", ...}
Gemini: "You bought a brown pilot jacket on January 15th for NT$1,890."

The answer is completely correct, but it always feels like something is missing - the user is talking about "that jacket", and Gemini is just restating the text in the JSON, and has no way to "confirm" what that piece of clothing looks like. If there happen to be three jackets in the database, the AI simply cannot determine which one is the one the user remembers.

AI can read text, but cannot see images - this limitation has always been a blind spot under the traditional Function Calling architecture.

This problem was truly solved until Gemini launched Multimodal Function Response.

What is Multimodal Function Response?

The traditional Function Calling process is as follows:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON] → Gemini → [Text answer]

Multimodal Function Response changed that middle step. The function can not only return JSON, but also include images (JPEG/PNG/WebP) or documents (PDF) in the same response:

[User message] → Gemini → [function_call] → [Execute function] → [Return JSON + image bytes] → Gemini → [Text answer after seeing the image]

Gemini can "see" the structured data and images returned by the function at the same time when generating the next round of answers, thereby generating richer and more accurate responses.

The media formats currently supported by the official:

Category	Supported format
Image	`image/jpeg`, `image/png`, `image/webp`
Document	`application/pdf`, `text/plain`

The application scenarios of this function are very broad: e-commerce customer service (identifying product images), medical consultation (analyzing PDF of inspection reports), design review (giving suggestions based on screenshots)... almost all scenarios that require "functions to return visual data for AI analysis" are applicable.

Project Goals

This time, I used Multimodal Function Response to create a LINE e-commerce customer service robot, demonstrating the following scenario:

User: "Help me take a look at the jacket I bought before" Bot (traditional): "You bought a brown pilot jacket." Bot (Multimodal): "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon, with metal zipper decorative pockets on the sides. This is your January 15th order ORD-2026-0115, a total of NT$1,890, and has been delivered." + Product photo

The difference is obvious: Gemini really "saw" that piece of clothing, rather than just restating the text in the database.

Architecture Design

Why not use Google ADK?

Originally, this repo used Google ADK (Agent Development Kit) to manage the Agent. The Runner and Agent of ADK encapsulated the entire process of Function Calling, which was very convenient.

But Multimodal Function Response needs to manually include image bytes in the parts of the function response, and ADK completely encapsulates this layer, so it cannot intervene.

So this time, I directly used google.genai.Client to implement the iterative loop of function calls myself:

# Old architecture (ADK)
runner = Runner(agent=root_agent, ...)
async for event in runner.run_async(...):
    ... # ADK handles all function calls for you, but you cannot control the response content

# New architecture (directly use google.genai)
response = await client.aio.models.generate_content(
    model=model,
    contents=contents,
    config=types.GenerateContentConfig(tools=ECOMMERCE_TOOLS),
)
# Handle function calls yourself, include images yourself

Overall Architecture

LINE User
    │
    ▼ POST /
FastAPI Webhook Handler
    │
    ▼
EcommerceAgent.process_message(text, line_user_id)
    │
    ├─ ① Call Gemini (with conversation history)
    │
    ├─ ② Gemini decides to call the tool → function_call
    │
    ├─ ③ _execute_tool()
    │ ├─ Execute query function (search_products / get_order_history / get_product_details)
    │ └─ Read real product photos in the img/ directory (Unsplash JPEG)
    │
    ├─ ④ Construct Multimodal Function Response
    │ └─ FunctionResponsePart(inline_data=FunctionResponseBlob(data=image_bytes))
    │
    ├─ ⑤ Call Gemini again (Gemini sees the image + data)
    │
    └─ ⑥ Return (ai_text, image_bytes)
    │
    ▼
LINE Reply:
  TextSendMessage(text=ai_text)
  ImageSendMessage(url=BOT_HOST_URL/images/{uuid}) ← FastAPI /images endpoint provided

How do the product images come from?

This demo uses real Unsplash clothing photography photos. Each of the five products corresponds to an actual photo of the clothing, stored in the img/ directory. The reading logic is very simple:

def generate_product_image(product: dict) -> bytes:
    """Read the product image and return JPEG bytes."""
    with open(product["image_path"], "rb") as f:
        return f.read()

Each product in PRODUCTS_DB has an image_path field pointing to the corresponding image file:

Product ID	Name	Image
P001	Brown pilot jacket	tobias-tullius-...-unsplash.jpg
P002	White cotton T-shirt	mediamodifier-...-unsplash.jpg
P003	Dark blue denim jacket	caio-coelho-...-unsplash.jpg
P004	Beige knit shawl	milada-vigerova-...-unsplash.jpg
P005	Light blue simple T-shirt	cristofer-maximilian-...-unsplash.jpg

The image bytes read have two uses:

As FunctionResponseBlob to be included for Gemini analysis - real photos allow Gemini to describe the actual fabric texture and tailoring details
Temporarily stored in the image_cache dict, provided to the LINE Bot for display through the FastAPI /images/{uuid} endpoint

Core Code Details

Step 1: Define Tools (FunctionDeclaration)

from google.genai import types

ECOMMERCE_TOOLS = [
    types.Tool(function_declarations=[
        types.FunctionDeclaration(
            name="get_order_history",
            description="Query the current user's order history",
            parameters=types.Schema(
                type=types.Type.OBJECT,
                properties={
                    "time_range": types.Schema(
                        type=types.Type.STRING,
                        description="Time range: all / last_month / last_3_months",
                        enum=["all", "last_month", "last_3_months"],
                    ),
                },
                required=[],
            ),
        ),
        # ... search_products, get_product_details
    ])
]

Step 2: Function Call Loop (up to 5 iterations)

async def process_message(self, text: str, line_user_id: str):
    contents = self._get_history(line_user_id) + [
        types.Content(role="user", parts=[types.Part(text=text)])
    ]

    for _iteration in range(5): # Up to 5 times, to prevent infinite loops
        response = await self._client.aio.models.generate_content(
            model=self._model,
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=_SYSTEM_INSTRUCTION,
                tools=ECOMMERCE_TOOLS,
            ),
        )

        model_content = response.candidates[0].content
        contents.append(model_content)

        # Find all function_call parts
        fc_parts = [p for p in model_content.parts if p.function_call and p.function_call.name]

        if not fc_parts:
            # No function call → final text response
            final_text = "".join(p.text for p in model_content.parts if p.text)
            break

        # Has function call → execute tool, include image
        tool_parts = []
        for fc_part in fc_parts:
            result_dict, image_bytes = _execute_tool(
                fc_part.function_call.name,
                dict(fc_part.function_call.args),
                line_user_id,
            )
            tool_parts.append(
                self._build_multimodal_response(fc_part.function_call.name, result_dict, image_bytes)
            )

        contents.append(types.Content(role="tool", parts=tool_parts))

Step 3: Construct Multimodal Function Response (the most critical step)

def _build_multimodal_response(self, func_name, result_dict, image_bytes):
    multimodal_parts = []

    if image_bytes:
        # ⚠️ Note: Here you need to use FunctionResponseBlob, not types.Blob!
        multimodal_parts.append(
            types.FunctionResponsePart(
                inline_data=types.FunctionResponseBlob(
                    mime_type="image/jpeg",
                    data=image_bytes, # raw bytes, SDK handles base64 internally
                )
            )
        )

    return types.Part.from_function_response(
        name=func_name,
        response=result_dict, # Structured JSON data
        parts=multimodal_parts or None, # ← Image is here! Gemini can "see" it after receiving it
    )

Gemini will receive result_dict (order JSON) and image_bytes (product image) at the same time in the next generate_content call, and the generated answer can therefore describe the visual content of the image.

Step 4: LINE Bot simultaneously returns text + image

# main.py

ai_text, image_bytes = await ecommerce_agent.process_message(msg_text, line_user_id)

reply_messages = [TextSendMessage(text=ai_text)]

if image_bytes:
    image_id = str(uuid.uuid4())
    image_cache[image_id] = image_bytes # Temporary storage
    image_url = f"{BOT_HOST_URL}/images/{image_id}" # FastAPI provides service
    reply_messages.append(
        ImageSendMessage(
            original_content_url=image_url,
            preview_image_url=image_url,
        )
    )

await get_line_bot_api().reply_message(event.reply_token, reply_messages)

LINE Bot's reply_message supports returning multiple messages at once (up to 5), so text and images can be sent at the same time.

Pitfalls

❌ Pitfall 1: `FunctionResponseBlob` is not `Blob`

The easiest pitfall to step on: When constructing multimodal image components, you cannot use types.Blob, you need to use types.FunctionResponseBlob:

# ❌ Error (will TypeError)
types.FunctionResponsePart(
    inline_data=types.Blob(mime_type="image/jpeg", data=image_bytes)
)

# ✅ Correct
types.FunctionResponsePart(
    inline_data=types.FunctionResponseBlob(mime_type="image/jpeg", data=image_bytes)
)

❌ Pitfall 2: `aiohttp.ClientSession` cannot be created at the module level

The original code directly created aiohttp.ClientSession() at the module level:

# ❌ Old method: module level
session = aiohttp.ClientSession() # If there is no running event loop, there will be a warning or error
async_http_client = AiohttpAsyncHttpClient(session)

# ✅ New method: lazy init
_line_bot_api = None

def get_line_bot_api():
    global _line_bot_api
    if _line_bot_api is None:
        session = aiohttp.ClientSession() # Called within the async route handler, ensuring there is an event loop
        _line_bot_api = AsyncLineBotApi(channel_access_token, AiohttpAsyncHttpClient(session))
    return _line_bot_api

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Gemini receives raw bytes, but LINE Bot's ImageSendMessage needs a publicly accessible HTTPS URL.

The solution is to add a /images/{image_id} endpoint in FastAPI, temporarily store the read image bytes in the image_cache dict, and LINE retrieves the image through this endpoint:

@app.get("/images/{image_id}")
async def serve_image(image_id: str):
    image_bytes = image_cache.get(image_id)
    if image_bytes is None:
        raise HTTPException(status_code=404, detail="Image not found")
    return Response(content=image_bytes, media_type="image/jpeg")

Use ngrok to expose port 8000 for local development, and use the service URL directly after Cloud Run deployment.

Demo Display

Mock Database (default data for Demo)

The system has 5 built-in products (all with real Unsplash photos), and each LINE user automatically binds two demo orders when querying orders for the first time:

Order Number	Date	Product
ORD-2026-0115	2026-01-15	P001 Brown pilot jacket
ORD-2026-0108	2026-01-08	P003 Dark blue denim jacket

Scenario 1: "Help me take a look at the jacket I bought before"

User sends: "Help me take a look at the jacket I bought before"

[Gemini → function_call]
  get_order_history(time_range="all")

[_execute_tool execution]
  - get_order_history() returns two orders (P001, P003)
  - Read img/tobias-tullius-...-unsplash.jpg → Brown pilot jacket real photo bytes

[Multimodal Function Response]
  Part.from_function_response(
    name="get_order_history",
    response={"orders": [...], "order_count": 2},
    parts=[FunctionResponsePart(inline_data=FunctionResponseBlob(data=<photo>))]
  )

[Gemini responds after seeing the real photo]
  "From the photo, you can see that this is a brown pilot jacket, made of lightweight nylon
   material with a glossy feel, and a metal zipper decorative pocket on the left sleeve. This is your January 15, 2026
   order ORD-2026-0115, a total of NT$1,890, status: delivered."

LINE displays: [Text] + [Brown pilot jacket real photo]

Scenario 2: "Are there any dark blue jackets?"

[Gemini → function_call]
  search_products(description="dark blue jacket", color="dark blue")

[Gemini sees the real photo of P003 dark blue denim jacket]
  "Yes! This dark blue denim jacket (P003) in the photo adopts a retro stitching design,
   lapel with metal buttons, the overall garment feel is full, priced at NT$1,490, 8 in stock."

Scenario 3: "What are the features of the P004 knit shawl?"

[Gemini → function_call]
  get_product_details(product_id="P004")

[Gemini sees the real photo of the beige knit shawl]
  "The photo shows a beige handmade crochet shawl, V-neck design with tassels at the bottom,
   you can see the light lace-like mesh weave, elegant texture, priced at NT$1,290."

Traditional Function Response vs Multimodal Function Response

	Traditional	Multimodal
Function return	Pure JSON	JSON + image/PDF bytes
Gemini perception	Text data	Text + visual content
Answer quality	"You bought a brown pilot jacket"	"You can see the nylon material gloss, zipper pocket on the left sleeve..."
API difference	`Part.from_function_response(name, response)`	`Part.from_function_response(name, response, parts=[...])`
Applicable scenarios	Pure text data query	Scenarios that require visual recognition/confirmation

Analysis and Outlook

This implementation gave me a new understanding of Gemini's Function Calling capabilities.

The problem that Multimodal Function Response truly solves is to allow the AI agent to bring in visual information in the action of "calling an external system" itself, instead of first checking the text and then uploading the image separately. This will be an important basic capability in areas highly related to visuals such as e-commerce, medical, and design.

However, there are still a few limitations worth noting:

Image URLs cannot be used directly: Gemini's FunctionResponseBlob needs raw bytes, and cannot directly fill in the URL (this is different from directly including images in the prompt). If the image is originally a URL, you need to download it with requests.get() to bytes and then pass it in.
No display_name can also be used: The official documentation examples have display_name and $ref JSON reference, but in actual tests in google-genai 1.49.0, it can also work normally without filling in display_name, and Gemini can still see and analyze the image.
Model limitations: The official mark supports the Gemini 3 series, but gemini-2.0-flash can also handle it normally in actual tests, and the API structure is the same.

Summary

The focus of this LINE Bot implementation is only one sentence: Let the function response carry images, and Gemini's answer will upgrade from "restating data" to "telling stories based on images".

The core API is just these few lines, but it takes a lot of details to get the whole process through:

# Gemini sees the complete writing of the image returned by the function
types.Part.from_function_response(
    name="get_order_history",
    response={"orders": [...]},
    parts=[
        types.FunctionResponsePart(
            inline_data=types.FunctionResponseBlob( # ← Not types.Blob!
                mime_type="image/jpeg",
                data=image_bytes,
            )
        )
    ],
)

The complete code is on GitHub, feel free to clone and play with it.

See you next time!

Google Developer Year-end 2025 Recap: Gemini 2025 New Features and Perfect Integration with LINE Bot

Evan Lin — Sun, 08 Feb 2026 05:03:50 +0000

Background

Yesterday, I attended the Google Developer Year-end 2025 event hosted by Google and also visited the Google Banqiao office. I was very happy to share my observations on the evolution of Gemini technology throughout 2025 with everyone in my capacity as LINE Taiwan Developer Relations.

In the popular anime "Frieren: Beyond Journey's End," I really like the character "Übel" from the First-Class Mage Exam arc. She has a unique ability concept: "If you can imagine cutting it, you can definitely cut it."

This sentence perfectly echoes the current AI era - imagination and comprehension have become more important than ever before. How to "precisely imagine how to solve a problem" has become the key to enabling AI to assist you accurately. This article will summarize the key Gemini 2025 features shared that day, as well as my views on the core capabilities of "software engineers" in the AI wave.

2025 Gemini Feature Evolution Review

Looking back at 2025, the integration of Gemini and LINE Bot had groundbreaking updates at multiple points in time. Here's a review of the technical nodes from this year:

Time Point	Feature Update	Description
2025.04	Google ADK	Initial integration of Agent and Messaging API, demonstrating basic Agent applications such as weather inquiries.
2025.06	Gemini CLI	Major upgrade to the developer experience, directly collaborating with AI in the terminal to perform file operations and code writing.
2025.08	Video Understanding	Support for YouTube video understanding. Gemini 2.5 directly grabs subtitles and video content for summarization and interaction.
2025.11	File Search	Enhanced file search capabilities, supporting RAG applications for various formats such as JSON, JS, PDF, and Python.
2025.12	Map Grounding	Combined with Google Maps Platform, allowing the Bot to answer geographical information questions such as "recent earthquake information" or "nearby restaurants."

Detailed Explanation of Technical Highlights

1. Gemini CLI and Vibe Coding

The Gemini CLI, launched in June, changed the habits of many developers. It's not just printing "Hello World"; it integrates tools like Git and Gcloud. This brings forth a new development concept: Vibe Coding.

Definition: This is not just writing code, but allowing the development process to enter a "flow" state through tools like Gemini CLI, Vertex AI Studio, and Antigravity.
Key: The focus is on how developers orchestrate the connection of these tools, rather than manually writing every line of code.

2. Integration of Visual and Geographic Information (Video & Map)

Video Understanding in August allowed us to directly input YouTube links, and Gemini could generate summaries and even answer video details. At the end of the year, Map Grounding filled the biggest gap in LLMs: "real-time geographic information."

Application Scenario: When users ask "find restaurants," the Bot uses Map Grounding to find nearby restaurants like "CHILLAX" or "博感情" and provides addresses and types.
Data Source: Combines World Knowledge (Google Search) and Private Knowledge (Your Data/RAG) to make the answers more grounded.

Re-examining: The Three Pillars of Outstanding Talent

While technology tools are constantly evolving, I'm also thinking about what kind of abilities AI cannot replace. Just like "Übel's" imagination mentioned earlier, I believe that outstanding talent needs to possess three pillars:

1. AI Collaboration

This is not just knowing how to use tools, but also having the ability of Prompt Engineering.

Difference: Those who know how to converse with AI and guide AI to produce results can increase their production speed by 10 times.
Key: AI is your Copilot, but you are the captain.

2. Domain Depth

In an era where AI is rampant, Domain Knowledge is your strongest moat.

Value: AI can write syntactically correct code, but "experience in solving complex problems" and "a deep understanding of business logic" are difficult for AI to imitate.

3. Empathy & Creativity

Transforming "passion" into human-specific critical thinking and empathy.

Core: This is the core of building relationships and managing decisions. AI can process data, but it cannot understand the true motivations behind people's emotions and needs.

Conclusion

This Google Developer Year-end event reaffirmed for me that the understanding of systems and the recognition of problems are the most important abilities for software engineers.

Development in the AI era is no longer just simple coding, but like designing the AP2 protocol, it requires consideration of the overall architecture and security. If we only quickly Vibe Code and ignore the underlying principles (such as forgetting to consider the timeliness of Tokens or the correctness of data), it is easy to produce a flawed system.

Therefore, maintaining curiosity about technology while deepening your domain knowledge is the important reason why we can prove that AI cannot replace us.

Slides Download: Friends who are interested can refer to the slides from that day: https://speakerdeck.com/line_developers_tw/2025-features-recap-perfect-integration-linebot

(If you find this helpful, feel free to share this article!)

[Gemini CLI] Google Developer Knowledge API and MCP Server: Equipping Your AI Assistant with an Official Knowledge Base

Evan Lin — Sun, 08 Feb 2026 05:03:37 +0000

References:

Background

Remember last week when I was integrating the Gemini API using the Gemini CLI, and it confidently told me, "This is how you use this API parameter." But after running it, I got a bunch of errors. It turned out that Google had changed the API format three months ago. This isn't the AI's fault; its training data cutoff date is what it is. Faced with ever-changing technical documentation, even the strongest models will "become outdated."

Typical scenarios we've encountered in the past:

Developer: "Gemini, help me write an example of Gemini Function Calling"
AI: "Okay, you can write it like this..." [Generates code based on June 2024 documentation]
Developer: [Copy and paste, execute]
Terminal: ❌ Error: Parameter 'tools' format has changed in v2
Developer: 😤 "I have to go look at the official documentation again..."

Are you familiar with this cycle? Even Gemini 1.5 Pro sometimes gives outdated suggestions because its own API updates too quickly. AI's knowledge is static, but technical documentation is dynamic, and this contradiction has always troubled us.

To completely solve this problem, Google released two major killer tools in early 2025:

Developer Knowledge API - A machine-readable official documentation API
Knowledge MCP Server - A real-time document query service based on the Model Context Protocol

This means that your AI assistant is no longer just "remembering" how to write code, but can actively "consult the latest official documentation" when needed, becoming a development expert truly guaranteed by the official source and never outdated.

What is the Developer Knowledge API?

How AI used to learn documents: The dilemma of web crawlers

Traditionally, AI models learn documents by crawling web pages. But this method has several fatal problems:

❌ Noise Interference

<!-- Actual content seen by AI -->
<nav>...</nav> <!-- Navigation bar -->
<ad>...</ad> <!-- Advertisement -->
<cookie-banner>...</cookie-banner> <!-- Cookie prompt -->
<div class="content">
  <!-- The real document content only accounts for 30% -->
  This is how to use the Gemini API...
</div>
<footer>...</footer> <!-- Footer -->

AI has to "guess" from this pile of HTML which is the real document content.

❌ Inconsistent Formatting

Some use <code> tags, some use <pre>
Some use Markdown rendering, some use custom syntax
Image descriptions may be in alt, title, or figcaption

❌ Update Delay

Crawlers may only crawl every few months
New API parameters have to wait for the next training to know
The training data cutoff date becomes a perpetual pain

Developer Knowledge API: A machine-first document system

Developer Knowledge API completely changes the game, it provides:

✅ Machine-readable source of truth:
- Directly provides pure Markdown format
- No noise, no ads, no navigation bar
- Structured metadata (author, update time, version)
✅ Real-time:
- Synchronized updates with Google's official documentation (delay < 1 hour)
- When the API changes, the AI can immediately read the new documents
- There will never be the problem of "outdated training data"
✅ Comprehensive: It can directly retrieve and obtain documents from the following Google official domains. If your development field is related to these, it is strongly recommended to enable this MCP:
- ai.google.dev
- developer.android.com
- developer.chrome.com
- developers.home.google.com
- developers.google.com
- docs.cloud.google.com
- docs.apigee.com
- firebase.google.com
- fuchsia.dev
- web.dev
- www.tensorflow.org

MCP Server: Making AI more "knowledgeable"

Model Context Protocol (MCP) is an open standard, which is like an "add-on slot" for AI tools. Google's Knowledge MCP Server, launched this time, allows various tools that support MCP (such as Claude Code, Cursor, and even our favorite Gemini CLI) to easily integrate.

Through this MCP Server, AI no longer just writes code from memory, but can "consult the books" for specific questions:

Implementation guidance: Ask for the best implementation method for a new feature.
Troubleshooting: Diagnose directly based on the latest Error Code documentation.
Version comparison: Understand the differences between different versions of the API.

If you are interested in MCP applications in a specific field, I also shared in a previous article Google Maps Platform Assist MCP: Let AI help you write more accurate map applications, which is also a very powerful tool that can give AI assistants an advantage when developing map features.

Hands-on: Letting AI assistants import the official knowledge base

To enable AI assistants to read official documentation, we need to complete some simple preparations in Google Cloud.

Step 1: Enable the Developer Knowledge API

Go to the Developer Knowledge API page in the Google API Library.
Make sure you have selected the correct project.
Click "Enable". This API does not require special IAM permissions to use.

Step 2: Create and protect your API key

For security, it is recommended to restrict the key:

In the Google Cloud console, navigate to the "Credentials" page.
Click "Create Credentials", then select "API key".
Click "Edit API key".
Enter a recognizable name in the name field (e.g., Dev-Knowledge-Key).
Under "API restrictions", select "Restrict key".
Select "Developer Knowledge API" from the API list, and then click OK.
Click "Save".

After creation, click "Show Key" and write it down, this is the credential we will use next.

If you are using Claude Code or Gemini CLI, you can now make it stronger with a simple configuration.

Configuration Example (using Gemini CLI as an example)

You only need to add Google's MCP Server address to the settings and include your API Key:

# Add Google Developer Knowledge MCP Server
gemini mcp add -t http -H "X-Goog-Api-Key: YOUR_API_KEY" google-developer-knowledge https://developerknowledge.googleapis.com/mcp --scope user

Once the configuration is complete, when you ask "how to use the latest Gemini API for Function Calling", the AI will actively call the MCP Server to retrieve the most accurate and up-to-date document content from the official website to answer you.

Analysis and Outlook: Why is this important?

The launch of this technology marks two major shifts in the development process:

From "relying on memory" to "real-time query" In the past, we pursued making the model bigger and remembering more things. Now, we let the model learn to "look up information" through MCP. This not only greatly reduces hallucinations, but also reduces the pressure on the model to retrain frequently.
More powerful development agents (AI Agents) When AI assistants can read documents, execute instructions, and perform version control, they truly evolve into "digital colleagues" who can handle tasks independently. The structured information provided by the Developer Knowledge API is the fuel that AI Agents need to perform complex reasoning.

Summary

This time, Google not only provides powerful models, but also provides excellent "data interfaces". For developers who pursue efficiency, configuring the Developer Knowledge MCP Server is definitely worth the 5 minutes of investment.

In the future, when writing code, the AI assistant is no longer just a machine that can write code, but a technical consultant who is always consulting the latest official documentation and giving you the most accurate advice. Why not apply for an API Key and try it out?

See you next time!

Building Interoperable AI Business Agents with UCP: DevBooks Agent Implementation Analysis

Evan Lin — Sat, 31 Jan 2026 17:23:33 +0000

Previous Article Recap

In the previous article, we explored how to implement Agentic Vision using LINE Bot. Today, we'll shift our focus to another important area of AI Agents: E-commerce and Interoperability.

Most current AI Agents are "islands." If you want to buy a book, you might need a dedicated bookstore Bot; to buy groceries, you'll need another grocery Bot. These Agents don't communicate with each other, and the user experience is fragmented.

To solve this problem, the Universal Commerce Protocol (UCP) was born. It's like HTML for the commerce world, defining a set of standard languages that allow different AI Agents (buyer agents and seller agents) to "communicate" with each other and complete complex commercial transactions.

In this article, I will take you deep into the code of devbooks_agent, a UCP-based technical bookstore Agent, to demonstrate how it works.

What is UCP and A2A?

Before diving into the code, let's briefly understand two core concepts:

UCP (Universal Commerce Protocol): A standardized commerce protocol. It defines data structures (Schemas) such as "Product", "Order", and "Checkout", ensuring that the "Product" your Agent says is the same as the "Product" my Agent understands.
A2A (Agent-to-Agent): The communication model between Agents. Here, we will have two roles:
- User Agent (Client): Represents the user, responsible for sending requests (e.g., "I want to buy a React book").
- Business Agent (Merchant): Represents the merchant (like DevBooks), responsible for providing product information and processing orders.

Our focus today is this Business Agent — devbooks_agent.

Project Structure Overview

devbooks_agent is a standard Python project that uses Google's Agent Development Kit (ADK).

devbooks_agent/
├── src/devbooks_agent/
│ ├── agent.py # Agent's brain: defines tools and behaviors
│ ├── ucp_profile_resolver.py # UCP handshake protocol: confirms each other's capabilities
│ ├── store.py # Simulated database and business logic
│ ├── data/
│ │ ├── ucp.json # UCP capability declaration
│ │ ├── products.json # Book catalog
│ │ └── agent_card.json # Agent's business card
│ └── main.py # Program entry point

1. Defining Agent Capabilities (`ucp.json`)

First, the Agent needs to tell the world what it "can do." This is defined through ucp.json. This is like the Agent's resume.

{
  "ucp": {
    "version": "2026-01-11",
    "capabilities": [
      {
        "name": "dev.ucp.shopping.checkout",
        "version": "2026-01-11",
        "spec": "https://ucp.dev/specs/shopping/checkout"
      },
      {
        "name": "dev.ucp.shopping.fulfillment",
        "version": "2026-01-11",
        "extends": "dev.ucp.shopping.checkout"
      }
    ]
  }
}

This configuration declares that the DevBooks Agent supports the 2026 version of the UCP protocol and has the capabilities of "shopping checkout" and "logistics delivery."

2. Agent's Brain and Tools (`agent.py`)

This is the most core part. We use google.adk.agents.Agent to define the Agent and give it various tools (Tools).

# src/devbooks_agent/agent.py

root_agent = Agent(
    name="devbooks_agent",
    model="gemini-2.5-flash", # Uses the latest Gemini model
    description="Agent to help with shopping for technical books",
    instruction=(
        "You are a helpful agent who assists developers in finding and purchasing"
        " technical books..."
        # ... Detailed Prompt instructions ...
    ),
    tools=[
        search_shopping_catalog, # Search for books
        preview_book, # Preview (DevBooks exclusive feature)
        add_to_checkout, # Add to cart
        start_payment, # Start checkout
        complete_checkout, # Complete order
        # ... Other tools
    ],
    # ... callback settings
)

Featured Tool: preview_book

Unlike general grocery stores, selling books usually requires a "preview." This is where the benefits of Agent toolification come in handy, allowing us to easily add custom features:

def preview_book(tool_context: ToolContext, book_id: str) -> dict:
  """Gets a preview/sample chapter of a book."""
  try:
    preview = store.get_book_preview(book_id)
    if preview is None:
        # Handle the case where there is no preview
        return _create_error_response(...)

    return {
        "preview": preview.model_dump(mode="json"),
        "status": "success"
    }
  except Exception:
    # Error handling
    return _create_error_response(...)

3. UCP Handshake Protocol (`ucp_profile_resolver.py`)

When the User Agent connects to the Business Agent, both parties need to "tune in" first to confirm the UCP versions they support. This is handled by ProfileResolver.

# src/devbooks_agent/ucp_profile_resolver.py

def resolve_profile(self, client_profile_url: str, user_id: str | None = None) -> dict:
    # 1. Get the Client's Profile
    profile = self._fetch_profile(client_profile_url, headers=headers)

    # 2. Check version compatibility
    client_version = profile.get("ucp").get("version")
    merchant_version = self.merchant_profile.get("ucp").get("version")

    # If the Client version is too new, and the Merchant doesn't support it, then report an error
    if client_version > merchant_version:
      raise ServerError(...)

    return profile

This ensures that both parties in the transaction are on the same channel and that there is no miscommunication.

Practical Demo

After understanding the architecture, let's run a complete UCP test flow. This Demo will simulate a developer purchasing a technical book.

Environment Preparation

Make sure you have started the following services:

Business Agent (DevBooks): http://localhost:11000
Chat Client: http://localhost:3000

Test Script

Please open the Chat Client (http://localhost:3000) in your browser and follow these steps:

1. Book Search

User Input:

"I looking for some books about React to learn."

Behind the Scenes: The Agent will call the search_shopping_catalog tool and return a list of matching books (e.g., "Learning React", "React Design Patterns").

Expected Result: You will see book cards with cover images and prices.

2. Preview Content

This is a DevBooks exclusive feature.

User Input:

"Can I see a preview of the first one?"

Behind the Scenes: The Agent identifies the user's intent and calls the preview_book tool to get the preview chapter content.

Expected Result: The Agent returns the first chapter excerpt or a preview link of the book.

3. Add to Cart

User Action: Click the "Add to Checkout" button on the card, or enter "Add Learning React to my cart".

Behind the Scenes: Calls add_to_checkout. At this point, the Agent creates a UCP Checkout Session (ADK_USER_CHECKOUT_ID) in the background.

4. Checkout Info

User Input:

"My email is dev@example.com, ship to 456 Tech Blvd, San Francisco, CA 94107"

Behind the Scenes: The Agent parses the address information and calls update_customer_details to fill in the information in the UCP Checkout object.

5. Payment

User Action: Click "Complete Payment" -> Select a payment method (Mock Pay) -> "Confirm Purchase".

Behind the Scenes: Calls complete_checkout. The Agent interacts with MockPaymentProcessor, verifies the payment, and finally calls store.place_order to complete the order.

Expected Result: Receive order confirmation message: "Order Confirmed! Order ID: ORDER-12345".

Technical Development Experience

1. The Importance of State Management

In agent.py, you can see a lot of tool_context.state usage. Because the Agent's interaction is a multi-turn conversation, we must preserve checkout_id between conversations.

def add_to_checkout(tool_context: ToolContext, ...):
    # Read or create Checkout ID from Context
    checkout_id = tool_context.state.get(ADK_USER_CHECKOUT_ID)
    if not checkout_id:
        # ... Create new Checkout
        tool_context.state[ADK_USER_CHECKOUT_ID] = checkout.id

This is similar to the concept of Session in traditional Web development, but in Agent development, this is maintained jointly by the LLM's Context Window and an external State Store.

2. The Power of UCP: Structured Data

Pay attention to the after_tool_modifier function:

def after_tool_modifier(..., tool_response: Dict) -> Optional[Dict]:
    # ...
    # Inject structured UCP data into the response
    if UcpExtension.URI in extensions:
        tool_context.state[ADK_LATEST_TOOL_RESULT] = tool_response

This allows the Agent to return not just a piece of text ("Okay, added to cart"), but a complete, machine-readable JSON object. The Client side (front-end UI) receives this JSON and can render beautiful product cards or checkout buttons, instead of just displaying plain text. This is the essence of A2A: Not just chatting, but data exchange.

Conclusion

Through the implementation of devbooks_agent, we see how UCP elevates AI Agents from simple chatbots to "digital clerks" capable of handling complex business logic.

Standardization: UCP allows Agents written by different developers to communicate with each other.
Modularity: Through ADK Tools, we can easily extend functionality (such as previews).
Interoperability: The front-end UI can automatically generate interfaces based on standard protocols, without customizing the screen for each Agent.

The future of AI Agents is definitely not a solo fight, but an interconnected world.

DEV Community: Evan Lin

[Gemini] Building a LINE E-commerce Chatbot That Can "Tell Stories from Images"

Background

What is Multimodal Function Response?

Project Goal

Architecture Design

Why not use Google ADK?

Overall architecture

How to get product images?

Detailed explanation of the core code

Step 1: Define tools (FunctionDeclaration)

Step 2: Function call cycle (up to 5 iterations)

Step 3: Construct Multimodal Function Response (the most critical step)

Step 4: LINE Bot simultaneously returns text + image

Potholes

❌ Pitfall 1: FunctionResponseBlob is not Blob

❌ Pitfall 2: aiohttp.ClientSession cannot be created at the module level

❌ Pitfall 3: LINE Bot needs HTTPS URL to send images

Demo Display

Mock database (default data for demo)

Scenario 1: "Help me see the jacket I bought before"

Scenario 2: "Are there any dark blue jackets?"

Scenario 3: "What are the features of the P004 knitted shawl?"

Traditional Function Response vs Multimodal Function Response

Analysis and Outlook

Summary

Gemini Tool Combo: Building a LINE Meetup Helper with Maps Grounding and Places API in a Single API Call

Background

What are Tool Combinations?

Project Goal

Architecture Design

Overall Message Flow

Tool Combo Agentic Loop

Why not put lat/lng in Function Declaration?

Core Code Details

Step 1: Define Function Declaration

Step 2: Places API Call

Step 3: Tool Combo Main Function (Agentic Loop)

Pitfalls Encountered

❌ Pitfall 1: Part.from_function_response() does not accept the id parameter

❌ Pitfall 2: include_server_side_tool_invocations=True causes Pydantic to explode

❌ Pitfall 3: textQuery is a parameter of searchText, not searchNearby

❌ Pitfall 4: No guard for response.candidates[0]

Demo Display

Scenario 1: "Find a hot pot restaurant with a rating of 4 stars or above for group dining"

Scenario 2: "Are there any high-value Japanese restaurants?"

Demo Script Quick Test

Old Architecture vs. New Architecture

Analysis and Outlook

Summary

Gemini 3.1: Real-World Voice Recognition with Flash Live: Making Your LINE Bot Understand You

Background

Design Decision: Flash Live or Standard Gemini API?

Architecture Design

Integration Approach

Complete Flow

Why two replies?

Core Code Explanation

Step 1: Audio Transcription Tool (tools/audio_tool.py)

Step 2: Handler Main Flow (main.py)

Step 3: Enabling Orchestrator to Support External Text Input

Pitfalls Encountered

❌ Pitfall 1: Part.from_text() does not accept positional arguments

❌ Pitfall 2: LINE reply token can only be used once

❌ Pitfall 3: Gemini Flash Live is not suitable for pre-recorded files

Effect Demonstration

Scenario 1: Voice Command Query

Scenario 2: Voice Question

Scenario 3: Voice Send URL

Traditional Text Input vs. Voice Input

Analysis and Outlook

Extension: 🔊 Read Summary Aloud – Make the Bot Speak

Function Design

Core Code (tools/tts_tool.py)

Pitfalls of Read Aloud Function

❌ Pitfall 4: Completely Different Model Name

❌ Pitfall 5: GOOGLE_CLOUD_LOCATION=global Causes Live API to Disconnect

Voice Recognition vs. Read Summary Aloud

Conclusion

Building an Agent Skill Hub: From Skill Development to Automated Multilingual Documentation Deployment on GitHub Pages

❌ Pitfall 1: `FunctionResponseBlob` is not `Blob`

❌ Pitfall 2: `aiohttp.ClientSession` cannot be created at the module level

❌ Pitfall 1: `Part.from_function_response()` does not accept the `id` parameter

❌ Pitfall 2: `include_server_side_tool_invocations=True` causes Pydantic to explode

❌ Pitfall 3: `textQuery` is a parameter of `searchText`, not `searchNearby`

❌ Pitfall 4: No guard for `response.candidates[0]`

❌ Pitfall 1: `Part.from_text()` does not accept positional arguments

❌ Pitfall 5: `GOOGLE_CLOUD_LOCATION=global` Causes Live API to Disconnect