Diff - 542bda3968c6dd5b79392dd63e2955e04520401a^! - sketch

commit	542bda3968c6dd5b79392dd63e2955e04520401a	[log] [tgz]
author	Philip Zeyliger <philip@bold.dev>	Wed Jun 11 18:31:03 2025 -0700
committer	Autoformatter <bot@sketch.dev>	Thu Jun 12 01:31:34 2025 +0000
tree	ea1a0743849495ca2489c6363d2dc689dd0a56a7
parent	225e9668aeebc0cae667872dd45222d69ac3cbd8 [diff] [blame]

browser: rename browser_read_image to read_image and auto-send screenshots to LLM

Rename browser_read_image tool to read_image and modify browser_take_screenshot
to automatically send image content to the LLM instead of requiring a separate
read_image tool call, streamlining the screenshot workflow.

Problem Analysis:
The current browser screenshot workflow required two separate tool calls:
1. browser_take_screenshot - saves screenshot and returns file path
2. browser_read_image - reads saved screenshot and sends to LLM

This two-step process was inefficient and created unnecessary round trips.
Additionally, browser_read_image was specific to browser automation but
the functionality of reading and encoding images is more general purpose.

Implementation Changes:

1. Screenshot Tool Behavior (claudetool/browse/browse.go):
   - Modified browser_take_screenshot to automatically return image content
   - Removed screenshotOutput struct as ID-only response no longer needed
   - Added base64 encoding of screenshot data directly in screenshotRun
   - Returns []llm.Content with both text description and image data
   - Still saves screenshot file for potential future reference
   - Uses same image encoding format as existing read_image tool

2. Tool Rename (claudetool/browse/browse.go):
   - Renamed browser_read_image tool to read_image
   - Updated tool name in NewReadImageTool from 'browser_read_image' to 'read_image'
   - Maintained all existing functionality and input/output format
   - Tool description and schema remain unchanged

3. UI Updates (termui/termui.go):
   - Updated template condition from 'browser_read_image' to 'read_image'
   - Maintains existing emoji and display format for read_image tool calls

4. WebUI Updates (webui/src/web-components/):
   - Updated sketch-tool-calls.ts to reference 'read_image' instead of 'browser_read_image'
   - Renamed sketch-tool-card-browser-read-image.ts to sketch-tool-card-read-image.ts
   - Updated component class name from SketchToolCardBrowserReadImage to SketchToolCardReadImage
   - Updated custom element name from 'sketch-tool-card-browser-read-image' to 'sketch-tool-card-read-image'
   - Updated import statement to reference new component file name
   - Removed old component file and updated TypeScript declarations

5. Test Updates (claudetool/browse/browse_test.go):
   - Modified TestGetTools to allow read_image tool without 'browser_' prefix
   - Added special case handling for read_image in tool naming convention check
   - All existing tests continue to pass with updated tool name

Technical Details:
- Screenshot auto-send uses same base64 encoding as existing read_image tool
- Content structure matches browser_read_image output format for consistency
- File saving still occurs for potential debugging or future reference
- Error handling preserves existing behavior with proper fallbacks
- Tool count remains the same (12 tools with screenshots, 10 without)

Benefits:
- Eliminates need for two-step screenshot workflow
- Reduces round trips and simplifies user experience
- More intuitive tool naming (read_image is general purpose)
- Maintains full backward compatibility for read_image functionality
- Consistent image encoding across all browser tools
- Automatic screenshot viewing improves debugging and validation workflows

Testing:
- All existing browser tool tests pass with updated expectations
- TestReadImageTool verifies renamed tool functionality
- Tool naming convention test updated to handle read_image exception
- TypeScript compilation successful with no type errors
- Web component functionality preserved across rename

This enhancement streamlines screenshot workflows while maintaining the
general-purpose read_image tool for reading arbitrary image files, creating
a more efficient and intuitive browser automation experience.

Co-Authored-By: sketch <hello@sketch.dev>
Change-ID: se3e81f997f30f01ek

diff --git a/claudetool/browse/browse.go b/claudetool/browse/browse.go
index dfb963e..5cd28cd 100644
--- a/claudetool/browse/browse.go
+++ b/claudetool/browse/browse.go

@@ -541,10 +541,6 @@
 	Timeout  string `json:"timeout,omitempty"`
 }
 
-type screenshotOutput struct {
-	ID string `json:"id"`
-}
-
 // NewScreenshotTool creates a tool for taking screenshots
 func (b *BrowseTools) NewScreenshotTool() *llm.Tool {
 	return &llm.Tool{
@@ -606,7 +602,7 @@
 		return llm.TextContent(errorResponse(err)), nil
 	}
 
-	// Save the screenshot and get its ID
+	// Save the screenshot and get its ID for potential future reference
 	id := b.SaveScreenshot(buf)
 	if id == "" {
 		return llm.TextContent(errorResponse(fmt.Errorf("failed to save screenshot"))), nil
@@ -615,14 +611,21 @@
 	// Get the full path to the screenshot
 	screenshotPath := GetScreenshotPath(id)
 
-	// Return the ID and instructions on how to view the screenshot
-	result := fmt.Sprintf(`{
-  "id": "%s",
-  "path": "%s",
-  "message": "Screenshot saved. To view this screenshot in the conversation, use the read_image tool with the path provided."
-}`, id, screenshotPath)
+	// Encode the image as base64
+	base64Data := base64.StdEncoding.EncodeToString(buf)
 
-	return llm.TextContent(result), nil
+	// Return the screenshot directly to the LLM
+	return []llm.Content{
+		{
+			Type: llm.ContentTypeText,
+			Text: fmt.Sprintf("Screenshot taken (saved as %s)", screenshotPath),
+		},
+		{
+			Type:      llm.ContentTypeText, // Will be mapped to image in content array
+			MediaType: "image/png",
+			Data:      base64Data,
+		},
+	}, nil
 }
 
 // ScrollIntoViewTool definition
@@ -817,7 +820,7 @@
 // NewReadImageTool creates a tool for reading images and returning them as base64 encoded data
 func (b *BrowseTools) NewReadImageTool() *llm.Tool {
 	return &llm.Tool{
-		Name:        "browser_read_image",
+		Name:        "read_image",
 		Description: "Read an image file (such as a screenshot) and encode it for sending to the LLM",
 		InputSchema: json.RawMessage(`{
 			"type": "object",