Diff - f1e517d64fe4a726552f5d240a1ecb3d418f16b6^! - sketch

commit	f1e517d64fe4a726552f5d240a1ecb3d418f16b6	[log] [tgz]
author	Marc-Antoine Ruel <maruel@gmail.com>	Sun Jun 08 17:30:37 2025 +0000
committer	Philip Zeyliger <philip.zeyliger@gmail.com>	Sun Jun 08 13:08:59 2025 -0700
tree	ace5b25920201d79054a8b3f7cc17cf884fae8c5
parent	de19aca257ab21956f2fba828d9265ef218687da [diff] [blame]

claudetool/onstart: add comprehensive tests for non-ASCII filename handling

Add test cases validating AnalyzeCodebase() correctly processes files with
Unicode characters in filenames, ensuring proper categorization and analysis.

Problem Analysis:
The AnalyzeCodebase function uses git ls-files to enumerate repository files
and categorize them by type. While the implementation should theoretically
handle Unicode filenames, there were no existing tests to verify this
behavior with international characters, emojis, combining characters, or
right-to-left scripts.

Implementation Changes:

1. Test Data Creation:
   - Created testdata directory with files containing non-ASCII characters
   - Included Chinese (\u6d4b\u8bd5\u6587\u4ef6.go), French (café.js), Russian (\u0440\u0443\u0441\u0441\u043a\u0438\u0439.py)
   - Added emoji (\ud83d\ude80rocket.md), German umlauts (\u00dcbung.html)
   - Included Japanese (Makefile-\u65e5\u672c\u8a9e), Spanish (readme-español.md)
   - Added Korean guidance file (subdir/claude.\ud55c\uad6d\uc5b4.md) in subdirectory

2. Comprehensive Test Cases:
   - TestAnalyzeCodebase validates file counting and extension tracking
   - Verifies proper categorization of build, documentation, and guidance files
   - Tests git ls-files integration with Unicode filenames
   - Confirms extension counting works with non-ASCII characters

3. Edge Case Testing:
   - Added combining characters test (file\u0301\u0302.go)
   - Arabic right-to-left script test (\u0645\u0631\u062d\u0628\u0627.py)
   - Mixed Unicode with emoji test (test\u4e2d\u6587\ud83d\ude80.txt)
   - Validates categorizeFile function handles Unicode paths correctly

4. File Categorization Validation:
   - Japanese Makefile correctly identified as build file
   - Spanish README properly categorized as documentation
   - Korean Claude file in subdirectory marked as guidance file
   - Extension counting accurate across all Unicode filenames

Technical Details:
- Uses git ls-files -z for null-separated output handling Unicode safely
- Test files represent major Unicode blocks: CJK, Latin Extended, Cyrillic
- Proper handling of combining characters and emoji sequences
- Validates both filename parsing and categorization logic paths

Benefits:
- Ensures international users can use non-ASCII filenames
- Validates Unicode safety in codebase analysis pipeline
- Prevents regressions in Unicode filename handling
- Comprehensive coverage of real-world filename scenarios

Testing:
- All tests pass with current implementation
- Verified git ls-files correctly enumerates Unicode filenames
- Confirmed extension extraction works with international characters
- Validated categorization logic handles Unicode paths properly

This test suite ensures AnalyzeCodebase robustly handles international
codebases with diverse filename conventions and character encodings.

Co-Authored-By: sketch <hello@sketch.dev>
Change-ID: s2431e70f6f23ec83k

diff --git "a/claudetool/onstart/testdata/\303\234bung.html" "b/claudetool/onstart/testdata/\303\234bung.html"
new file mode 100644
index 0000000..afc66ab
--- /dev/null
+++ "b/claudetool/onstart/testdata/\303\234bung.html"

@@ -0,0 +1,10 @@
+<!DOCTYPE html>
+<html>
+<head>
+    <title>German Umlaut Test</title>
+</head>
+<body>
+    <h1>Übung HTML File</h1>
+    <p>This HTML file has German umlaut characters in the filename.</p>
+</body>
+</html>