fix(csv): strip leading byte-order mark in CsvParseStream#7183
Conversation
parse() already strips a leading BOM from its input string, but CsvParseStream left it intact. When a UTF-8 CSV file starts with a BOM (common output of tools like Excel), the first field name would arrive as "name" instead of "name", corrupting headers and key lookups. StreamLineReader now strips the BOM from the first line it reads, matching the existing parse() behaviour exactly.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7183 +/- ##
==========================================
- Coverage 94.57% 94.57% -0.01%
==========================================
Files 636 637 +1
Lines 52142 52159 +17
Branches 9401 9403 +2
==========================================
+ Hits 49315 49328 +13
- Misses 2249 2254 +5
+ Partials 578 577 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Add four greymoth PRs and one cited upstream PR, all verified open via the GitHub API: - BasedHardware/omi#8601 — onboarding answer gate counts a spaceless CJK answer as one word (segmentation) - emdash-cms/emdash#1661 — editor footer word count / reading time splits on spaces, so a CJK paragraph reads as 1 word (segmentation) - validatorjs/validator.js#2789 — isAlphanumeric el-GR range omits accented Greek that isAlpha accepts (unicode-range) - date-fns/date-fns#4231 — Galician formats June as the wide form but cannot parse it back; the pattern stops before the tilde (locale-data) - denoland/std#7183 — CsvParseStream leaves a leading BOM on the first header key while sync parse() strips it (encoding; cited, not greymoth-authored) Three new categories: segmentation, unicode-range, encoding. Status re-sync against the API: zag color-picker channel IME guard merged.
bartlomieju
left a comment
There was a problem hiding this comment.
Nice catch — this is a real correctness bug, and putting the strip in StreamLineReader.readLine() is the right place since line-splitting happens upstream, so the BOM is always the first char of the first line. Tests pass locally and the scope (first line only) matches parse().
One consistency nit inline: every other spot in csv/ defines this as "\ufeff" rather than the raw glyph. Worth aligning so the source stays free of invisible characters.
| columns?: readonly string[]; | ||
| } | ||
|
|
||
| const BYTE_ORDER_MARK = ""; |
There was a problem hiding this comment.
Prefer the escape sequence "\ufeff" over the literal BOM glyph here. Every other definition in csv/ uses it — parse.ts:16, stringify.ts:128, parse_test.ts:11, stringify_test.ts:16 — and a raw, invisible U+FEFF in source is unreviewable and copy-paste-hostile.
| const BYTE_ORDER_MARK = ""; | |
| const BYTE_ORDER_MARK = "\ufeff"; |
| name: "CsvParseStream strips leading byte-order mark from first field", | ||
| permissions: "none", | ||
| fn: async () => { | ||
| const BOM = ""; |
There was a problem hiding this comment.
Same here — use "\ufeff" instead of the literal glyph, and consider naming it BYTE_ORDER_MARK to match the convention used in parse_test.ts / stringify_test.ts.
| const BOM = ""; | |
| const BOM = "\ufeff"; |
| name: "CsvParseStream strips leading byte-order mark when using skipFirstRow", | ||
| permissions: "none", | ||
| fn: async () => { | ||
| const BOM = ""; |
There was a problem hiding this comment.
Same "\ufeff" suggestion as above.
| const BOM = ""; | |
| const BOM = "\ufeff"; |
The synchronous
parse()function already strips a leading UTF-8byte-order mark (U+FEFF) from its input, but
CsvParseStreamdid not.When a CSV file begins with a BOM -- common output from Excel and other
Windows tools -- the first field name arrives as
"name"insteadof
"name". That corrupts header-based lookups silently:The fix adds a
#firstLineflag toStreamLineReaderand strips theBOM from the first line it reads, exactly matching what
parse()doesvia its
BYTE_ORDER_MARKconstant.Two new tests cover the regression: one for plain
string[][]output andone for
skipFirstRow: true(object output, where the BOM corrupts theheader key).