[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace by aviyehuda · Pull Request #56787 · apache/spark

aviyehuda · 2026-06-25T22:16:13Z

What changes were proposed in this pull request?

Add retryWithTrim in UnivocityParser for integral, boolean, and decimal value converters. When the first parse attempt fails with NumberFormatException or IllegalArgumentException, the parser retries after trimming leading/trailing whitespace.

This aligns behavior with float/double parsing (which already accepts surrounding whitespace via Java's parsers) for both from_csv and spark.read.csv, since they share UnivocityParser.

Reproduction example:

-- before: b is null
SELECT from_csv('1, 1', 'a INT, b INT');
-- {"a":1,"b":null}

-- after: b is parsed correctly
SELECT from_csv('1, 1', 'a INT, b INT');
-- {"a":1,"b":1}


### Why are the changes needed?
CSV fields often include a space after the delimiter (e.g. "1, 1"). With default ignoreLeadingWhiteSpace=false, integral columns fail to parse while float/double columns succeed, producing inconsistent and surprising results:

SELECT from_csv('1, 1', 'a INT, b INT');    -- {"a":1,"b":null}
SELECT from_csv('1, 1', 'a INT, b DOUBLE'); -- {"a":1,"b":1.0}
This is because Scala's toInt/toLong/etc. reject leading/trailing whitespace, while toDouble/toFloat delegate to Java parsers that accept it.

Related JIRA: https://issues.apache.org/jira/browse/SPARK-50110

### Does this PR introduce any user-facing change?
Yes. Numeric CSV values with leading or trailing whitespace are now parsed correctly for integral, boolean, and decimal types when reading CSV files and using from_csv, without requiring ignoreLeadingWhiteSpace=true.

Example:

SELECT from_csv('1, 1', 'a INT, b INT');
Previously returned {"a":1,"b":null}; now returns {"a":1,"b":1}.

### How was this patch tested?
Added unit tests:

CsvFunctionsSuite: SPARK-50110: parse numeric CSV values with surrounding whitespace (covers from_csv for integral types, decimal, and boolean)
CSVSuite: SPARK-50110: read CSV numeric values with surrounding whitespace (covers spark.read.csv)
Updated existing corrupt-input tests (SPARK-25387, SPARK-31261, SPARK-32968) so malformed control-character inputs remain invalid after the trim-retry path.

Locally:

build/sbt 'sql/testOnly *CsvFunctionsSuite -- -z "SPARK-50110"'
build/sbt 'sql/testOnly *CSVv1Suite -- -z "SPARK-50110"'

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor

---
To open the PR from the CLI after pushing to your fork:
```bash
gh pr create --repo apache/spark \
  --head aviyehuda:SPARK-50110 \
  --base master \
  --title "[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace" \
  --body "$(cat <<'EOF'
<paste the body above>
EOF
)"
Adjust the Generated-by line if you prefer to answer No after your own review.

…ng whitespace Retry integral, boolean, and decimal parsing after trim when the first parse attempt fails, aligning behavior with float/double and fixing from_csv/read.csv for common inputs like "1, 1".

HyukjinKwon

0 blocking, 3 non-blocking, 1 nit.
Behaviorally correct for the default locale and a clean alignment with float/double parsing; the substantive follow-ups are the non-US-locale decimal gap and the exception-on-hot-path approach.

Design / architecture (1)

UnivocityParser.scala:182: whitespace tolerance is a per-value exception-retry; the motivating "1, 1" throws+catches a NumberFormatException per integral/boolean/US-decimal cell — an up-front value.trim is behaviorally identical without it — see inline

Correctness (1)

UnivocityParser.scala:232: non-US-locale decimals aren't fixed — cannotParseDecimalError (a SparkRuntimeException) isn't caught by retryWithTrim — see inline

Suggestions (1)

Tests cover only leading whitespace under the default US locale; add a trailing-whitespace case ("1 ,1") and a non-US-locale decimal case — the latter would have caught the Correctness finding above.

Nits: 1 minor item (see inline comments).

…erics Fix code review feedback on the whitespace parsing change: - Replace retryWithTrim with up-front trim for integral, boolean, and decimal converters to avoid per-cell exceptions on the common "1, 1" path - Trim before decimalParser so non-US locales (e.g. de-DE) also handle surrounding whitespace, since DecimalFormat throws SparkRuntimeException rather than NumberFormatException Add UnivocityParserSuite coverage for spaced decimals under multiple locales.

[SPARK-50110][SQL] Fix CSV parsing when numeric values have surroundi…

86cb062

…ng whitespace Retry integral, boolean, and decimal parsing after trim when the first parse attempt fails, aligning behavior with float/double and fixing from_csv/read.csv for common inputs like "1, 1".

HyukjinKwon reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace#56787

[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace#56787
aviyehuda wants to merge 2 commits into
apache:masterfrom
aviyehuda:SPARK-50110

aviyehuda commented Jun 25, 2026

Uh oh!

HyukjinKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aviyehuda commented Jun 25, 2026

What changes were proposed in this pull request?

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Correctness (1)

Suggestions (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants