Skip to content

[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace#56787

Open
aviyehuda wants to merge 2 commits into
apache:masterfrom
aviyehuda:SPARK-50110
Open

[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace#56787
aviyehuda wants to merge 2 commits into
apache:masterfrom
aviyehuda:SPARK-50110

Conversation

@aviyehuda

Copy link
Copy Markdown

What changes were proposed in this pull request?

Add retryWithTrim in UnivocityParser for integral, boolean, and decimal value converters. When the first parse attempt fails with NumberFormatException or IllegalArgumentException, the parser retries after trimming leading/trailing whitespace.

This aligns behavior with float/double parsing (which already accepts surrounding whitespace via Java's parsers) for both from_csv and spark.read.csv, since they share UnivocityParser.

Reproduction example:

-- before: b is null
SELECT from_csv('1, 1', 'a INT, b INT');
-- {"a":1,"b":null}

-- after: b is parsed correctly
SELECT from_csv('1, 1', 'a INT, b INT');
-- {"a":1,"b":1}


### Why are the changes needed?
CSV fields often include a space after the delimiter (e.g. "1, 1"). With default ignoreLeadingWhiteSpace=false, integral columns fail to parse while float/double columns succeed, producing inconsistent and surprising results:

SELECT from_csv('1, 1', 'a INT, b INT');    -- {"a":1,"b":null}
SELECT from_csv('1, 1', 'a INT, b DOUBLE'); -- {"a":1,"b":1.0}
This is because Scala's toInt/toLong/etc. reject leading/trailing whitespace, while toDouble/toFloat delegate to Java parsers that accept it.

Related JIRA: https://issues.apache.org/jira/browse/SPARK-50110

### Does this PR introduce any user-facing change?
Yes. Numeric CSV values with leading or trailing whitespace are now parsed correctly for integral, boolean, and decimal types when reading CSV files and using from_csv, without requiring ignoreLeadingWhiteSpace=true.

Example:

SELECT from_csv('1, 1', 'a INT, b INT');
Previously returned {"a":1,"b":null}; now returns {"a":1,"b":1}.

### How was this patch tested?
Added unit tests:

CsvFunctionsSuite: SPARK-50110: parse numeric CSV values with surrounding whitespace (covers from_csv for integral types, decimal, and boolean)
CSVSuite: SPARK-50110: read CSV numeric values with surrounding whitespace (covers spark.read.csv)
Updated existing corrupt-input tests (SPARK-25387, SPARK-31261, SPARK-32968) so malformed control-character inputs remain invalid after the trim-retry path.

Locally:

build/sbt 'sql/testOnly *CsvFunctionsSuite -- -z "SPARK-50110"'
build/sbt 'sql/testOnly *CSVv1Suite -- -z "SPARK-50110"'

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor

---
To open the PR from the CLI after pushing to your fork:
```bash
gh pr create --repo apache/spark \
  --head aviyehuda:SPARK-50110 \
  --base master \
  --title "[SPARK-50110][SQL] Fix CSV parsing when numeric values have surrounding whitespace" \
  --body "$(cat <<'EOF'
<paste the body above>
EOF
)"
Adjust the Generated-by line if you prefer to answer No after your own review.

…ng whitespace

Retry integral, boolean, and decimal parsing after trim when the first parse
attempt fails, aligning behavior with float/double and fixing from_csv/read.csv
for common inputs like "1, 1".

@HyukjinKwon HyukjinKwon left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 3 non-blocking, 1 nit.
Behaviorally correct for the default locale and a clean alignment with float/double parsing; the substantive follow-ups are the non-US-locale decimal gap and the exception-on-hot-path approach.

Design / architecture (1)

  • UnivocityParser.scala:182: whitespace tolerance is a per-value exception-retry; the motivating "1, 1" throws+catches a NumberFormatException per integral/boolean/US-decimal cell — an up-front value.trim is behaviorally identical without it — see inline

Correctness (1)

  • UnivocityParser.scala:232: non-US-locale decimals aren't fixed — cannotParseDecimalError (a SparkRuntimeException) isn't caught by retryWithTrim — see inline

Suggestions (1)

  • Tests cover only leading whitespace under the default US locale; add a trailing-whitespace case ("1 ,1") and a non-US-locale decimal case — the latter would have caught the Correctness finding above.

Nits: 1 minor item (see inline comments).

…erics

Fix code review feedback on the whitespace parsing change:
- Replace retryWithTrim with up-front trim for integral, boolean, and decimal
  converters to avoid per-cell exceptions on the common "1, 1" path
- Trim before decimalParser so non-US locales (e.g. de-DE) also handle
  surrounding whitespace, since DecimalFormat throws SparkRuntimeException
  rather than NumberFormatException

Add UnivocityParserSuite coverage for spaced decimals under multiple locales.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants