Skip to content

[SPARK-57562][SQL] Add benchmarks for the TIME data type#56809

Open
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-57562-time-benchmarks
Open

[SPARK-57562][SQL] Add benchmarks for the TIME data type#56809
shrirangmhalgi wants to merge 3 commits into
apache:masterfrom
shrirangmhalgi:SPARK-57562-time-benchmarks

Conversation

@shrirangmhalgi

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add TimeBenchmark covering TIME data type functions: current_time, make_time, to_time, hour/minute/second extraction, time_trunc, time_diff, TIME +/- interval, and LocalTime conversion/collection.

Why are the changes needed?

The SPIP (SPARK-51162, Q6) flagged performance regressions as the main risk. There was no TIME-specific benchmark until now.

Does this PR introduce any user-facing change?

No. Benchmark only.

How was this patch tested?

Benchmark runs successfully and results file are generated.

Was this patch authored or co-authored using generative AI tooling?

Co-Authored using Claude Opus 4.6

@shrirangmhalgi shrirangmhalgi changed the title Add benchmarks for the TIME data type [SPARK-57562][SQL] Add benchmarks for the TIME data type Jun 26, 2026
@shrirangmhalgi shrirangmhalgi marked this pull request as ready for review June 27, 2026 04:47

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 2 non-blocking, 0 nits.
Clean, well-modeled benchmark mirroring DateTimeBenchmark. One blocking process issue: the committed results must come from the standard CI hardware, not a local Mac.

Design / architecture (1)

  • *-results.txt: results were regenerated on Apple M3 Pro / macOS instead of the standard CI hardware (Linux / Intel Xeon Platinum 8370C, used by the existing committed results). This rewrites every number in the three modified files, not just the new TIME cases (ExtractBenchmark +104/-86, CSVBenchmark +56/-51, JsonBenchmark +82/-77), so the numbers are no longer comparable against the rest of the repo and the unrelated churn obscures the PR's actual contribution. Please regenerate all four results files via the "Benchmarks" GitHub Actions workflow and drop the local Mac-generated output.

Correctness (1)

  • CSVBenchmark.scala:197 (same pattern at JsonBenchmark.scala:414): times generator (t % 86400000000L) * 1000000L — see inline.

Suggestions (1)

  • ExtractBenchmark.scala:72: the case other message "Valid column types are 'timestamp' and 'date'" is now stale — it omits the newly-added 'time' (and the pre-existing 'interval'). Only reachable on a programming error, so optional, but worth updating while you're here.


def times = {
spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (t % 86400000000L) * 1000000L formula is inconsistent and latently unsafe:

  • The modulus is micros/day (86_400_000_000) but the scale factor is *1e6, so the two don't compose into nanos-of-day.
  • Since t < rowsNum (1e7) is far below the modulus, the % 86400000000L is a dead no-op, and the generated times only span the first ~2.8h of the day rather than the full day.
  • It would overflow LocalTime.ofNanoOfDay (max 86399999999999) if rowsNum is ever raised above ~8.64e7.

It doesn't fail at the current rowsNum, but consider aligning with TimeBenchmark's full-day wrap (nanos % 86400000000000L):

Suggested change
iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))
iter.map(t => LocalTime.ofNanoOfDay(t % 86400000000000L))


def times = {
spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same generator issue as CSVBenchmark.scala:197: (t % 86400000000L) * 1000000L has an inconsistent modulus/scale (micros/day mod, *1e6 scale), the % 86400000000L is a dead no-op at this rowsNum, the times don't span a full day, and it would overflow ofNanoOfDay if rowsNum exceeded ~8.64e7. Suggest the same full-day wrap as TimeBenchmark:

Suggested change
iter.map(t => LocalTime.ofNanoOfDay((t % 86400000000L) * 1000000L))
iter.map(t => LocalTime.ofNanoOfDay(t % 86400000000000L))

### What changes were proposed in this pull request?

Add TimeBenchmark covering TIME data type functions: current_time, make_time, to_time, hour/minute/second extraction, time_trunc, time_diff, TIME +/- interval, and LocalTime conversion/collection.

### Why are the changes needed?

The SPIP (SPARK-51162, Q6) flagged performance regressions as the main risk. There was no TIME-specific benchmark until now.

### Does this PR introduce _any_ user-facing change?

No. Benchmark only.

### How was this patch tested?

Benchmark runs successfully and results file is generated.

### Was this patch authored or co-authored using generative AI tooling?

Co-Authored using Claude Opus 4.6
…e error message

- Fix LocalTime generator: use t % 86400000000000L (nanos/day) instead of
  inconsistent (t % 86400000000L) * 1000000L in CSV and JSON benchmarks
- Revert ExtractBenchmark/CSV/JSON/Hash results to master (drop Mac output;
  will regenerate via GitHub Actions Benchmarks workflow)
- Update ExtractBenchmark error message to list all valid types
@shrirangmhalgi shrirangmhalgi force-pushed the SPARK-57562-time-benchmarks branch from c620316 to ad5c3c5 Compare June 27, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants