Parquet: Skip parquet conversion for blocks with too many labels#7524
Parquet: Skip parquet conversion for blocks with too many labels#7524siddarth2810 wants to merge 10 commits into
Conversation
friedrichg
left a comment
There was a problem hiding this comment.
just one minor nit on the metrics that copilot suggested. pre-approving!
|
|
||
| // We don't convert blocks again if they already have a valid converter mark. | ||
| if cortex_parquet.ValidConverterMarkVersion(marker.Version) { | ||
| level.Debug(logger).Log("msg", "skipping block, no-convert marker already exists", "block", b.ULID.String()) |
There was a problem hiding this comment.
Is this the right log here?
There was a problem hiding this comment.
Apologies. It's supposed to be cortex_parquet.ValidNoConvertMarkVersion. I'l fix it
| Version int `json:"version"` | ||
| Reason string `json:"reason"` | ||
| LabelNamesCount int `json:"label_names_count"` | ||
| Threshold int `json:"threshold"` |
There was a problem hiding this comment.
Do we need details like LabelNamesCount and Threshold in this file? The no convert marker can be manually uploaded, too. Those details can be embeded in reason or have another string field for that
There was a problem hiding this comment.
Got it. I'll just embed them into Reason
| continue | ||
| } | ||
| labelNamesCount := len(labelNames) | ||
| if labelNamesCount > maxBlockLabelNames { |
There was a problem hiding this comment.
A note. Today the max column limit in parquet go is like 32767 IIRC. But since our parquet file has additional system columns, when configuring the max block label names we need to keep some buffer
There was a problem hiding this comment.
I see. Is it 32767 + ColIndexesColumn + SeriesHashColumn + N DataColumns ?
Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
- Add max-block-label-names limit, blocks exceeding it get a no-convert marker instead of being converted. Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
…correctly Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
…test Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
- Add a new cortex_parquet_converter_blocks_skipped_total counter with user and reason labels - Extract "too_many_labels" to a constant to avoid string duplication Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
…ert marker exists Signed-off-by: Siddarth Gundu <siddarthg0910@gmail.com>
ebeecdc to
2929125
Compare
What this PR does:
If a TSDB block exceeds a configurable threshold of distinct label names, the converter writes a
parquet-no-convert-mark.jsonmarker and skips the block.parquet-converter.max-block-label-nameslimitWhich issue(s) this PR fixes:
Fixes #7195
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]docs/configuration/v1-guarantees.mdupdated if this PR introduces experimental flags