feat(compression): add MemorySlice comparator and support LookupStoreFactory for SST file#172
Open
lxy-9602 wants to merge 4 commits intoalibaba:mainfrom
Open
feat(compression): add MemorySlice comparator and support LookupStoreFactory for SST file#172lxy-9602 wants to merge 4 commits intoalibaba:mainfrom
lxy-9602 wants to merge 4 commits intoalibaba:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds ordered binary comparison support (via MemorySlice/variant comparators) and introduces a LookupStoreFactory implementation backed by SST files, along with new lookup-cache related options (bloom filter, compression, and page size).
Changes:
- Added Java-compatible floating-point ordering and variant-based comparators to support consistent ordering across components.
- Introduced SST-backed lookup store factory (reader/writer) and updated SST reader/iterator APIs to return
Result<>for error propagation. - Added new core options for lookup cache bloom filter, spill compression, and cache page size (plus corresponding tests).
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/core/utils/fields_comparator.h | Adds floating-point ordering helper and variant comparator API. |
| src/paimon/core/utils/fields_comparator.cpp | Implements variant comparator selection by Arrow type. |
| src/paimon/core/options/compress_options.h | Introduces compression option struct used by lookup spill compression. |
| src/paimon/core/core_options_test.cpp | Adds assertions and parsing coverage for new lookup/cache options. |
| src/paimon/core/core_options.h | Exposes new lookup/cache option getters. |
| src/paimon/core/core_options.cpp | Parses and stores lookup/cache bloom filter, compression, and page size options. |
| src/paimon/common/utils/bloom_filter.cpp | Tightens integer types for consistency and avoids implicit narrowing. |
| src/paimon/common/sst/sst_file_writer.h | Documents null bloom filter handle behavior when disabled. |
| src/paimon/common/sst/sst_file_writer.cpp | Returns null bloom filter handle instead of failing when bloom filter is disabled. |
| src/paimon/common/sst/sst_file_utils.h | Makes compression-type parsing fallible (Result<>) with clear error. |
| src/paimon/common/sst/sst_file_reader.h | Refactors construction to use InputStream, upgrades Lookup/Seek APIs to Result<>, adds Close(). |
| src/paimon/common/sst/sst_file_reader.cpp | Propagates errors via Result<>, supports optional bloom filter handle, adds reader close path. |
| src/paimon/common/sst/sst_file_io_test.cpp | Updates tests for new Result<> APIs and new reader creation signature. |
| src/paimon/common/sst/block_reader.h | Makes BlockReader creation fallible and uses MemorySlice::SliceComparator. |
| src/paimon/common/sst/block_reader.cpp | Updates creation path to return Result<> and modernizes comparator exposure. |
| src/paimon/common/sst/block_iterator.h | Changes SeekTo to return Result<bool> and updates ctor signature. |
| src/paimon/common/sst/block_iterator.cpp | Propagates comparator/positioning errors with Result<>. |
| src/paimon/common/sst/block_entry.h | Renames fields and adjusts constructor to const-ref shared_ptrs. |
| src/paimon/common/sst/block_cache.h | Adds explicit Close() to invalidate cached pages. |
| src/paimon/common/memory/memory_slice.h | Adds SliceComparator alias and exposes underlying segment via GetSegment(). |
| src/paimon/common/memory/memory_slice.cpp | Implements GetSegment() and makes accessors const. |
| src/paimon/common/lookup/sort/sort_lookup_store_test.cpp | Adds test coverage for SST-backed lookup store factory and new options. |
| src/paimon/common/lookup/sort/sort_lookup_store_factory.h | Introduces SST-backed lookup store reader/writer and factory. |
| src/paimon/common/lookup/sort/sort_lookup_store_factory.cpp | Implements reader/writer creation and close/flush flows. |
| src/paimon/common/lookup/lookup_store_factory.h | Adds lookup store factory interfaces and bloom filter generator. |
| src/paimon/common/lookup/lookup_store_factory.cpp | Creates SST-backed factory based on options; generates bloom filters per options. |
| src/paimon/common/io/cache/cache_manager.h | Makes InvalidPage accept const-ref shared_ptr. |
| src/paimon/common/io/cache/cache_manager.cpp | Updates InvalidPage implementation signature accordingly. |
| src/paimon/common/file_index/rangebitmap/dictionary/key_factory.cpp | Reuses centralized floating-point ordering from FieldsComparator. |
| src/paimon/common/defs.cpp | Registers new option keys for lookup cache and page size. |
| src/paimon/common/data/serializer/row_compacted_serializer_test.cpp | Adds slice comparator tests across types, nulls, and float/double edge cases. |
| src/paimon/common/data/serializer/row_compacted_serializer.h | Adds API to create a MemorySlice comparator for compacted rows. |
| src/paimon/common/data/serializer/row_compacted_serializer.cpp | Implements slice comparator creation using field readers + variant comparators. |
| src/paimon/common/compression/block_compression_factory.h | Adds Create(const CompressOptions&) overload. |
| src/paimon/common/compression/block_compression_factory.cpp | Implements string-driven compression selection and improves error message formatting. |
| src/paimon/CMakeLists.txt | Wires new lookup factory sources and test into build. |
| include/paimon/defs.h | Documents new lookup cache and cache page size options in public API. |
Comments suppressed due to low confidence (4)
src/paimon/core/utils/fields_comparator.cpp:1
arrow::Type::INT8is being extracted aschar, whose signedness is implementation-defined. This can break ordering for negative values on platforms wherecharis unsigned. Useint8_tconsistently (and ensure the underlying VariantType stores INT8 asint8_t).
src/paimon/core/utils/fields_comparator.cpp:1FieldsComparatornow providesCompareFloatingPoint()with Java-compatible ordering, butCompareField()for FLOAT/DOUBLE still documents (and likely implements) non-Java behavior. This can produce inconsistent ordering betweenInternalRowcomparators and the new slice/variant comparators. Consider switching FLOAT/DOUBLE inCompareField()to useCompareFloatingPoint()and update/remove the TODO/comment accordingly.
src/paimon/core/core_options.cpp:1cache_page_sizeis stored asint64_tbut exposed asint32_tand narrowed viastatic_cast. This risks overflow for legitimate configurations (e.g., cache sizes > 2GB) and makes the API misleading. Prefer returningint64_t(orsize_t) fromGetCachePageSize()and keep the type consistent end-to-end.
src/paimon/core/core_options.cpp:1cache_page_sizeis stored asint64_tbut exposed asint32_tand narrowed viastatic_cast. This risks overflow for legitimate configurations (e.g., cache sizes > 2GB) and makes the API misleading. Prefer returningint64_t(orsize_t) fromGetCachePageSize()and keep the type consistent end-to-end.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: #93
MemorySlicecomparator - Implements a comparator forMemorySliceto enable ordered comparison of binary data.SortLookupStoreFactoryfor creating SST readers/writers.Tests
RowCompactedSerializerTest.TestSliceComparator
SortLookupStoreTest
API and Format
Add options for lookup in
def.h.