PERF: Speed up np.isin table method with clipped indexing#30828
PERF: Speed up np.isin table method with clipped indexing#30828mdrdope wants to merge 3 commits intonumpy:mainfrom
np.isin table method with clipped indexing#30828Conversation
Added a quick data-overlap test for improved performance in array operations by checking if a portion of ar1 falls within the specified range of ar2. This change optimizes the indexing path based on empirical overlap thresholds.
np.isin table method with clipped indexingnp.isin table method with clipped indexing
|
Can you please confirm that this PR was not created with AI? |
Removed unnecessary blank line in the function.
To clarify: this PR was not automatically generated by AI. |
|
Unfortunately I'm still doubtful. To me the PR is suspicious because the original post is word perfect, follows the formatting used by AI, has the emojis used by AI, the heading title used by AI, etc (all without the post being edited). That would be very unusual for a first time contributor. |
Thanks for raising the concern, I understand why the style might look unusual. Just to clarify, this isn’t my first contribution; a previous performance PR was merged earlier this week. For this patch, I take full responsibility for the implementation and the benchmarks. The sampling heuristic and threshold choice were iterated on and validated by me, and I verified correctness locally and through CI. If there are particular concerns about the clipped-indexing path, the sampling strategy, or the performance trade-offs at different overlap densities, I’d be very happy to go through them in detail. If the formatting or emoji use is distracting, I’m happy to adjust the PR description to better match the usual NumPy style. |
|
I'm won't be the one reviewing the PR, just triaging in this particular case. Open source projects are currently experiencing a lot of upheaval and extra work caused by (mis)use of AI. Several projects are actively figuring out stances w.r.t how we deal with project contributions where AI is involved. The scipy sister project is in the process of drafting guidelines for AI assisted contributions, https://github.com/scipy/scipy/pull/24583/changes, which I don't think would be too dissimilar to numpy's point of view. The important factors here:
These are the motivations behind my queries. Unfortunately maintainers raising these kinds of questions does make us appear less welcoming. That is not our intention. |
@andyfaff That is an extreme position, if we do that we will end up with no maintainers because everyone coming up will be using AI. That is not my understanding of where the mailing list discussion was going. We don't want to be the Spanish Inquisition. |
|
I don't mind winding back by comments.. That's a side effect of a rapidly changing landscape. |
|
If you have time, I’d really appreciate technical feedback on this PR.
|
Summary
When
np.isinuses the integer table method, the current implementation createsa boolean mask to filter out-of-range elements, then gathers the in-range subset,
performs a lookup, and scatters results back. This PR adds a faster clipped
indexing path that avoids the mask / gather / scatter overhead by
extending the lookup table with boundary sentinels and using
np.clipto mapall out-of-range values to the sentinel slots.
Local tests:
pytest numpy/lib/testsall passedFull test suite covered by CI.
A lightweight strided-sampling heuristic (~384 elements, O(1) view) selects
which path to use at runtime: clipped indexing when ≥ 5 % of
ar1falls inside[ar2_min, ar2_max], and the original masking path otherwise.Motivation
The table method in
_isincurrently does:This involves 5 memory passes over
ar1(two comparisons, AND, gather,scatter) plus non-contiguous memory access for the gather/scatter.
The clipped indexing path replaces this with:
3 dense passes instead of 5.
Why a heuristic is needed
When very few elements of
ar1fall within[ar2_min, ar2_max](< 5 %),masking skips nearly all elements in the gather/scatter step, making it cheaper
than clipped indexing (which always processes every element). A quick
data-overlap test selects the faster path at runtime.
How the sampling works
ar1[::stride]is a strided view — O(1), zero allocation.N // 384) reduces sensitivity to periodic orblock-sorted input data.
(cf.
std::sort's median-of-three pivot, database optimizer sample scans).Experimental evidence
Benchmarked across 210 parameter combinations
(N ∈ {1K, 10K, 100K, 1M, 5M},
ar2_range ∈ {10, 50, 200, 2K, 20K},
p_in ∈ {1 %, 3 %, 5 %, 7 %, 10 %, 15 %, 30 %, 50 %, 80 %, 100 %}).
ar1length (number of membership queries)max(ar2) - min(ar2)(lookup table domain size)ar1elements within[ar2_min, ar2_max]Key finding: p_in determines the crossover
The heatmap plot below shows T_clip / T_mask as a function of the sampled
in-range fraction across all N and ar2_range values. The crossover
(ratio = 1.0) sits near p_in ≈ 0.05, justifying the 5 %
threshold as the sole decision variable.
When overlap is very small (≤5%), the masking path is cheaper because it only does the expensive work on a tiny subset of elements. It first builds a boolean mask over all of
ar1, but the actual table lookup and write-back (outgoing[basic_mask] = isin_helper_ar[in_range_ar1 - ar2_min]) only happen for the few values that fall inside the range ([ar2_min,ar2_max]). If only 1% match (i.e., only 1% ofar1within [ar2_min,ar2_max]), 99% of the array is effectively skipped after the mask check.The clipped path, however, always processes every element. It creates and clips an index array for all of
ar1, then performs a lookup for all elements, even those out of range, there is a fixed cost for this. So there is a small regression when very few values match.Once overlap grows beyond ~5%, masking loses its advantage. Now it must gather and write back a large fraction of elements, and those writes are scattered (non-contiguous), which is slower. The clipped path remains a simple, sequential pass over memory, so it becomes faster as overlap increases.
Experiment script (not part of the commit)
ASV benchmarks
The benchmark below is included for transparency and to document the measurement
methodology; it is not proposed to be added in-tree unless maintainers think it would be useful.
Click to expand: ASV benchmark code used
Add following class at the end of
/numpy/benchmarks/benchmarks/bench_lib.pyapply the new
_isinfunction, commit and run:cd benchmarks asv continuous main HEAD -b IsinTableOverlapIt covers three overlap regimes:
Results (Apple M-series)
At N = 1 000 000 the clipped path is 2–6× faster with no regression.
At N = 10 000 with high/all overlap it is 1.5–2.3× faster; with low
overlap a small regression of ~3 μs (~23 %) is observed, which is negligible in absolute terms (15 → 18 μs).
Output report (
asv continuous main HEAD -b IsinTableOverlap, Apple M-series):The two "low overlap, N = 10 000" cases show a ~3.5 μs regression (sampling
overhead dominates when the total operation is only ~15 μs). At N = 1 000 000
the sampling cost is negligible and every overlap regime is faster or neutral.