X Tutup
Skip to content

PERF: Speed up np.isin table method with clipped indexing#30828

Open
mdrdope wants to merge 3 commits intonumpy:mainfrom
mdrdope:patch-4
Open

PERF: Speed up np.isin table method with clipped indexing#30828
mdrdope wants to merge 3 commits intonumpy:mainfrom
mdrdope:patch-4

Conversation

@mdrdope
Copy link
Contributor

@mdrdope mdrdope commented Feb 13, 2026

Summary

When np.isin uses the integer table method, the current implementation creates
a boolean mask to filter out-of-range elements, then gathers the in-range subset,
performs a lookup, and scatters results back. This PR adds a faster clipped
indexing
path that avoids the mask / gather / scatter overhead by
extending the lookup table with boundary sentinels and using np.clip to map
all out-of-range values to the sentinel slots.

Local tests: pytest numpy/lib/tests all passed
Full test suite covered by CI.

A lightweight strided-sampling heuristic (~384 elements, O(1) view) selects
which path to use at runtime: clipped indexing when ≥ 5 % of ar1 falls inside
[ar2_min, ar2_max], and the original masking path otherwise.

Motivation

The table method in _isin currently does:

basic_mask = (ar1 <= ar2_max) & (ar1 >= ar2_min)                  # 2 comparisons + AND
in_range_ar1 = ar1[basic_mask]                                    # gather
outgoing[basic_mask] = isin_helper_ar[in_range_ar1 - ar2_min]     # lookup + scatter

This involves 5 memory passes over ar1 (two comparisons, AND, gather,
scatter) plus non-contiguous memory access for the gather/scatter.

The clipped indexing path replaces this with:

ar1_indices = ar1 - (ar2_min - 1)           # shift
np.clip(ar1_indices, 0, ar2_range + 2, …)   # clip (in-place)
return extended_table[indices]              # lookup

3 dense passes instead of 5.

Why a heuristic is needed

When very few elements of ar1 fall within [ar2_min, ar2_max] (< 5 %),
masking skips nearly all elements in the gather/scatter step, making it cheaper
than clipped indexing (which always processes every element). A quick
data-overlap test selects the faster path at runtime.

How the sampling works

_stride = max(1, ar1.size // 384)
_sample = ar1[::_stride]                                          # O(1) view
_n_in = np.count_nonzero((_sample >= ar2_min) & (_sample <= ar2_max))
  • ar1[::stride] is a strided view — O(1), zero allocation.
  • Touches ~384 elements (~3 KB), negligible vs the O(N) main work.
  • The dense stride (N // 384) reduces sensitivity to periodic or
    block-sorted input data.
  • Algorithm-selection via data sampling is a standard technique
    (cf. std::sort's median-of-three pivot, database optimizer sample scans).

Experimental evidence

Benchmarked across 210 parameter combinations

(N ∈ {1K, 10K, 100K, 1M, 5M},

ar2_range ∈ {10, 50, 200, 2K, 20K},

p_in ∈ {1 %, 3 %, 5 %, 7 %, 10 %, 15 %, 30 %, 50 %, 80 %, 100 %}).

Variable Meaning
N ar1 length (number of membership queries)
ar2_range max(ar2) - min(ar2) (lookup table domain size)
p_in fraction of ar1 elements within [ar2_min, ar2_max]

Key finding: p_in determines the crossover

The heatmap plot below shows T_clip / T_mask as a function of the sampled
in-range fraction across all N and ar2_range values. The crossover
(ratio = 1.0) sits near p_in ≈ 0.05, justifying the 5 %
threshold as the sole decision variable.

heatmap_clip_vs_mask

When overlap is very small (≤5%), the masking path is cheaper because it only does the expensive work on a tiny subset of elements. It first builds a boolean mask over all of ar1, but the actual table lookup and write-back (outgoing[basic_mask] = isin_helper_ar[in_range_ar1 - ar2_min]) only happen for the few values that fall inside the range ([ar2_min, ar2_max]). If only 1% match (i.e., only 1% of ar1 within [ar2_min, ar2_max]), 99% of the array is effectively skipped after the mask check.

The clipped path, however, always processes every element. It creates and clips an index array for all of ar1, then performs a lookup for all elements, even those out of range, there is a fixed cost for this. So there is a small regression when very few values match.

Once overlap grows beyond ~5%, masking loses its advantage. Now it must gather and write back a large fraction of elements, and those writes are scattered (non-contiguous), which is slower. The clipped path remains a simple, sequential pass over memory, so it becomes faster as overlap increases.

Experiment script (not part of the commit)
"""
Experiment: masking vs clipping raw comparison + sampled in-range fraction.

Goal: find the best threshold for the sampling heuristic by plotting
      clip/mask ratio as a function of the sampled in-range fraction.

Sweeps: N (ar1 size), ar2_range, p_in (fraction of ar1 in [ar2_min, ar2_max]).
"""
import numpy as np
import timeit
import json
import os

# ===== Two raw implementations =====

def isin_masking(ar1, ar2, ar2_min, ar2_max, ar2_range):
    """Original NumPy masking approach."""
    outgoing_array = np.zeros(ar1.shape, dtype=bool)

    isin_helper_ar = np.zeros(ar2_range + 1, dtype=bool)
    isin_helper_ar[ar2 - ar2_min] = True

    basic_mask = (ar1 <= ar2_max) & (ar1 >= ar2_min)
    in_range_ar1 = ar1[basic_mask]
    if in_range_ar1.size == 0:
        return outgoing_array

    ar2_min_arr = np.array(ar2_min, dtype=np.intp)
    out = np.empty_like(in_range_ar1, dtype=np.intp)
    outgoing_array[basic_mask] = isin_helper_ar[
        np.subtract(in_range_ar1, ar2_min_arr, dtype=np.intp,
                    out=out, casting="unsafe")]

    return outgoing_array


def isin_clipping(ar1, ar2, ar2_min, ar2_max, ar2_range):
    """Clipped indexing: subtract + clip + fancy index."""
    shift_val = ar2_min - 1

    extended_table = np.zeros(ar2_range + 3, dtype=bool)
    extended_table[ar2 - ar2_min + 1] = True

    ar1_indices = ar1.astype(np.intp, copy=False) - shift_val
    np.clip(ar1_indices, 0, ar2_range + 2, out=ar1_indices)

    return extended_table[ar1_indices]


# ===== Data generation with controlled p_in =====

def generate_data(N, ar2_range, p_in, ar2_size=1000, seed=42):
    """Generate test data with precisely controlled in-range proportion."""
    rng = np.random.RandomState(seed)

    ar2_min = 100000
    ar2_max = ar2_min + ar2_range

    ar2 = rng.randint(ar2_min, ar2_max + 1, size=ar2_size)

    n_in = int(p_in * N)
    n_out = N - n_in

    parts = []
    if n_in > 0:
        parts.append(rng.randint(ar2_min, ar2_max + 1, size=n_in))
    if n_out > 0:
        n_below = n_out // 2
        n_above = n_out - n_below
        if n_below > 0:
            parts.append(rng.randint(0, ar2_min, size=n_below))
        if n_above > 0:
            parts.append(rng.randint(ar2_max + 1, ar2_max + 1 + 100000,
                                     size=n_above))

    ar1 = np.concatenate(parts)
    rng.shuffle(ar1)

    return ar1, ar2, ar2_min, ar2_max, ar2_range


def compute_sampled_fraction(ar1, ar2_min, ar2_max, n_sample=256):
    """Compute the sampled in-range fraction (what the heuristic would see)."""
    stride = max(1, len(ar1) // n_sample)
    sample = ar1[::stride]  # view, O(1)
    n_in = np.count_nonzero((sample >= ar2_min) & (sample <= ar2_max))
    return n_in / len(sample)


# ===== Parameters =====

Ns = [1000, 10_000, 100_000, 1_000_000, 5_000_000]
ar2_ranges = [10, 50, 200, 2_000, 20_000]
p_ins = [0.01, 0.03, 0.05, 0.07, 0.1, 0.15, 0.3, 0.5, 0.8, 1.0]

total = len(Ns) * len(ar2_ranges) * len(p_ins)
results = {}
count = 0

print(f"Total experiments: {total}")
print("=" * 80)

for N in Ns:
    n_repeat, n_number = 6, 4

    for ar2_range in ar2_ranges:
        for p_in in p_ins:
            count += 1
            print(f"[{count:3d}/{total}] N={N:>10,}  "
                  f"ar2_range={ar2_range:>7,}  p_in={p_in:.3f}  ", end="")

            ar1, ar2, ar2_min, ar2_max, ar2_rng = generate_data(
                N, ar2_range, p_in)

            # Verify correctness
            result_mask = isin_masking(ar1, ar2, ar2_min, ar2_max, ar2_rng)
            result_clip = isin_clipping(ar1, ar2, ar2_min, ar2_max, ar2_rng)
            assert np.array_equal(result_mask, result_clip), \
                f"mask vs clip differ! N={N}, range={ar2_range}, p_in={p_in}"

            # Compute sampled in-range fraction
            sampled_frac = compute_sampled_fraction(ar1, ar2_min, ar2_max)

            # Time both
            t_mask = min(timeit.repeat(
                lambda: isin_masking(ar1, ar2, ar2_min, ar2_max, ar2_rng),
                repeat=n_repeat, number=n_number
            )) / n_number

            t_clip = min(timeit.repeat(
                lambda: isin_clipping(ar1, ar2, ar2_min, ar2_max, ar2_rng),
                repeat=n_repeat, number=n_number
            )) / n_number

            ratio = t_clip / t_mask

            results[(N, ar2_range, p_in)] = {
                't_mask': t_mask,
                't_clip': t_clip,
                'ratio': ratio,
                'sampled_frac': sampled_frac,
            }

            print(f"mask={t_mask:.6f}s  clip={t_clip:.6f}s  "
                  f"clip/mask={ratio:.3f}  sampled_frac={sampled_frac:.4f}")

# ===== Save raw results to JSON =====
save_path = os.path.dirname(os.path.abspath(__file__))
json_results = {}
for (N, ar2_range, p_in), v in results.items():
    key = f"{N}_{ar2_range}_{p_in}"
    json_results[key] = {
        'N': N, 'ar2_range': ar2_range, 'p_in': p_in,
        't_mask': v['t_mask'], 't_clip': v['t_clip'],
        'ratio': v['ratio'], 'sampled_frac': v['sampled_frac'],
    }

json_path = os.path.join(save_path, 'sweep_results.json')
with open(json_path, 'w') as f:
    json.dump(json_results, f, indent=2)
print(f"\nRaw results saved to {json_path}")

# ===== Summary table =====
print("\n" + "=" * 95)
print("SUMMARY: clipping vs original masking  (+ sampled in-range fraction)")
print("=" * 95)
print(f"{'N':>12}  {'ar2_range':>10}  {'p_in':>6}  "
      f"{'sampled%':>9}  {'clip/mask':>10}  {'winner':>8}")
print("-" * 95)

for N in Ns:
    for ar2_range in ar2_ranges:
        for p_in in p_ins:
            v = results[(N, ar2_range, p_in)]
            winner = "CLIP" if v['ratio'] < 1.0 else "MASK"
            print(f"{N:>12,}  {ar2_range:>10,}  {p_in:>6.3f}  "
                  f"{v['sampled_frac']:>8.4f}  {v['ratio']:>10.3f}  "
                  f"{winner:>8}")

print("-" * 95)

# ===== Visualization =====
try:
    import matplotlib
    matplotlib.use('Agg')
    import matplotlib.pyplot as plt

    # --- Key plot: clip/mask ratio vs sampled_frac, grouped by N ---
    fig, axes = plt.subplots(1, len(Ns), figsize=(5 * len(Ns), 5),
                             sharey=True)
    if len(Ns) == 1:
        axes = [axes]

    for ax, N in zip(axes, Ns):
        fracs = []
        ratios = []
        for ar2_range in ar2_ranges:
            for p_in in p_ins:
                v = results[(N, ar2_range, p_in)]
                fracs.append(v['sampled_frac'])
                ratios.append(v['ratio'])

        ax.scatter(fracs, ratios, alpha=0.6, s=20, edgecolors='k',
                   linewidths=0.3)
        ax.axhline(y=1.0, color='red', linestyle='--', alpha=0.7,
                   linewidth=2, label='breakeven')
        ax.set_xlabel('Sampled in-range fraction')
        ax.set_ylabel('T_clip / T_mask')
        ax.set_title(f'N = {N:,}')
        ax.set_xlim(-0.05, 1.05)
        ax.set_ylim(0, 3.0)
        ax.legend(fontsize=8)
        ax.grid(True, alpha=0.3)

    plt.suptitle('Clip/Mask ratio vs Sampled in-range fraction\n'
                 '(below red line = clipping wins)',
                 fontsize=13, y=1.04)
    plt.tight_layout()
    scatter_path = os.path.join(save_path, 'scatter_frac_vs_ratio.png')
    plt.savefig(scatter_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"Saved {scatter_path}")

    # --- Heatmap: clip/mask per N ---
    fig, axes = plt.subplots(1, len(Ns), figsize=(5 * len(Ns), 5))
    if len(Ns) == 1:
        axes = [axes]

    for col, N in enumerate(Ns):
        ax = axes[col]
        ratio_matrix = np.zeros((len(p_ins), len(ar2_ranges)))
        for i, p_in in enumerate(p_ins):
            for j, ar2_range in enumerate(ar2_ranges):
                ratio_matrix[i, j] = results[(N, ar2_range, p_in)]['ratio']

        im = ax.imshow(ratio_matrix, aspect='auto', cmap='RdYlGn_r',
                       vmin=0.3, vmax=2.0, origin='lower')
        ax.set_xticks(range(len(ar2_ranges)))
        ax.set_xticklabels([str(r) for r in ar2_ranges], rotation=45)
        ax.set_yticks(range(len(p_ins)))
        ax.set_yticklabels([str(p) for p in p_ins])
        ax.set_xlabel('ar2_range')
        ax.set_ylabel('p_in')
        ax.set_title(f'T_clip / T_mask  (N={N:,})')

        for i in range(len(p_ins)):
            for j in range(len(ar2_ranges)):
                val = ratio_matrix[i, j]
                color = 'white' if val > 1.5 or val < 0.5 else 'black'
                ax.text(j, i, f'{val:.2f}', ha='center', va='center',
                        fontsize=6, color=color)

        plt.colorbar(im, ax=ax, shrink=0.8)

    plt.suptitle('Clip vs Mask heatmap', fontsize=14, y=1.02)
    plt.tight_layout()
    heatmap_path = os.path.join(save_path, 'heatmap_clip_vs_mask.png')
    plt.savefig(heatmap_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"Saved {heatmap_path}")

    # --- All-N overlay: clip/mask vs sampled_frac ---
    fig, ax = plt.subplots(figsize=(10, 6))
    colors_N = plt.cm.tab10(np.linspace(0, 0.5, len(Ns)))

    for idx, N in enumerate(Ns):
        fracs = []
        ratios = []
        for ar2_range in ar2_ranges:
            for p_in in p_ins:
                v = results[(N, ar2_range, p_in)]
                fracs.append(v['sampled_frac'])
                ratios.append(v['ratio'])
        ax.scatter(fracs, ratios, alpha=0.5, s=15, color=colors_N[idx],
                   label=f'N={N:,}')

    ax.axhline(y=1.0, color='red', linestyle='--', linewidth=2,
               label='breakeven')
    ax.set_xlabel('Sampled in-range fraction', fontsize=12)
    ax.set_ylabel('T_clip / T_mask', fontsize=12)
    ax.set_title('All N: where does clipping beat masking?', fontsize=14)
    ax.set_xlim(-0.05, 1.05)
    ax.set_ylim(0, 3.5)
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    overlay_path = os.path.join(save_path, 'overlay_frac_vs_ratio.png')
    plt.savefig(overlay_path, dpi=150, bbox_inches='tight')
    plt.close()
    print(f"Saved {overlay_path}")

except ImportError:
    print("matplotlib not installed, skipping visualization.")

print("\nDone!")

ASV benchmarks

The benchmark below is included for transparency and to document the measurement
methodology; it is not proposed to be added in-tree unless maintainers think it would be useful.

Click to expand: ASV benchmark code used

Add following class at the end of /numpy/benchmarks/benchmarks/bench_lib.py

# numpy/numpy/benchmarks/benchmarks

class IsinTableOverlap(Benchmark):
    """Benchmarks for `numpy.isin` integer table-method paths.

    Tests the clipped-indexing optimisation across three overlap regimes
    that determine whether the new fast path or the original masking path
    is taken:

    - "high"  (p_in ~ 50 %): most ar1 values inside ar2's range
    - "low"   (p_in ~ 0.1 %): almost no ar1 values inside ar2's range
    - "all"   (p_in = 100 %): every ar1 value inside ar2's range
    """

    params = (
        [10_000, 1_000_000],            # N  (ar1 size)
        ["high", "low", "all"],         # overlap regime
        [False, True],                  # invert
    )
    param_names = ["N", "overlap", "invert"]

    def setup(self, N, overlap, invert):
        rng = np.random.default_rng(42)

        ar2_min = 100_000
        ar2_max = 110_000                       # ar2_range = 10_000
        self.b = rng.integers(ar2_min, ar2_max + 1, size=1000)

        if overlap == "high":
            # ~50 % of ar1 inside [ar2_min, ar2_max]
            n_in = N // 2
            n_out = N - n_in
            parts = [
                rng.integers(ar2_min, ar2_max + 1, size=n_in),
                rng.integers(0, ar2_min, size=n_out // 2),
                rng.integers(ar2_max + 1, ar2_max + 100_000,
                             size=n_out - n_out // 2),
            ]
            self.a = np.concatenate(parts)
            rng.shuffle(self.a)
        elif overlap == "low":
            # ~0.1 % of ar1 inside range
            n_in = max(1, N // 1000)
            n_out = N - n_in
            parts = [
                rng.integers(ar2_min, ar2_max + 1, size=n_in),
                rng.integers(0, ar2_min, size=n_out // 2),
                rng.integers(ar2_max + 1, ar2_max + 100_000,
                             size=n_out - n_out // 2),
            ]
            self.a = np.concatenate(parts)
            rng.shuffle(self.a)
        else:  # "all"
            self.a = rng.integers(ar2_min, ar2_max + 1, size=N)

        self.invert = invert

    def time_isin(self, N, overlap, invert):
        np.isin(self.a, self.b, invert=self.invert)

apply the new _isin function, commit and run:

cd benchmarks
asv continuous main HEAD -b IsinTableOverlap

It covers three overlap regimes:

  • High overlap (p_in ≈ 50 %): largest speedup from clipped indexing
  • Low overlap (p_in ≈ 0.1 %): heuristic correctly routes to masking
  • All in range (p_in = 100 %): clipped indexing avoids unnecessary masking

Results (Apple M-series)

At N = 1 000 000 the clipped path is 2–6× faster with no regression.
At N = 10 000 with high/all overlap it is 1.5–2.3× faster; with low
overlap a small regression of ~3 μs (~23 %) is observed, which is negligible in absolute terms (15 → 18 μs).

Output report (asv continuous main HEAD -b IsinTableOverlap, Apple M-series):

N Overlap Invert Before After Ratio
1 000 000 high False 7.51±0.08 ms 1.27±0.08 ms 0.17
1 000 000 high True 7.58±0.2 ms 1.31±0.07 ms 0.17
1 000 000 all False 2.67±0.09 ms 1.30±0.06 ms 0.49
1 000 000 all True 2.54±0.1 ms 1.32±0.03 ms 0.52
10 000 high False 47.2±0.7 μs 20.6±0.2 μs 0.44
10 000 high True 47.8±1 μs 21.1±0.7 μs 0.44
10 000 all False 32.3±2 μs 20.6±0.5 μs 0.64
10 000 all True 33.2±1 μs 21.0±0.5 μs 0.63
10 000 low False 14.9±0.2 μs 18.4±0.3 μs 1.23
10 000 low True 15.1±0.5 μs 18.6±1 μs 1.23

The two "low overlap, N = 10 000" cases show a ~3.5 μs regression (sampling
overhead dominates when the total operation is only ~15 μs). At N = 1 000 000
the sampling cost is negligible and every overlap regime is faster or neutral.

Added a quick data-overlap test for improved performance in array operations by checking if a portion of ar1 falls within the specified range of ar2. This change optimizes the indexing path based on empirical overlap thresholds.
@mdrdope mdrdope changed the title # PERF: Speed up np.isin table method with clipped indexing PERF: Speed up np.isin table method with clipped indexing Feb 13, 2026
@andyfaff
Copy link
Member

Can you please confirm that this PR was not created with AI?

@mdrdope
Copy link
Contributor Author

mdrdope commented Feb 13, 2026

Can you please confirm that this PR was not created with AI?

To clarify: this PR was not automatically generated by AI.
Parts of this patch were initially explored using an AI system, but I reviewed, refined, and validated the final implementation myself. I designed and ran the empirical experiments that determined the sampling heuristic and threshold, and I verified correctness locally and through the full test suite (CI is green). The PR and final code are my own work.

@andyfaff
Copy link
Member

Unfortunately I'm still doubtful. To me the PR is suspicious because the original post is word perfect, follows the formatting used by AI, has the emojis used by AI, the heading title used by AI, etc (all without the post being edited). That would be very unusual for a first time contributor.

@mdrdope
Copy link
Contributor Author

mdrdope commented Feb 14, 2026

Unfortunately I'm still doubtful. To me the PR is suspicious because the original post is word perfect, follows the formatting used by AI, has the emojis used by AI, the heading title used by AI, etc (all without the post being edited). That would be very unusual for a first time contributor.

Thanks for raising the concern, I understand why the style might look unusual.

Just to clarify, this isn’t my first contribution; a previous performance PR was merged earlier this week.

For this patch, I take full responsibility for the implementation and the benchmarks. The sampling heuristic and threshold choice were iterated on and validated by me, and I verified correctness locally and through CI.

If there are particular concerns about the clipped-indexing path, the sampling strategy, or the performance trade-offs at different overlap densities, I’d be very happy to go through them in detail.

If the formatting or emoji use is distracting, I’m happy to adjust the PR description to better match the usual NumPy style.

@andyfaff
Copy link
Member

I'm won't be the one reviewing the PR, just triaging in this particular case.

Open source projects are currently experiencing a lot of upheaval and extra work caused by (mis)use of AI. Several projects are actively figuring out stances w.r.t how we deal with project contributions where AI is involved.
The discussion from the numpy community is in the numpy-discussion mail archive.

The scipy sister project is in the process of drafting guidelines for AI assisted contributions, https://github.com/scipy/scipy/pull/24583/changes, which I don't think would be too dissimilar to numpy's point of view.

The important factors here:

  • the maintainers typically want to be communicating with a human, community is important. When there's a possibility of a post/PR being generated by AI, our guard becomes raised. Figuring that out is not an exact science.
  • We want to be scrupulous about the licensing of the numpy code-base. For example, we cannot include GPL based code with our BSD-3 clause licence. We also can't use things like a Python translation of code from Numerical Recipes. Therefore, if any of the code included in the PR was generated by AI, and will eventually be used in a numpy wheel/sdist, then we can't merge it. This is because the AI would've originally been trained using code with a non-compatible licence. So if the final code was not written by a person from the high level algorithm, there's a high probability it wouldn't get merged.

These are the motivations behind my queries. Unfortunately maintainers raising these kinds of questions does make us appear less welcoming. That is not our intention.

@charris
Copy link
Member

charris commented Feb 18, 2026

Therefore, if any of the code included in the PR was generated by AI,

@andyfaff That is an extreme position, if we do that we will end up with no maintainers because everyone coming up will be using AI. That is not my understanding of where the mailing list discussion was going. We don't want to be the Spanish Inquisition.

@andyfaff
Copy link
Member

I don't mind winding back by comments.. That's a side effect of a rapidly changing landscape.

@mdrdope
Copy link
Contributor Author

mdrdope commented Feb 23, 2026

If you have time, I’d really appreciate technical feedback on this PR.
In particular:

  1. correctness of the clipped-indexing sentinel logic for all integer edge cases,
  2. the current sampling threshold and its performance trade-offs across regimes.
    Thank you! @charris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

X Tutup