benchmarks

PySpark Benchmarks

This directory contains microbenchmarks for PySpark using ASV (Airspeed Velocity).

Prerequisites

Install ASV:

pip install asv

For running benchmarks with isolated environments (without --python=same), you need an environment manager. The default configuration uses virtualenv, but ASV also supports conda, mamba, uv, and some others. See the official docs for details.

Running Benchmarks

Quick run (current environment)

Run benchmarks using your current Python environment (fastest for development):

cd python/benchmarks
asv run --python=same --quick

You can also specify the test class to run:

cd python/benchmarks
asv run --python=same --quick -b 'bench_arrow.LongArrowToPandasBenchmark'

Full run against a commit

Run benchmarks in an isolated virtualenv (builds pyspark from source):

cd python/benchmarks
asv run master^!          # Run on latest master commit
asv run v3.5.0^!          # Run on a specific tag
asv run abc123^!          # Run on a specific commit

Compare two commits

Compare current branch against upstream/main with 10% threshold:

asv continuous -f 1.1 upstream/main HEAD

Other useful commands

asv check          # Validate benchmark syntax

Writing Benchmarks

Benchmarks are Python classes with methods prefixed by:

time_* - Measure execution time
peakmem_* - Measure peak memory usage
mem_* - Measure memory usage of returned object

Example:

class MyBenchmark:
    params = [[1000, 10000], ["option1", "option2"]]
    param_names = ["n_rows", "option"]

    def setup(self, n_rows, option):
        # Called before each benchmark method
        self.data = create_test_data(n_rows, option)

    def time_my_operation(self, n_rows, option):
        # Benchmark timing
        process(self.data)

    def peakmem_my_operation(self, n_rows, option):
        # Benchmark peak memory
        process(self.data)

See ASV documentation for more details.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
asv.conf.json		asv.conf.json
bench_arrow.py		bench_arrow.py
bench_eval_type.py		bench_eval_type.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

PySpark Benchmarks

Prerequisites

Running Benchmarks

Quick run (current environment)

Full run against a commit

Compare two commits

Other useful commands

Writing Benchmarks

FilesExpand file tree

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

PySpark Benchmarks

Prerequisites

Running Benchmarks

Quick run (current environment)

Full run against a commit

Compare two commits

Other useful commands

Writing Benchmarks