X Tutup
Skip to content

apple metal support#690

Open
bung87 wants to merge 1 commit intomratsim:masterfrom
bung87:Silicon
Open

apple metal support#690
bung87 wants to merge 1 commit intomratsim:masterfrom
bung87:Silicon

Conversation

@bung87
Copy link
Contributor

@bung87 bung87 commented Feb 8, 2026

Summary by CodeRabbit

  • New Features
    • Added Metal GPU backend for tensor acceleration on macOS devices.
    • Implemented GPU-accelerated matrix operations (GEMM, GEMV).
    • Enabled elementwise arithmetic, scalar operations, and BLAS Level 1 operations on GPU.
    • Added tensor conversion utilities between CPU and GPU memory.
    • Included GPU tensor reshaping, transposition, and broadcasting support.
    • Comprehensive test suite validating all Metal backend functionality.

@coderabbitai
Copy link

coderabbitai bot commented Feb 8, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a complete Metal GPU backend for Arraymancer, enabling GPU-accelerated tensor operations on Apple devices. It adds Metal device/buffer management, GPU kernels for BLAS operations, tensor data types with CPU/Metal conversion routines, operator overloads, and a comprehensive test suite validating Metal tensor functionality.

Changes

Cohort / File(s) Summary
Build Configuration
arraymancer.nimble, src/arraymancer/tensor/backend/global_config.nim
New Metal test task and Metal-specific configuration constants (METAL_THREADGROUP_SIZE=256, METAL_BUFFER_POOL_MAX_SIZE=1GB).
Core Metal Backend Infrastructure
src/arraymancer/tensor/backend/metal.nim, src/arraymancer/tensor/backend/metal/metal_backend.nim, src/arraymancer/tensor/backend/metal/metal_buffer.nim
Foundation layer providing Metal device initialization, command queue management, buffer pooling with LRU eviction, kernel caching, and comprehensive GPU memory and compute primitives.
Metal Kernels
src/arraymancer/tensor/backend/metal/metal_kernels.metal
Float32 GPU kernels for elementwise operations, GEMM variants (naive, transposed, batched), GEMV, vector dot, activations (relu, sigmoid, tanh), and contiguous copy with column-major layout support.
Tensor Type Definitions
src/arraymancer/tensor/tensor.nim, src/arraymancer/tensor/data_structure.nim
Introduction of MetalTensor, MetalStorage, and MetalBuffer types; integration into AnyTensor union across all backend combinations.
Tensor Initialization & Conversion
src/arraymancer/tensor/init_metal.nim, src/arraymancer/tensor/init_copy_metal.nim, src/arraymancer/tensor/private/p_init_metal.nim
CPU↔Metal tensor conversion (metal(), cpu(), toMetal(), toCpu()), deep cloning, storage allocation, and data pointer accessors.
BLAS Operations Level 1
src/arraymancer/tensor/operators_blas_l1_metal.nim
Vector dot product, elementwise addition/subtraction with optional broadcasting, scalar multiplication, both in-place and out-of-place variants.
BLAS Operations Level 2/3
src/arraymancer/tensor/operators_blas_l2l3_metal.nim
Matrix-vector (GEMV) and matrix-matrix (GEMM) multiplication via metalGemv/metalGemm, with contiguity validation and matrix/vector operator overloads.
Broadcasted Operators
src/arraymancer/tensor/operators_broadcasted_metal.nim
Elementwise broadcasted arithmetic (+., -., \*., /.) for tensor-tensor, tensor-scalar, and scalar-tensor operations with both standard and in-place variants.
Tensor Utilities
src/arraymancer/tensor/shapeshifting_metal.nim, src/arraymancer/tensor/private/p_checks_metal.nim
Transpose, reshape, broadcast, contiguity checks, and asContiguous GPU-accelerated copy; runtime validation helpers for elementwise, dot product, matmat, and matvec operations.
Module Aggregation
src/arraymancer/tensor/tensor_metal.nim
Public re-export of all Metal backend modules (metal_backend, metal_buffer, init_metal, operators, shapeshifting) for convenient external access.
Test Suite
tests/tests_metal.nim, tests/tensor/test_*.nim (init, blas, accessors, broadcasting, shapeshifting)
Comprehensive Metal backend validation covering tensor initialization, BLAS operations, broadcasting, slicing, transpose, reshape, contiguity, and round-trip CPU↔Metal conversions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Metal gleams! GPU dreams now flow, with kernels swift and buffers bright,
Tensors dance on Apple's stage, from CPU's hands to Metal's might.
Pool them, cache them, multiply with grace—
Arraymancer leaps to new GPU space! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'apple metal support' is directly related to the changeset, which comprehensively implements Metal backend support for GPU-accelerated tensor operations in Arraymancer.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @bung87, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Arraymancer's capabilities by adding native support for Apple's Metal graphics API. This integration allows users on macOS with Metal-compatible hardware to perform tensor computations directly on their GPU, promising substantial performance improvements for machine learning and scientific computing workloads. The changes encompass a new backend, custom Metal kernels, and updates to the core tensor data structures and operations to seamlessly incorporate Metal-accelerated functionality.

Highlights

  • Apple Metal Backend Integration: Introduced a new Apple Metal backend for GPU-accelerated tensor operations, enabling Arraymancer to leverage Apple Silicon and other Metal-compatible GPUs on macOS.
  • Nim-Objective-C Bridge for Metal API: Implemented a bridge using Nim's .emit pragma to directly call Objective-C Metal APIs, facilitating low-level GPU control and resource management from Nim.
  • Metal Shading Language (MSL) Kernels: Added a comprehensive set of Metal Shading Language (MSL) kernels for various tensor operations, including element-wise arithmetic, scalar operations, BLAS Level 1 (dot product), BLAS Level 2/3 (GEMM, GEMV), activation functions (ReLU, Sigmoid, Tanh), and specialized copy operations.
  • MetalTensor Data Structure and Operations: Defined a new MetalTensor type and associated storage mechanisms (MetalStorage, MetalBuffer, MetalBufferPool) to manage GPU memory and tensor metadata. This includes functions for CPU-GPU data transfer, cloning, shapeshifting (transpose, reshape, broadcast), and contiguous memory management.
  • Build System and Type System Updates: Updated the arraymancer.nimble file to include a test_metal task and modified src/arraymancer/tensor.nim to conditionally import the Metal backend and extend the AnyTensor type to support MetalTensor alongside existing CudaTensor and ClTensor.
  • Float32 Support Only: The initial Metal backend implementation is limited to float32 data types for all GPU operations, with explicit checks raising ValueError for float64.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • arraymancer.nimble
    • Added a new test_metal task to run Metal backend specific tests.
  • src/arraymancer/tensor.nim
    • Conditionally imports tensor_metal when the metal define is present.
    • Updated AnyTensor type definition to include MetalTensor for various backend combinations (CUDA, OpenCL, Metal).
  • src/arraymancer/tensor/backend/global_config.nim
    • Added METAL_THREADGROUP_SIZE and METAL_BUFFER_POOL_MAX_SIZE constants for Metal-specific configuration.
  • src/arraymancer/tensor/backend/metal.nim
    • New file: Implements the core Metal backend logic, including Objective-C interoperability for Metal API calls.
    • Provides Nim wrappers for Metal device, command queue, buffer, library, and pipeline management.
    • Includes procedures for buffer creation, data transfer (upload/download), kernel execution (element-wise, GEMM, GEMV, scalar ops, contiguous copy, dot product).
    • Loads Metal kernels from metal_kernels.metal and manages a kernel cache.
    • Initializes a global MetalContext for resource management.
  • src/arraymancer/tensor/backend/metal/metal_buffer.nim
    • New file: Defines MetalBuffer and MetalBufferPool types for efficient GPU memory management.
    • Includes procedures for initializing, acquiring, releasing, and cleaning up Metal buffers.
  • src/arraymancer/tensor/backend/metal/metal_kernels.metal
    • New file: Contains Metal Shading Language (MSL) source code for various GPU kernels.
    • Includes kernels for element-wise operations (add, sub, mul, div), scalar operations (mul, add), GEMM (naive, transpose variants), batched GEMM (placeholder), vector operations (dot, scale, axpy), activation functions (relu, sigmoid, tanh), contiguous copy, and matrix-vector multiplication (GEMV).
  • src/arraymancer/tensor/data_structure.nim
    • Added MetalTensorRefTrackerObj, MetalTensorRefTracker, MetalStorage, and MetalTensor types to support Metal backend tensors.
    • Updated AnyTensor type to include MetalTensor in its definition, allowing for polymorphic tensor handling across backends.
  • src/arraymancer/tensor/init_copy_metal.nim
    • New file: Implements clone functionality for MetalTensor, performing a deep copy via CPU transfer.
  • src/arraymancer/tensor/init_metal.nim
    • New file: Provides conversion procedures between CPU Tensor and MetalTensor (metal, cpu, toMetal, toCpu).
    • Includes zeros_like and ones_like for MetalTensor.
  • src/arraymancer/tensor/operators_blas_l1_metal.nim
    • New file: Implements BLAS Level 1 operations (dot product, element-wise add/sub, scalar mul) for MetalTensor using Metal kernels.
  • src/arraymancer/tensor/operators_blas_l2l3_metal.nim
    • New file: Implements BLAS Level 2/3 operations (matrix-vector and matrix-matrix multiplication) for MetalTensor using Metal kernels.
  • src/arraymancer/tensor/operators_broadcasted_metal.nim
    • New file: Implements broadcasted element-wise operations (+., -., *., /.) and their in-place variants for MetalTensor.
  • src/arraymancer/tensor/private/p_checks_metal.nim
    • New file: Adds runtime checks for MetalTensor operations, ensuring shape and rank compatibility.
  • src/arraymancer/tensor/private/p_init_metal.nim
    • New file: Internal procedures for initializing MetalStorage and MetalTensor.
  • src/arraymancer/tensor/shapeshifting_metal.nim
    • New file: Implements transpose, isContiguous, asContiguous, reshape, and broadcast for MetalTensor.
  • src/arraymancer/tensor/tensor_metal.nim
    • New file: Central module that imports and exports all Metal-related functionalities.
  • tests/tensor/test_accessors_slicer_metal.nim
    • New file: Adds unit tests for MetalTensor accessors and slicing operations.
  • tests/tensor/test_broadcasting_metal.nim
    • New file: Adds unit tests for MetalTensor broadcasting and element-wise operations.
  • tests/tensor/test_init_metal.nim
    • New file: Adds unit tests for MetalTensor initialization and cloning.
  • tests/tensor/test_operators_blas_metal.nim
    • New file: Adds unit tests for MetalTensor BLAS Level 1, 2, and 3 operations.
  • tests/tensor/test_shapeshifting_metal.nim
    • New file: Adds unit tests for MetalTensor shapeshifting operations (transpose, reshape, contiguous checks).
  • tests/tests_metal.nim
    • New file: Aggregates all Metal-specific unit tests.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: a backend for Apple's Metal framework, enabling GPU acceleration on macOS and iOS devices. While the overall structure is sound, several security-critical issues were identified, including missing bounds checks in buffer transfer functions, potential integer overflows in GPU kernels for large tensors, and a GPU memory leak due to missing destructors. Additionally, the implementation has bugs in the buffer pooling logic and areas for performance and maintainability improvements. Addressing these points is crucial for the stability, robustness, and security of the new Metal backend.

Comment on lines +26 to +31
MetalBufferObj* = object
buffer*: MTLBuffer
length*: int
devicePtr*: pointer

MetalBuffer* = ref MetalBufferObj

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

CRITICAL: There is a memory leak. MetalBufferObj wraps an MTLBuffer pointer which is a reference-counted Objective-C object. The MTLBuffer is created with a +1 retain count but is never released. This will cause memory to leak every time a MetalBuffer is created and garbage collected.

You need to add a finalizer (=destroy) to MetalBufferObj that releases the underlying MTLBuffer object. This requires a helper function to call the Objective-C release method.

  MetalBuffer* = ref MetalBufferObj

{.emit: """
#import <Foundation/Foundation.h>
void arraymancer_release(void* obj) { if (obj) [(id)obj release]; }
""".}
proc release_object(obj: pointer) {.importc: "arraymancer_release", nodecl.}

when NimMajor == 1:
  proc `=destroy`*(p: var MetalBufferObj) {.noSideEffect.}=
    if p.buffer != nil:
      release_object(p.buffer)
      p.buffer = nil
else:
  proc `=destroy`*(p: MetalBufferObj) {.noSideEffect.}=
    if p.buffer != nil:
      release_object(p.buffer)
      p.buffer = nil

Comment on lines +236 to +240
proc uploadToBuffer*(mb: MetalBuffer, data: pointer, length: int) =
copyToBuffer(mb.buffer, data, csize_t(length))

proc downloadFromBuffer*(mb: MetalBuffer, data: pointer, length: int) =
copyFromBuffer(mb.buffer, data, csize_t(length))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The uploadToBuffer and downloadFromBuffer functions lack bounds checking on the length parameter. Since these functions are exported and pass the length directly to a memcpy operation in the underlying Objective-C code, providing a length greater than the buffer's allocated size will result in a heap buffer overflow.

Recommendation: Add a check to ensure that the provided length does not exceed mb.length before performing the copy.

proc uploadToBuffer*(mb: MetalBuffer, data: pointer, length: int) =
  if length > mb.length:
    raise newException(ValueError, "Length exceeds buffer size")
  copyToBuffer(mb.buffer, data, csize_t(length))

proc downloadFromBuffer*(mb: MetalBuffer, data: pointer, length: int) =
  if length > mb.length:
    raise newException(ValueError, "Length exceeds buffer size")
  copyFromBuffer(mb.buffer, data, csize_t(length))

Comment on lines +328 to +334
int idx = (int)gid;
int src_offset = 0;

for (int dim = rank - 1; dim >= 0; dim--) {
int coord = idx % src_shape[dim];
src_offset += coord * src_strides[dim];
idx /= src_shape[dim];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The Metal kernels use 32-bit signed integers (int) for calculating memory offsets and indices. For large tensors (e.g., those with more than 2^31 elements), these calculations will overflow, leading to incorrect memory access and potential out-of-bounds reads or writes on the GPU. This is particularly evident in the contiguous_copy_f32 kernel where src_offset is calculated using 32-bit arithmetic.

Recommendation: Use 64-bit integers (e.g., uint64_t or size_t) for all index and offset calculations within the Metal kernels to support large tensors safely.

Comment on lines +56 to +60
for i in 0 ..< pool.entries.len:
if pool.entries[i].size >= size and pool.entries[i].buffer != nil:
result = pool.entries[i].buffer
pool.entries[i].lastUsed = pool.accessCounter
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a bug in the buffer pooling logic. When a buffer is acquired from the pool, it is not removed from the entries sequence. This means the same buffer can be acquired multiple times, leading to data corruption if used concurrently.

Additionally, the device parameter is unused, and if no suitable buffer is found, the procedure returns nil, which will likely cause a crash downstream. The procedure should probably create a new buffer if one cannot be reused.

I'll suggest fixing the most immediate issue of not removing the buffer from the pool.

  for i in 0 ..< pool.entries.len:
    if pool.entries[i].size >= size and pool.entries[i].buffer != nil:
      result = pool.entries[i].buffer
      pool.currentSize -= pool.entries[i].size
      pool.entries.del(i)
      return

Comment on lines +97 to +100
MetalTensorRefTrackerObj*[T: SomeFloat] = object
value*: MetalBuffer

MetalTensorRefTracker*[T] = ref MetalTensorRefTrackerObj[T]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The MetalTensor implementation lacks a destructor to release the underlying GPU resources. While CudaTensor has a =destroy hook to call cudaFree, MetalTensor does not have a corresponding hook to release the MTLBuffer Objective-C object. This leads to a GPU memory leak every time a MetalTensor is deallocated by the Nim garbage collector, which can eventually lead to a Denial of Service (DoS) due to resource exhaustion.

Recommendation: Implement a =destroy hook for MetalTensorRefTrackerObj or MetalBufferObj that calls an Objective-C function to release the MTLBuffer using [buffer release].

task test_metal, "Run all Metal backend tests":
var switches = " -d:metal"
switches.add " -d:blas=cblas" # Archlinux, comment out on Debian/Ubuntu
test "tests_metal", switches, split = false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other GPU backends (CUDA, OpenCL) and to ensure compatibility with Objective-C code via C++, it's recommended to specify "cpp" as the compilation language for this test task.

  test "tests_metal", switches, split = false, "cpp"

var sizeVar = size

# Create a buffer for partial results (one per threadgroup)
let threadgroupSize = 256

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The threadgroupSize is hardcoded to 256. This is fragile because the corresponding Metal kernel vector_dot_f32 also has a hardcoded size for its threadgroup memory. If these values get out of sync, it will lead to crashes or incorrect behavior.

It's better to use the METAL_THREADGROUP_SIZE constant defined in src/arraymancer/tensor/backend/global_config.nim to ensure consistency. You'll need to import ../global_config.

    let threadgroupSize = METAL_THREADGROUP_SIZE

Comment on lines +209 to +211
for (int k = 0; k < K; k++) {
sum += A[batchOffsetA + row * K + k] * B[batchOffsetB + k * N + col];
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The indexing used here (row * K + k and k * N + col) assumes a row-major memory layout for the matrices within each batch. However, other GEMM kernels in this file (gemm_naive_f32, etc.) use a column-major layout, which is standard for BLAS compatibility.

This inconsistency can lead to bugs in the future when this kernel is used. It should be changed to use column-major indexing to match the other kernels.

        for (int k = 0; k < K; k++) {
            // Column-major indexing: A[row, k] is at k*M+row, B[k, col] is at col*K+k
            sum += A[batchOffsetA + k * M + row] * B[batchOffsetB + col * K + k];
        }

uint lid [[thread_position_in_threadgroup]],
uint groupSize [[threads_per_threadgroup]])
{
threadgroup float localSum[256];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The size of the localSum array for threadgroup memory is hardcoded to 256. This is fragile. If the dispatching code in Nim ever uses a different threadgroup size, this will cause a buffer overflow within the threadgroup or incorrect results.

While Metal requires this size to be a compile-time constant, you should ensure it's tied to a single source of truth. The Nim code also hardcodes 256. Consider adding a comment here to note the dependency on the dispatch size being exactly 256.

Comment on lines +135 to +151
proc `-.`*[T: SomeFloat](val: T, t: MetalTensor[T]): MetalTensor[T] {.noinit.} =
## Broadcasted subtraction for scalar - tensor.
## Implemented as -(tensor - scalar)
result = newMetalTensor[T](t.shape)
metalScalarAdd[T](
t.storage.Fbuffer,
result.storage.Fbuffer,
-val,
t.storage.Flen
)
# Negate the result
metalScalarMul[T](
result.storage.Fbuffer,
result.storage.Fbuffer,
T(-1),
t.storage.Flen
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation of val - t (scalar-tensor subtraction) currently uses two separate kernel launches: one for scalar_add (with a negative value) and another for scalar_mul to negate the result. This is inefficient.

A more performant approach would be to create a dedicated Metal kernel scalar_sub_from_f32 that computes C[gid] = scalar - A[gid] in a single pass. This would halve the number of kernel launches and GPU synchronization for this operation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim`:
- Around line 418-419: The local variable sizeVar is declared as Nim int (64-bit
on x86_64) but the Metal kernel expects a 4-byte integer; change sizeVar to an
explicit int32 and initialize it from size (e.g. sizeVar: int32 = int32(size)),
and pass sizeof(int32) to executeElementwiseKernel so the pointer/size match
what the kernel expects; follow the same pattern used in executeGemmKernel to
ensure all kernel constant parameters use int32 consistently.
- Around line 356-398: The metalGemv proc currently ignores trans, incx and incy
and also passes host ints to the GPU with the wrong size; update metalGemv to
validate/handle these parameters: if trans is true either select/implement a
transposed kernel (e.g. getKernel("gemv_f32_trans") and dispatch with swapped
M/N and appropriate index mapping) or explicitly raise a ValueError when
trans==true; likewise raise a ValueError when incx != 1 or incy != 1 (or
implement strided support in the kernel) so callers aren’t silently wrong;
finally fix the int size mismatch by converting m and n to int32 (or Int32)
before calling setBytes and use csize_t(sizeof(int32)) when passing their sizes
(same for any other ints sent to the GPU like lda), and update references to
addr m/addr n to point at the int32 variables.
- Around line 198-202: The try/except block around initMetalContext currently
uses a bare except which catches non-recoverable Defect errors; change it to
catch only recoverable exceptions by using "except CatchableError:" (optionally
"except CatchableError as e:" if you want to log e) so initMetalContext()
failures are handled safely without swallowing defects; update the block in
metal_backend.nim where initMetalContext() is called to use CatchableError and
add an optional log of the exception before returning false.
- Around line 344-351: The current dispatch logic in the block setting
kernelName using transA/transB incorrectly falls through to "gemm_naive_f32"
when both transA and transB are true; update the logic in the kernelName
assignment (the variables transA, transB and kernelName) to explicitly handle
the both-true case: either set kernelName to "gemm_ab_transpose_f32" if you
add/implement that kernel, or throw/raise a clear runtime error indicating the
transA+transB combination is unsupported so it doesn't silently use the
non-transposed kernel.
- Around line 423-450: The metalScalarMul proc currently unconditionally picks
the "scalar_mul_f32" kernel and casts scalar with float32(scalar), which will
silently truncate float64 callers; wrap the body of metalScalarMul in a
compile-time type guard like "when T is float32: <existing body> else: raise
newException(ValueError, \"metalScalarMul only supports float32\")" so calls
with T=float64 fail early; apply the same pattern to metalScalarAdd (guarding
against using the wrong kernel and the float32(scalar) cast) and ensure the
raised error type/message matches project conventions.
- Around line 532-572: The MetalBuffer ref wrapper leaks GPU memory because
MetalBufferObj lacks a destructor; add a destructor/finalize proc to
MetalBufferObj in metal_buffer.nim that releases the underlying ObjC MTLBuffer
(the field used by createMetalBuffer/createBuffer, e.g. the mtlBuffer or buffer
field) when the MetalBuffer ref is garbage-collected; ensure the destructor
calls the ObjC release/`release()` (or sets the Swift/ObjC reference to nil as
appropriate) and is attached to the MetalBufferObj type so temporary buffers
like partialBuffer in metalDot and shapeBuffer/stridesBuffer in
metalContiguousCopy are freed.
- Around line 309-319: The code is sending 64-bit Nim ints to a Metal kernel
that expects 32-bit ints (gemm_naive_f32 parameters M/N/K); change the local
vars and setBytes size to use 32-bit integers: declare m, n, k as int32 (e.g.
var m: int32 = int32(M), n: int32 = int32(N), k: int32 = int32(K)) and call
setBytes(encoder, addr m, csize_t(sizeof(int32)), 3) (and similarly for n/k),
leaving the float32 alpha/beta handling unchanged; update the three setBytes
calls that reference sizeof(int) to sizeof(int32) so the Metal kernel receives
4-byte integers.

In `@src/arraymancer/tensor/backend/metal/metal_buffer.nim`:
- Around line 53-62: The acquireBuffer procedure returns a matching MetalBuffer
but leaves the entry in MetalBufferPool.entries, allowing the same buffer to be
handed out multiple times; modify proc acquireBuffer*(pool: var MetalBufferPool,
size: int, device: MTLDevice) to remove the chosen entry from pool.entries (and
decrement pool.currentSize) when you take its buffer: capture
pool.entries[i].buffer into result, update
pool.entries[i].lastUsed/accessCounter as needed, then delete that entry from
pool.entries (or swap-remove) and decrement pool.currentSize before returning;
ensure you still return nil when none found and preserve accessCounter
semantics.

In `@src/arraymancer/tensor/backend/metal/metal_kernels.metal`:
- Around line 187-214: The batched_gemm_f32 kernel uses row-major indexing but
the library expects column-major; update A, B and C accesses to column-major
using the same pattern as gemm_naive_f32: replace A[batchOffsetA + row * K + k]
with A[batchOffsetA + k * M + row], replace B[batchOffsetB + k * N + col] with
B[batchOffsetB + col * K + k], and replace C[batchOffsetC + row * N + col] with
C[batchOffsetC + col * M + row] (keep batchOffsetA/B/C, loop over k, and
alpha/beta logic unchanged).
- Around line 220-249: The kernel vector_dot_f32 uses gid
(thread_position_in_grid) but strides by groupSize (threads_per_threadgroup),
causing overlapping work across threadgroups and races; change the loop to
stride by the total grid size (totalThreads) instead of groupSize and write
partial sums to a per-threadgroup reduction slot using an array sized to the
actual threadgroup size (replace hardcoded localSum[256] with a size based on
METAL_THREADGROUP_SIZE or the runtime groupSize), then perform the threadgroup
reduction (using localSum, lid and threadgroup_barrier) and have the lane 0
write the group's result to result[gid / groupSize] (or use group index to
compute output slot) so each threadgroup reduces a distinct chunk. Ensure
references in code: vector_dot_f32, localSum, gid, lid, groupSize,
threadgroup_barrier and align the local buffer size with METAL_THREADGROUP_SIZE.

In `@src/arraymancer/tensor/operators_blas_l2l3_metal.nim`:
- Around line 26-42: The function metalMV_y_eq_aAx_p_by currently only verifies
a.isContiguous but always passes incx = 1 to metalGemv which is unsafe for
non-contiguous x; update metalMV_y_eq_aAx_p_by to validate x contiguity (e.g.,
check x.isContiguous) and if not contiguous either raise a ValueError similar to
the a check or create a contiguous temporary buffer (copy x.storage.Fbuffer into
a contiguous buffer) and pass that buffer with incx = 1 to metalGemv; ensure
references to x.storage.Fbuffer and the chosen incx value are adjusted and that
any temporary buffer is kept alive for the call.

In `@src/arraymancer/tensor/operators_broadcasted_metal.nim`:
- Around line 153-158: The file is missing an import of init_metal which
provides the toCpu() and metal() procs used in the broadcasted scalar-by-tensor
division proc `/.`*[T: SomeFloat](val: T, t: MetalTensor[T]); add `import
init_metal` (or the module that re-exports toCpu/metal) at the top of
operators_broadcasted_metal.nim so the calls to t.toCpu() and cpuResult.metal()
resolve when this file is compiled independently, and add a TODO comment inside
`/.`* to implement a native Metal kernel for scalar-tensor division to avoid the
CPU fallback performance penalty.

In `@src/arraymancer/tensor/private/p_init_metal.nim`:
- Around line 19-23: Add a destructor for MetalTensorRefTrackerObj analogous to
the CUDA backend so Metal buffers are freed when the tracker is GC'd: implement
a =destroy proc for MetalTensorRefTrackerObj that retrieves the stored buffer
from the ref tracker (the same field used by newMetalStorage:
MetalStorage.Fref_tracking.value) and releases it via the Metal buffer release
routine (or call the existing releaseBuffer/cleanupPool helper). Place the
=destroy alongside other data_structure destructors so garbage collection
invokes it and prevents GPU memory leaks.

In `@src/arraymancer/tensor/shapeshifting_metal.nim`:
- Around line 84-91: The MetalTensor.reshape* implementation currently calls
reshape_no_copy directly and copies storage/cpuData without verifying
contiguity; add the same contiguity guard used by the CPU tensor: check
t.is_C_contiguous (or equivalent is_C_contiguous) before calling reshape_no_copy
and if not contiguous perform a safe copy/reshape path (i.e., allocate a
contiguous result and copy data) so reshape_no_copy is only invoked for
rowMajor-contiguous tensors; update reshape*(t: MetalTensor, new_shape:
varargs[int]) to mirror the CPU path around reshape_no_copy, preserving
result.storage and result.cpuData only after the contiguity-safe branch.
- Around line 60-82: The GPU kernel dispatch length is using t.storage.Flen
which is the underlying buffer length and can exceed the logical tensor size; in
asContiguous (MetalTensor) pass the logical output element count instead (use
result.size or the product of shapeSeq) to metalContiguousCopy so threads match
the allocated contiguous buffer; update the metalContiguousCopy call to replace
t.storage.Flen with result.size (or product(shapeSeq)) ensuring safe GPU thread
dispatch.
🟡 Minor comments (6)
tests/tensor/test_init_metal.nim-25-28 (1)

25-28: ⚠️ Potential issue | 🟡 Minor

Minor: comment says type "int" but the tensor was converted to float on line 23.

The inline comment describes the tensor as type "int" but asType(float) was applied before toMetal. This is cosmetic but could confuse future readers.

src/arraymancer/tensor/init_copy_metal.nim-15-15 (1)

15-15: ⚠️ Potential issue | 🟡 Minor

Remove unused import ./private/p_init_metal.

The imported symbols from p_init_metal (newMetalTensor, newMetalStorage, tensorMetal, get_data_ptr) are not used in this file. The clone proc only calls toCpu(), init_copy_cpu.clone(), and metal(), which are provided by other imports.

arraymancer.nimble-222-225 (1)

222-225: ⚠️ Potential issue | 🟡 Minor

The -d:blas=cblas comment is misleading for macOS-only Metal backend.

The comment "Archlinux, comment out on Debian/Ubuntu" is copied from the CUDA/OpenCL tasks but is confusing for Metal, which only runs on macOS. On macOS, the standard BLAS implementation is the Accelerate framework, so -d:blas=cblas may not be appropriate for this platform. Update the comment to be macOS-specific or remove the flag if Accelerate should be used instead.

Regarding the missing lang parameter: unlike test_cuda and test_opencl which use lang="cpp", the Metal task has no language parameter. This is correct because Metal's compilation mode is handled via the {.passC: "-x objective-c".} pragma in metal_backend.nim, which specifies Objective-C compilation directly. The missing parameter does not indicate a problem.

src/arraymancer/tensor/operators_blas_l1_metal.nim-58-58 (1)

58-58: ⚠️ Potential issue | 🟡 Minor

Typo: "substraction" → "subtraction".

Lines 58 and 71 both have "substraction" in the doc comments.

Also applies to: 71-71

src/arraymancer/tensor/operators_blas_l1_metal.nim-94-103 (1)

94-103: ⚠️ Potential issue | 🟡 Minor

Missing MetalTensor * scalar overload.

Only scalar * tensor is defined. Users will expect commutativity (tensor * 2.0'f32), which will fail to compile. The CPU backend likely provides both directions.

Proposed fix
+proc `*`*[T: SomeFloat](t: MetalTensor[T], a: T): MetalTensor[T] {.noinit, inline.} =
+  ## MetalTensor multiplication by a scalar (scalar on the right)
+  result = a * t
src/arraymancer/tensor/backend/metal/metal_buffer.nim-64-87 (1)

64-87: ⚠️ Potential issue | 🟡 Minor

releaseBuffer may still exceed maxSize after evicting only one entry.

If the single eviction isn't enough to bring currentSize + buffer.length under maxSize, the buffer is added anyway. Consider evicting in a loop until there's room, or rejecting the release.

Proposed fix — evict in a loop
 proc releaseBuffer*(pool: var MetalBufferPool, buffer: MetalBuffer) =
   if buffer == nil:
     return
 
   pool.accessCounter.inc
 
-  if pool.currentSize + buffer.length > pool.maxSize:
+  while pool.currentSize + buffer.length > pool.maxSize and pool.entries.len > 0:
     var oldestIdx = -1
     var oldestAccess = int64.high
     for i in 0 ..< pool.entries.len:
       if pool.entries[i].lastUsed < oldestAccess:
         oldestAccess = pool.entries[i].lastUsed
         oldestIdx = i
 
     if oldestIdx >= 0:
       pool.currentSize -= pool.entries[oldestIdx].size
       pool.entries.del(oldestIdx)
🧹 Nitpick comments (9)
src/arraymancer/tensor/backend/global_config.nim (1)

35-37: Metal constants are unconditionally defined, consistent with existing CUDA/OMP pattern — looks good.

One concern: METAL_THREADGROUP_SIZE = 256 must stay in sync with the hardcoded threadgroup float localSum[256] in metal_kernels.metal (line 229). Since the .metal file is compiled by the Metal shader compiler, this Nim constant can't flow into it automatically. Consider adding a comment here noting the coupling, or generating the kernel source with the constant substituted.

src/arraymancer/tensor/tensor_metal.nim (1)

16-32: Consider adding a when defined(metal) guard for consistency.

backend/metal.nim wraps its body in when defined(metal), but this aggregator module does not. If someone imports tensor_metal directly without -d:metal, the Metal-specific sub-module imports will fail at compile time with potentially confusing errors. Adding the guard here would make the failure message clearer and stay consistent with the rest of the backend.

tests/tensor/test_operators_blas_metal.nim (1)

86-93: Consider adding a correctness check for the large matrix test.

Currently this only verifies shape and no crash. Since randomTensor is seeded deterministically, you could compare against the CPU result to catch numerical issues in the Metal GEMM kernel at scale:

let cpu_result = a.toCpu() * b.toCpu()
check: result.toCpu() == cpu_result

This would catch precision or tiling bugs that only manifest with larger matrices.

src/arraymancer/tensor/backend/metal.nim (1)

23-29: Remove redundant explicit re-exports.

The export metal_buffer, metal_backend statements already re-export all public symbols from those modules. The explicit export MetalBuffer, MetalBufferPool, MTLDevice, MTLCommandQueue, MTLCommandBuffer, MTLBuffer, MetalContext, metalContext, initMetalContext, isMetalAvailable lines are redundant and can be removed.

src/arraymancer/tensor/private/p_init_metal.nim (1)

19-21: No validation on length parameter.

If length <= 0, createMetalBuffer receives a zero or negative size. Consider guarding against this.

Proposed fix
 proc newMetalStorage*[T: SomeFloat](length: int): MetalStorage[T] =
+  assert length > 0, "MetalStorage length must be positive"
   result.Flen = length
   result.Fbuffer = createMetalBuffer(length * sizeof(T))
src/arraymancer/tensor/init_metal.nim (1)

59-65: zeros_like / ones_like round-trip through CPU unnecessarily.

Both procs allocate a CPU tensor, fill it, then upload to Metal. For zeros_like, a more efficient approach would be to allocate a Metal buffer and zero it directly on the GPU (e.g., via a blit encoder fill command). Fine for an initial implementation but worth a TODO.

src/arraymancer/tensor/backend/metal/metal_buffer.nim (1)

15-25: All Metal Objective-C types are opaque pointer aliases — no type safety.

Using pointer for MTLDevice, MTLBuffer, etc. means any pointer can be passed where a specific Metal type is expected. Consider using Nim distinct pointer types to get compile-time safety.

src/arraymancer/tensor/backend/metal/metal_backend.nim (2)

26-34: Prefer MTLCreateSystemDefaultDevice() over MTLCopyAllDevices().

MTLCopyAllDevices() is macOS-only and unnecessary when targeting Apple Silicon (single GPU). Using MTLCreateSystemDefaultDevice() is simpler, avoids the array memory management, and is also available on iOS/iPadOS if you ever want to expand platform support.

Additionally, the current code has a subtle memory-management concern: after [devices release], the returned device reference is only valid because Metal devices are system singletons. Using MTLCreateSystemDefaultDevice() eliminates this concern entirely.

Suggested fix
 id<MTLDevice> arraymancer_getDefaultDevice() {
-  NSArray<id<MTLDevice>> *devices = MTLCopyAllDevices();
-  if (devices && [devices count] > 0) {
-    id<MTLDevice> device = [devices objectAtIndex:0];
-    [devices release];
-    return device;
-  }
-  return nil;
+  return MTLCreateSystemDefaultDevice();
 }

255-286: Consider extracting the repeated command-buffer boilerplate into a shared helper.

Every kernel dispatch (elementwise, GEMM, GEMV, scalar, contiguous copy, dot) repeats the same init-check → getKernel → createCommandBuffer → createComputeEncoder → setComputePipeline → … → endEncoding → commit → waitForCompletion pattern. A withComputeEncoder or similar higher-order helper would reduce duplication and centralize error handling.

Comment on lines +198 to +202
try:
initMetalContext()
return true
except:
return false
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Bare except: catches Defect (non-recoverable errors); use except CatchableError: instead.

In Nim, bare except: catches everything including Defect types (e.g., AssertionDefect, IndexDefect), which represent programming errors that should not be silently swallowed. Use except CatchableError: to only catch recoverable errors.

Suggested fix
   try:
     initMetalContext()
     return true
-  except:
+  except CatchableError:
     return false
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 198 -
202, The try/except block around initMetalContext currently uses a bare except
which catches non-recoverable Defect errors; change it to catch only recoverable
exceptions by using "except CatchableError:" (optionally "except CatchableError
as e:" if you want to log e) so initMetalContext() failures are handled safely
without swallowing defects; update the block in metal_backend.nim where
initMetalContext() is called to use CatchableError and add an optional log of
the exception before returning false.

Comment on lines +309 to +319
# Set constants
var m = M
var n = N
var k = K
var a = alpha
var b = beta
setBytes(encoder, addr m, csize_t(sizeof(int)), 3)
setBytes(encoder, addr n, csize_t(sizeof(int)), 4)
setBytes(encoder, addr k, csize_t(sizeof(int)), 5)
setBytes(encoder, addr a, csize_t(sizeof(float32)), 6)
setBytes(encoder, addr b, csize_t(sizeof(float32)), 7)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "*.metal" -o -name "*metal*kernel*" -type f | head -20

Repository: mratsim/Arraymancer

Length of output: 121


🏁 Script executed:

cat ./src/arraymancer/tensor/backend/metal/metal_kernels.metal

Repository: mratsim/Arraymancer

Length of output: 11396


🏁 Script executed:

sed -n '309,320p' ./src/arraymancer/tensor/backend/metal/metal_backend.nim

Repository: mratsim/Arraymancer

Length of output: 423


Critical: Type size mismatch — Nim int is 8 bytes but Metal kernels expect 4-byte integers.

The Metal kernel gemm_naive_f32 declares parameters as constant int& M, constant int& N, and constant int& K (32-bit in Metal Shading Language). However, the Nim code sends 8-byte values via setBytes(encoder, addr m, csize_t(sizeof(int)), 3) because Nim's int is 64-bit on 64-bit platforms. This causes the 4-byte buffer to be corrupted when receiving 8 bytes, corrupting subsequent parameters and yielding wrong results.

Suggested fix
   # Set constants
-  var m = M
-  var n = N
-  var k = K
+  var m = int32(M)
+  var n = int32(N)
+  var k = int32(K)
   var a = alpha
   var b = beta
-  setBytes(encoder, addr m, csize_t(sizeof(int)), 3)
-  setBytes(encoder, addr n, csize_t(sizeof(int)), 4)
-  setBytes(encoder, addr k, csize_t(sizeof(int)), 5)
+  setBytes(encoder, addr m, csize_t(sizeof(int32)), 3)
+  setBytes(encoder, addr n, csize_t(sizeof(int32)), 4)
+  setBytes(encoder, addr k, csize_t(sizeof(int32)), 5)
   setBytes(encoder, addr a, csize_t(sizeof(float32)), 6)
   setBytes(encoder, addr b, csize_t(sizeof(float32)), 7)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Set constants
var m = M
var n = N
var k = K
var a = alpha
var b = beta
setBytes(encoder, addr m, csize_t(sizeof(int)), 3)
setBytes(encoder, addr n, csize_t(sizeof(int)), 4)
setBytes(encoder, addr k, csize_t(sizeof(int)), 5)
setBytes(encoder, addr a, csize_t(sizeof(float32)), 6)
setBytes(encoder, addr b, csize_t(sizeof(float32)), 7)
# Set constants
var m = int32(M)
var n = int32(N)
var k = int32(K)
var a = alpha
var b = beta
setBytes(encoder, addr m, csize_t(sizeof(int32)), 3)
setBytes(encoder, addr n, csize_t(sizeof(int32)), 4)
setBytes(encoder, addr k, csize_t(sizeof(int32)), 5)
setBytes(encoder, addr a, csize_t(sizeof(float32)), 6)
setBytes(encoder, addr b, csize_t(sizeof(float32)), 7)
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 309 -
319, The code is sending 64-bit Nim ints to a Metal kernel that expects 32-bit
ints (gemm_naive_f32 parameters M/N/K); change the local vars and setBytes size
to use 32-bit integers: declare m, n, k as int32 (e.g. var m: int32 = int32(M),
n: int32 = int32(N), k: int32 = int32(K)) and call setBytes(encoder, addr m,
csize_t(sizeof(int32)), 3) (and similarly for n/k), leaving the float32
alpha/beta handling unchanged; update the three setBytes calls that reference
sizeof(int) to sizeof(int32) so the Metal kernel receives 4-byte integers.

Comment on lines +344 to +351
var kernelName: string
if transA and not transB:
kernelName = "gemm_a_transpose_f32"
elif not transA and transB:
kernelName = "gemm_b_transpose_f32"
else:
kernelName = "gemm_naive_f32"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Both-transposed GEMM case silently falls through to the no-transpose kernel.

When both transA and transB are true, the else branch dispatches gemm_naive_f32 (the non-transposed kernel), producing incorrect results. Either add a dedicated gemm_ab_transpose_f32 kernel or raise an error for the unsupported combination.

Suggested fix (error for unsupported case)
     if transA and not transB:
       kernelName = "gemm_a_transpose_f32"
     elif not transA and transB:
       kernelName = "gemm_b_transpose_f32"
-    else:
+    elif not transA and not transB:
       kernelName = "gemm_naive_f32"
+    else:
+      raise newException(ValueError, "Metal GEMM with both A and B transposed is not yet supported")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var kernelName: string
if transA and not transB:
kernelName = "gemm_a_transpose_f32"
elif not transA and transB:
kernelName = "gemm_b_transpose_f32"
else:
kernelName = "gemm_naive_f32"
var kernelName: string
if transA and not transB:
kernelName = "gemm_a_transpose_f32"
elif not transA and transB:
kernelName = "gemm_b_transpose_f32"
elif not transA and not transB:
kernelName = "gemm_naive_f32"
else:
raise newException(ValueError, "Metal GEMM with both A and B transposed is not yet supported")
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 344 -
351, The current dispatch logic in the block setting kernelName using
transA/transB incorrectly falls through to "gemm_naive_f32" when both transA and
transB are true; update the logic in the kernelName assignment (the variables
transA, transB and kernelName) to explicitly handle the both-true case: either
set kernelName to "gemm_ab_transpose_f32" if you add/implement that kernel, or
throw/raise a clear runtime error indicating the transA+transB combination is
unsupported so it doesn't silently use the non-transposed kernel.

Comment on lines +356 to +398
proc metalGemv*[T: SomeFloat](
trans: bool,
M, N: int,
alpha: T,
A: MetalBuffer, lda: int,
x: MetalBuffer, incx: int,
beta: T,
y: MetalBuffer, incy: int
) =
## Matrix-Vector multiplication: y = alpha * A * x + beta * y
## Column-major layout (Fortran style)
if not metalContext.initialized:
initMetalContext()

# Metal only supports float32, not float64
when T is float32:
var m = M
var n = N
var a = float32(alpha)
var b = float32(beta)

let pipeline = getKernel("gemv_f32")
let commandBuffer = createCommandBuffer(metalContext.commandQueue)
let encoder = createComputeEncoder(commandBuffer)

setComputePipeline(encoder, pipeline)
setBuffer(encoder, A.buffer, 0)
setBuffer(encoder, x.buffer, 1)
setBuffer(encoder, y.buffer, 2)
setBytes(encoder, addr m, csize_t(sizeof(int)), 3)
setBytes(encoder, addr n, csize_t(sizeof(int)), 4)
setBytes(encoder, addr a, csize_t(sizeof(float32)), 5)
setBytes(encoder, addr b, csize_t(sizeof(float32)), 6)

# Dispatch threads - one thread per output element (M rows)
let threadgroupSize = 256
dispatchThreads1D(encoder, csize_t(M), csize_t(threadgroupSize))

endEncoding(encoder)
commitCommandBuffer(commandBuffer)
waitForCompletion(commandBuffer)
else:
raise newException(ValueError, "Metal only supports float32, not float64")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

trans, incx, and incy parameters are accepted but completely ignored.

  1. The trans flag is never used — the kernel always computes y = α·A·x + β·y regardless of whether a transposed multiply was requested. Callers expecting A^T * x will get wrong results silently.
  2. incx and incy (stride parameters) are also ignored — non-unit increments will produce incorrect results.
  3. Same int vs int32 type-size mismatch as in executeGemmKernel (lines 385-386).

At minimum, either implement the transposed path (separate kernel or index remapping) or raise an error when trans=true / incx != 1 / incy != 1 so callers don't silently get wrong results.

🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 356 -
398, The metalGemv proc currently ignores trans, incx and incy and also passes
host ints to the GPU with the wrong size; update metalGemv to validate/handle
these parameters: if trans is true either select/implement a transposed kernel
(e.g. getKernel("gemv_f32_trans") and dispatch with swapped M/N and appropriate
index mapping) or explicitly raise a ValueError when trans==true; likewise raise
a ValueError when incx != 1 or incy != 1 (or implement strided support in the
kernel) so callers aren’t silently wrong; finally fix the int size mismatch by
converting m and n to int32 (or Int32) before calling setBytes and use
csize_t(sizeof(int32)) when passing their sizes (same for any other ints sent to
the GPU like lda), and update references to addr m/addr n to point at the int32
variables.

Comment on lines +418 to +419
var sizeVar = size
executeElementwiseKernel(kernelName, @[A, B, C], @[cast[pointer](addr sizeVar)], @[sizeof(int)], size)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Same int vs int32 type-size issue for sizeVar.

sizeVar is Nim int (8 bytes on 64-bit) but the Metal kernel almost certainly expects a 4-byte integer. This is the same root cause as flagged in executeGemmKernel — all kernel constant parameters should use int32 consistently.

🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 418 -
419, The local variable sizeVar is declared as Nim int (64-bit on x86_64) but
the Metal kernel expects a 4-byte integer; change sizeVar to an explicit int32
and initialize it from size (e.g. sizeVar: int32 = int32(size)), and pass
sizeof(int32) to executeElementwiseKernel so the pointer/size match what the
kernel expects; follow the same pattern used in executeGemmKernel to ensure all
kernel constant parameters use int32 consistently.

Comment on lines +26 to +42
# TODO: remove this contiguous layout constraint
if not a.isContiguous:
raise newException(ValueError, "NotImplemented: for now matrix should be contiguous")

let M = a.shape[0]
let N = a.shape[1]

# Use GPU kernel for matrix-vector multiplication
metalGemv[T](
false, # trans
M, N,
alpha,
a.storage.Fbuffer, M, # lda = rows of A
x.storage.Fbuffer, 1, # incx = 1
beta,
y.storage.Fbuffer, 1 # incy = 1
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Vector x is not checked for contiguity.

metalMV_y_eq_aAx_p_by checks that a is contiguous but passes incx = 1 for x unconditionally. If x is a non-contiguous slice, the kernel will read wrong data.

Proposed fix
   if not a.isContiguous:
     raise newException(ValueError, "NotImplemented: for now matrix should be contiguous")
+  if not x.isContiguous:
+    raise newException(ValueError, "NotImplemented: for now vector should be contiguous")
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/operators_blas_l2l3_metal.nim` around lines 26 - 42,
The function metalMV_y_eq_aAx_p_by currently only verifies a.isContiguous but
always passes incx = 1 to metalGemv which is unsafe for non-contiguous x; update
metalMV_y_eq_aAx_p_by to validate x contiguity (e.g., check x.isContiguous) and
if not contiguous either raise a ValueError similar to the a check or create a
contiguous temporary buffer (copy x.storage.Fbuffer into a contiguous buffer)
and pass that buffer with incx = 1 to metalGemv; ensure references to
x.storage.Fbuffer and the chosen incx value are adjusted and that any temporary
buffer is kept alive for the call.

Comment on lines +153 to +158
proc `/.`*[T: SomeFloat](val: T, t: MetalTensor[T]): MetalTensor[T] {.noinit.} =
## Broadcasted division of a float by a tensor of floats.
## For now, fall back to CPU
let cpuT = t.toCpu()
let cpuResult = val /. cpuT
result = cpuResult.metal()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's see the imports and structure of the file
head -100 src/arraymancer/tensor/operators_broadcasted_metal.nim | cat -n

Repository: mratsim/Arraymancer

Length of output: 4113


🏁 Script executed:

# Check if toCpu and metal are defined/exported in the current file or imported modules
rg -n 'proc.*toCpu|proc.*metal\(' src/arraymancer/tensor/operators_broadcasted_metal.nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Look for where toCpu and metal are defined in the codebase
fd -e nim | xargs rg -l 'proc.*toCpu|proc.*metal\(' | head -20

Repository: mratsim/Arraymancer

Length of output: 100


🏁 Script executed:

# Check what's actually imported in the file
grep -n "^import\|^from" src/arraymancer/tensor/operators_broadcasted_metal.nim | head -20

Repository: mratsim/Arraymancer

Length of output: 135


🏁 Script executed:

# Get the actual lines 153-158 to verify the code snippet
sed -n '150,165p' src/arraymancer/tensor/operators_broadcasted_metal.nim | cat -n

Repository: mratsim/Arraymancer

Length of output: 700


🏁 Script executed:

# Check if toCpu and metal are defined/exported in p_init_metal
grep -n 'proc.*toCpu\|proc.*metal\|export' src/arraymancer/tensor/private/p_init_metal.nim | head -30

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Also check the size of p_init_metal to understand what it contains
wc -l src/arraymancer/tensor/private/p_init_metal.nim

Repository: mratsim/Arraymancer

Length of output: 113


🏁 Script executed:

# Check the contents of p_init_metal to see if toCpu and metal are defined/exported
cat -n src/arraymancer/tensor/private/p_init_metal.nim

Repository: mratsim/Arraymancer

Length of output: 2526


🏁 Script executed:

# Also search for toCpu and metal definitions in init_metal.nim (not p_init_metal)
rg -n 'proc.*toCpu|proc.*metal' src/arraymancer/tensor/init_metal.nim

Repository: mratsim/Arraymancer

Length of output: 206


🏁 Script executed:

# Check if p_init_metal re-exports from init_metal
grep -n 'export\|import.*init_metal' src/arraymancer/tensor/private/p_init_metal.nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Check if init_metal is imported in the file or any of its imported modules
rg -n 'import.*init_metal' src/arraymancer/tensor/operators_blas_l1_metal.nim
rg -n 'import.*init_metal' src/arraymancer/tensor/shapeshifting_metal.nim
rg -n 'import.*init_metal' src/arraymancer/tensor/private/p_init_metal.nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Check if metal and toCpu are exported from init_metal
grep -n 'export\|proc.*metal\|proc.*toCpu' src/arraymancer/tensor/init_metal.nim | head -20

Repository: mratsim/Arraymancer

Length of output: 206


🏁 Script executed:

# Check if the file compiles - look for any build/test results
find . -name "*.nim" -path "*/test*" | xargs rg -l "operators_broadcasted_metal" 2>/dev/null | head -5

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Check all imported modules to see if any re-export toCpu or metal
for file in metal_backend shapeshifting_metal operators_blas_l1_metal; do
  echo "=== Checking $file ==="
  rg -n 'export.*metal|export.*toCpu' "src/arraymancer/tensor/${file}.nim" 2>/dev/null || echo "No exports found"
done

Repository: mratsim/Arraymancer

Length of output: 222


🏁 Script executed:

# Check if the operators_broadcasted_metal file is used/tested to see if it compiles
rg -l "operators_broadcasted_metal" --type nim | head -10

Repository: mratsim/Arraymancer

Length of output: 102


🏁 Script executed:

# Check the backend/metal_backend to understand what's exported there
rg -n 'export\|proc.*metal\|proc.*toCpu' src/arraymancer/tensor/backend/metal/metal_backend.nim | head -20

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Check if tensor_metal imports and re-exports the needed functions
grep -n 'import.*init_metal\|import.*metal\|export' src/arraymancer/tensor/tensor_metal.nim | head -30

Repository: mratsim/Arraymancer

Length of output: 129


🏁 Script executed:

# Check what tensor_metal actually does - does it include operators_broadcasted_metal?
grep -n 'include\|import.*operators_broadcasted_metal' src/arraymancer/tensor/tensor_metal.nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Let's look at init_metal to see if it's part of a public API
head -30 src/arraymancer/tensor/init_metal.nim

Repository: mratsim/Arraymancer

Length of output: 1177


🏁 Script executed:

# Check if there's a main module that brings everything together
rg -n 'import.*operators_broadcasted_metal\|import.*init_metal' src/arraymancer/*.nim src/arraymancer/tensor/*.nim 2>/dev/null | head -20

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Verify which modules are in the tensor directory and understand the structure
ls -la src/arraymancer/tensor/*.nim | grep -E "init|metal"

Repository: mratsim/Arraymancer

Length of output: 1184


🏁 Script executed:

# Check if there's any mechanism in Nim that would make metal/toCpu accessible without import
# For example, check if metal_backend or p_init_metal re-export from init_metal
cat src/arraymancer/tensor/backend/metal/metal_backend.nim | head -50

Repository: mratsim/Arraymancer

Length of output: 1690


🏁 Script executed:

# Check the actual namespace - does operators_broadcasted_metal even compile?
# Look for compilation hints or tests
find . -name "*.nim" -type f | xargs grep -l "operators_broadcasted_metal" 2>/dev/null

Repository: mratsim/Arraymancer

Length of output: 104


🏁 Script executed:

# Let's be thorough - check if init_metal is a re-export at module level
grep -A5 -B5 'export.*init_metal\|from.*init_metal.*export' src/arraymancer/tensor/*.nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Check the full content of tensor_metal.nim to see if it includes or imports operators_broadcasted_metal
cat src/arraymancer/tensor/tensor_metal.nim

Repository: mratsim/Arraymancer

Length of output: 1146


🏁 Script executed:

# Let's verify: does operators_broadcasted_metal actually need init_metal?
# Check what modules define Tensor type (CPU version)
rg -n 'type.*Tensor\[' src/arraymancer/tensor/data_structure.nim | head -5

Repository: mratsim/Arraymancer

Length of output: 414


🏁 Script executed:

# Verify the imports chain - what does data_structure export?
head -40 src/arraymancer/tensor/data_structure.nim

Repository: mratsim/Arraymancer

Length of output: 1447


🏁 Script executed:

# Critical question: can operators_broadcasted_metal.nim even compile standalone?
# Let's check what happens when we try to find the definitions of toCpu and metal in scope
rg -n 'proc metal\*|proc toCpu\*' src/arraymancer/tensor/ --type nim

Repository: mratsim/Arraymancer

Length of output: 282


🏁 Script executed:

# Check if p_init_metal somehow imports/exports init_metal
cat src/arraymancer/tensor/private/p_init_metal.nim | grep -i 'import\|export'

Repository: mratsim/Arraymancer

Length of output: 102


🏁 Script executed:

# Verify: does init_metal import p_init_metal and export everything?
cat src/arraymancer/tensor/init_metal.nim

Repository: mratsim/Arraymancer

Length of output: 2631


🏁 Script executed:

# Check if toCpu or metal are defined/re-exported in the imported modules
rg -n 'proc.*toCpu|proc.*metal|^export' src/arraymancer/tensor/operators_blas_l1_metal.nim
rg -n 'proc.*toCpu|proc.*metal|^export' src/arraymancer/tensor/shapeshifting_metal.nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Final check - does the file actually compile? Look for compilation tests or CI config
find . -name "*.yml" -o -name "*.yaml" -o -name "*.cfg" -o -name "*.nims" | xargs grep -l "operators_broadcasted_metal\|nim c\|nimble" 2>/dev/null | head -5

Repository: mratsim/Arraymancer

Length of output: 89


🏁 Script executed:

# Check the CI to see if the Metal operators are actually tested/compiled
cat .github/workflows/ci.yml | grep -A 20 -B 5 "metal\|Metal" | head -60

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Most importantly - check if data_structure.nim actually exports init_metal 
# by looking at its full export statements
grep -n '^export' src/arraymancer/tensor/data_structure.nim

Repository: mratsim/Arraymancer

Length of output: 141


🏁 Script executed:

# One more verification - let's check if there's a build script or nimble file that shows how this is compiled
find . -name "*.nimble" -o -name "config.nims" | xargs cat 2>/dev/null | head -50

Repository: mratsim/Arraymancer

Length of output: 1443


🏁 Script executed:

# Also verify the exact dependency chain - what does the file actually need?
# Check if any of the imported modules provide toCpu or metal
rg 'proc.*toCpu|proc.*metal' src/arraymancer/tensor/operators_blas_l1_metal.nim src/arraymancer/tensor/shapeshifting_metal.nim src/arraymancer/tensor/private/p_init_metal.nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Final check - does Nim's module system make toCpu and metal available through tensor_metal's imports?
# Check if there are any import cycles or if this is indeed a transitive dependency issue
rg -n 'import.*operators_broadcasted_metal' src/arraymancer/tensor/

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

# Let's also verify the actual compilation by checking if there are any error logs or known issues
find . -name "*.md" | xargs grep -i "operators_broadcasted_metal\|metal.*fallback\|toCpu" 2>/dev/null | head -10

Repository: mratsim/Arraymancer

Length of output: 45


Add missing import of init_metal for toCpu() and metal() procs.

Lines 153–158 call t.toCpu() and cpuResult.metal(), but init_metal is not imported in this file. While tensor_metal.nim composes both modules, operators_broadcasted_metal.nim must declare its own dependencies. Without the import, the file cannot be used independently. Additionally, the CPU fallback carries a significant performance penalty — consider adding a TODO to implement a native Metal kernel for scalar-tensor division.

🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/operators_broadcasted_metal.nim` around lines 153 -
158, The file is missing an import of init_metal which provides the toCpu() and
metal() procs used in the broadcasted scalar-by-tensor division proc `/.`*[T:
SomeFloat](val: T, t: MetalTensor[T]); add `import init_metal` (or the module
that re-exports toCpu/metal) at the top of operators_broadcasted_metal.nim so
the calls to t.toCpu() and cpuResult.metal() resolve when this file is compiled
independently, and add a TODO comment inside `/.`* to implement a native Metal
kernel for scalar-tensor division to avoid the CPU fallback performance penalty.

Comment on lines +19 to +23
proc newMetalStorage*[T: SomeFloat](length: int): MetalStorage[T] =
result.Flen = length
result.Fbuffer = createMetalBuffer(length * sizeof(T))
new result.Fref_tracking
result.Fref_tracking.value = result.Fbuffer
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n '=destroy' --type nim | grep -i 'metal\|cuda'

Repository: mratsim/Arraymancer

Length of output: 700


🏁 Script executed:

rg -n 'MetalTensorRefTrackerObj\|CudaTensorRefTrackerObj' --type nim -A 5

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

fd -e nim -path '*data_structure*' | head -5

Repository: mratsim/Arraymancer

Length of output: 231


🏁 Script executed:

rg -n 'proc createMetalBuffer' --type nim -A 10

Repository: mratsim/Arraymancer

Length of output: 1113


🏁 Script executed:

rg -n 'MetalTensorRefTrackerObj' --type nim

Repository: mratsim/Arraymancer

Length of output: 270


🏁 Script executed:

sed -n '55,75p' src/arraymancer/tensor/data_structure.nim

Repository: mratsim/Arraymancer

Length of output: 788


🏁 Script executed:

rg -n 'type.*MetalBuffer\|type.*MetalTensor' --type nim | head -20

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

rg -n 'metalBuffer.*=\|freeMetalBuffer\|releaseMetalBuffer' --type nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

sed -n '95,115p' src/arraymancer/tensor/data_structure.nim

Repository: mratsim/Arraymancer

Length of output: 930


🏁 Script executed:

rg -n 'Metal.*destroy\|releaseMetal\|deleteBuffer\|freeBuffer' --type nim

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

rg -n 'type MetalBuffer' --type nim -A 8

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

fd -name 'metal_backend.nim' -exec head -100 {} \;

Repository: mratsim/Arraymancer

Length of output: 293


🏁 Script executed:

rg -n 'type MetalBuffer' --type nim -B 2 -A 8

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

rg -n 'proc.*=destroy' src/arraymancer/tensor/data_structure.nim

Repository: mratsim/Arraymancer

Length of output: 218


🏁 Script executed:

grep -n 'MetalBuffer' src/arraymancer/tensor/backend/metal/metal_backend.nim | head -20

Repository: mratsim/Arraymancer

Length of output: 871


🏁 Script executed:

sed -n '1,250p' src/arraymancer/tensor/backend/metal/metal_backend.nim | grep -n -A 10 'type.*MetalBuffer'

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

rg -n 'type MetalBuffer' -A 8 src/arraymancer/tensor/backend/metal/

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

rg -n 'MetalBuffer.*=\|MetalBuffer.*object' src/arraymancer/tensor/backend/metal/ -A 5

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

head -50 src/arraymancer/tensor/backend/metal/metal_backend.nim

Repository: mratsim/Arraymancer

Length of output: 1690


🏁 Script executed:

rg 'MetalBuffer' src/arraymancer/tensor/backend/metal/ -l

Repository: mratsim/Arraymancer

Length of output: 171


🏁 Script executed:

rg -n 'from\|import.*Metal' src/arraymancer/tensor/backend/metal/metal_backend.nim | head -10

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

cat src/arraymancer/tensor/backend/metal/metal_buffer.nim

Repository: mratsim/Arraymancer

Length of output: 2816


Missing =destroy hook for MetalTensorRefTrackerObj causes GPU memory leaks.

The CUDA backend defines a =destroy proc on CudaTensorRefTrackerObj (data_structure.nim lines 61–67) that calls cudaFree when the ref tracker is garbage collected. MetalTensorRefTrackerObj lacks an equivalent destructor, so Metal buffer resources are never released when tensors are collected. The buffer pool functions (releaseBuffer, cleanupPool) exist but are not automatically invoked—they must be called manually by user code, which is not a reliable cleanup mechanism.

🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/private/p_init_metal.nim` around lines 19 - 23, Add a
destructor for MetalTensorRefTrackerObj analogous to the CUDA backend so Metal
buffers are freed when the tracker is GC'd: implement a =destroy proc for
MetalTensorRefTrackerObj that retrieves the stored buffer from the ref tracker
(the same field used by newMetalStorage: MetalStorage.Fref_tracking.value) and
releases it via the Metal buffer release routine (or call the existing
releaseBuffer/cleanupPool helper). Place the =destroy alongside other
data_structure destructors so garbage collection invokes it and prevents GPU
memory leaks.

Comment on lines +60 to +82
proc asContiguous*[T: SomeFloat](t: MetalTensor[T], layout: OrderType = rowMajor, force: bool = false): MetalTensor[T] {.noinit.} =
## Return a contiguous copy of the MetalTensor using GPU kernel
if not force and t.isContiguous:
return t

# Create new contiguous tensor
result = newMetalTensor[T](t.shape)

# Convert shape and strides to seq for GPU kernel
var shapeSeq = newSeq[int](t.rank)
var stridesSeq = newSeq[int](t.rank)
for i in 0 ..< t.rank:
shapeSeq[i] = t.shape[i]
stridesSeq[i] = t.strides[i]

# Use GPU kernel to copy data with arbitrary strides to contiguous buffer
metalContiguousCopy[T](
t.storage.Fbuffer,
result.storage.Fbuffer,
shapeSeq,
stridesSeq,
t.storage.Flen
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find MetalBuffer/storage definitions and how Flen/Fbuffer are used
echo "=== MetalBuffer / MetalStorage definitions ==="
rg -n "Flen|Fbuffer" --type nim -A 3 -B 3 | head -80

echo ""
echo "=== MetalTensor size proc ==="
rg -n "proc size" --type nim -A 5 | grep -A 5 "MetalTensor"

echo ""
echo "=== metalContiguousCopy dispatch usage ==="
rg -n "metalContiguousCopy" --type nim -B 3 -A 10

Repository: mratsim/Arraymancer

Length of output: 8303


🏁 Script executed:

#!/bin/bash
echo "=== MetalStorage/MetalTensor type definitions ==="
rg -n "type MetalTensor|type MetalStorage|type MetalBuffer" --type nim -A 10

echo ""
echo "=== Looking for Flen field definition ==="
rg -n "Flen:" --type nim -B 3 -A 1

echo ""
echo "=== MetalTensor.size definition ==="
rg -n "proc size" --type nim -A 5 -B 1 | head -50

Repository: mratsim/Arraymancer

Length of output: 192


🏁 Script executed:

#!/bin/bash
echo "=== Find all files mentioning MetalTensor or MetalStorage ==="
fd "\.nim" | xargs grep -l "MetalTensor\|MetalStorage" | head -20

echo ""
echo "=== MetalBuffer definition search ==="
rg -n "MetalBuffer" --type nim -B 2 -A 5 | head -60

echo ""
echo "=== Full metalContiguousCopy implementation ==="
rg -n "proc metalContiguousCopy" --type nim -A 60

Repository: mratsim/Arraymancer

Length of output: 11357


🏁 Script executed:

#!/bin/bash
echo "=== Search for MetalTensor size or numel proc ==="
rg -n "proc.*size|proc.*numel" --type nim | grep -i "metaltensor"

echo ""
echo "=== Full MetalTensor definition ==="
rg -n "MetalTensor\*\[" --type nim -A 15 | head -50

echo ""
echo "=== Look for slicing/indexing operations on MetalTensor ==="
fd "\.nim" | xargs grep -l "MetalTensor.*\[\|slice.*MetalTensor" | head -5

Repository: mratsim/Arraymancer

Length of output: 4207


🏁 Script executed:

#!/bin/bash
echo "=== Check how CPU/CUDA backends calculate tensor size ==="
rg -n "\.size\|product.*shape" --type nim | grep -v "Flen\|result.size" | head -30

echo ""
echo "=== Check other asContiguous implementations ==="
rg -n "proc asContiguous" --type nim -A 15

echo ""
echo "=== Verify Flen is set only once at tensor creation ==="
rg -n "Flen =" --type nim

Repository: mratsim/Arraymancer

Length of output: 4833


🏁 Script executed:

#!/bin/bash
echo "=== Search for numel or size procedures in CPU tensor backend ==="
rg -n "proc.*size|proc.*numel" --type nim -B 1 -A 3 | grep -A 3 "Tensor\["

echo ""
echo "=== Check if MetalTensor has size property in data_structure ==="
cat src/arraymancer/tensor/data_structure.nim | grep -A 30 "MetalTensor\*\[T"

echo ""
echo "=== Look at how CPU asContiguous calculates element count ==="
rg -n "asContiguous" src/arraymancer/tensor/shapeshifting.nim -A 30 | head -60

Repository: mratsim/Arraymancer

Length of output: 4843


🏁 Script executed:

#!/bin/bash
echo "=== Search for how shape product is computed in MetalTensor context ==="
rg -n "proc.*rank\|for.*shape\|product" --type nim | grep -i metal -A 2 -B 2 | head -30

echo ""
echo "=== Check if there's a proc to compute element count from shape ==="
rg -n "product\|numel\|totalElements" --type nim -B 1 -A 3

echo ""
echo "=== Final check: what is actually passed to metalElementwise ==="
rg -n "metalElementwise" --type nim -B 2 -A 5 | head -50

Repository: mratsim/Arraymancer

Length of output: 3913


🏁 Script executed:

#!/bin/bash
echo "=== Verify the key insight: result has correct Flen ==="
rg -n "result = newMetalTensor" --type nim -B 2 -A 5 | head -40

echo ""
echo "=== Check if broadcasting or other ops handle this correctly ==="
cat src/arraymancer/tensor/operators_broadcasted_metal.nim | head -60

Repository: mratsim/Arraymancer

Length of output: 5423


🏁 Script executed:

#!/bin/bash
echo "=== Trace newMetalTensor and tensorMetal macro ==="
rg -n "tensorMetal|newMetalStorage" --type nim -B 2 -A 10 | head -80

echo ""
echo "=== Check if there's documentation of what happens with sliced/strided MetalTensor ==="
cat src/arraymancer/tensor/data_structure.nim | grep -A 20 "Warning.*Assignment"

Repository: mratsim/Arraymancer

Length of output: 3811


🏁 Script executed:

#!/bin/bash
echo "=== Find the size property/proc for MetalTensor ==="
rg -n "\.size|proc size" --type nim -B 1 -A 3 | grep -B 2 -A 3 "MetalTensor\|shape"

echo ""
echo "=== Look for size proc in generic tensor code ==="
fd "\.nim" | xargs grep -l "\.size" | xargs grep -n "proc.*size\|\.size.*=" | head -20

Repository: mratsim/Arraymancer

Length of output: 23212


🏁 Script executed:

#!/bin/bash
echo "=== Confirm: find .size property definition ==="
rg -n "\.size" --type nim | grep -i "property\|func\|proc" | head -20

echo ""
echo "=== Final check: metalElementwise uses totalSize same way as metalContiguousCopy ==="
rg -n "proc metalElementwise" --type nim -A 40 | head -50

Repository: mratsim/Arraymancer

Length of output: 4421


Pass result.size (product of output shape) instead of t.storage.Flen to control GPU thread dispatch.

In asContiguous, t.storage.Flen represents the underlying buffer's total element count, which can exceed the logical size of a sliced or strided view. The kernel dispatches one thread per element, so passing t.storage.Flen to metalContiguousCopy causes out-of-bounds writes to the output buffer (which is allocated for only product(t.shape) elements). Use result.size (or equivalently, the product of shapeSeq) to dispatch the correct number of threads.

Proposed fix
   metalContiguousCopy[T](
     t.storage.Fbuffer,
     result.storage.Fbuffer,
     shapeSeq,
     stridesSeq,
-    t.storage.Flen
+    result.size
   )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
proc asContiguous*[T: SomeFloat](t: MetalTensor[T], layout: OrderType = rowMajor, force: bool = false): MetalTensor[T] {.noinit.} =
## Return a contiguous copy of the MetalTensor using GPU kernel
if not force and t.isContiguous:
return t
# Create new contiguous tensor
result = newMetalTensor[T](t.shape)
# Convert shape and strides to seq for GPU kernel
var shapeSeq = newSeq[int](t.rank)
var stridesSeq = newSeq[int](t.rank)
for i in 0 ..< t.rank:
shapeSeq[i] = t.shape[i]
stridesSeq[i] = t.strides[i]
# Use GPU kernel to copy data with arbitrary strides to contiguous buffer
metalContiguousCopy[T](
t.storage.Fbuffer,
result.storage.Fbuffer,
shapeSeq,
stridesSeq,
t.storage.Flen
)
proc asContiguous*[T: SomeFloat](t: MetalTensor[T], layout: OrderType = rowMajor, force: bool = false): MetalTensor[T] {.noinit.} =
## Return a contiguous copy of the MetalTensor using GPU kernel
if not force and t.isContiguous:
return t
# Create new contiguous tensor
result = newMetalTensor[T](t.shape)
# Convert shape and strides to seq for GPU kernel
var shapeSeq = newSeq[int](t.rank)
var stridesSeq = newSeq[int](t.rank)
for i in 0 ..< t.rank:
shapeSeq[i] = t.shape[i]
stridesSeq[i] = t.strides[i]
# Use GPU kernel to copy data with arbitrary strides to contiguous buffer
metalContiguousCopy[T](
t.storage.Fbuffer,
result.storage.Fbuffer,
shapeSeq,
stridesSeq,
result.size
)
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/shapeshifting_metal.nim` around lines 60 - 82, The GPU
kernel dispatch length is using t.storage.Flen which is the underlying buffer
length and can exceed the logical tensor size; in asContiguous (MetalTensor)
pass the logical output element count instead (use result.size or the product of
shapeSeq) to metalContiguousCopy so threads match the allocated contiguous
buffer; update the metalContiguousCopy call to replace t.storage.Flen with
result.size (or product(shapeSeq)) ensuring safe GPU thread dispatch.

Comment on lines +84 to +91
proc reshape*(t: MetalTensor, new_shape: varargs[int]): MetalTensor =
## Reshape a MetalTensor without copy.
##
## ⚠ Reshaping without copy is only possible on contiguous rowMajor Tensors

reshape_no_copy(t, new_shape, result, rowMajor)
result.storage = t.storage
result.cpuData = t.cpuData
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "reshape_no_copy" --type nim -C 5

Repository: mratsim/Arraymancer

Length of output: 4857


🏁 Script executed:

rg -n "is_C_contiguous" --type nim -C 3

Repository: mratsim/Arraymancer

Length of output: 23784


🏁 Script executed:

rg -n "type MetalTensor" --type nim -C 5

Repository: mratsim/Arraymancer

Length of output: 45


🏁 Script executed:

grep -n "MetalTensor" src/arraymancer/tensor/data_structure.nim | head -20

Repository: mratsim/Arraymancer

Length of output: 963


🏁 Script executed:

sed -n '90,220p' src/arraymancer/tensor/data_structure.nim

Repository: mratsim/Arraymancer

Length of output: 4609


🏁 Script executed:

sed -n '84,95p' src/arraymancer/tensor/shapeshifting_metal.nim

Repository: mratsim/Arraymancer

Length of output: 502


Add contiguity validation to MetalTensor.reshape before calling reshape_no_copy.

The MetalTensor reshape implementation lacks the contiguity check present in the CPU tensor version. The CPU implementation validates is_C_contiguous before calling reshape_no_copy and falls back to copying on non-contiguous tensors. MetalTensor.reshape bypasses this validation entirely, risking incorrect reshaping of non-contiguous tensors. Add the same contiguity guard as the CPU version at line 89.

🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/shapeshifting_metal.nim` around lines 84 - 91, The
MetalTensor.reshape* implementation currently calls reshape_no_copy directly and
copies storage/cpuData without verifying contiguity; add the same contiguity
guard used by the CPU tensor: check t.is_C_contiguous (or equivalent
is_C_contiguous) before calling reshape_no_copy and if not contiguous perform a
safe copy/reshape path (i.e., allocate a contiguous result and copy data) so
reshape_no_copy is only invoked for rowMajor-contiguous tensors; update
reshape*(t: MetalTensor, new_shape: varargs[int]) to mirror the CPU path around
reshape_no_copy, preserving result.storage and result.cpuData only after the
contiguity-safe branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

X Tutup