Conversation
📝 WalkthroughWalkthroughThis pull request introduces a complete Metal GPU backend for Arraymancer, enabling GPU-accelerated tensor operations on Apple devices. It adds Metal device/buffer management, GPU kernels for BLAS operations, tensor data types with CPU/Metal conversion routines, operator overloads, and a comprehensive test suite validating Metal tensor functionality. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @bung87, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands Arraymancer's capabilities by adding native support for Apple's Metal graphics API. This integration allows users on macOS with Metal-compatible hardware to perform tensor computations directly on their GPU, promising substantial performance improvements for machine learning and scientific computing workloads. The changes encompass a new backend, custom Metal kernels, and updates to the core tensor data structures and operations to seamlessly incorporate Metal-accelerated functionality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: a backend for Apple's Metal framework, enabling GPU acceleration on macOS and iOS devices. While the overall structure is sound, several security-critical issues were identified, including missing bounds checks in buffer transfer functions, potential integer overflows in GPU kernels for large tensors, and a GPU memory leak due to missing destructors. Additionally, the implementation has bugs in the buffer pooling logic and areas for performance and maintainability improvements. Addressing these points is crucial for the stability, robustness, and security of the new Metal backend.
| MetalBufferObj* = object | ||
| buffer*: MTLBuffer | ||
| length*: int | ||
| devicePtr*: pointer | ||
|
|
||
| MetalBuffer* = ref MetalBufferObj |
There was a problem hiding this comment.
CRITICAL: There is a memory leak. MetalBufferObj wraps an MTLBuffer pointer which is a reference-counted Objective-C object. The MTLBuffer is created with a +1 retain count but is never released. This will cause memory to leak every time a MetalBuffer is created and garbage collected.
You need to add a finalizer (=destroy) to MetalBufferObj that releases the underlying MTLBuffer object. This requires a helper function to call the Objective-C release method.
MetalBuffer* = ref MetalBufferObj
{.emit: """
#import <Foundation/Foundation.h>
void arraymancer_release(void* obj) { if (obj) [(id)obj release]; }
""".}
proc release_object(obj: pointer) {.importc: "arraymancer_release", nodecl.}
when NimMajor == 1:
proc `=destroy`*(p: var MetalBufferObj) {.noSideEffect.}=
if p.buffer != nil:
release_object(p.buffer)
p.buffer = nil
else:
proc `=destroy`*(p: MetalBufferObj) {.noSideEffect.}=
if p.buffer != nil:
release_object(p.buffer)
p.buffer = nil
| proc uploadToBuffer*(mb: MetalBuffer, data: pointer, length: int) = | ||
| copyToBuffer(mb.buffer, data, csize_t(length)) | ||
|
|
||
| proc downloadFromBuffer*(mb: MetalBuffer, data: pointer, length: int) = | ||
| copyFromBuffer(mb.buffer, data, csize_t(length)) |
There was a problem hiding this comment.
The uploadToBuffer and downloadFromBuffer functions lack bounds checking on the length parameter. Since these functions are exported and pass the length directly to a memcpy operation in the underlying Objective-C code, providing a length greater than the buffer's allocated size will result in a heap buffer overflow.
Recommendation: Add a check to ensure that the provided length does not exceed mb.length before performing the copy.
proc uploadToBuffer*(mb: MetalBuffer, data: pointer, length: int) =
if length > mb.length:
raise newException(ValueError, "Length exceeds buffer size")
copyToBuffer(mb.buffer, data, csize_t(length))
proc downloadFromBuffer*(mb: MetalBuffer, data: pointer, length: int) =
if length > mb.length:
raise newException(ValueError, "Length exceeds buffer size")
copyFromBuffer(mb.buffer, data, csize_t(length))
| int idx = (int)gid; | ||
| int src_offset = 0; | ||
|
|
||
| for (int dim = rank - 1; dim >= 0; dim--) { | ||
| int coord = idx % src_shape[dim]; | ||
| src_offset += coord * src_strides[dim]; | ||
| idx /= src_shape[dim]; |
There was a problem hiding this comment.
The Metal kernels use 32-bit signed integers (int) for calculating memory offsets and indices. For large tensors (e.g., those with more than 2^31 elements), these calculations will overflow, leading to incorrect memory access and potential out-of-bounds reads or writes on the GPU. This is particularly evident in the contiguous_copy_f32 kernel where src_offset is calculated using 32-bit arithmetic.
Recommendation: Use 64-bit integers (e.g., uint64_t or size_t) for all index and offset calculations within the Metal kernels to support large tensors safely.
| for i in 0 ..< pool.entries.len: | ||
| if pool.entries[i].size >= size and pool.entries[i].buffer != nil: | ||
| result = pool.entries[i].buffer | ||
| pool.entries[i].lastUsed = pool.accessCounter | ||
| return |
There was a problem hiding this comment.
There's a bug in the buffer pooling logic. When a buffer is acquired from the pool, it is not removed from the entries sequence. This means the same buffer can be acquired multiple times, leading to data corruption if used concurrently.
Additionally, the device parameter is unused, and if no suitable buffer is found, the procedure returns nil, which will likely cause a crash downstream. The procedure should probably create a new buffer if one cannot be reused.
I'll suggest fixing the most immediate issue of not removing the buffer from the pool.
for i in 0 ..< pool.entries.len:
if pool.entries[i].size >= size and pool.entries[i].buffer != nil:
result = pool.entries[i].buffer
pool.currentSize -= pool.entries[i].size
pool.entries.del(i)
return
| MetalTensorRefTrackerObj*[T: SomeFloat] = object | ||
| value*: MetalBuffer | ||
|
|
||
| MetalTensorRefTracker*[T] = ref MetalTensorRefTrackerObj[T] |
There was a problem hiding this comment.
The MetalTensor implementation lacks a destructor to release the underlying GPU resources. While CudaTensor has a =destroy hook to call cudaFree, MetalTensor does not have a corresponding hook to release the MTLBuffer Objective-C object. This leads to a GPU memory leak every time a MetalTensor is deallocated by the Nim garbage collector, which can eventually lead to a Denial of Service (DoS) due to resource exhaustion.
Recommendation: Implement a =destroy hook for MetalTensorRefTrackerObj or MetalBufferObj that calls an Objective-C function to release the MTLBuffer using [buffer release].
| task test_metal, "Run all Metal backend tests": | ||
| var switches = " -d:metal" | ||
| switches.add " -d:blas=cblas" # Archlinux, comment out on Debian/Ubuntu | ||
| test "tests_metal", switches, split = false |
| var sizeVar = size | ||
|
|
||
| # Create a buffer for partial results (one per threadgroup) | ||
| let threadgroupSize = 256 |
There was a problem hiding this comment.
The threadgroupSize is hardcoded to 256. This is fragile because the corresponding Metal kernel vector_dot_f32 also has a hardcoded size for its threadgroup memory. If these values get out of sync, it will lead to crashes or incorrect behavior.
It's better to use the METAL_THREADGROUP_SIZE constant defined in src/arraymancer/tensor/backend/global_config.nim to ensure consistency. You'll need to import ../global_config.
let threadgroupSize = METAL_THREADGROUP_SIZE
| for (int k = 0; k < K; k++) { | ||
| sum += A[batchOffsetA + row * K + k] * B[batchOffsetB + k * N + col]; | ||
| } |
There was a problem hiding this comment.
The indexing used here (row * K + k and k * N + col) assumes a row-major memory layout for the matrices within each batch. However, other GEMM kernels in this file (gemm_naive_f32, etc.) use a column-major layout, which is standard for BLAS compatibility.
This inconsistency can lead to bugs in the future when this kernel is used. It should be changed to use column-major indexing to match the other kernels.
for (int k = 0; k < K; k++) {
// Column-major indexing: A[row, k] is at k*M+row, B[k, col] is at col*K+k
sum += A[batchOffsetA + k * M + row] * B[batchOffsetB + col * K + k];
}
| uint lid [[thread_position_in_threadgroup]], | ||
| uint groupSize [[threads_per_threadgroup]]) | ||
| { | ||
| threadgroup float localSum[256]; |
There was a problem hiding this comment.
The size of the localSum array for threadgroup memory is hardcoded to 256. This is fragile. If the dispatching code in Nim ever uses a different threadgroup size, this will cause a buffer overflow within the threadgroup or incorrect results.
While Metal requires this size to be a compile-time constant, you should ensure it's tied to a single source of truth. The Nim code also hardcodes 256. Consider adding a comment here to note the dependency on the dispatch size being exactly 256.
| proc `-.`*[T: SomeFloat](val: T, t: MetalTensor[T]): MetalTensor[T] {.noinit.} = | ||
| ## Broadcasted subtraction for scalar - tensor. | ||
| ## Implemented as -(tensor - scalar) | ||
| result = newMetalTensor[T](t.shape) | ||
| metalScalarAdd[T]( | ||
| t.storage.Fbuffer, | ||
| result.storage.Fbuffer, | ||
| -val, | ||
| t.storage.Flen | ||
| ) | ||
| # Negate the result | ||
| metalScalarMul[T]( | ||
| result.storage.Fbuffer, | ||
| result.storage.Fbuffer, | ||
| T(-1), | ||
| t.storage.Flen | ||
| ) |
There was a problem hiding this comment.
The implementation of val - t (scalar-tensor subtraction) currently uses two separate kernel launches: one for scalar_add (with a negative value) and another for scalar_mul to negate the result. This is inefficient.
A more performant approach would be to create a dedicated Metal kernel scalar_sub_from_f32 that computes C[gid] = scalar - A[gid] in a single pass. This would halve the number of kernel launches and GPU synchronization for this operation.
There was a problem hiding this comment.
Actionable comments posted: 15
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim`:
- Around line 418-419: The local variable sizeVar is declared as Nim int (64-bit
on x86_64) but the Metal kernel expects a 4-byte integer; change sizeVar to an
explicit int32 and initialize it from size (e.g. sizeVar: int32 = int32(size)),
and pass sizeof(int32) to executeElementwiseKernel so the pointer/size match
what the kernel expects; follow the same pattern used in executeGemmKernel to
ensure all kernel constant parameters use int32 consistently.
- Around line 356-398: The metalGemv proc currently ignores trans, incx and incy
and also passes host ints to the GPU with the wrong size; update metalGemv to
validate/handle these parameters: if trans is true either select/implement a
transposed kernel (e.g. getKernel("gemv_f32_trans") and dispatch with swapped
M/N and appropriate index mapping) or explicitly raise a ValueError when
trans==true; likewise raise a ValueError when incx != 1 or incy != 1 (or
implement strided support in the kernel) so callers aren’t silently wrong;
finally fix the int size mismatch by converting m and n to int32 (or Int32)
before calling setBytes and use csize_t(sizeof(int32)) when passing their sizes
(same for any other ints sent to the GPU like lda), and update references to
addr m/addr n to point at the int32 variables.
- Around line 198-202: The try/except block around initMetalContext currently
uses a bare except which catches non-recoverable Defect errors; change it to
catch only recoverable exceptions by using "except CatchableError:" (optionally
"except CatchableError as e:" if you want to log e) so initMetalContext()
failures are handled safely without swallowing defects; update the block in
metal_backend.nim where initMetalContext() is called to use CatchableError and
add an optional log of the exception before returning false.
- Around line 344-351: The current dispatch logic in the block setting
kernelName using transA/transB incorrectly falls through to "gemm_naive_f32"
when both transA and transB are true; update the logic in the kernelName
assignment (the variables transA, transB and kernelName) to explicitly handle
the both-true case: either set kernelName to "gemm_ab_transpose_f32" if you
add/implement that kernel, or throw/raise a clear runtime error indicating the
transA+transB combination is unsupported so it doesn't silently use the
non-transposed kernel.
- Around line 423-450: The metalScalarMul proc currently unconditionally picks
the "scalar_mul_f32" kernel and casts scalar with float32(scalar), which will
silently truncate float64 callers; wrap the body of metalScalarMul in a
compile-time type guard like "when T is float32: <existing body> else: raise
newException(ValueError, \"metalScalarMul only supports float32\")" so calls
with T=float64 fail early; apply the same pattern to metalScalarAdd (guarding
against using the wrong kernel and the float32(scalar) cast) and ensure the
raised error type/message matches project conventions.
- Around line 532-572: The MetalBuffer ref wrapper leaks GPU memory because
MetalBufferObj lacks a destructor; add a destructor/finalize proc to
MetalBufferObj in metal_buffer.nim that releases the underlying ObjC MTLBuffer
(the field used by createMetalBuffer/createBuffer, e.g. the mtlBuffer or buffer
field) when the MetalBuffer ref is garbage-collected; ensure the destructor
calls the ObjC release/`release()` (or sets the Swift/ObjC reference to nil as
appropriate) and is attached to the MetalBufferObj type so temporary buffers
like partialBuffer in metalDot and shapeBuffer/stridesBuffer in
metalContiguousCopy are freed.
- Around line 309-319: The code is sending 64-bit Nim ints to a Metal kernel
that expects 32-bit ints (gemm_naive_f32 parameters M/N/K); change the local
vars and setBytes size to use 32-bit integers: declare m, n, k as int32 (e.g.
var m: int32 = int32(M), n: int32 = int32(N), k: int32 = int32(K)) and call
setBytes(encoder, addr m, csize_t(sizeof(int32)), 3) (and similarly for n/k),
leaving the float32 alpha/beta handling unchanged; update the three setBytes
calls that reference sizeof(int) to sizeof(int32) so the Metal kernel receives
4-byte integers.
In `@src/arraymancer/tensor/backend/metal/metal_buffer.nim`:
- Around line 53-62: The acquireBuffer procedure returns a matching MetalBuffer
but leaves the entry in MetalBufferPool.entries, allowing the same buffer to be
handed out multiple times; modify proc acquireBuffer*(pool: var MetalBufferPool,
size: int, device: MTLDevice) to remove the chosen entry from pool.entries (and
decrement pool.currentSize) when you take its buffer: capture
pool.entries[i].buffer into result, update
pool.entries[i].lastUsed/accessCounter as needed, then delete that entry from
pool.entries (or swap-remove) and decrement pool.currentSize before returning;
ensure you still return nil when none found and preserve accessCounter
semantics.
In `@src/arraymancer/tensor/backend/metal/metal_kernels.metal`:
- Around line 187-214: The batched_gemm_f32 kernel uses row-major indexing but
the library expects column-major; update A, B and C accesses to column-major
using the same pattern as gemm_naive_f32: replace A[batchOffsetA + row * K + k]
with A[batchOffsetA + k * M + row], replace B[batchOffsetB + k * N + col] with
B[batchOffsetB + col * K + k], and replace C[batchOffsetC + row * N + col] with
C[batchOffsetC + col * M + row] (keep batchOffsetA/B/C, loop over k, and
alpha/beta logic unchanged).
- Around line 220-249: The kernel vector_dot_f32 uses gid
(thread_position_in_grid) but strides by groupSize (threads_per_threadgroup),
causing overlapping work across threadgroups and races; change the loop to
stride by the total grid size (totalThreads) instead of groupSize and write
partial sums to a per-threadgroup reduction slot using an array sized to the
actual threadgroup size (replace hardcoded localSum[256] with a size based on
METAL_THREADGROUP_SIZE or the runtime groupSize), then perform the threadgroup
reduction (using localSum, lid and threadgroup_barrier) and have the lane 0
write the group's result to result[gid / groupSize] (or use group index to
compute output slot) so each threadgroup reduces a distinct chunk. Ensure
references in code: vector_dot_f32, localSum, gid, lid, groupSize,
threadgroup_barrier and align the local buffer size with METAL_THREADGROUP_SIZE.
In `@src/arraymancer/tensor/operators_blas_l2l3_metal.nim`:
- Around line 26-42: The function metalMV_y_eq_aAx_p_by currently only verifies
a.isContiguous but always passes incx = 1 to metalGemv which is unsafe for
non-contiguous x; update metalMV_y_eq_aAx_p_by to validate x contiguity (e.g.,
check x.isContiguous) and if not contiguous either raise a ValueError similar to
the a check or create a contiguous temporary buffer (copy x.storage.Fbuffer into
a contiguous buffer) and pass that buffer with incx = 1 to metalGemv; ensure
references to x.storage.Fbuffer and the chosen incx value are adjusted and that
any temporary buffer is kept alive for the call.
In `@src/arraymancer/tensor/operators_broadcasted_metal.nim`:
- Around line 153-158: The file is missing an import of init_metal which
provides the toCpu() and metal() procs used in the broadcasted scalar-by-tensor
division proc `/.`*[T: SomeFloat](val: T, t: MetalTensor[T]); add `import
init_metal` (or the module that re-exports toCpu/metal) at the top of
operators_broadcasted_metal.nim so the calls to t.toCpu() and cpuResult.metal()
resolve when this file is compiled independently, and add a TODO comment inside
`/.`* to implement a native Metal kernel for scalar-tensor division to avoid the
CPU fallback performance penalty.
In `@src/arraymancer/tensor/private/p_init_metal.nim`:
- Around line 19-23: Add a destructor for MetalTensorRefTrackerObj analogous to
the CUDA backend so Metal buffers are freed when the tracker is GC'd: implement
a =destroy proc for MetalTensorRefTrackerObj that retrieves the stored buffer
from the ref tracker (the same field used by newMetalStorage:
MetalStorage.Fref_tracking.value) and releases it via the Metal buffer release
routine (or call the existing releaseBuffer/cleanupPool helper). Place the
=destroy alongside other data_structure destructors so garbage collection
invokes it and prevents GPU memory leaks.
In `@src/arraymancer/tensor/shapeshifting_metal.nim`:
- Around line 84-91: The MetalTensor.reshape* implementation currently calls
reshape_no_copy directly and copies storage/cpuData without verifying
contiguity; add the same contiguity guard used by the CPU tensor: check
t.is_C_contiguous (or equivalent is_C_contiguous) before calling reshape_no_copy
and if not contiguous perform a safe copy/reshape path (i.e., allocate a
contiguous result and copy data) so reshape_no_copy is only invoked for
rowMajor-contiguous tensors; update reshape*(t: MetalTensor, new_shape:
varargs[int]) to mirror the CPU path around reshape_no_copy, preserving
result.storage and result.cpuData only after the contiguity-safe branch.
- Around line 60-82: The GPU kernel dispatch length is using t.storage.Flen
which is the underlying buffer length and can exceed the logical tensor size; in
asContiguous (MetalTensor) pass the logical output element count instead (use
result.size or the product of shapeSeq) to metalContiguousCopy so threads match
the allocated contiguous buffer; update the metalContiguousCopy call to replace
t.storage.Flen with result.size (or product(shapeSeq)) ensuring safe GPU thread
dispatch.
🟡 Minor comments (6)
tests/tensor/test_init_metal.nim-25-28 (1)
25-28:⚠️ Potential issue | 🟡 MinorMinor: comment says type "int" but the tensor was converted to float on line 23.
The inline comment describes the tensor as
type "int"butasType(float)was applied beforetoMetal. This is cosmetic but could confuse future readers.src/arraymancer/tensor/init_copy_metal.nim-15-15 (1)
15-15:⚠️ Potential issue | 🟡 MinorRemove unused import
./private/p_init_metal.The imported symbols from
p_init_metal(newMetalTensor,newMetalStorage,tensorMetal,get_data_ptr) are not used in this file. Thecloneproc only callstoCpu(),init_copy_cpu.clone(), andmetal(), which are provided by other imports.arraymancer.nimble-222-225 (1)
222-225:⚠️ Potential issue | 🟡 MinorThe
-d:blas=cblascomment is misleading for macOS-only Metal backend.The comment "Archlinux, comment out on Debian/Ubuntu" is copied from the CUDA/OpenCL tasks but is confusing for Metal, which only runs on macOS. On macOS, the standard BLAS implementation is the Accelerate framework, so
-d:blas=cblasmay not be appropriate for this platform. Update the comment to be macOS-specific or remove the flag if Accelerate should be used instead.Regarding the missing
langparameter: unliketest_cudaandtest_openclwhich uselang="cpp", the Metal task has no language parameter. This is correct because Metal's compilation mode is handled via the{.passC: "-x objective-c".}pragma inmetal_backend.nim, which specifies Objective-C compilation directly. The missing parameter does not indicate a problem.src/arraymancer/tensor/operators_blas_l1_metal.nim-58-58 (1)
58-58:⚠️ Potential issue | 🟡 MinorTypo: "substraction" → "subtraction".
Lines 58 and 71 both have "substraction" in the doc comments.
Also applies to: 71-71
src/arraymancer/tensor/operators_blas_l1_metal.nim-94-103 (1)
94-103:⚠️ Potential issue | 🟡 MinorMissing
MetalTensor * scalaroverload.Only
scalar * tensoris defined. Users will expect commutativity (tensor * 2.0'f32), which will fail to compile. The CPU backend likely provides both directions.Proposed fix
+proc `*`*[T: SomeFloat](t: MetalTensor[T], a: T): MetalTensor[T] {.noinit, inline.} = + ## MetalTensor multiplication by a scalar (scalar on the right) + result = a * tsrc/arraymancer/tensor/backend/metal/metal_buffer.nim-64-87 (1)
64-87:⚠️ Potential issue | 🟡 Minor
releaseBuffermay still exceedmaxSizeafter evicting only one entry.If the single eviction isn't enough to bring
currentSize + buffer.lengthundermaxSize, the buffer is added anyway. Consider evicting in a loop until there's room, or rejecting the release.Proposed fix — evict in a loop
proc releaseBuffer*(pool: var MetalBufferPool, buffer: MetalBuffer) = if buffer == nil: return pool.accessCounter.inc - if pool.currentSize + buffer.length > pool.maxSize: + while pool.currentSize + buffer.length > pool.maxSize and pool.entries.len > 0: var oldestIdx = -1 var oldestAccess = int64.high for i in 0 ..< pool.entries.len: if pool.entries[i].lastUsed < oldestAccess: oldestAccess = pool.entries[i].lastUsed oldestIdx = i if oldestIdx >= 0: pool.currentSize -= pool.entries[oldestIdx].size pool.entries.del(oldestIdx)
🧹 Nitpick comments (9)
src/arraymancer/tensor/backend/global_config.nim (1)
35-37: Metal constants are unconditionally defined, consistent with existing CUDA/OMP pattern — looks good.One concern:
METAL_THREADGROUP_SIZE = 256must stay in sync with the hardcodedthreadgroup float localSum[256]inmetal_kernels.metal(line 229). Since the.metalfile is compiled by the Metal shader compiler, this Nim constant can't flow into it automatically. Consider adding a comment here noting the coupling, or generating the kernel source with the constant substituted.src/arraymancer/tensor/tensor_metal.nim (1)
16-32: Consider adding awhen defined(metal)guard for consistency.
backend/metal.nimwraps its body inwhen defined(metal), but this aggregator module does not. If someone importstensor_metaldirectly without-d:metal, the Metal-specific sub-module imports will fail at compile time with potentially confusing errors. Adding the guard here would make the failure message clearer and stay consistent with the rest of the backend.tests/tensor/test_operators_blas_metal.nim (1)
86-93: Consider adding a correctness check for the large matrix test.Currently this only verifies shape and no crash. Since
randomTensoris seeded deterministically, you could compare against the CPU result to catch numerical issues in the Metal GEMM kernel at scale:let cpu_result = a.toCpu() * b.toCpu() check: result.toCpu() == cpu_resultThis would catch precision or tiling bugs that only manifest with larger matrices.
src/arraymancer/tensor/backend/metal.nim (1)
23-29: Remove redundant explicit re-exports.The
export metal_buffer, metal_backendstatements already re-export all public symbols from those modules. The explicitexport MetalBuffer, MetalBufferPool, MTLDevice, MTLCommandQueue, MTLCommandBuffer, MTLBuffer, MetalContext, metalContext, initMetalContext, isMetalAvailablelines are redundant and can be removed.src/arraymancer/tensor/private/p_init_metal.nim (1)
19-21: No validation onlengthparameter.If
length <= 0,createMetalBufferreceives a zero or negative size. Consider guarding against this.Proposed fix
proc newMetalStorage*[T: SomeFloat](length: int): MetalStorage[T] = + assert length > 0, "MetalStorage length must be positive" result.Flen = length result.Fbuffer = createMetalBuffer(length * sizeof(T))src/arraymancer/tensor/init_metal.nim (1)
59-65:zeros_like/ones_likeround-trip through CPU unnecessarily.Both procs allocate a CPU tensor, fill it, then upload to Metal. For
zeros_like, a more efficient approach would be to allocate a Metal buffer and zero it directly on the GPU (e.g., via a blit encoderfillcommand). Fine for an initial implementation but worth a TODO.src/arraymancer/tensor/backend/metal/metal_buffer.nim (1)
15-25: All Metal Objective-C types are opaquepointeraliases — no type safety.Using
pointerforMTLDevice,MTLBuffer, etc. means any pointer can be passed where a specific Metal type is expected. Consider using Nimdistinct pointertypes to get compile-time safety.src/arraymancer/tensor/backend/metal/metal_backend.nim (2)
26-34: PreferMTLCreateSystemDefaultDevice()overMTLCopyAllDevices().
MTLCopyAllDevices()is macOS-only and unnecessary when targeting Apple Silicon (single GPU). UsingMTLCreateSystemDefaultDevice()is simpler, avoids the array memory management, and is also available on iOS/iPadOS if you ever want to expand platform support.Additionally, the current code has a subtle memory-management concern: after
[devices release], the returned device reference is only valid because Metal devices are system singletons. UsingMTLCreateSystemDefaultDevice()eliminates this concern entirely.Suggested fix
id<MTLDevice> arraymancer_getDefaultDevice() { - NSArray<id<MTLDevice>> *devices = MTLCopyAllDevices(); - if (devices && [devices count] > 0) { - id<MTLDevice> device = [devices objectAtIndex:0]; - [devices release]; - return device; - } - return nil; + return MTLCreateSystemDefaultDevice(); }
255-286: Consider extracting the repeated command-buffer boilerplate into a shared helper.Every kernel dispatch (elementwise, GEMM, GEMV, scalar, contiguous copy, dot) repeats the same init-check → getKernel → createCommandBuffer → createComputeEncoder → setComputePipeline → … → endEncoding → commit → waitForCompletion pattern. A
withComputeEncoderor similar higher-order helper would reduce duplication and centralize error handling.
| try: | ||
| initMetalContext() | ||
| return true | ||
| except: | ||
| return false |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Bare except: catches Defect (non-recoverable errors); use except CatchableError: instead.
In Nim, bare except: catches everything including Defect types (e.g., AssertionDefect, IndexDefect), which represent programming errors that should not be silently swallowed. Use except CatchableError: to only catch recoverable errors.
Suggested fix
try:
initMetalContext()
return true
- except:
+ except CatchableError:
return false🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 198 -
202, The try/except block around initMetalContext currently uses a bare except
which catches non-recoverable Defect errors; change it to catch only recoverable
exceptions by using "except CatchableError:" (optionally "except CatchableError
as e:" if you want to log e) so initMetalContext() failures are handled safely
without swallowing defects; update the block in metal_backend.nim where
initMetalContext() is called to use CatchableError and add an optional log of
the exception before returning false.
| # Set constants | ||
| var m = M | ||
| var n = N | ||
| var k = K | ||
| var a = alpha | ||
| var b = beta | ||
| setBytes(encoder, addr m, csize_t(sizeof(int)), 3) | ||
| setBytes(encoder, addr n, csize_t(sizeof(int)), 4) | ||
| setBytes(encoder, addr k, csize_t(sizeof(int)), 5) | ||
| setBytes(encoder, addr a, csize_t(sizeof(float32)), 6) | ||
| setBytes(encoder, addr b, csize_t(sizeof(float32)), 7) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "*.metal" -o -name "*metal*kernel*" -type f | head -20Repository: mratsim/Arraymancer
Length of output: 121
🏁 Script executed:
cat ./src/arraymancer/tensor/backend/metal/metal_kernels.metalRepository: mratsim/Arraymancer
Length of output: 11396
🏁 Script executed:
sed -n '309,320p' ./src/arraymancer/tensor/backend/metal/metal_backend.nimRepository: mratsim/Arraymancer
Length of output: 423
Critical: Type size mismatch — Nim int is 8 bytes but Metal kernels expect 4-byte integers.
The Metal kernel gemm_naive_f32 declares parameters as constant int& M, constant int& N, and constant int& K (32-bit in Metal Shading Language). However, the Nim code sends 8-byte values via setBytes(encoder, addr m, csize_t(sizeof(int)), 3) because Nim's int is 64-bit on 64-bit platforms. This causes the 4-byte buffer to be corrupted when receiving 8 bytes, corrupting subsequent parameters and yielding wrong results.
Suggested fix
# Set constants
- var m = M
- var n = N
- var k = K
+ var m = int32(M)
+ var n = int32(N)
+ var k = int32(K)
var a = alpha
var b = beta
- setBytes(encoder, addr m, csize_t(sizeof(int)), 3)
- setBytes(encoder, addr n, csize_t(sizeof(int)), 4)
- setBytes(encoder, addr k, csize_t(sizeof(int)), 5)
+ setBytes(encoder, addr m, csize_t(sizeof(int32)), 3)
+ setBytes(encoder, addr n, csize_t(sizeof(int32)), 4)
+ setBytes(encoder, addr k, csize_t(sizeof(int32)), 5)
setBytes(encoder, addr a, csize_t(sizeof(float32)), 6)
setBytes(encoder, addr b, csize_t(sizeof(float32)), 7)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Set constants | |
| var m = M | |
| var n = N | |
| var k = K | |
| var a = alpha | |
| var b = beta | |
| setBytes(encoder, addr m, csize_t(sizeof(int)), 3) | |
| setBytes(encoder, addr n, csize_t(sizeof(int)), 4) | |
| setBytes(encoder, addr k, csize_t(sizeof(int)), 5) | |
| setBytes(encoder, addr a, csize_t(sizeof(float32)), 6) | |
| setBytes(encoder, addr b, csize_t(sizeof(float32)), 7) | |
| # Set constants | |
| var m = int32(M) | |
| var n = int32(N) | |
| var k = int32(K) | |
| var a = alpha | |
| var b = beta | |
| setBytes(encoder, addr m, csize_t(sizeof(int32)), 3) | |
| setBytes(encoder, addr n, csize_t(sizeof(int32)), 4) | |
| setBytes(encoder, addr k, csize_t(sizeof(int32)), 5) | |
| setBytes(encoder, addr a, csize_t(sizeof(float32)), 6) | |
| setBytes(encoder, addr b, csize_t(sizeof(float32)), 7) |
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 309 -
319, The code is sending 64-bit Nim ints to a Metal kernel that expects 32-bit
ints (gemm_naive_f32 parameters M/N/K); change the local vars and setBytes size
to use 32-bit integers: declare m, n, k as int32 (e.g. var m: int32 = int32(M),
n: int32 = int32(N), k: int32 = int32(K)) and call setBytes(encoder, addr m,
csize_t(sizeof(int32)), 3) (and similarly for n/k), leaving the float32
alpha/beta handling unchanged; update the three setBytes calls that reference
sizeof(int) to sizeof(int32) so the Metal kernel receives 4-byte integers.
| var kernelName: string | ||
| if transA and not transB: | ||
| kernelName = "gemm_a_transpose_f32" | ||
| elif not transA and transB: | ||
| kernelName = "gemm_b_transpose_f32" | ||
| else: | ||
| kernelName = "gemm_naive_f32" | ||
|
|
There was a problem hiding this comment.
Bug: Both-transposed GEMM case silently falls through to the no-transpose kernel.
When both transA and transB are true, the else branch dispatches gemm_naive_f32 (the non-transposed kernel), producing incorrect results. Either add a dedicated gemm_ab_transpose_f32 kernel or raise an error for the unsupported combination.
Suggested fix (error for unsupported case)
if transA and not transB:
kernelName = "gemm_a_transpose_f32"
elif not transA and transB:
kernelName = "gemm_b_transpose_f32"
- else:
+ elif not transA and not transB:
kernelName = "gemm_naive_f32"
+ else:
+ raise newException(ValueError, "Metal GEMM with both A and B transposed is not yet supported")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| var kernelName: string | |
| if transA and not transB: | |
| kernelName = "gemm_a_transpose_f32" | |
| elif not transA and transB: | |
| kernelName = "gemm_b_transpose_f32" | |
| else: | |
| kernelName = "gemm_naive_f32" | |
| var kernelName: string | |
| if transA and not transB: | |
| kernelName = "gemm_a_transpose_f32" | |
| elif not transA and transB: | |
| kernelName = "gemm_b_transpose_f32" | |
| elif not transA and not transB: | |
| kernelName = "gemm_naive_f32" | |
| else: | |
| raise newException(ValueError, "Metal GEMM with both A and B transposed is not yet supported") | |
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 344 -
351, The current dispatch logic in the block setting kernelName using
transA/transB incorrectly falls through to "gemm_naive_f32" when both transA and
transB are true; update the logic in the kernelName assignment (the variables
transA, transB and kernelName) to explicitly handle the both-true case: either
set kernelName to "gemm_ab_transpose_f32" if you add/implement that kernel, or
throw/raise a clear runtime error indicating the transA+transB combination is
unsupported so it doesn't silently use the non-transposed kernel.
| proc metalGemv*[T: SomeFloat]( | ||
| trans: bool, | ||
| M, N: int, | ||
| alpha: T, | ||
| A: MetalBuffer, lda: int, | ||
| x: MetalBuffer, incx: int, | ||
| beta: T, | ||
| y: MetalBuffer, incy: int | ||
| ) = | ||
| ## Matrix-Vector multiplication: y = alpha * A * x + beta * y | ||
| ## Column-major layout (Fortran style) | ||
| if not metalContext.initialized: | ||
| initMetalContext() | ||
|
|
||
| # Metal only supports float32, not float64 | ||
| when T is float32: | ||
| var m = M | ||
| var n = N | ||
| var a = float32(alpha) | ||
| var b = float32(beta) | ||
|
|
||
| let pipeline = getKernel("gemv_f32") | ||
| let commandBuffer = createCommandBuffer(metalContext.commandQueue) | ||
| let encoder = createComputeEncoder(commandBuffer) | ||
|
|
||
| setComputePipeline(encoder, pipeline) | ||
| setBuffer(encoder, A.buffer, 0) | ||
| setBuffer(encoder, x.buffer, 1) | ||
| setBuffer(encoder, y.buffer, 2) | ||
| setBytes(encoder, addr m, csize_t(sizeof(int)), 3) | ||
| setBytes(encoder, addr n, csize_t(sizeof(int)), 4) | ||
| setBytes(encoder, addr a, csize_t(sizeof(float32)), 5) | ||
| setBytes(encoder, addr b, csize_t(sizeof(float32)), 6) | ||
|
|
||
| # Dispatch threads - one thread per output element (M rows) | ||
| let threadgroupSize = 256 | ||
| dispatchThreads1D(encoder, csize_t(M), csize_t(threadgroupSize)) | ||
|
|
||
| endEncoding(encoder) | ||
| commitCommandBuffer(commandBuffer) | ||
| waitForCompletion(commandBuffer) | ||
| else: | ||
| raise newException(ValueError, "Metal only supports float32, not float64") |
There was a problem hiding this comment.
trans, incx, and incy parameters are accepted but completely ignored.
- The
transflag is never used — the kernel always computesy = α·A·x + β·yregardless of whether a transposed multiply was requested. Callers expectingA^T * xwill get wrong results silently. incxandincy(stride parameters) are also ignored — non-unit increments will produce incorrect results.- Same
intvsint32type-size mismatch as inexecuteGemmKernel(lines 385-386).
At minimum, either implement the transposed path (separate kernel or index remapping) or raise an error when trans=true / incx != 1 / incy != 1 so callers don't silently get wrong results.
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 356 -
398, The metalGemv proc currently ignores trans, incx and incy and also passes
host ints to the GPU with the wrong size; update metalGemv to validate/handle
these parameters: if trans is true either select/implement a transposed kernel
(e.g. getKernel("gemv_f32_trans") and dispatch with swapped M/N and appropriate
index mapping) or explicitly raise a ValueError when trans==true; likewise raise
a ValueError when incx != 1 or incy != 1 (or implement strided support in the
kernel) so callers aren’t silently wrong; finally fix the int size mismatch by
converting m and n to int32 (or Int32) before calling setBytes and use
csize_t(sizeof(int32)) when passing their sizes (same for any other ints sent to
the GPU like lda), and update references to addr m/addr n to point at the int32
variables.
| var sizeVar = size | ||
| executeElementwiseKernel(kernelName, @[A, B, C], @[cast[pointer](addr sizeVar)], @[sizeof(int)], size) |
There was a problem hiding this comment.
Same int vs int32 type-size issue for sizeVar.
sizeVar is Nim int (8 bytes on 64-bit) but the Metal kernel almost certainly expects a 4-byte integer. This is the same root cause as flagged in executeGemmKernel — all kernel constant parameters should use int32 consistently.
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/backend/metal/metal_backend.nim` around lines 418 -
419, The local variable sizeVar is declared as Nim int (64-bit on x86_64) but
the Metal kernel expects a 4-byte integer; change sizeVar to an explicit int32
and initialize it from size (e.g. sizeVar: int32 = int32(size)), and pass
sizeof(int32) to executeElementwiseKernel so the pointer/size match what the
kernel expects; follow the same pattern used in executeGemmKernel to ensure all
kernel constant parameters use int32 consistently.
| # TODO: remove this contiguous layout constraint | ||
| if not a.isContiguous: | ||
| raise newException(ValueError, "NotImplemented: for now matrix should be contiguous") | ||
|
|
||
| let M = a.shape[0] | ||
| let N = a.shape[1] | ||
|
|
||
| # Use GPU kernel for matrix-vector multiplication | ||
| metalGemv[T]( | ||
| false, # trans | ||
| M, N, | ||
| alpha, | ||
| a.storage.Fbuffer, M, # lda = rows of A | ||
| x.storage.Fbuffer, 1, # incx = 1 | ||
| beta, | ||
| y.storage.Fbuffer, 1 # incy = 1 | ||
| ) |
There was a problem hiding this comment.
Vector x is not checked for contiguity.
metalMV_y_eq_aAx_p_by checks that a is contiguous but passes incx = 1 for x unconditionally. If x is a non-contiguous slice, the kernel will read wrong data.
Proposed fix
if not a.isContiguous:
raise newException(ValueError, "NotImplemented: for now matrix should be contiguous")
+ if not x.isContiguous:
+ raise newException(ValueError, "NotImplemented: for now vector should be contiguous")🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/operators_blas_l2l3_metal.nim` around lines 26 - 42,
The function metalMV_y_eq_aAx_p_by currently only verifies a.isContiguous but
always passes incx = 1 to metalGemv which is unsafe for non-contiguous x; update
metalMV_y_eq_aAx_p_by to validate x contiguity (e.g., check x.isContiguous) and
if not contiguous either raise a ValueError similar to the a check or create a
contiguous temporary buffer (copy x.storage.Fbuffer into a contiguous buffer)
and pass that buffer with incx = 1 to metalGemv; ensure references to
x.storage.Fbuffer and the chosen incx value are adjusted and that any temporary
buffer is kept alive for the call.
| proc `/.`*[T: SomeFloat](val: T, t: MetalTensor[T]): MetalTensor[T] {.noinit.} = | ||
| ## Broadcasted division of a float by a tensor of floats. | ||
| ## For now, fall back to CPU | ||
| let cpuT = t.toCpu() | ||
| let cpuResult = val /. cpuT | ||
| result = cpuResult.metal() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's see the imports and structure of the file
head -100 src/arraymancer/tensor/operators_broadcasted_metal.nim | cat -nRepository: mratsim/Arraymancer
Length of output: 4113
🏁 Script executed:
# Check if toCpu and metal are defined/exported in the current file or imported modules
rg -n 'proc.*toCpu|proc.*metal\(' src/arraymancer/tensor/operators_broadcasted_metal.nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Look for where toCpu and metal are defined in the codebase
fd -e nim | xargs rg -l 'proc.*toCpu|proc.*metal\(' | head -20Repository: mratsim/Arraymancer
Length of output: 100
🏁 Script executed:
# Check what's actually imported in the file
grep -n "^import\|^from" src/arraymancer/tensor/operators_broadcasted_metal.nim | head -20Repository: mratsim/Arraymancer
Length of output: 135
🏁 Script executed:
# Get the actual lines 153-158 to verify the code snippet
sed -n '150,165p' src/arraymancer/tensor/operators_broadcasted_metal.nim | cat -nRepository: mratsim/Arraymancer
Length of output: 700
🏁 Script executed:
# Check if toCpu and metal are defined/exported in p_init_metal
grep -n 'proc.*toCpu\|proc.*metal\|export' src/arraymancer/tensor/private/p_init_metal.nim | head -30Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Also check the size of p_init_metal to understand what it contains
wc -l src/arraymancer/tensor/private/p_init_metal.nimRepository: mratsim/Arraymancer
Length of output: 113
🏁 Script executed:
# Check the contents of p_init_metal to see if toCpu and metal are defined/exported
cat -n src/arraymancer/tensor/private/p_init_metal.nimRepository: mratsim/Arraymancer
Length of output: 2526
🏁 Script executed:
# Also search for toCpu and metal definitions in init_metal.nim (not p_init_metal)
rg -n 'proc.*toCpu|proc.*metal' src/arraymancer/tensor/init_metal.nimRepository: mratsim/Arraymancer
Length of output: 206
🏁 Script executed:
# Check if p_init_metal re-exports from init_metal
grep -n 'export\|import.*init_metal' src/arraymancer/tensor/private/p_init_metal.nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Check if init_metal is imported in the file or any of its imported modules
rg -n 'import.*init_metal' src/arraymancer/tensor/operators_blas_l1_metal.nim
rg -n 'import.*init_metal' src/arraymancer/tensor/shapeshifting_metal.nim
rg -n 'import.*init_metal' src/arraymancer/tensor/private/p_init_metal.nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Check if metal and toCpu are exported from init_metal
grep -n 'export\|proc.*metal\|proc.*toCpu' src/arraymancer/tensor/init_metal.nim | head -20Repository: mratsim/Arraymancer
Length of output: 206
🏁 Script executed:
# Check if the file compiles - look for any build/test results
find . -name "*.nim" -path "*/test*" | xargs rg -l "operators_broadcasted_metal" 2>/dev/null | head -5Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Check all imported modules to see if any re-export toCpu or metal
for file in metal_backend shapeshifting_metal operators_blas_l1_metal; do
echo "=== Checking $file ==="
rg -n 'export.*metal|export.*toCpu' "src/arraymancer/tensor/${file}.nim" 2>/dev/null || echo "No exports found"
doneRepository: mratsim/Arraymancer
Length of output: 222
🏁 Script executed:
# Check if the operators_broadcasted_metal file is used/tested to see if it compiles
rg -l "operators_broadcasted_metal" --type nim | head -10Repository: mratsim/Arraymancer
Length of output: 102
🏁 Script executed:
# Check the backend/metal_backend to understand what's exported there
rg -n 'export\|proc.*metal\|proc.*toCpu' src/arraymancer/tensor/backend/metal/metal_backend.nim | head -20Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Check if tensor_metal imports and re-exports the needed functions
grep -n 'import.*init_metal\|import.*metal\|export' src/arraymancer/tensor/tensor_metal.nim | head -30Repository: mratsim/Arraymancer
Length of output: 129
🏁 Script executed:
# Check what tensor_metal actually does - does it include operators_broadcasted_metal?
grep -n 'include\|import.*operators_broadcasted_metal' src/arraymancer/tensor/tensor_metal.nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Let's look at init_metal to see if it's part of a public API
head -30 src/arraymancer/tensor/init_metal.nimRepository: mratsim/Arraymancer
Length of output: 1177
🏁 Script executed:
# Check if there's a main module that brings everything together
rg -n 'import.*operators_broadcasted_metal\|import.*init_metal' src/arraymancer/*.nim src/arraymancer/tensor/*.nim 2>/dev/null | head -20Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Verify which modules are in the tensor directory and understand the structure
ls -la src/arraymancer/tensor/*.nim | grep -E "init|metal"Repository: mratsim/Arraymancer
Length of output: 1184
🏁 Script executed:
# Check if there's any mechanism in Nim that would make metal/toCpu accessible without import
# For example, check if metal_backend or p_init_metal re-export from init_metal
cat src/arraymancer/tensor/backend/metal/metal_backend.nim | head -50Repository: mratsim/Arraymancer
Length of output: 1690
🏁 Script executed:
# Check the actual namespace - does operators_broadcasted_metal even compile?
# Look for compilation hints or tests
find . -name "*.nim" -type f | xargs grep -l "operators_broadcasted_metal" 2>/dev/nullRepository: mratsim/Arraymancer
Length of output: 104
🏁 Script executed:
# Let's be thorough - check if init_metal is a re-export at module level
grep -A5 -B5 'export.*init_metal\|from.*init_metal.*export' src/arraymancer/tensor/*.nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Check the full content of tensor_metal.nim to see if it includes or imports operators_broadcasted_metal
cat src/arraymancer/tensor/tensor_metal.nimRepository: mratsim/Arraymancer
Length of output: 1146
🏁 Script executed:
# Let's verify: does operators_broadcasted_metal actually need init_metal?
# Check what modules define Tensor type (CPU version)
rg -n 'type.*Tensor\[' src/arraymancer/tensor/data_structure.nim | head -5Repository: mratsim/Arraymancer
Length of output: 414
🏁 Script executed:
# Verify the imports chain - what does data_structure export?
head -40 src/arraymancer/tensor/data_structure.nimRepository: mratsim/Arraymancer
Length of output: 1447
🏁 Script executed:
# Critical question: can operators_broadcasted_metal.nim even compile standalone?
# Let's check what happens when we try to find the definitions of toCpu and metal in scope
rg -n 'proc metal\*|proc toCpu\*' src/arraymancer/tensor/ --type nimRepository: mratsim/Arraymancer
Length of output: 282
🏁 Script executed:
# Check if p_init_metal somehow imports/exports init_metal
cat src/arraymancer/tensor/private/p_init_metal.nim | grep -i 'import\|export'Repository: mratsim/Arraymancer
Length of output: 102
🏁 Script executed:
# Verify: does init_metal import p_init_metal and export everything?
cat src/arraymancer/tensor/init_metal.nimRepository: mratsim/Arraymancer
Length of output: 2631
🏁 Script executed:
# Check if toCpu or metal are defined/re-exported in the imported modules
rg -n 'proc.*toCpu|proc.*metal|^export' src/arraymancer/tensor/operators_blas_l1_metal.nim
rg -n 'proc.*toCpu|proc.*metal|^export' src/arraymancer/tensor/shapeshifting_metal.nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Final check - does the file actually compile? Look for compilation tests or CI config
find . -name "*.yml" -o -name "*.yaml" -o -name "*.cfg" -o -name "*.nims" | xargs grep -l "operators_broadcasted_metal\|nim c\|nimble" 2>/dev/null | head -5Repository: mratsim/Arraymancer
Length of output: 89
🏁 Script executed:
# Check the CI to see if the Metal operators are actually tested/compiled
cat .github/workflows/ci.yml | grep -A 20 -B 5 "metal\|Metal" | head -60Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Most importantly - check if data_structure.nim actually exports init_metal
# by looking at its full export statements
grep -n '^export' src/arraymancer/tensor/data_structure.nimRepository: mratsim/Arraymancer
Length of output: 141
🏁 Script executed:
# One more verification - let's check if there's a build script or nimble file that shows how this is compiled
find . -name "*.nimble" -o -name "config.nims" | xargs cat 2>/dev/null | head -50Repository: mratsim/Arraymancer
Length of output: 1443
🏁 Script executed:
# Also verify the exact dependency chain - what does the file actually need?
# Check if any of the imported modules provide toCpu or metal
rg 'proc.*toCpu|proc.*metal' src/arraymancer/tensor/operators_blas_l1_metal.nim src/arraymancer/tensor/shapeshifting_metal.nim src/arraymancer/tensor/private/p_init_metal.nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Final check - does Nim's module system make toCpu and metal available through tensor_metal's imports?
# Check if there are any import cycles or if this is indeed a transitive dependency issue
rg -n 'import.*operators_broadcasted_metal' src/arraymancer/tensor/Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
# Let's also verify the actual compilation by checking if there are any error logs or known issues
find . -name "*.md" | xargs grep -i "operators_broadcasted_metal\|metal.*fallback\|toCpu" 2>/dev/null | head -10Repository: mratsim/Arraymancer
Length of output: 45
Add missing import of init_metal for toCpu() and metal() procs.
Lines 153–158 call t.toCpu() and cpuResult.metal(), but init_metal is not imported in this file. While tensor_metal.nim composes both modules, operators_broadcasted_metal.nim must declare its own dependencies. Without the import, the file cannot be used independently. Additionally, the CPU fallback carries a significant performance penalty — consider adding a TODO to implement a native Metal kernel for scalar-tensor division.
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/operators_broadcasted_metal.nim` around lines 153 -
158, The file is missing an import of init_metal which provides the toCpu() and
metal() procs used in the broadcasted scalar-by-tensor division proc `/.`*[T:
SomeFloat](val: T, t: MetalTensor[T]); add `import init_metal` (or the module
that re-exports toCpu/metal) at the top of operators_broadcasted_metal.nim so
the calls to t.toCpu() and cpuResult.metal() resolve when this file is compiled
independently, and add a TODO comment inside `/.`* to implement a native Metal
kernel for scalar-tensor division to avoid the CPU fallback performance penalty.
| proc newMetalStorage*[T: SomeFloat](length: int): MetalStorage[T] = | ||
| result.Flen = length | ||
| result.Fbuffer = createMetalBuffer(length * sizeof(T)) | ||
| new result.Fref_tracking | ||
| result.Fref_tracking.value = result.Fbuffer |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n '=destroy' --type nim | grep -i 'metal\|cuda'Repository: mratsim/Arraymancer
Length of output: 700
🏁 Script executed:
rg -n 'MetalTensorRefTrackerObj\|CudaTensorRefTrackerObj' --type nim -A 5Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
fd -e nim -path '*data_structure*' | head -5Repository: mratsim/Arraymancer
Length of output: 231
🏁 Script executed:
rg -n 'proc createMetalBuffer' --type nim -A 10Repository: mratsim/Arraymancer
Length of output: 1113
🏁 Script executed:
rg -n 'MetalTensorRefTrackerObj' --type nimRepository: mratsim/Arraymancer
Length of output: 270
🏁 Script executed:
sed -n '55,75p' src/arraymancer/tensor/data_structure.nimRepository: mratsim/Arraymancer
Length of output: 788
🏁 Script executed:
rg -n 'type.*MetalBuffer\|type.*MetalTensor' --type nim | head -20Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
rg -n 'metalBuffer.*=\|freeMetalBuffer\|releaseMetalBuffer' --type nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
sed -n '95,115p' src/arraymancer/tensor/data_structure.nimRepository: mratsim/Arraymancer
Length of output: 930
🏁 Script executed:
rg -n 'Metal.*destroy\|releaseMetal\|deleteBuffer\|freeBuffer' --type nimRepository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
rg -n 'type MetalBuffer' --type nim -A 8Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
fd -name 'metal_backend.nim' -exec head -100 {} \;Repository: mratsim/Arraymancer
Length of output: 293
🏁 Script executed:
rg -n 'type MetalBuffer' --type nim -B 2 -A 8Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
rg -n 'proc.*=destroy' src/arraymancer/tensor/data_structure.nimRepository: mratsim/Arraymancer
Length of output: 218
🏁 Script executed:
grep -n 'MetalBuffer' src/arraymancer/tensor/backend/metal/metal_backend.nim | head -20Repository: mratsim/Arraymancer
Length of output: 871
🏁 Script executed:
sed -n '1,250p' src/arraymancer/tensor/backend/metal/metal_backend.nim | grep -n -A 10 'type.*MetalBuffer'Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
rg -n 'type MetalBuffer' -A 8 src/arraymancer/tensor/backend/metal/Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
rg -n 'MetalBuffer.*=\|MetalBuffer.*object' src/arraymancer/tensor/backend/metal/ -A 5Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
head -50 src/arraymancer/tensor/backend/metal/metal_backend.nimRepository: mratsim/Arraymancer
Length of output: 1690
🏁 Script executed:
rg 'MetalBuffer' src/arraymancer/tensor/backend/metal/ -lRepository: mratsim/Arraymancer
Length of output: 171
🏁 Script executed:
rg -n 'from\|import.*Metal' src/arraymancer/tensor/backend/metal/metal_backend.nim | head -10Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
cat src/arraymancer/tensor/backend/metal/metal_buffer.nimRepository: mratsim/Arraymancer
Length of output: 2816
Missing =destroy hook for MetalTensorRefTrackerObj causes GPU memory leaks.
The CUDA backend defines a =destroy proc on CudaTensorRefTrackerObj (data_structure.nim lines 61–67) that calls cudaFree when the ref tracker is garbage collected. MetalTensorRefTrackerObj lacks an equivalent destructor, so Metal buffer resources are never released when tensors are collected. The buffer pool functions (releaseBuffer, cleanupPool) exist but are not automatically invoked—they must be called manually by user code, which is not a reliable cleanup mechanism.
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/private/p_init_metal.nim` around lines 19 - 23, Add a
destructor for MetalTensorRefTrackerObj analogous to the CUDA backend so Metal
buffers are freed when the tracker is GC'd: implement a =destroy proc for
MetalTensorRefTrackerObj that retrieves the stored buffer from the ref tracker
(the same field used by newMetalStorage: MetalStorage.Fref_tracking.value) and
releases it via the Metal buffer release routine (or call the existing
releaseBuffer/cleanupPool helper). Place the =destroy alongside other
data_structure destructors so garbage collection invokes it and prevents GPU
memory leaks.
| proc asContiguous*[T: SomeFloat](t: MetalTensor[T], layout: OrderType = rowMajor, force: bool = false): MetalTensor[T] {.noinit.} = | ||
| ## Return a contiguous copy of the MetalTensor using GPU kernel | ||
| if not force and t.isContiguous: | ||
| return t | ||
|
|
||
| # Create new contiguous tensor | ||
| result = newMetalTensor[T](t.shape) | ||
|
|
||
| # Convert shape and strides to seq for GPU kernel | ||
| var shapeSeq = newSeq[int](t.rank) | ||
| var stridesSeq = newSeq[int](t.rank) | ||
| for i in 0 ..< t.rank: | ||
| shapeSeq[i] = t.shape[i] | ||
| stridesSeq[i] = t.strides[i] | ||
|
|
||
| # Use GPU kernel to copy data with arbitrary strides to contiguous buffer | ||
| metalContiguousCopy[T]( | ||
| t.storage.Fbuffer, | ||
| result.storage.Fbuffer, | ||
| shapeSeq, | ||
| stridesSeq, | ||
| t.storage.Flen | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find MetalBuffer/storage definitions and how Flen/Fbuffer are used
echo "=== MetalBuffer / MetalStorage definitions ==="
rg -n "Flen|Fbuffer" --type nim -A 3 -B 3 | head -80
echo ""
echo "=== MetalTensor size proc ==="
rg -n "proc size" --type nim -A 5 | grep -A 5 "MetalTensor"
echo ""
echo "=== metalContiguousCopy dispatch usage ==="
rg -n "metalContiguousCopy" --type nim -B 3 -A 10Repository: mratsim/Arraymancer
Length of output: 8303
🏁 Script executed:
#!/bin/bash
echo "=== MetalStorage/MetalTensor type definitions ==="
rg -n "type MetalTensor|type MetalStorage|type MetalBuffer" --type nim -A 10
echo ""
echo "=== Looking for Flen field definition ==="
rg -n "Flen:" --type nim -B 3 -A 1
echo ""
echo "=== MetalTensor.size definition ==="
rg -n "proc size" --type nim -A 5 -B 1 | head -50Repository: mratsim/Arraymancer
Length of output: 192
🏁 Script executed:
#!/bin/bash
echo "=== Find all files mentioning MetalTensor or MetalStorage ==="
fd "\.nim" | xargs grep -l "MetalTensor\|MetalStorage" | head -20
echo ""
echo "=== MetalBuffer definition search ==="
rg -n "MetalBuffer" --type nim -B 2 -A 5 | head -60
echo ""
echo "=== Full metalContiguousCopy implementation ==="
rg -n "proc metalContiguousCopy" --type nim -A 60Repository: mratsim/Arraymancer
Length of output: 11357
🏁 Script executed:
#!/bin/bash
echo "=== Search for MetalTensor size or numel proc ==="
rg -n "proc.*size|proc.*numel" --type nim | grep -i "metaltensor"
echo ""
echo "=== Full MetalTensor definition ==="
rg -n "MetalTensor\*\[" --type nim -A 15 | head -50
echo ""
echo "=== Look for slicing/indexing operations on MetalTensor ==="
fd "\.nim" | xargs grep -l "MetalTensor.*\[\|slice.*MetalTensor" | head -5Repository: mratsim/Arraymancer
Length of output: 4207
🏁 Script executed:
#!/bin/bash
echo "=== Check how CPU/CUDA backends calculate tensor size ==="
rg -n "\.size\|product.*shape" --type nim | grep -v "Flen\|result.size" | head -30
echo ""
echo "=== Check other asContiguous implementations ==="
rg -n "proc asContiguous" --type nim -A 15
echo ""
echo "=== Verify Flen is set only once at tensor creation ==="
rg -n "Flen =" --type nimRepository: mratsim/Arraymancer
Length of output: 4833
🏁 Script executed:
#!/bin/bash
echo "=== Search for numel or size procedures in CPU tensor backend ==="
rg -n "proc.*size|proc.*numel" --type nim -B 1 -A 3 | grep -A 3 "Tensor\["
echo ""
echo "=== Check if MetalTensor has size property in data_structure ==="
cat src/arraymancer/tensor/data_structure.nim | grep -A 30 "MetalTensor\*\[T"
echo ""
echo "=== Look at how CPU asContiguous calculates element count ==="
rg -n "asContiguous" src/arraymancer/tensor/shapeshifting.nim -A 30 | head -60Repository: mratsim/Arraymancer
Length of output: 4843
🏁 Script executed:
#!/bin/bash
echo "=== Search for how shape product is computed in MetalTensor context ==="
rg -n "proc.*rank\|for.*shape\|product" --type nim | grep -i metal -A 2 -B 2 | head -30
echo ""
echo "=== Check if there's a proc to compute element count from shape ==="
rg -n "product\|numel\|totalElements" --type nim -B 1 -A 3
echo ""
echo "=== Final check: what is actually passed to metalElementwise ==="
rg -n "metalElementwise" --type nim -B 2 -A 5 | head -50Repository: mratsim/Arraymancer
Length of output: 3913
🏁 Script executed:
#!/bin/bash
echo "=== Verify the key insight: result has correct Flen ==="
rg -n "result = newMetalTensor" --type nim -B 2 -A 5 | head -40
echo ""
echo "=== Check if broadcasting or other ops handle this correctly ==="
cat src/arraymancer/tensor/operators_broadcasted_metal.nim | head -60Repository: mratsim/Arraymancer
Length of output: 5423
🏁 Script executed:
#!/bin/bash
echo "=== Trace newMetalTensor and tensorMetal macro ==="
rg -n "tensorMetal|newMetalStorage" --type nim -B 2 -A 10 | head -80
echo ""
echo "=== Check if there's documentation of what happens with sliced/strided MetalTensor ==="
cat src/arraymancer/tensor/data_structure.nim | grep -A 20 "Warning.*Assignment"Repository: mratsim/Arraymancer
Length of output: 3811
🏁 Script executed:
#!/bin/bash
echo "=== Find the size property/proc for MetalTensor ==="
rg -n "\.size|proc size" --type nim -B 1 -A 3 | grep -B 2 -A 3 "MetalTensor\|shape"
echo ""
echo "=== Look for size proc in generic tensor code ==="
fd "\.nim" | xargs grep -l "\.size" | xargs grep -n "proc.*size\|\.size.*=" | head -20Repository: mratsim/Arraymancer
Length of output: 23212
🏁 Script executed:
#!/bin/bash
echo "=== Confirm: find .size property definition ==="
rg -n "\.size" --type nim | grep -i "property\|func\|proc" | head -20
echo ""
echo "=== Final check: metalElementwise uses totalSize same way as metalContiguousCopy ==="
rg -n "proc metalElementwise" --type nim -A 40 | head -50Repository: mratsim/Arraymancer
Length of output: 4421
Pass result.size (product of output shape) instead of t.storage.Flen to control GPU thread dispatch.
In asContiguous, t.storage.Flen represents the underlying buffer's total element count, which can exceed the logical size of a sliced or strided view. The kernel dispatches one thread per element, so passing t.storage.Flen to metalContiguousCopy causes out-of-bounds writes to the output buffer (which is allocated for only product(t.shape) elements). Use result.size (or equivalently, the product of shapeSeq) to dispatch the correct number of threads.
Proposed fix
metalContiguousCopy[T](
t.storage.Fbuffer,
result.storage.Fbuffer,
shapeSeq,
stridesSeq,
- t.storage.Flen
+ result.size
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| proc asContiguous*[T: SomeFloat](t: MetalTensor[T], layout: OrderType = rowMajor, force: bool = false): MetalTensor[T] {.noinit.} = | |
| ## Return a contiguous copy of the MetalTensor using GPU kernel | |
| if not force and t.isContiguous: | |
| return t | |
| # Create new contiguous tensor | |
| result = newMetalTensor[T](t.shape) | |
| # Convert shape and strides to seq for GPU kernel | |
| var shapeSeq = newSeq[int](t.rank) | |
| var stridesSeq = newSeq[int](t.rank) | |
| for i in 0 ..< t.rank: | |
| shapeSeq[i] = t.shape[i] | |
| stridesSeq[i] = t.strides[i] | |
| # Use GPU kernel to copy data with arbitrary strides to contiguous buffer | |
| metalContiguousCopy[T]( | |
| t.storage.Fbuffer, | |
| result.storage.Fbuffer, | |
| shapeSeq, | |
| stridesSeq, | |
| t.storage.Flen | |
| ) | |
| proc asContiguous*[T: SomeFloat](t: MetalTensor[T], layout: OrderType = rowMajor, force: bool = false): MetalTensor[T] {.noinit.} = | |
| ## Return a contiguous copy of the MetalTensor using GPU kernel | |
| if not force and t.isContiguous: | |
| return t | |
| # Create new contiguous tensor | |
| result = newMetalTensor[T](t.shape) | |
| # Convert shape and strides to seq for GPU kernel | |
| var shapeSeq = newSeq[int](t.rank) | |
| var stridesSeq = newSeq[int](t.rank) | |
| for i in 0 ..< t.rank: | |
| shapeSeq[i] = t.shape[i] | |
| stridesSeq[i] = t.strides[i] | |
| # Use GPU kernel to copy data with arbitrary strides to contiguous buffer | |
| metalContiguousCopy[T]( | |
| t.storage.Fbuffer, | |
| result.storage.Fbuffer, | |
| shapeSeq, | |
| stridesSeq, | |
| result.size | |
| ) |
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/shapeshifting_metal.nim` around lines 60 - 82, The GPU
kernel dispatch length is using t.storage.Flen which is the underlying buffer
length and can exceed the logical tensor size; in asContiguous (MetalTensor)
pass the logical output element count instead (use result.size or the product of
shapeSeq) to metalContiguousCopy so threads match the allocated contiguous
buffer; update the metalContiguousCopy call to replace t.storage.Flen with
result.size (or product(shapeSeq)) ensuring safe GPU thread dispatch.
| proc reshape*(t: MetalTensor, new_shape: varargs[int]): MetalTensor = | ||
| ## Reshape a MetalTensor without copy. | ||
| ## | ||
| ## ⚠ Reshaping without copy is only possible on contiguous rowMajor Tensors | ||
|
|
||
| reshape_no_copy(t, new_shape, result, rowMajor) | ||
| result.storage = t.storage | ||
| result.cpuData = t.cpuData |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "reshape_no_copy" --type nim -C 5Repository: mratsim/Arraymancer
Length of output: 4857
🏁 Script executed:
rg -n "is_C_contiguous" --type nim -C 3Repository: mratsim/Arraymancer
Length of output: 23784
🏁 Script executed:
rg -n "type MetalTensor" --type nim -C 5Repository: mratsim/Arraymancer
Length of output: 45
🏁 Script executed:
grep -n "MetalTensor" src/arraymancer/tensor/data_structure.nim | head -20Repository: mratsim/Arraymancer
Length of output: 963
🏁 Script executed:
sed -n '90,220p' src/arraymancer/tensor/data_structure.nimRepository: mratsim/Arraymancer
Length of output: 4609
🏁 Script executed:
sed -n '84,95p' src/arraymancer/tensor/shapeshifting_metal.nimRepository: mratsim/Arraymancer
Length of output: 502
Add contiguity validation to MetalTensor.reshape before calling reshape_no_copy.
The MetalTensor reshape implementation lacks the contiguity check present in the CPU tensor version. The CPU implementation validates is_C_contiguous before calling reshape_no_copy and falls back to copying on non-contiguous tensors. MetalTensor.reshape bypasses this validation entirely, risking incorrect reshaping of non-contiguous tensors. Add the same contiguity guard as the CPU version at line 89.
🤖 Prompt for AI Agents
In `@src/arraymancer/tensor/shapeshifting_metal.nim` around lines 84 - 91, The
MetalTensor.reshape* implementation currently calls reshape_no_copy directly and
copies storage/cpuData without verifying contiguity; add the same contiguity
guard used by the CPU tensor: check t.is_C_contiguous (or equivalent
is_C_contiguous) before calling reshape_no_copy and if not contiguous perform a
safe copy/reshape path (i.e., allocate a contiguous result and copy data) so
reshape_no_copy is only invoked for rowMajor-contiguous tensors; update
reshape*(t: MetalTensor, new_shape: varargs[int]) to mirror the CPU path around
reshape_no_copy, preserving result.storage and result.cpuData only after the
contiguity-safe branch.
Summary by CodeRabbit