[WIP] src: implement a prototype for uv coroutines#62494
[WIP] src: implement a prototype for uv coroutines#62494jasnell wants to merge 1 commit intonodejs:mainfrom
Conversation
|
Review requested:
|
|
@addaleax ... very curious in your thoughts around this. good idea? bad idea? really terrible idea? feasible? etc |
| // --------------------------------------------------------------------------- | ||
|
|
||
| template <typename Fn, typename... Args> | ||
| class UvFsAwaitable { |
There was a problem hiding this comment.
This initial POC uses a different Uv*Awaitable impl for each kind of libuv req handle. It's likely possible to make it a bit more generic but this works well enough for the prototype.
4bf88eb to
6e6b092
Compare
|
Asked opencode/opus to generate a description of the relative profiles to help with evaluation... The details here really break down why this would be a benefit. To be certain, the callback case is still faster and most efficient, that wouldn't change. However, using the coroutine approach here instead of the DetailsPerformance characteristics: C++ coroutine libuv integrationAllocation profileSingle operation (e.g.,
|
Callback (FSReqCallback) |
Promise (FSReqPromise) |
Coroutine, no hooks | Coroutine, hooks active | |
|---|---|---|---|---|
| C++ heap allocs | 1 (500 bytes) | 3 (~970 bytes) | 0 (free-list hit) | 0 (free-list hit) |
| V8 heap objects | 1 (JS wrapper) | 7 (wrapper + Resolver + Promise + 2 ArrayBuffer + 2 TypedArray) | 2 (Resolver + Promise) | 3 (+ resource Object) |
| Total allocs | 2 | 10 | 2 | 3 |
The FSReqPromise path unconditionally allocates two AliasedBufferT typed array pairs
(for stat and statfs fields) even for operations that never use them. The coroutine path
has no wasted allocations.
In steady state the coroutine frame is served from a thread-local free-list with 256-byte
size class granularity (up to 4096 bytes, 32 frames per bucket). No malloc/free calls
occur after the first coroutine of each size class completes.
Multi-step operation (open + stat + read + close)
| 4x Callback | 4x Promise | 1 Coroutine, no hooks | 1 Coroutine, hooks active | |
|---|---|---|---|---|
| C++ heap allocs | 4 | 12 | 0 (free-list) | 0 (free-list) |
| V8 heap objects | 4 (+ 3 if async/await) | 28 | 2 | 3 |
| Total allocs | 7-8 | 40 | 2 | 3 |
malloc/free pairs |
4 | ~12 | 0 (steady state) | 0 (steady state) |
The multi-step case is where the coroutine pattern provides the largest benefit. A single
coroutine frame serves all four libuv operations. The intermediate steps (open result
dispatches stat, stat result dispatches read, etc.) are pure C++ with no V8 involvement.
Per-operation overhead
Dispatch path (JS binding call to libuv dispatch)
| Callback | Promise | Coroutine (no hooks) | |
|---|---|---|---|
| HandleScopes | 1 | 1 | 2 (init + on_resume) |
| V8 API calls (req setup) | 2 | ~8 | 3 |
InternalCallbackScope |
0 | 0 | 1 (initial segment) |
Object::New |
0 | 1 | 0 |
malloc |
0 | 1 | 0 (free-list) |
The coroutine pays one extra InternalCallbackScope cycle for the initial segment (from
Start() to the first co_await). When no hooks are registered and no ticks are
scheduled, this is roughly 10 cheap operations (integer increments, pointer swaps,
branch-not-taken checks).
Completion path (libuv callback to JS notification)
| Callback | Promise | Coroutine | |
|---|---|---|---|
| HandleScopes | 3-4 | 3 | 2 |
InternalCallbackScope |
1 | 1 | 1 |
| V8 property lookups | 1 ("oncomplete") | 1 ("promise") | 0 |
MakeCallback chain |
3 levels | N/A | N/A |
| BaseObjectPtr ref counting | ~3 ops | ~3 ops | 0 |
| Weak ref cycles | 1 | 1 | 0 |
free |
1 | 1 | 0 (returned to free-list) |
The coroutine avoids the MakeCallback indirection chain (AsyncWrap::MakeCallback
string overload -> Name overload -> Function overload -> InternalMakeCallback), the V8
property lookup for the callback function, all BaseObjectPtr reference counting, weak
reference management, and the BaseObject destructor chain.
Multi-step totals
| 4x Callback | 4x Promise | 1 Coroutine | |
|---|---|---|---|
InternalCallbackScope entries |
4 | 4 | 5 |
| Microtask drains (non-trivial) | 4 | 7 | 1 |
| JS/C++ boundary crossings | 8 | 4 | 1 |
MakeCallback chains |
4 | 0 | 0 |
| V8 property lookups | 4 | 4 | 0 |
| Ref counting ops | ~12 | ~12 | 0 |
| Weak ref cycles | 4 | 4 | 0 |
Promise .then() reactions |
3 (if async/await) | 3 | 0 |
For the 4-step readFile coroutine, only the final resolve crosses the JS/C++ boundary.
The three intermediate steps are handle_.resume() -> coroutine body -> dispatch next
libuv call -> suspend. Each intermediate resume segment goes through an
InternalCallbackScope for correctness (async_hooks, context frame, draining), but the
drain finds empty queues and returns after a few comparisons.
Per-resume segment cost (intermediate steps, no hooks)
Each intermediate step (e.g., "open completed, dispatch stat") costs:
| Operation | Cost |
|---|---|
handle_.resume() |
indirect jump |
DecreaseWaitingRequestCounter |
integer decrement |
HandleScope (persistent across segment) |
V8 handle limit push |
InternalCallbackScope constructor |
~8 ops (SetIdle, exchange context frame, push async context) |
| Coroutine body: process result, dispatch next | application-specific |
InternalCallbackScope::Close |
~6 ops (EmitAfter skipped, pop context, PerformCheckpoint no-op, tick check no-op) |
HandleScope destroy |
V8 handle limit pop |
IncreaseWaitingRequestCounter |
integer increment |
inner await_suspend: dispatch libuv call |
the actual I/O |
Total: roughly 20-25 cheap operations per intermediate step. No heap allocations, no JS
calls, no string operations, no object creation.
Optimization details
Four specific optimizations reduce overhead compared to a naive coroutine implementation:
-
Lazy resource object:
Object::New(isolate)(the most expensive single operation
ininit_tracking) is only called when async_hooks listeners are active (kInit > 0
orkUsesExecutionAsyncResource > 0).InternalCallbackScopewas updated to accept a
nullGlobal<Object>*for the resource, withpush_async_contextalready handling
null correctly by skipping thenative_execution_async_resources_store. -
Thread-local frame allocator: Coroutine frames are allocated from a thread-local
free-list with 256-byte size class granularity viapromise_type::operator new/delete.
Each bucket holds up to 32 frames (bounded bykMaxCachedPerBucket). After the first
coroutine of a given size class completes, subsequent ones have zeromallocoverhead. -
Cached type name string: The V8 string for the async resource type name is cached
per Isolate inIsolateData::static_str_map(the same mechanism used by HTTP/2 header
name caching). Only the first coroutine of a given type per Isolate pays the
String::NewFromUtf8cost. Safe with Worker threads since each Worker has its own
IsolateData. -
No
AdjustAmountOfExternalAllocatedMemory: Removed. The inaccurate fixed estimate
(1024 bytes) was giving V8 misleading GC pressure signals, the API is deprecated in
favor ofExternalMemoryAccounter, and short-lived coroutine frames don't benefit from
GC heuristic adjustments.
Node.js integration
The coroutine implementation provides the same integration as the existing ReqWrap
pattern:
| Feature | Status |
|---|---|
async_hooks init/before/after/destroy |
Full support via InternalCallbackScope + EmitAsyncInit/EmitDestroy |
AsyncLocalStorage |
Full support via async_context_frame save/restore in InternalCallbackScope |
executionAsyncResource() |
Supported (resource object created when hooks are active) |
| Microtask draining | Every resume-to-suspend segment drains via InternalCallbackScope::Close |
process.nextTick draining |
Same mechanism as microtask draining |
| Event loop liveness | request_waiting_ incremented on dispatch, decremented on completion |
| Environment teardown | Coroutine task list iterated in CleanupHandles(), uv_cancel() called on each |
can_call_into_js() guard |
Checked in resolve/reject helpers |
| Trace events | TRACE_EVENT_NESTABLE_ASYNC_BEGIN/END emitted in init/destroy |
| Unhandled exceptions | Detached coroutines with captured exceptions call std::terminate() |
6e6b092 to
3859c94
Compare
|
/cc @nodejs/libuv |
Signed-off-by: James M Snell <jasnell@gmail.com> Assisted-by: Opencode/Open 4.6
3859c94 to
7cdb3fb
Compare
There was a problem hiding this comment.
It's an interesting idea. My main questions/concerns are:
- How do existing HandleScopes (and other scopes) interact with this if we were to want to call into some coroutines thing when we're already in a scope?
- What do the benchmarks say? It'd be great to have concrete numbers for memory use at load.
- Could we perhaps design the dispatch to support also support carrying callbacks through to the completion so we could have symmetrical support for callbacks and promises everywhere? It'd be nice if we could get both free promise support for everything and also eliminate the unnecessary overhead from wrapping of promises around callbacks.
- Could this enable writing everything as coroutines with a translation layer to promises/callbacks such that we could have sequencing of native-to-native things skip the JS transition, but be operated through the same interfaces? It'd be super nice if we could compose calls together and have their linkage avoid barrier crosses when not necessary. You have the FS example of that as a distinct function, but I'm wondering about having the individual steps usable as their own calls from JS such that we can put all the business logic in things composable on the C++ side and be trivially translated to JS?
- Presumably the C++20 coroutines are just their own separately managed stacks and don't influence the other non-coroutines as fiber systems would? Just want to be sure we aren't getting into situations like swapping out libuv memory and then blowing up when a signal handler triggers and is not in the state it expects.
It really Just Works. If you take a look at the example uses in node_file.c++, specifically the two v8 callback functions that accept the
I haven't run any yet. Before I went through the trouble of crafting benchmarks I wanted to make sure there weren't any obvious reasons why we shouldn't do this at all.
Potentially, yes. The callback version currently is still going to be a bit faster because the coroutine setup itself has some overhead. Coalescing is worth exploring, however, we'd just need to make sure there's not too much of a perf cost.
That's something we need to fully test but I don't believe it'll be a problem. |
The only concern I would really have in that regard is the significant chance of behaviour changes. This is interacting with quite a bit of scheduling machinery. There's a lot of potential for edge cases in there. I haven't spotted anything on a cursory glance, but obviously that'd need a whole lot more review before such a change should be landed. I think what we need for this is a lot more just research/validation than the actual build itself. Confidence of correctness is hard on something like this. |
This is just a proof-of-concept NOT MEANT TO BE CONSIDERED TO MERGE YET. I'm opening this to get early feedback on the ideas / approach.
What this does is implement co-routines for libuv calls. There are tests and example bindings that demonstrate the feasibility. The intent here is to demonstrate viability, solicit feedback on implementation approach, determine if this is something we'd want to do, allow experimentation, etc.
This proves that coroutine support is possible. The next consideration is whether it would be desirable. There are a few notable gaps in the POC currently:
No async context tracking (push_async_context/pop_async_context). This also means async context frames are not propagated.The typical microtask scheduling is not fully enabledV8 idle time notifications would need to be worked throughThis is not meant to land as is, so "Request Changes" or "Approvals" are unnecessary at this point. I'm looking solely for feedback on the idea.