<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://async-java.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://async-java.github.io/" rel="alternate" type="text/html" /><updated>2026-05-17T22:40:12+00:00</updated><id>https://async-java.github.io/feed.xml</id><title type="html">async.java</title><subtitle>async.java is a Java 11+ port of the Node.js async control-flow library. Parallel, Series, Waterfall, Race, Map, Reduce, Queue, Lock — small, predictable, virtual-thread friendly.</subtitle><entry><title type="html">async.java vs Akka Streams: a load-curve story</title><link href="https://async-java.github.io/blog/2026/05/17/async-java-vs-akka-streams/" rel="alternate" type="text/html" title="async.java vs Akka Streams: a load-curve story" /><published>2026-05-17T00:00:00+00:00</published><updated>2026-05-17T00:00:00+00:00</updated><id>https://async-java.github.io/blog/2026/05/17/async-java-vs-akka-streams</id><content type="html" xml:base="https://async-java.github.io/blog/2026/05/17/async-java-vs-akka-streams/"><![CDATA[<p>A few months back we replaced an Akka Streams hot path with <a href="https://github.com/async-java/async.java">async.java</a> and the median latency dropped, the tail latency dropped <em>more</em>, and we stopped seeing the occasional saturation cliff. That was surprising enough that we built a controlled benchmark to figure out why. This post is the writeup.</p>

<p>Short version: <strong>async.java is faster than Akka Streams for per-request orchestration because it doesn’t materialise a graph per call.</strong> Akka Streams is <em>better</em> for long-running stream consumers, where that materialisation cost amortises to zero. They’re solving different problems, but if you’re using one for the other’s workload, the wrong-tool tax shows up sharply in the tail.</p>

<p>Numbers, then mechanism, then the Loom angle.</p>

<h2 id="the-benchmark">The benchmark</h2>

<p>Five-stage WebSocket request pipeline. Same business logic both ways, byte-identical, in <code class="language-plaintext highlighter-rouge">PipelineStages.java</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>parse JSON → validate → enrich (lookupA ∥ lookupB) → score → serialize
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">lookupA</code> and <code class="language-plaintext highlighter-rouge">lookupB</code> each <code class="language-plaintext highlighter-rouge">Thread.sleep</code> 1-4 ms (simulating an HTTP/DB hop) and run <em>concurrently</em>. The rest is sequential. Per-message work hovers around 5 ms.</p>

<p>Two pipeline implementations:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// async.java version (excerpt)</span>
<span class="kd">final</span> <span class="kt">var</span> <span class="n">lookups</span> <span class="o">=</span> <span class="nc">List</span><span class="o">.</span><span class="na">of</span><span class="o">(</span>
  <span class="n">cb</span> <span class="o">-&gt;</span> <span class="n">exec</span><span class="o">.</span><span class="na">submit</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">cb</span><span class="o">.</span><span class="na">done</span><span class="o">(</span><span class="kc">null</span><span class="o">,</span> <span class="n">enrichLookupA</span><span class="o">(</span><span class="n">validated</span><span class="o">))),</span>
  <span class="n">cb</span> <span class="o">-&gt;</span> <span class="n">exec</span><span class="o">.</span><span class="na">submit</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">cb</span><span class="o">.</span><span class="na">done</span><span class="o">(</span><span class="kc">null</span><span class="o">,</span> <span class="n">enrichLookupB</span><span class="o">(</span><span class="n">validated</span><span class="o">)))</span>
<span class="o">);</span>
<span class="nc">Asyncc</span><span class="o">.</span><span class="na">Parallel</span><span class="o">(</span><span class="n">lookups</span><span class="o">,</span> <span class="o">(</span><span class="n">err</span><span class="o">,</span> <span class="n">results</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="o">{</span>
  <span class="n">result</span><span class="o">.</span><span class="na">complete</span><span class="o">(</span><span class="n">serialize</span><span class="o">(</span><span class="n">score</span><span class="o">(</span><span class="n">validated</span><span class="o">,</span> <span class="n">results</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="mi">0</span><span class="o">),</span> <span class="n">results</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="mi">1</span><span class="o">))));</span>
<span class="o">});</span>
</code></pre></div></div>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Akka Streams version (excerpt)</span>
<span class="k">return</span> <span class="nc">Source</span><span class="o">.</span><span class="na">single</span><span class="o">(</span><span class="n">inputFrame</span><span class="o">)</span>
    <span class="o">.</span><span class="na">via</span><span class="o">(</span><span class="n">parseFlow</span><span class="o">)</span>
    <span class="o">.</span><span class="na">via</span><span class="o">(</span><span class="n">validateFlow</span><span class="o">)</span>
    <span class="o">.</span><span class="na">via</span><span class="o">(</span><span class="n">enrichFlow</span><span class="o">)</span>        <span class="c1">// mapAsync(2) — runs both lookups in parallel</span>
    <span class="o">.</span><span class="na">via</span><span class="o">(</span><span class="n">scoreFlow</span><span class="o">)</span>
    <span class="o">.</span><span class="na">via</span><span class="o">(</span><span class="n">serializeFlow</span><span class="o">)</span>
    <span class="o">.</span><span class="na">runWith</span><span class="o">(</span><span class="nc">Sink</span><span class="o">.</span><span class="na">head</span><span class="o">(),</span> <span class="n">system</span><span class="o">);</span>
</code></pre></div></div>

<p>Both expose the same external signature: <code class="language-plaintext highlighter-rouge">String → CompletionStage&lt;String&gt;</code>. The Akka HTTP WebSocket handler wraps each frame with <code class="language-plaintext highlighter-rouge">mapAsync(8, pipeline::process)</code>, so up to 8 messages are in flight per WS connection on either side.</p>

<p>The setup: an <a href="https://github.com/oresoftware/k8s-cluster/tree/main/remote/akka-ws-server">Akka HTTP + WebSocket server</a> exposing <code class="language-plaintext highlighter-rouge">/ws/asyncjava</code> and <code class="language-plaintext highlighter-rouge">/ws/akkastreams</code>. A <a href="https://github.com/oresoftware/k8s-cluster/tree/main/remote/ws-loadtest-rs">Rust pipeline-mode load tester</a> opens N clients, each sending JSON at a fixed rate, correlating responses by request ID with a 15-second timeout. Latency tracked with <code class="language-plaintext highlighter-rouge">hdrhistogram</code>.</p>

<p>Hardware: single host. JDK 21 (Eclipse Temurin), virtual-thread executor for async.java’s task submission, Akka Streams’ default ForkJoinPool dispatcher. async.java is <code class="language-plaintext highlighter-rouge">com.github.async-java:async.java:v0.2.2</code>; Akka Streams is 2.8.8. All five fixes from the v0.2.x cycle applied (see <a href="#was-asyncjava-always-this-fast">§ Was async.java always this fast?</a> below).</p>

<h2 id="the-load-curve">The load curve</h2>

<table>
  <thead>
    <tr>
      <th>offered load</th>
      <th>clients × rate/s</th>
      <th>async.java p50 / p95 / p99 / max</th>
      <th>akka-streams p50 / p95 / p99 / max</th>
      <th>drops async / akka</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>10 msg/s</td>
      <td>1 × 10</td>
      <td>10.0 / 16.1 / 18.7 / 27 ms</td>
      <td>8.6 / 12.8 / 15.3 / 16 ms</td>
      <td>0 / 0</td>
    </tr>
    <tr>
      <td>100 msg/s</td>
      <td>10 × 10</td>
      <td>6.9 / 11.3 / 13.3 / 16 ms</td>
      <td>7.2 / 11.4 / 13.8 / 16 ms</td>
      <td>0 / 0</td>
    </tr>
    <tr>
      <td>500 msg/s</td>
      <td>50 × 10</td>
      <td>5.7 / 10.9 / 14.3 / 46 ms</td>
      <td>17.8 / 27.7 / 30.7 / 55 ms</td>
      <td>0 / 0</td>
    </tr>
    <tr>
      <td>1 000 msg/s</td>
      <td>200 × 5</td>
      <td>5.1 / 10.1 / 14.8 / 21 ms</td>
      <td>5.9 / 34.9 / 54.3 / 100 ms</td>
      <td>0 / 0</td>
    </tr>
    <tr>
      <td>2 500 msg/s</td>
      <td>50 × 50</td>
      <td>5.0 / 8.7 / 11.5 / 18 ms</td>
      <td><strong>2 017 / 4 624 / 5 230 / 6 258 ms</strong></td>
      <td>0 / <strong>~14.3 %</strong></td>
    </tr>
  </tbody>
</table>

<p>Reading the table row by row:</p>

<ul>
  <li><strong>10 msg/s</strong>, single client, no concurrency. Both libraries are dominated by the work itself (max of two <code class="language-plaintext highlighter-rouge">Thread.sleep(1-4ms)</code>). Akka Streams is ~15 % faster at p99 because its actor mailbox is well-warmed and the JIT has already inlined the stage interpreter. With 288 samples the run is mostly JIT-warmup variance.</li>
  <li><strong>100 msg/s</strong> — moderate concurrency. The two libraries are at <strong>parity</strong>, within 4 % across all percentiles. The dispatcher has spare capacity; per-message overhead doesn’t show up.</li>
  <li><strong>500 msg/s</strong> — Akka Streams starts to wobble. p50 climbs from 7 → 18 ms while async.java’s <em>drops</em> from 7 → 6 ms (lock-free fan-out doesn’t get worse with concurrency). Both still deliver 100 %.</li>
  <li><strong>1 000 msg/s</strong> — the tail diverges sharply. async.java’s p99 = 14.8 ms (2.9× p50; healthy distribution). Akka Streams’ p99 = 54 ms (9.2× p50; long right tail). Both deliver 100 %.</li>
  <li><strong>2 500 msg/s</strong> — Akka Streams falls off a cliff. p50 = <strong>2.0 seconds</strong>, ~14 % of in-flight messages never come back within the 15-second correlation budget. async.java is unchanged: p50 = 5 ms, p99 = 11.5 ms, 0 drops, 0 correlation misses.</li>
</ul>

<p>The shape is <strong>a knee, not a slope</strong>. Below ~500 msg/s the two are roughly comparable. From 500 to 1 000 msg/s, Akka’s tail grows non-linearly while its median holds. At 2 500 msg/s, the actor-mailbox queue depth blows up and median latency becomes seconds.</p>

<h2 id="why-the-mechanism">Why? The mechanism</h2>

<p>Both pipelines run identical work — they parse the same JSON, sleep the same amounts, serialise the same output. The latency gap is <strong>entirely overhead</strong>. Let’s account for it per message.</p>

<h3 id="asyncjava-per-message-overhead">async.java per-message overhead</h3>

<p>From <a href="https://github.com/oresoftware/k8s-cluster/blob/main/remote/akka-ws-server/src/main/java/com/oresoftware/dd/akkaws/pipeline/AsyncJavaPipeline.java"><code class="language-plaintext highlighter-rouge">AsyncJavaPipeline.java</code></a>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="nc">CompletableFuture</span><span class="o">&lt;</span><span class="nc">String</span><span class="o">&gt;</span> <span class="nf">process</span><span class="o">(</span><span class="kd">final</span> <span class="nc">String</span> <span class="n">inputFrame</span><span class="o">)</span> <span class="o">{</span>
  <span class="kd">final</span> <span class="nc">CompletableFuture</span><span class="o">&lt;</span><span class="nc">String</span><span class="o">&gt;</span> <span class="n">result</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">CompletableFuture</span><span class="o">&lt;&gt;();</span>
  <span class="k">try</span> <span class="o">{</span>
    <span class="kd">final</span> <span class="nc">JsonNode</span> <span class="n">parsed</span>    <span class="o">=</span> <span class="nc">PipelineStages</span><span class="o">.</span><span class="na">parse</span><span class="o">(</span><span class="n">inputFrame</span><span class="o">);</span>     <span class="c1">// sync, caller thread</span>
    <span class="kd">final</span> <span class="nc">JsonNode</span> <span class="n">validated</span> <span class="o">=</span> <span class="nc">PipelineStages</span><span class="o">.</span><span class="na">validate</span><span class="o">(</span><span class="n">parsed</span><span class="o">);</span>      <span class="c1">// sync, caller thread</span>

    <span class="kd">final</span> <span class="kt">var</span> <span class="n">lookups</span> <span class="o">=</span> <span class="nc">List</span><span class="o">.</span><span class="na">of</span><span class="o">(</span>
      <span class="n">cb</span> <span class="o">-&gt;</span> <span class="n">executor</span><span class="o">.</span><span class="na">submit</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">cb</span><span class="o">.</span><span class="na">done</span><span class="o">(</span><span class="kc">null</span><span class="o">,</span> <span class="n">enrichLookupA</span><span class="o">(</span><span class="n">validated</span><span class="o">))),</span>
      <span class="n">cb</span> <span class="o">-&gt;</span> <span class="n">executor</span><span class="o">.</span><span class="na">submit</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="n">cb</span><span class="o">.</span><span class="na">done</span><span class="o">(</span><span class="kc">null</span><span class="o">,</span> <span class="n">enrichLookupB</span><span class="o">(</span><span class="n">validated</span><span class="o">)))</span>
    <span class="o">);</span>

    <span class="nc">Asyncc</span><span class="o">.</span><span class="na">Parallel</span><span class="o">(</span><span class="n">lookups</span><span class="o">,</span> <span class="o">(</span><span class="n">err</span><span class="o">,</span> <span class="n">results</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="o">{</span>
      <span class="n">result</span><span class="o">.</span><span class="na">complete</span><span class="o">(</span><span class="nc">PipelineStages</span><span class="o">.</span><span class="na">serialize</span><span class="o">(</span>
        <span class="nc">PipelineStages</span><span class="o">.</span><span class="na">score</span><span class="o">(</span><span class="n">validated</span><span class="o">,</span> <span class="n">results</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="mi">0</span><span class="o">),</span> <span class="n">results</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="mi">1</span><span class="o">))));</span>
    <span class="o">});</span>
  <span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="nc">Throwable</span> <span class="n">t</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">result</span><span class="o">.</span><span class="na">completeExceptionally</span><span class="o">(</span><span class="n">t</span><span class="o">);</span>
  <span class="o">}</span>
  <span class="k">return</span> <span class="n">result</span><span class="o">;</span>
<span class="o">}</span>
</code></pre></div></div>

<p>Per message:</p>

<ol>
  <li>One <code class="language-plaintext highlighter-rouge">CompletableFuture</code> for the boundary.</li>
  <li><code class="language-plaintext highlighter-rouge">parse</code> and <code class="language-plaintext highlighter-rouge">validate</code> run <strong>synchronously</strong> on the caller thread (the Akka HTTP <code class="language-plaintext highlighter-rouge">mapAsync</code> worker).</li>
  <li><code class="language-plaintext highlighter-rouge">List.of(...)</code> for the two tasks.</li>
  <li><code class="language-plaintext highlighter-rouge">Asyncc.Parallel</code> allocates: one <code class="language-plaintext highlighter-rouge">ParallelRunner</code>, one <code class="language-plaintext highlighter-rouge">ShortCircuit</code>, one <code class="language-plaintext highlighter-rouge">CounterLimit</code> (two <code class="language-plaintext highlighter-rouge">AtomicInteger</code>s), two <code class="language-plaintext highlighter-rouge">AsyncTaskRunner</code>s.</li>
  <li>Two <code class="language-plaintext highlighter-rouge">executor.submit(...)</code> calls. On JDK 21, <strong>virtual-thread spawn is ~250 ns</strong> — basically free.</li>
  <li>Each task lands on a VT, sleeps 1-4 ms, calls <code class="language-plaintext highlighter-rouge">cb.done(...)</code>. A dedup-guarded final callback fires once on whichever VT finishes last.</li>
</ol>

<p>Total coordination overhead: <strong>under 50 µs</strong>. It’s a handful of heap allocations, four <code class="language-plaintext highlighter-rouge">AtomicInteger.incrementAndGet()</code>s, and a callback. <strong>No mailbox, no scheduler, no shared contended queue.</strong></p>

<h3 id="akka-streams-per-message-overhead">Akka Streams per-message overhead</h3>

<p>From <a href="https://github.com/oresoftware/k8s-cluster/blob/main/remote/akka-ws-server/src/main/java/com/oresoftware/dd/akkaws/pipeline/AkkaStreamsPipeline.java"><code class="language-plaintext highlighter-rouge">AkkaStreamsPipeline.java</code></a>:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">public</span> <span class="nc">CompletionStage</span><span class="o">&lt;</span><span class="nc">String</span><span class="o">&gt;</span> <span class="nf">process</span><span class="o">(</span><span class="kd">final</span> <span class="nc">String</span> <span class="n">inputFrame</span><span class="o">)</span> <span class="o">{</span>
  <span class="kd">final</span> <span class="nc">Flow</span><span class="o">&lt;</span><span class="nc">String</span><span class="o">,</span> <span class="nc">JsonNode</span><span class="o">,</span> <span class="nc">NotUsed</span><span class="o">&gt;</span> <span class="n">parseFlow</span>    <span class="o">=</span> <span class="nc">Flow</span><span class="o">.&lt;</span><span class="nc">String</span><span class="o">&gt;</span><span class="n">create</span><span class="o">().</span><span class="na">map</span><span class="o">(</span><span class="nl">PipelineStages:</span><span class="o">:</span><span class="n">parse</span><span class="o">);</span>
  <span class="kd">final</span> <span class="nc">Flow</span><span class="o">&lt;</span><span class="nc">JsonNode</span><span class="o">,</span> <span class="nc">JsonNode</span><span class="o">,</span> <span class="nc">NotUsed</span><span class="o">&gt;</span> <span class="n">validateFlow</span> <span class="o">=</span> <span class="nc">Flow</span><span class="o">.&lt;</span><span class="nc">JsonNode</span><span class="o">&gt;</span><span class="n">create</span><span class="o">().</span><span class="na">map</span><span class="o">(</span><span class="nl">PipelineStages:</span><span class="o">:</span><span class="n">validate</span><span class="o">);</span>
  <span class="kd">final</span> <span class="nc">Flow</span><span class="o">&lt;</span><span class="nc">JsonNode</span><span class="o">,</span> <span class="nc">EnrichedRecord</span><span class="o">,</span> <span class="nc">NotUsed</span><span class="o">&gt;</span> <span class="n">enrichFlow</span> <span class="o">=</span> <span class="nc">Flow</span><span class="o">.&lt;</span><span class="nc">JsonNode</span><span class="o">&gt;</span><span class="n">create</span><span class="o">().</span><span class="na">mapAsync</span><span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="n">validated</span> <span class="o">-&gt;</span> <span class="o">{</span>
    <span class="c1">// CompletableFutures on system.executionContext()</span>
  <span class="o">});</span>
  <span class="c1">// ... scoreFlow, serializeFlow ...</span>

  <span class="k">return</span> <span class="nc">Source</span><span class="o">.</span><span class="na">single</span><span class="o">(</span><span class="n">inputFrame</span><span class="o">)</span>
      <span class="o">.</span><span class="na">via</span><span class="o">(</span><span class="n">parseFlow</span><span class="o">).</span><span class="na">via</span><span class="o">(</span><span class="n">validateFlow</span><span class="o">).</span><span class="na">via</span><span class="o">(</span><span class="n">enrichFlow</span><span class="o">).</span><span class="na">via</span><span class="o">(</span><span class="n">scoreFlow</span><span class="o">).</span><span class="na">via</span><span class="o">(</span><span class="n">serializeFlow</span><span class="o">)</span>
      <span class="o">.</span><span class="na">runWith</span><span class="o">(</span><span class="nc">Sink</span><span class="o">.</span><span class="na">head</span><span class="o">(),</span> <span class="n">system</span><span class="o">);</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">runWith</code> call is doing real work — <strong>materialisation</strong>. It walks the graph, allocates a <code class="language-plaintext highlighter-rouge">GraphInterpreterShell</code>, creates an <code class="language-plaintext highlighter-rouge">ActorGraphInterpreter</code> actor with its own mailbox, instantiates each stage’s logic, wires async callbacks through the actor system, and schedules an initial pull on the source. That actor mailbox lands on the <strong>default dispatcher’s ForkJoinPool</strong>, which is a shared, contended structure.</p>

<p>Per message:</p>

<ol>
  <li>Allocate five <code class="language-plaintext highlighter-rouge">Flow</code> instances + one <code class="language-plaintext highlighter-rouge">Source.single</code> + one <code class="language-plaintext highlighter-rouge">Sink.head</code>. Heap allocations of <code class="language-plaintext highlighter-rouge">LinearTraversalBuilder</code> graph fragments, attribute maps, port handles.</li>
  <li><code class="language-plaintext highlighter-rouge">.via(...)</code> composition. Each call fuses two builders.</li>
  <li><code class="language-plaintext highlighter-rouge">.runWith(...)</code> — the materialiser walks the resulting graph, instantiates each stage’s <code class="language-plaintext highlighter-rouge">GraphStageLogic</code>, wires its <code class="language-plaintext highlighter-rouge">InHandler</code>s/<code class="language-plaintext highlighter-rouge">OutHandler</code>s, allocates an <code class="language-plaintext highlighter-rouge">ActorGraphInterpreter</code> with a mailbox, schedules <code class="language-plaintext highlighter-rouge">dispatcher.execute(runnable)</code>.</li>
  <li>Each stage push/pull event goes through the <code class="language-plaintext highlighter-rouge">GraphInterpreter</code> step loop. That’s the <a href="https://github.com/oresoftware/k8s-cluster/blob/main/remote/akka-ws-server/readme.md#debuggability">~26-frame stack trace</a> you see in stage code.</li>
  <li>When the graph completes, the actor stops, the materialiser tears the graph down.</li>
</ol>

<p>Base coordination overhead: roughly <strong>80-200 µs</strong> when the dispatcher is idle, plus <em>whatever the actor mailbox queue depth costs</em> when it isn’t.</p>

<p>That second term is the whole story.</p>

<h3 id="per-message-accounting-at-1-000-msgs">Per-message accounting at 1 000 msg/s</h3>

<p>JFR-sampled, on the live server during a 60-second run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                          async.java          akka-streams
work (sleep + JSON)           ~4.8 ms             ~4.8 ms       (identical)
coordination overhead          0.10 ms             0.30 ms      (base case)
dispatcher queue wait          0.10 ms             4.50 ms      (load-dependent)
total p99                     14.8 ms             54.3 ms

queue-wait share of p99        ~7 %                ~50 %
</code></pre></div></div>

<p><strong>The 3-4× tail-latency gap is the actor-mailbox queue wait.</strong> The work is the same. The base overhead differs by 200 µs, which is rounding error at this latency.</p>

<p>What changes is that async.java’s per-message coordination doesn’t enqueue anything onto a shared structure: each <code class="language-plaintext highlighter-rouge">Asyncc.Parallel</code> is self-contained, lives in its own heap allocation, and submits two VT tasks to the executor directly. The executor (a VT-per-task executor) doesn’t queue — every VT is its own work unit. Coordination overhead stays flat as offered load grows.</p>

<p>Akka Streams, by contrast, schedules a fresh <code class="language-plaintext highlighter-rouge">ActorGraphInterpreter</code> actor per <code class="language-plaintext highlighter-rouge">process()</code> call onto the default dispatcher. At 1 000 msg/s, that’s 1 000 actors per second piling onto a ForkJoinPool with ~8 worker threads. The pool’s run queue grows. Median latency reflects “average queue depth × time-per-actor-slice”. p99 reflects the right tail of that queue depth distribution.</p>

<p>At 2 500 msg/s the pool can’t dequeue fast enough. New <code class="language-plaintext highlighter-rouge">runWith</code> calls pile on top of in-flight actors. Eventually <code class="language-plaintext highlighter-rouge">mapAsync(8)</code> upstream stops accepting frames per connection, but the Akka-side queue has already grown unboundedly, and client-side correlation timeouts start firing.</p>

<p>That’s the cliff.</p>

<h2 id="is-this-a-meaningful-benchmark-or-did-we-contrive-it">Is this a meaningful benchmark, or did we contrive it?</h2>

<p>Honest answer: <strong>it’s meaningful for the shape it measures, and unfair to Akka Streams for the shape Akka Streams was actually built for.</strong></p>

<p>The benchmark shape is <strong>per-request orchestration</strong>:</p>

<ul>
  <li>HTTP request handler that fans out a few async calls and combines them.</li>
  <li>WebSocket RPC-shaped pipeline (request frame in, response frame out).</li>
  <li>Job orchestration where each job is its own pipeline.</li>
</ul>

<p>For that shape, async.java is faster by a margin that grows with load, and the table above is honest. We see it in production. We saw it in the synthetic micro-benchmark. We saw it in both the Rust and Gleam/Node load testers. The numbers reproduce.</p>

<p>But Akka Streams was built for the <strong>long-running stream consumer</strong> shape:</p>

<ul>
  <li>A Kafka consumer running for days, processing millions of messages through a fixed graph.</li>
  <li>A change-data-capture pipeline materialised once at startup.</li>
  <li>A WebSocket <em>connection</em> treated as a stream (one <code class="language-plaintext highlighter-rouge">Source.queue → ... → Sink</code> per connection, <code class="language-plaintext highlighter-rouge">offer(frame)</code> per message).</li>
</ul>

<p>In that shape, the per-call materialisation cost — the entire reason async.java wins our benchmark — happens <strong>once</strong>, at startup. Amortised over millions of messages, it’s effectively zero. Akka Streams’ structural back-pressure prevents memory growth when upstream outpaces downstream; its actor mailbox queue is a feature, not a bug, when work is naturally batched.</p>

<p>We deliberately built the benchmark in the way that makes the function signatures match (<code class="language-plaintext highlighter-rouge">String → CompletionStage&lt;String&gt;</code>). That forces <code class="language-plaintext highlighter-rouge">Source.single → runWith</code> per message, which is the worst-case Akka Streams usage. If we’d instead built a long-lived flow per WS connection with <code class="language-plaintext highlighter-rouge">offer(frame)</code> per message, the materialisation tax would vanish and Akka Streams would hold its tail latency just like async.java does. We just couldn’t write that code with the same external signature.</p>

<p>So:</p>

<ul>
  <li><strong>If you’re choosing an orchestration library for per-request work, the table above is real.</strong> async.java wins on overhead, tail latency, and saturation behaviour. Pick it.</li>
  <li><strong>If you’re choosing a stream consumer for long-running flows</strong>, the table above isn’t a good signal. Akka Streams (or <a href="https://pekko.apache.org/">Pekko Streams</a>, or Reactor, or RxJava) is built for that case and async.java doesn’t have a built-in structural-back-pressure story. Pick that.</li>
</ul>

<p>The mistake people make — including the team that prompted this writeup — is to use Akka Streams for per-request orchestration because it’s already in the stack. The wrong-tool tax compounds with load.</p>

<h2 id="was-asyncjava-always-this-fast">Was async.java always this fast?</h2>

<p>No. The v0.2.x cycle fixed three real concurrency bugs that previously made the library lose messages under sustained load:</p>

<ol>
  <li><strong><code class="language-plaintext highlighter-rouge">CounterLimit</code> lost-update race</strong> (<a href="https://github.com/async-java/async.java/pull/9">PR #9</a>) — non-atomic <code class="language-plaintext highlighter-rouge">Integer++</code> from per-task callbacks. Under concurrent increments from different VTs, one increment was lost; <code class="language-plaintext highlighter-rouge">isDone()</code> returned <code class="language-plaintext highlighter-rouge">false</code> forever; the final callback never fired. Fixed by switching to <code class="language-plaintext highlighter-rouge">AtomicInteger</code>.</li>
  <li><strong>Double-fire of the final callback</strong> (<a href="https://github.com/async-java/async.java/pull/10">PR #10</a>) — <code class="language-plaintext highlighter-rouge">NeoParallel.Parallel(List, callback)</code> called <code class="language-plaintext highlighter-rouge">f.done(...)</code> directly without the shared <code class="language-plaintext highlighter-rouge">NeoUtils.fireFinalCallback</code> dedup guard. Two task runners finishing nearly simultaneously could each invoke the user callback. Akka HTTP’s <code class="language-plaintext highlighter-rouge">mapAsync</code> silently dropped the duplicate emit, so the WS client saw it as a <em>lost</em> response.</li>
  <li><strong>Slot-write-before-counter-increment race</strong> (v0.2.2) — <code class="language-plaintext highlighter-rouge">NeoParallel</code> and <code class="language-plaintext highlighter-rouge">NeoMap</code> incremented their atomic counter <em>before</em> writing the per-index result slot. A sibling runner reading <code class="language-plaintext highlighter-rouge">count == size</code> could fire the final callback while another’s slot write hadn’t landed yet, publishing <code class="language-plaintext highlighter-rouge">null</code> at the last-finishing index.</li>
</ol>

<p>Earlier benchmarks showed async.java at ~94.5 % delivery on 20-second runs at 500 msg/s. That was these bugs, not a load-test artefact. With all three fixes (v0.2.2 and later), delivery is 100 % through to saturation, and beyond saturation it stays 100 % up to the point where the executor itself is overloaded — which on this hardware is well past anything we’d push through one node.</p>

<p>Each fix has a <a href="https://github.com/async-java/async.java/tree/master/src/test/java/general">reproducer test</a> pinning it. The <code class="language-plaintext highlighter-rouge">MisuseTest</code> class adds 12 adversarial scenarios (cross-thread callbacks, double <code class="language-plaintext highlighter-rouge">cb.done</code>, sync throws, empty lists, short-circuit, nested composition) so future regressions surface fast.</p>

<h2 id="project-loom-integration">Project Loom integration</h2>

<p>Both libraries can run on virtual threads on JDK 21+. They get <em>different things</em> out of it.</p>

<h3 id="asyncjava--loom">async.java + Loom</h3>

<p>async.java doesn’t own a thread pool. It takes whatever executor you hand it, and most combinators don’t even take one — <code class="language-plaintext highlighter-rouge">Asyncc.Parallel</code> just dispatches via whatever your tasks submit to. That makes it trivially Loom-native:</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">final</span> <span class="kt">var</span> <span class="n">vt</span> <span class="o">=</span> <span class="nc">Executors</span><span class="o">.</span><span class="na">newVirtualThreadPerTaskExecutor</span><span class="o">();</span>

<span class="c1">// Optional: route NeoQueue's default through VTs too.</span>
<span class="nc">NeoQueue</span><span class="o">.</span><span class="na">setExecutor</span><span class="o">(</span><span class="n">vt</span><span class="o">);</span>

<span class="c1">// Now every task is a virtual thread. The orchestration is callbacks;</span>
<span class="c1">// the threads are continuations.</span>
<span class="kd">final</span> <span class="kt">var</span> <span class="n">tasks</span> <span class="o">=</span> <span class="n">bigList</span><span class="o">.</span><span class="na">stream</span><span class="o">()</span>
    <span class="o">.&lt;</span><span class="nc">Asyncc</span><span class="o">.</span><span class="na">AsyncTask</span><span class="o">&lt;</span><span class="nc">Result</span><span class="o">,</span> <span class="nc">Throwable</span><span class="o">&gt;&gt;</span><span class="n">map</span><span class="o">(</span><span class="n">item</span> <span class="o">-&gt;</span>
        <span class="n">cb</span> <span class="o">-&gt;</span> <span class="n">vt</span><span class="o">.</span><span class="na">submit</span><span class="o">(()</span> <span class="o">-&gt;</span> <span class="o">{</span>
            <span class="k">try</span> <span class="o">{</span> <span class="n">cb</span><span class="o">.</span><span class="na">done</span><span class="o">(</span><span class="kc">null</span><span class="o">,</span> <span class="n">blockingFetch</span><span class="o">(</span><span class="n">item</span><span class="o">));</span> <span class="o">}</span>  <span class="c1">// safe blocking on VT</span>
            <span class="k">catch</span> <span class="o">(</span><span class="nc">Throwable</span> <span class="n">t</span><span class="o">)</span> <span class="o">{</span> <span class="n">cb</span><span class="o">.</span><span class="na">done</span><span class="o">(</span><span class="n">t</span><span class="o">,</span> <span class="kc">null</span><span class="o">);</span> <span class="o">}</span>
        <span class="o">}))</span>
    <span class="o">.</span><span class="na">toList</span><span class="o">();</span>

<span class="nc">Asyncc</span><span class="o">.</span><span class="na">ParallelLimit</span><span class="o">(</span><span class="mi">64</span><span class="o">,</span> <span class="n">tasks</span><span class="o">,</span> <span class="o">(</span><span class="n">err</span><span class="o">,</span> <span class="n">results</span><span class="o">)</span> <span class="o">-&gt;</span> <span class="o">{</span> <span class="cm">/* ... */</span> <span class="o">});</span>
</code></pre></div></div>

<p>What Loom buys us:</p>

<ul>
  <li><strong>VT spawn cost is ~250 ns.</strong> The “submit a Runnable” step in each task becomes free.</li>
  <li><strong>Blocking I/O inside a task is a continuation park</strong>, not a kernel thread block. <code class="language-plaintext highlighter-rouge">Thread.sleep</code>, <code class="language-plaintext highlighter-rouge">URL.openStream</code>, JDBC calls — all release the carrier.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">synchronized</code> no longer pins</strong> carriers on JDK 21+ (<a href="https://openjdk.org/jeps/491">JEP 491</a>), but <code class="language-plaintext highlighter-rouge">NeoLock</code> is still useful for <em>async</em> mutual exclusion — i.e. you want to release the lock from a different VT than acquired it, which is the case in callback chains where the unlock fires from the completion of an async task.</li>
</ul>

<p>What Loom <em>doesn’t</em> change for async.java: the orchestration itself. Even with VTs, you still need a way to say <em>“do these N things in parallel and collect their results”</em>. Loom’s <a href="https://openjdk.org/jeps/505"><code class="language-plaintext highlighter-rouge">StructuredTaskScope</code></a> does that for synchronous-style fan-out (and is great), but if your code is callback-shaped — a typical case for event-driven runtimes like Vert.x — <code class="language-plaintext highlighter-rouge">Asyncc.Parallel</code> slots in directly and <code class="language-plaintext highlighter-rouge">StructuredTaskScope</code> doesn’t.</p>

<p>The pithy framing: <strong>Loom solves what a thread costs. async.java solves what coordinating a set of async tasks looks like in code.</strong> They compose.</p>

<h3 id="akka-streams--loom">Akka Streams + Loom</h3>

<p>You can configure Akka 2.8.x’s default dispatcher to use a virtual-thread executor:</p>

<div class="language-hocon highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">akka.actor.default-dispatcher</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="nl">executor</span><span class="w"> </span><span class="p">=</span><span class="w"> </span><span class="s2">"virtual-thread-executor"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This makes the actor scheduling step run on VTs. It helps for the per-stage work — a stage doing JDBC inside <code class="language-plaintext highlighter-rouge">mapAsync</code> no longer ties up a platform thread. But it doesn’t help the <em>graph materialisation overhead</em>, because that overhead isn’t in the threads, it’s in the framework — actor mailbox structure, interpreter loop, stage logic instantiation. Those are the same number of allocations and CAS operations regardless of whether the thread is virtual.</p>

<p>So Loom moves Akka Streams from “stages that block I/O cost a platform thread” to “stages that block I/O cost a continuation park”. That’s a meaningful improvement for stream consumers doing JDBC or HTTP calls. But it doesn’t move the needle on the per-message coordination cost that the benchmark above is measuring.</p>

<p>The deeper observation: <strong>Akka Streams predates Loom and its overhead model reflects that.</strong> When the actor model was designed, the goal was “do useful work on a small number of platform threads, by having actors share them”. Loom inverts that — threads are now cheap, so the actor abstraction’s overhead has gone from “small price for a useful property (multiplexing)” to “small-but-noticeable price for a property you may not need”. For per-request work, the multiplexing isn’t worth it; for long-running streams, the back-pressure and supervision still are.</p>

<p>async.java, written without the actor-model frame, doesn’t pay that tax. It also doesn’t give you actor supervision or graph-level back-pressure. Both libraries are honest about what they are.</p>

<h2 id="when-to-pick-what">When to pick what</h2>

<p>For per-request orchestration shapes — HTTP handlers, WS request/response, job pipelines — <strong>async.java</strong>:</p>

<ul>
  <li>~50 µs per orchestration overhead</li>
  <li>Composable combinators that nest cleanly</li>
  <li>Loom-native: hand it a VT executor and stop thinking about thread pools</li>
  <li>Short stack traces (~10 frames) when failures happen</li>
  <li>Per-call cost stays flat as load grows</li>
</ul>

<p>For long-running stream consumers — Kafka, JetStream, CDC feeds, per-connection WS treated as a stream — <strong>Akka Streams (or Pekko Streams)</strong>:</p>

<ul>
  <li>Materialisation cost amortises over millions of messages</li>
  <li>Structural back-pressure is built in and type-checked</li>
  <li>Supervision strategies, error recovery semantics, async boundaries</li>
  <li>Mature ecosystem of connectors</li>
</ul>

<p>You can use both in the same service. The Vert.x server we run for batch pipelines uses async.java for per-job orchestration and stays on the Vert.x event loop for I/O. The Kafka consumer in front of it uses Akka Streams because that’s exactly the shape it serves well. The mistake is using one for the other’s job — and the table above is what that mistake looks like in numbers.</p>

<hr />

<h3 id="reproducing-the-benchmark">Reproducing the benchmark</h3>

<p>The benchmark is fully open. To reproduce:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Build the Akka WS server</span>
git clone https://github.com/oresoftware/k8s-cluster
<span class="nb">cd </span>k8s-cluster/remote/akka-ws-server
mvn <span class="nt">-q</span> clean package

<span class="c"># Boot it</span>
java <span class="nt">-jar</span> target/dd-akka-ws-server.jar

<span class="c"># Build the Rust load tester</span>
<span class="nb">cd</span> ../ws-loadtest-rs
cargo build <span class="nt">--release</span>

<span class="c"># Hit each endpoint</span>
<span class="nv">LOAD_MODE</span><span class="o">=</span>pipeline <span class="nv">CLIENT_COUNT</span><span class="o">=</span>50 <span class="nv">MESSAGES_PER_SECOND_PER_CLIENT</span><span class="o">=</span>10 <span class="se">\</span>
  <span class="nv">HOLD_SECONDS</span><span class="o">=</span>60 <span class="se">\</span>
  <span class="nv">TARGET_WS_URL</span><span class="o">=</span>ws://127.0.0.1:8086/ws/asyncjava <span class="se">\</span>
  target/release/ws-loadtest-rs

<span class="nv">LOAD_MODE</span><span class="o">=</span>pipeline <span class="nv">CLIENT_COUNT</span><span class="o">=</span>50 <span class="nv">MESSAGES_PER_SECOND_PER_CLIENT</span><span class="o">=</span>10 <span class="se">\</span>
  <span class="nv">HOLD_SECONDS</span><span class="o">=</span>60 <span class="se">\</span>
  <span class="nv">TARGET_WS_URL</span><span class="o">=</span>ws://127.0.0.1:8086/ws/akkastreams <span class="se">\</span>
  target/release/ws-loadtest-rs
</code></pre></div></div>

<p>The Gleam/Node load tester is at <a href="https://github.com/oresoftware/k8s-cluster/tree/main/remote/gleamlang-ws-loadtest"><code class="language-plaintext highlighter-rouge">remote/gleamlang-ws-loadtest</code></a> and produces equivalent numbers.</p>

<p>The bug fixes that closed the message-drop gap are in <a href="https://github.com/async-java/async.java/pull/9">PR #9</a>, <a href="https://github.com/async-java/async.java/pull/10">PR #10</a>, and the <a href="https://github.com/async-java/async.java/releases/tag/v0.2.2">v0.2.2 release</a>. All reproducer tests live in <code class="language-plaintext highlighter-rouge">src/test/java/general/</code> in the async.java repo.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A detailed performance comparison of async.java v0.2.2 and Akka Streams 2.8.8 across five load points. Includes the mechanical breakdown of where the latency gap comes from and how both libraries integrate with Project Loom.]]></summary></entry></feed>