5ab5traction5 - World Wide and Wonderful

An initial investigation into using Zig to speed up Raku code

Note: This post is also available as a gist if you find that format more readable.

Introduction

This research was conducted while preparing an upcoming Raku Advent Calendar post. The Raku code uses a basic supply pipeline to feed $volume objects through a validation stage that requires a CRC32 check before going to the output sink, which prints the processing time of the validation stage.

The "reaction graph" is designed to simulate a stream processing flow, where inputs arrive and depart via Candycaneâ„¢ queues (that's the name of Santa's Workshop Software's queueing service, in case you weren't familiar).

The entire scenario is contrived in that CRC32 was chosen due to native implementation availability in both Raku and Zig, allowing comparison. It's not an endorsement of using CRC32 in address validation to deliver Santa's, or anyone's, packages.

Also, thanks to the very helpful folks at ziggit.dev for answering my newbie question in depth.

Methodology

The source code:

At larger volumes, Raku struggles with the initialization speed of the $volume objects that are instantiated. I replaced the native Raku class with one written in Zig, using the is repr('CStruct') trait in Raku and the extern struct qualifier in Zig.

In Zig I use a combination of an arena allocator (for the string passed from Raku) and a memory pool (designed to quicklymake copies of a single type, exactly fitting our use case) to construct Package objects.

Additionally, for Raku+Zig the CRC32 hashing routine from Zig's stdlib is used via a tiny wrapper function.

A --bad-packages option is provided by both Raku scripts, which makes 10% of the objects have a mismatched address/CRC32 pair.

The library tested was compiled with -Doptimize=ReleaseFast.

Batches are repeated $batch times, which defaults to 5.

All results from an M2 MacBook Pro.

Caveats

This test and its is only intended to reflect the case where an object is constructed in Zig based on input from Raku. It is not intended to be a test of Zig's native speed in the creation of structs.

There is a call to sleep that gives time -- 0.001 seconds -- to get the react block up and running before emitting the first True on the $ticker-supplier. This affects overall runtime but not the batch or initialization metrics.

The speed of Raku+Zig was so fast that the tool used to measure these details (cmdbench) could not find results in ps for the execution because it had already finished. These are marked as Unmeasured.

In the next iteration of this research, there sould be two additional entries in the data tables below for:

Results

10,000

Volume Edition Runtime Batch Time Initialization Max bytes
10,000 Raku 1.072s 1: 0.146596686s
2: 0.138983732s
3: 0.142380065s
4: 0.136050775s
5: 0.134760525s
0.008991746s 180240384
10,000 Raku+Zig 0.44s 1: 0.010978411s
2: 0.006575705s
3: 0.004145623s
4: 0.004280415s
5: 0.00468929s
0.020358033s Unmeasured
10,000 Raku
(bad-packages)
1.112s 1: 0.157788932s
2: 0.149544686s
3: 0.156293433s
4: 0.151365477s
5: 0.147947436s
0.008059955s 196263936
10,000 Raku+Zig
(bad-packages)
0.463s 1: 0.031300276s
2: 0.01006562s
3: 0.010693328s
4: 0.011056994s
5: 0.010770828s
0.010954495s Unmeasured

Notes

The Raku+Zig solution wins in performance, but loses the initialization race. Raku is doing a decent showing in comparison to how far it has come performance-wise.

100,000

Volume Edition Overall Batch Time Initialization Max bytes
100,000 Raku 7.163s 1: 1.360029456s
2: 1.32534014s
3: 1.353072834s
4: 1.346668338s
5: 1.351110502s
0.062402473s 210173952
100,000 Raku+Zig 0.75s 1: 0.079802007s
2: 0.073638176s
3: 0.053291894s
4: 0.05087652s
5: 0.050394687s
0.05855585s 241205248
100,000 Raku
(bad-packages)
7.89s 1: 1.496982355s
2: 1.484494027s
3: 1.497365023s
4: 1.490810525s
5: 1.492416774s
0.060026016s 209403904
100,000 Raku+Zig
(bad-packages)
1.076s 1: 0.16960934s
2: 0.111172493s
3: 0.110844786s
4: 0.113021202s
5: 0.111713535s
0.051436311s 242450432

Notes

We see Raku+Zig take first place in everything but memory consumption, which we can assume is a function of using the NativeCall bridge, not to mention my new-ness as a Zig programmer.

1,000,000

Volume Edition Overall Batch Time Initialization Max bytes
1,000,000 Raku 68.081s 1: 13.475302627s
2: 13.161153845s
3: 13.293998956s
4: 13.364662217s
5: 13.474755295s
0.95481884s 417103872
1,000,000 Raku+Zig 3.758s 1: 0.788083286s
2: 0.509883905s
3: 0.492898873s
4: 0.500868284s
5: 0.498677495s
0.575087671s 514064384
1,000,000 Raku+Zig
(bad-packages)
75.796s 1: 14.940173822s
2: 14.632683637s
3: 14.866796226s
4: 15.272903792s
5: 15.027481448s
0.704549212s 396656640
1,000,000 Raku+Zig
(bad-packages)
6.553s 1: 1.362189763s
2: 1.061496504s
3: 1.069134685s
4: 1.062746049s
5: 1.061096044s
0.528011288s 462766080

Notes

Raku's native CRC32 performance is clearly lagging here. Raku+Zig keeps its domination except in the realm of memory usage. It would be hard to justify using the Raku native version strictly on its reduced memory usage, considering the performance advantage on display here

A "slow first batch" problem begins to affect Raku+Zig. Running with bad-packages enabled slows down the Raku+Zig crc32 loop, hinting that there might be some optimizations on either the Raku or the Zig/clang side of things that can't kick in when the looped data is heterogenous.

Dynamic runtime optimization sounds more like a Rakudo thing than a Zig thing, though.

10,000,000

Volume Edition Runtime Batch Time Initialization Max bytes
10,000,000 Raku 704.852s 1: 136.588638184s
2: 136.851019628s
3: 138.44696743s
4: 139.777040922s
5: 139.490784317s
13.299274221s 2055012352
10,000,000 Raku+Zig 38.505s 1: 8.843459877s
2: 4.84300835s
3: 4.991842433s
4: 5.077245603s
5: 4.939533707s
9.375436134s 2881126400
10,000,000 Raku
(bad-packages)
792.1s 1: 162.333803401s
2: 174.815386318s
3: 168.299796081s
4: 162.643428135s
5: 163.205406678s
10.252639311s 2124267520
10,000,000 Raku+Zig
(bad-packages)
65.174 1: 14.41616445s
2: 11.078961309s
3: 10.662389991s
4: 11.20240076s
5: 10.614430063s
6.778600235s 2861596672

Notes

Pure Raku really struggles with a volume of this order of magnitude. But if you add in just a little bit of Zig, you can reasonably supercharge Raku's capabilities.

The "slow first batch" for Raku+Zig has been appearing in more understated forms in other tests. Here the first batch is over double the runtime of the second batch. What is causing this?

100,000,000

This doesn't seem to work. At least, I'm not patient enough. The process seems to stall, growing and shrinking memory but never finishing.

Final Thoughts

This is a preliminary report in blog post form based on a contrived code sample written for another, entirely different blog post. More data and deeper analysis will have to come later.

Zig's C ABI compatibility is clearly no put on. It works seamlessly with Raku's NativeCall. Granted, we haven't really pushed the boundaries of what the C ABI can look like but one of the core takeaways is actually that with Zig we can design that interface. In other words, we are in charge of how ugly, or not, it gets. Considering how dead simple the extern struct <-> is repr('CStruct') support is, I don't think the function signatures need to get nearly as gnarly as they get in C.

Sussing the truth of that supposition out will take some time and effort in learning Zig. I'm looking forward to it. My first stop will probably be a JSON library that uses Zig. I'm also going to be looking into using Zig as the compiler for Rakudo, as it might simplify our releases significantly.