An initial investigation into using Zig to speed up Raku code

27 Nov, 2023

Note: This post is also available as a gist if you find that format more readable.

Introduction

This research was conducted while preparing an upcoming Raku Advent Calendar post. The Raku code uses a basic supply pipeline to feed $volume objects through a validation stage that requires a CRC32 check before going to the output sink, which prints the processing time of the validation stage.

The "reaction graph" is designed to simulate a stream processing flow, where inputs arrive and depart via Candycane™ queues (that's the name of Santa's Workshop Software's queueing service, in case you weren't familiar).

The entire scenario is contrived in that CRC32 was chosen due to native implementation availability in both Raku and Zig, allowing comparison. It's not an endorsement of using CRC32 in address validation to deliver Santa's, or anyone's, packages.

Also, thanks to the very helpful folks at ziggit.dev for answering my newbie question in depth.

Methodology

The source code:

Raku - crc-getter.raku
Raku+Zig - crc-getter-extended.raku, main.zig

At larger volumes, Raku struggles with the initialization speed of the $volume objects that are instantiated. I replaced the native Raku class with one written in Zig, using the is repr('CStruct') trait in Raku and the extern struct qualifier in Zig.

In Zig I use a combination of an arena allocator (for the string passed from Raku) and a memory pool (designed to quicklymake copies of a single type, exactly fitting our use case) to construct Package objects.

Additionally, for Raku+Zig the CRC32 hashing routine from Zig's stdlib is used via a tiny wrapper function.

A --bad-packages option is provided by both Raku scripts, which makes 10% of the objects have a mismatched address/CRC32 pair.

The library tested was compiled with -Doptimize=ReleaseFast.

Batches are repeated $batch times, which defaults to 5.

All results from an M2 MacBook Pro.

Caveats

This test and its is only intended to reflect the case where an object is constructed in Zig based on input from Raku. It is not intended to be a test of Zig's native speed in the creation of structs.

There is a call to sleep that gives time -- 0.001 seconds -- to get the react block up and running before emitting the first True on the $ticker-supplier. This affects overall runtime but not the batch or initialization metrics.

The speed of Raku+Zig was so fast that the tool used to measure these details (cmdbench) could not find results in ps for the execution because it had already finished. These are marked as Unmeasured.

In the next iteration of this research, there sould be two additional entries in the data tables below for:

Raku+Zig: Raku-managed objects / Zig crc32
Raku+Zig: Zig-managed objects / Raku crc32

Results

10,000

Volume	Edition	Runtime	Batch Time	Initialization	Max bytes
10,000	Raku	1.072s	1: 0.146596686s 2: 0.138983732s 3: 0.142380065s 4: 0.136050775s 5: 0.134760525s	0.008991746s	180240384
10,000	Raku+Zig	0.44s	1: 0.010978411s 2: 0.006575705s 3: 0.004145623s 4: 0.004280415s 5: 0.00468929s	0.020358033s	`Unmeasured`
10,000	Raku (`bad-packages`)	1.112s	1: 0.157788932s 2: 0.149544686s 3: 0.156293433s 4: 0.151365477s 5: 0.147947436s	0.008059955s	196263936
10,000	Raku+Zig (`bad-packages`)	0.463s	1: 0.031300276s 2: 0.01006562s 3: 0.010693328s 4: 0.011056994s 5: 0.010770828s	0.010954495s	`Unmeasured`

Notes

The Raku+Zig solution wins in performance, but loses the initialization race. Raku is doing a decent showing in comparison to how far it has come performance-wise.

100,000

Volume	Edition	Overall	Batch Time	Initialization	Max bytes
100,000	Raku	7.163s	1: 1.360029456s 2: 1.32534014s 3: 1.353072834s 4: 1.346668338s 5: 1.351110502s	0.062402473s	210173952
100,000	Raku+Zig	0.75s	1: 0.079802007s 2: 0.073638176s 3: 0.053291894s 4: 0.05087652s 5: 0.050394687s	0.05855585s	241205248
100,000	Raku (`bad-packages`)	7.89s	1: 1.496982355s 2: 1.484494027s 3: 1.497365023s 4: 1.490810525s 5: 1.492416774s	0.060026016s	209403904
100,000	Raku+Zig (`bad-packages`)	1.076s	1: 0.16960934s 2: 0.111172493s 3: 0.110844786s 4: 0.113021202s 5: 0.111713535s	0.051436311s	242450432

Notes

We see Raku+Zig take first place in everything but memory consumption, which we can assume is a function of using the NativeCall bridge, not to mention my new-ness as a Zig programmer.

1,000,000

Volume	Edition	Overall	Batch Time	Initialization	Max bytes
1,000,000	Raku	68.081s	1: 13.475302627s 2: 13.161153845s 3: 13.293998956s 4: 13.364662217s 5: 13.474755295s	0.95481884s	417103872
1,000,000	Raku+Zig	3.758s	1: 0.788083286s 2: 0.509883905s 3: 0.492898873s 4: 0.500868284s 5: 0.498677495s	0.575087671s	514064384
1,000,000	Raku+Zig (`bad-packages`)	75.796s	1: 14.940173822s 2: 14.632683637s 3: 14.866796226s 4: 15.272903792s 5: 15.027481448s	0.704549212s	396656640
1,000,000	Raku+Zig (`bad-packages`)	6.553s	1: 1.362189763s 2: 1.061496504s 3: 1.069134685s 4: 1.062746049s 5: 1.061096044s	0.528011288s	462766080

Notes

Raku's native CRC32 performance is clearly lagging here. Raku+Zig keeps its domination except in the realm of memory usage. It would be hard to justify using the Raku native version strictly on its reduced memory usage, considering the performance advantage on display here

A "slow first batch" problem begins to affect Raku+Zig. Running with bad-packages enabled slows down the Raku+Zig crc32 loop, hinting that there might be some optimizations on either the Raku or the Zig/clang side of things that can't kick in when the looped data is heterogenous.

Dynamic runtime optimization sounds more like a Rakudo thing than a Zig thing, though.

10,000,000

Volume	Edition	Runtime	Batch Time	Initialization	Max bytes
10,000,000	Raku	704.852s	1: 136.588638184s 2: 136.851019628s 3: 138.44696743s 4: 139.777040922s 5: 139.490784317s	13.299274221s	2055012352
10,000,000	Raku+Zig	38.505s	1: 8.843459877s 2: 4.84300835s 3: 4.991842433s 4: 5.077245603s 5: 4.939533707s	9.375436134s	2881126400
10,000,000	Raku (`bad-packages`)	792.1s	1: 162.333803401s 2: 174.815386318s 3: 168.299796081s 4: 162.643428135s 5: 163.205406678s	10.252639311s	2124267520
10,000,000	Raku+Zig (`bad-packages`)	65.174	1: 14.41616445s 2: 11.078961309s 3: 10.662389991s 4: 11.20240076s 5: 10.614430063s	6.778600235s	2861596672

Notes

Pure Raku really struggles with a volume of this order of magnitude. But if you add in just a little bit of Zig, you can reasonably supercharge Raku's capabilities.

The "slow first batch" for Raku+Zig has been appearing in more understated forms in other tests. Here the first batch is over double the runtime of the second batch. What is causing this?

100,000,000

This doesn't seem to work. At least, I'm not patient enough. The process seems to stall, growing and shrinking memory but never finishing.

Final Thoughts

This is a preliminary report in blog post form based on a contrived code sample written for another, entirely different blog post. More data and deeper analysis will have to come later.

Zig's C ABI compatibility is clearly no put on. It works seamlessly with Raku's NativeCall. Granted, we haven't really pushed the boundaries of what the C ABI can look like but one of the core takeaways is actually that with Zig we can design that interface. In other words, we are in charge of how ugly, or not, it gets. Considering how dead simple the extern struct <-> is repr('CStruct') support is, I don't think the function signatures need to get nearly as gnarly as they get in C.

Sussing the truth of that supposition out will take some time and effort in learning Zig. I'm looking forward to it. My first stop will probably be a JSON library that uses Zig. I'm also going to be looking into using Zig as the compiler for Rakudo, as it might simplify our releases significantly.