Note: this was written in early January 2022 but took a while before posting. In the meantime Evan Martin came up with n2 which departs for a fair number of assumptions in ninja code base. Other exciting projects are also coming along.
I spent the 2021-2022 holidays rewriting Ninja to Go to learn more about rewriting highly optimized software in a new language. This was fun because the baseline is that ninja is extremely efficient. This didn’t leave much wiggle room for non-efficient translation! I learned a few things along the way about what worked and what surprised me. If you do see yourself specifically having to do a C++ to Go translation, this post is for you.
My primary goal to rewrite Ninja to Go was to optimize no-op incremental builds on very large projects. The difference of impact between 10 seconds and 1 second incremental build is very significant for developers. The current ninja maintainers are doing a great job at keeping the project as stable and as boring as possible (this is a good thing!).
On the other hand, going from 10s to 1s may require drastic changes to the way the internals work. I felt that “just yet another ninja fork” was not the best way to go. It was clear to me that getting 10x faster required not only doing less, but also doing with more concurrency. Other ideas floating were to use a daemon, or other tricks that are really more a pain C++ than in Go. Source based distribution in Go is as easy as with npm or Rust; all an order of magnitude easier than in C++. Since go1.17, releasing a Go executable is:
go install github.com/maruel/nin/cmd/nin@latest
It hardly can get easier than that.
My secondary goal was to make it a library to enable more dependency graph analysis. I extracted the code from the “main” executable which forced me to think about the API layer. There’s a few reasons for this. This forced me to make the code clearer; I only exposed what was used by the ninja.cc main file. This permitted me to extract the core algorithm, which is the first step to mutate it.
There’s two areas of improvements I’d like to see eventually; content addressed build and native remote step execution. A leader for content addressed build is bazel. One big problem with bazel is that it owns the whole stack, you cannot compose it with other systems, unlike GN, CMake or meson are doing with ninja. Remote step execution doesn’t require a content addressed build. Both features are completely orthogonal. Where it matters is that a content addressed build will execute faster when run remotely due to better cache hit in the content addressed cache. In fact Goma (nee from Chromium) is not fundamentally content addressed; the files transferred are content addressed but the steps are not. I think that eventually it can be done with a clear remote execution API.
I wrote Chromium’s task distribution service (Swarming) and participated into prototyping what eventually became bazel’s native remote build execution API so it’s a problem space dear to me. As for incremental start, I’ll discuss more below.
My day time job is managing Fuchsia Engineering Productivity. The Fuchsia no-op incremental build currently sits around 5.2 seconds on a fairly powerful workstation. This post is not about Fuchsia itself. Optimizing for this specific use case was the primary quantified target.
Why Go instead of let’s say Rust? Because I knew that I could disable the garbage collector for a quick boost and that I personally write Go code significantly faster than Rust. I know the ecosystem better; tools like profiling, code coverage, documentation. I wanted to learn more about profiling than anything else. And how fast I could get it done!
Jeff think it’s a stupid idea and I respect him.
I knew I had 90% chances of failure, due to lack of interest of the final outcome, or my own failure to make it faster than the C++ implementation. Being as good is not good enough, it has to provide more value than that.
The acceptance aspect was less clear; I couldn’t ask colleagues about hypothetical, so I basically had to do it first to show the outcome as a fait accomplit. Nobody at work believed me I could do it within a month, which I did. As Evan mentioned in his 2020 ninja retrospective, the social aspect is the most important and I knew it would be an extremely hard sell. As such, I decided to cherish the journey as much as the end result.
As of now, 11 months after having written this project, the ideas were picked up by other team members so I think in the end that it was time well spent.
First of all, this would not have been possible if ninja’s code base was poorly written. It would have not been possible if it’s unit test coverage wasn’t as good as it is. Frankly, ninja’s code is great, its tests are great. It really made the process a breeze. Which is kind of ironic, if it’s great, why do the rewrite in the first place? My TL;DR: is that concurrency and RPCs in C++ is much harder than in Go and making good reusable C++ libraries is still hard, definitely much harder than in current generation languages (Go, Rust, nodejs).
I started the effort by writing a throw-away program
github.com/maruel/cc2go to do regexp based
conversion of anything that I could do in an automated way. One of my biggest
surprise was that the hardest part has been figuring out how to add brackets
{}
for bracket-less one liner conditions in C++. Figuring out the right
algorithm felt like doing a coding interview, implementing just the right
structure with a recursive algorithm.
The code generated opted out from compilation with a //go:build nobuild
statement. This permitted me to make incremental progress, keeping the project
as a whole compilable and testable, even if there wasn’t much to play with
initially.
I heavily leveraged GitHub Actions. Having
used it for my other popular projects
(panicparse and
periph.io), it was well known territory to me.
I enabled all sorts of linters, eventually enforcing them as post submit hard checks. It is generally recommended against using linters (go vet, go vet shadow, staticcheck, gosec) as hard checks but for C++ to go translation, the trade off was worth it as they caught a large number of serious errors. As a note, golint is now officially deprecated but I hadn’t realized at the time.
As soon as the code started being compilable, I enabled Codecov’s code coverage
integration:
This permitted me to easily spot under-tested areas, directly from my web browser.
C++ iterator are something. They are verbose, and the algorithm methods are
heavily overloaded. This means that each call site could do subtly different
thing based on argument types. The std::set<>::insert()
function has 8
overloads, some
returning a std::pair<iterator, bool>
, others an iterator, others void. It was
tricky when the iterator return value was used.
Lack of clarity made the conversion process significantly harder. In Go, the
for x = range y
idiom is used to iterate a collection. In C++, set is ordered
by default! So I had to figure out when ordering was actually needed (it was
generally not) and where micro
optimization in
the C++ code base were unnecessary in Go because Go’s map[string]struct{}
is
just so damn fast.
C++ functions can only return one value. To side-step this limitation, the ninja
code uses a lot of pointer to variables to return additional values. One pattern
used a lot in the ninja code base is to take a err *std::string
argument then
returning bool. One problem with this pattern is that while most functions
returned false exactly only when err was set, a handful of functions didn’t
match this pattern. This reduced clarity. I had to be careful during the
translation and had to improve clarity by manually figuring out if the bool
return value was significant or not. During the conversion process, I discovered
#2068 that one function have
not returned false for many years now, yet call sites were still keying on its
return value.
The ninja code base is monolithic. It is not designed to be used as a library. Previous attempt in refactoring were denied. I really wanted to do so, so I migrated the main executable outside of the primary package. This brought more clarity into what was a real API surface versus what was internal details.
It was very interesting because it happened that the C++ code style used by ninja is derivative of the code style used by Chromium which is itself a derivative of Google’s C++ style, which uses upper case first letter for classes and functions. This meant that in Go all of these were exported by default! It made my life slightly easier.
I leveraged the tool
golang.org/x/tools/cmd/gorename to
unexport symbols that were not leveraged by the ./cmd/nin
executable. This
resulted in tens of commits similar to this
one
where I unexported one or a few symbols at a time.
Another great side effect of Google’s C++ style is that it disallows function overloads. Ninja’s code base does use a handful of overloaded functions, the rarity definitely helped speed up the conversion process.
I got caught by few oddities. One worth of mention is the behavior difference
between atoi()
and strconv.Atoi()
.
atoi()
stops on the first non-parsable characters but strconv.Atoi()
fails
if the whole string is not a number.
As I improved the code, I enabled both the sampling CPU profiler and the memory profile dump.
I wrote a script to run the C++ benchmarks and the Go equivalent one as a comparison. This was useful to ensure that I was doing an apples-to-apples comparison with synthetic tests.
I wrote a small set of tools github.com/maruel/pat to help me iterate faster with the optimization process:
ba
is a quick tool to run Go benchmarks against parent commit HEAD~1
.
disfunc
disassemble a single function with code annotation and color
highlighting.
boundcheck
highlights the bound checks found in the code by processing the
disassembly.
I enabled Bencher to be able to see trends in
the benchmarks.
One of the biggest problem is that the synthetic Ninja tests were not representative of the performance gains I was achieving with the Fuchsia no-op incremental build! I had to test manually much more than what is reasonable to. This has been a bit unfortunate but it is an example of where doubting measurements is important and to cross-correlate the performance across multiple angles.
If this effort was continued, significant effort would be required to make more representative benchmarks.
My first biggest win was to identify 4 lines to remove. Like really just remove 4 lines and that’s it!
The reason is that the C++ code uses a slice of std::string
. std::string
has
an internal buffer so appending a single character generally does not cause
memory allocation. In Go, string
is an immutable type, so these concatenations
caused a lot of memory allocation.
There are still room for improvement in this area since this function is only called from the lexer. This means that when a concatenation occurs, the string content is already continuous in memory.
I
optimized
CanonicalizePath()
performance by 39% by removing 2 memory copy. This was done
by creating a []byte
slice of the expected size and modifying it in place.
Once done with the slice, return it as a string using the unsafe package with
the “unsafeString” pattern. The rationale is that is it safe, since the slice is
only ever touched by the function.
HashCommand()
is another fairly hot function. I decided to
cheat
the same way and to use the same “unsafeString” pattern. This accelerated the
function by 65%, from 248ns to 86ns on my workstation.
There was one problem when I optimized the function, it’s that I changed the algorithm by accident and hadn’t realized. 😱 So I added a unit test to ensure that my optimization was still leading to valid results.
I
made
reading text file return []byte
instead of the string
I originally did as a
quick translation. This reduced memory usage by 8% and improved end-to-end
performance by 5%. It is easy to hack a []byte
to a string
but the reverse
is not true. Doing mutating operations on string
leads to memory allocation.
I
accelerated
deps log file loading by 30% by reading the file as one shot instead of using
the io.Reader pattern. I did a follow
up
using
binary.LittleEndian.Uint64(data)
instead of binary.Read(r, binary.LittleEndian, ...)
which greatly improved
performance. Sadly I forgot to note the numbers there, it was a very large
improvement.
During the translation I had commented out the preallocation. Here’s an example where added preallocation back. It reduced the memory pressure. Preallocation must not be abused, most of my attempts to preallocate resulted in worse performance overall. I was surprised more than once. Always double-check!
Using stack allocated arrays for temporary buffers can help a lot hot functions.
This only works for small amounts, otherwise the Go function itself has to seek
out a new stack slice. I optimized
EdgeEnv.MakePathList()
and
EvalString.Evaluate()
which reduced the number of memory allocation by 21%. This is a big deal to
reduce pressure on the memory allocator!
This is a generic reusable optimization pattern. The pattern looks like:
var z [64]byte
var s []byte
if l := <needs>; l > cap(z) {
// Too large, allocate on heap.
s = make([]byte, l)
} else {
// Fast path. Zero heap allocation.
s = z[:l]
}
One important feature that the synthetic tests didn’t measure is subninjas. These files are children manifest that contain more targets to build. The fuchsia build uses over 3000 of these. My first optimization has been to read these files in concurrent goroutines and process them after the main manifest. I first changed the order without concurrency to see if it would break anything. Then I added concurrency.
You can pass the flag -noprewarm
to see the effect of prewarming. I sadly
didn’t get stats on the ninja prewarming but the effect was significant,
especially on workstations with more than 8 cores.
I made the manifest parser (build.ninja
) and the state processor (updating the
build graph) separate goroutines. As I was progressing, I realized that I needed
to be able to compare them both, doing before/after with separate commits didn’t
cut it. So I refactored the code to have both implementations
(serial,
concurrent)
live simultaneously. Having both implementation was critical to compare the
performance characteristics. Implementing the concurrent one was fairly hard;
the errors found late in state processing still needed to be reported with the
right lexer state, which resulted in a lot of lexer state data being passed
around, sending more data around which was a concern performance wise.
The data is sent between the two goroutines via a channel. One surprising
optimization
has been to send 10 items at a time to reduce both the channel length and the
“context switch”; reducing the CPU overhead of concurrency by 6% and user
visible lantecy by 4%! Since both algorithms are implemented, you can pass the
flag -serial
to contrast to see the impact of concurrent parsing.
I tested with and without garbage collection (GC). This is a one line to commit in cmd/nin/ninja.go. Running without GC definitely helped performance but did increase the runtime variance, which was surprising to me.
A follow up worth trying would be to reenable the GC after the manifest is processed but before starting the build steps.
I saved numerous failed optimization attempts on the repository as failed_*
branches. It was quite interesting
to see what ideas that sounded right to me that did not work out in practice.
Intuition is often wrong. It is a big deal for two reasons.
First, for non-server code, you cannot and should not optimize if you do not have a reproducible benchmark. For server, you can more easily afford to A/B test on live production data via canary and distributed profiling. For CLI tools, distributed profiling requires analytics embedded in the tool.
Still, the first step to optimize any code should always to write a benchmark. Thankfully there was already great synthetic benchmarks in the ninja code, and I added ones when needed for micro optimizations. The failed branches are a testament of the usefulness of these benchmarks.
Second, the synthetic benchmarks only go this far and testing in the real world is irreplaceable. I leveraged the Fuchsia no-op incremental build as my primary real world benchmark and it was really useful to catch optimizations that were not caught by synthetic benchmarks.
Ninja algorithm has a significant drawback that it stat() all inputs before starting execution. For large projects, this is too slow. Evan Martin, the original ninja author, created a new tool called n2. It uses a different algorithm that is more aligned on the structural changes I wanted to do. The difference is that I wanted to try to achieve it incrementally, Evan decided to try as a fresh new project. Another interesting difference is that Evan decided to use Rust while I used Go.
I didn’t go into improving the subprocess performance. The way Go handle subprocess performance doesn’t scale as well as I had hoped, especially compared to how ninja micro-optimize subprocess handling. This mean that getting on-par for local execution performance would require significantly more work. The performance is currently visibly slower.
On the other hand, if the end goal is to run the majority of commands remotely via RPCs, it may not be as critical.
nin
is still not as fast as I would like it to be; it is ~20% faster than
ninja at manifest parsing but it is ~10% slower at actually running the build.
My North Star was clear; making no-op incremental build magically faster, which
I achieved.
The social aspect has been mostly what I was expecting: all teams are understandably fairly cautious about a brand new code base. I do think that having the build graph as a library opens the door to completely change the way the build is managed but that’s in itself a non-trivial amount of work.
I want to acknowledge Jan Niklas Hasse help with handling my PRs upstream and his diligence in keeping the project boring (my words, not his).
In the end, I succeeded at my personal goal: growing better knowledge at performance optimization and I hope you learned a bit too.