I just finished watching Fast and Small C++ on YouTube. The title was a tad misleading for my taste, as it was a case study on a few cases instead of a guide on creating the same, but here we are.

The speaker delivers a (very) few examples but does a case study on three implementations of the Small String Optimization in MSVC, GNU Libg++, and Clang’s libc++. I started writing this as a comment on YouTube but figured it was better to write it here (where nobody will read it) instead of a long post on YouTube (where nobody will read it).

The questioners touched on an interesting topic that probably doesn’t get enough attention. Implementations of systems software, probably more than application software, almost always have the ability to make tradeoffs where you can negotiate.

If you KNOW what generations of hardware you run on (perhaps you don’t care about 36-bit, middle-endian, context switching on an RCA 1802, output to clay marble tablets, or whatever), **and** you’re willing to throw in some static_asserts and a run-time test suite, it’s often OK to just plain not care if your pointer address == your SSO string length (or whatever example the questioner called out) because you’re literally writing the rule book, and you can just declare that to not be a case you care about.

* Unaligned accesses are slow? Don’t care.
* Unaligned accesses crashing? Don’t care.
* Your memory bits count from zero, up, and your data bits count from 64, down? Don’t care.

Repeat this pattern. It might be long. You might get a bonus for documenting them, asserting them, etc., so that the next person that is trying to move this from a data center to a $0.09 CH32V003 can at least know there are dragons ahead. That team, operating under different constraints, is then free/encouraged/required to revisit those decisions. It’s centralizing those decisions that’s the key.

If you’re writing code that runs inside a data center on hardware you control, you have incredible control over where that code runs and what compromises you have to make. Yes, it’d be polite if you could give an unfortunate time-traveler the courtesy of a failed compile-time assertion for such failures, but if you can save Google, Facebook, OpenAI, or whomever 5% of a runtime (which may correspond to the electricity bill of a small city, and that’s not even much of an exaggeration), architectural purity for compatibility with mercury tube memory RAM cells may not matter when you can change one line in a config file and say, “Software built with flags X doesn’t run in data center nodes Y and Z. Don’t bother me if you try,” because you walked down the (virtual) halls and blessed it with the compiler team, the hardware team, the OS team, the CPU team, etc. That’s actually often OK. I’m guessing that if you’re the Microsoft STL team, it’s within your right to not care much if your floating point library screams on an Arduino that’s emulating IEEE754 in software anyway, for example. The team implementing STL for a Google data center may not care about SSO in the way that a similar-sounding team implementing STL for ChromeOS may have to.

Some of these bit-tweaking decisions can have HUGE ripple impacts through modern large systems. If your SSO can hold 23 bytes instead of 7 but costs you 8% compute overhead but saves you $$$$ in VM/paging overhead because you’ve profiled your last zillion transactions and you KNOW that 89% of your strings are 19 bytes long or less, you can look like a hero. Maybe your cores penalize you heavily for branches but make speculation free. Maybe that all falls apart when your loops get automatically vectorized to completely different packages and not just different execution units within the same core. These things can all lead to wildly different core data structures and implementations.

If you’re implementing the std::string that goes with a specific compiler for a specific class of computing environments (STM32 just has different rules than z/OS). This is OK.) It’s fine to declare that the rules you need to play by are different. As a courtesy, document or at least announce those compromises in a test suite or at compile time.

Don’t go nuts with this liberty. If you make a web server on an ESP32 for an air quality device running in a single home and can save a clock cycle and make it fail silently, working on an ESP32-H2 when it works on an ESP32-C3 [1] (in a way that nobody actually cares), you’re just a psychopath and deserve to have no friends. Don’t do that. You need to be able to justify such decisions on a really large scale, IMO. I worked on the URL parser in GPSBabel that probably parses many hundreds of URLs a year, scattered across many different computers. I also made some changes to the URL parser at Google.com. They have very, very different constraints.

I was in systems software for a long time, and I’d like one of the key takeaways from this talk to be that even if your goal is to make something as “simple” as a std::string be 24 bytes long [2], there are different approaches with different tradeoffs and different cliffs when the constraints are violated. (For example, there’s a LOT of value in having the SSO buffer start at offset zero of the structure. The equivalent of an unprototyped call to puts() on a std::string (eek!) has some chance of having at least some of the string be recognizable.

Large companies these days have access to total systems measurements we dreamed of in the past. “Let’s build the top 1000 GitHub projects on a toolchain implemented to measure X and see what the average string size is” is incredibly powerful for helping decide if the SSO length should be 7, 8, 15, 23, or (gasp) more. We have the tools these days to measure this stuff (before AND after) and to constrain the worlds where the code runs to fit those resulting decisions. Maybe the SSO values can be different for a Z80 and a Google Datacenter, where every opcode from boot to recycling is controlled by one software stack.

Signed,
Worked in really large companies.

[1] I’m not even totally sure that such a thing is possible. My point is that it’s probably not wise. There should be mountains of hair saved when shaving such yaks, not individual hairs. You should be able to point to receipts of shipping those bales of hair and include projections of shipping costs for future generations of yak hair.

[2] It’s not defended that “24” was even the best choice, though it was the choice of all three implementations. “As small as possible” has strong merit, but there are a lot of apps that deal in strings that don’t quite fit into SSO buffers. There’s also a large number of programs that won’t possibly care about the difference between SSO breaking points being 7 or 28. This is probably not why your CRUD app is slow. If you’re going to care about this, be sure it’s defensible.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>