Optimising Ubuntu performance on amd64 architecture
Michael Hudson-Doyle
on 12 December 2023
Tags: Intel , Performance , silicon , Ubuntu
Everyone wants the Linux distribution they are using to be fast. This is practically a content-free statement, of course: who would want their distro to be slow?
But at the same time, what does it mean for your distribution to be fast? For example, Ubuntu 21.10 switched the default compression for packages to zstd, which made them faster to both download and decompress, improving the performance of one important operation on Ubuntu. But, of course there are many, many other aspects of performance and this article is about something very different: the processor features Ubuntu assumes are available.
In this post, I will talk a little about the history of the amd64 architecture and some investigations we are doing in collaboration with Intel to make better use of newer processors.
Background
By far and away the most used architecture for Ubuntu is amd64, also known as x86-64 in some contexts. Ubuntu is still built for the very first amd64 CPUs, the AMD K8 from 2003 and Intel’s 64-bit Prescott from 2004, using the original instruction set architecture (ISA).
Over the years, Intel and AMD have added a number of extensions to the ISA, for example:
- SIMD: SSE3, SSE4, AVX, AVX512, etc
- Special purpose: RDRAND, AES-NI, VNNI
- Slightly more general: cmpxchg16b (atomic compare and exchange), vfmadd* (fused multiply-add for floating point), movbe (byte order conversion)
Not using these new instructions to improve performance throughout the distribution seems like a missed opportunity. A few core packages like glibc and openssl do runtime detection to use newer instructions when they are available but vastly more packages do no such thing.
A significant difference between an architecture like amd64 and, say, POWER is the diversity of implementations. The POWER architecture has been extended over the years in several ways, but a processor from 2018 can reasonably be assumed to support every instruction supported by a processor from 2013. This is not at all true for the amd64 world. For example, SSE4.1 was introduced in the Penryn microarchitecture in 2007 but as late as 2012 designs that did not support it (e.g. the Centerton range of Atoms) were being released. In addition, both AMD and Intel have introduced extensions that the other has eventually implemented (as well as extensions that never really became widely used and eventually disappeared such as 3DNow! and TSX).
For a long time, the dynamic loader (part of glibc) has allowed distributions to take some advantage of newer CPU features by searching extra directories when support for these features is detected, but on amd64 versions of glibc prior to 2.33 these additional directories were based on ad-hoc, poorly defined selections of capabilities. For example, to my knowledge /lib/x86_64-linux-gnu/haswell was searched on most Intel processors since 2014, but no AMD ones at all.
In 2020, the glibc developers, particularly Florian Weimer of Red Hat, got sufficiently fed up with this mess to propose a solution on the libc-alpha mailing list: assemble reasonable sets of CPU features into “levels” that are mostly supported together, and have the dynamic loader search directories based on these names.
Some bikeshedding later, four levels were defined, each including the previous: “v1” or baseline, “v2”, “v3”, “v4” and these definitions were added to the “psABI” specification (roughly speaking the document that defines what binary code for an amd64 Linux system looks like):
Level Name | CPU Feature | Example instruction |
---|---|---|
(baseline) | CMOV | cmov |
CX8 | cmpxchg8b | |
FPU | fld | |
FXSR | fxsave | |
MMX | emms | |
OSFXSR | fxsave | |
SCE | syscall | |
SSE | cvtss2si | |
SSE2 | cvtpi2pd | |
x86-64-v2 | CMPXCHG16B | cmpxchg16b |
LAHF-SAHF | lahf | |
POPCNT | popcnt | |
SSE3 | addsubpd | |
SSE4_1 | blendpd | |
SSE4_2 | pcmpestri | |
SSSE3 | phadd | |
x86-64-v3 | AVX | vzeroall |
AVX2 | vpermd | |
BMI1 | andn | |
BMI2 | bzhi | |
F16C | vcvtph2ps | |
FMA | vfmadd132pd | |
LZCNT | lzcnt | |
MOVBE | movbe | |
OSXSAVE | xgetbv | |
x86-64-v4 | AVX512F | kmovw |
AVX512BW | vdbpsadbw | |
AVX512CD | vplzcntd | |
AVX512DQ | vpmullq | |
AVX512VL | n/a |
Reference: page 14 of the psABI
As alluded to above, it’s not really possible to say that a processor from a given era supports a given level, but as a rough guide most processors from 2009 onward support v2 and most processors from 2015 on support v3.
v4 is complicated: Intel 11th Gen has support but 12th Gen and 13th Gen processors do not and AMD’s new Zen 4 microarchitecture adds support. It’s hard to know what the future holds for AVX512 and I’m not going to consider it for the rest of this article.
From glibc to the toolchains
Although the original idea of these levels was to rationalise the process by which the dynamic loader looks for shared libraries, they also provide a sensible label for a set of instructions assumed to be available by all parts of the distribution. Support for using “x86-64-v$N” as values for the -march flag was added to GCC in version 11 and LLVM in version 12.
It is worth noting here that we are only really talking about the C and C++ toolchains in this document. While the distribution clearly contains a great and increasing amount of code in other languages (Python, Go, Rust, Java, Ruby, …), a large majority of the code is in C/C++. For some language ecosystems, in particular Python, a lot of the performance sensitive code is in C/C++ anyway (e.g. numpy). The other statically compiled toolchains (like Rust and Go) do have support for selecting the precise ISA they target but for the rest of this document we will only think about C and C++.
Bumping the baseline?
It is a trivial change to the packaging of GCC to change the default value for -march, and some distributions have already made this change – both RHEL9 and SUSE Tumbleweed (as of Nov 2022) target x86-64-v2.
These changes have both a cost and a benefit:
- For users that have hardware that is too old to support v2 instructions, these operating systems will not work at all.
- For users that have paid for better hardware, these operating systems take better advantage of that hardware.
For a commercial distribution like RHEL, this probably still makes sense: if you are spending the money to get a RHEL (or SLES or …) license you are probably already running reasonably up-to-date hardware, or at least the additional cost of updating to hardware that is less than 10 years old is fairly insignificant. It is interesting to note that SUSE’s new “adaptive linux platform” product originally proposed targeting v3 and later scaled this back to v2.
For a free distribution like Ubuntu (or Fedora), the calculation is different: allowing users to extend the life of hardware by installing a free linux distribution is a significant, positive aspect of the open source world, and it is very likely that the users who are still using 2008-era hardware with Ubuntu are the users who are least able to upgrade.
That said, hardware doesn’t last forever. A few years ago, the cost of maintaining full support for 32-bit x86 machines started to outweigh the benefits and we stopped building most packages. Making a considered decision here requires data. Specifically:
- Usage – How many Ubuntu users are using hardware that supports only v1 or v2?
- Performance – How much performance improvement does changing the default to x86-64-v2 or x86-64-v3 bring anyway?
Neither of these questions is easy to answer.
Trying it for yourself
While we continue our own performance analysis and further assess the needs of our users, we have released an experimental Ubuntu 23.04 Server build – using -march=x86-64-v3 and -mtune=icelake-server – for the community to try out. As we consider the potential perks and drawbacks of using v3 system-wide, your feedback and observations will be an invaluable part of the process. Here are some of the questions we have for our own efforts:
- On aggregate, is the v3 version faster than the baseline v1 version of Ubuntu? As we mentioned above, this can mean a lot of different things, from quantitative benchmarking to a looser, qualitative feeling about speed from the user perspective.
- Are there certain domains where performance overwhelmingly benefits from or regresses because of v3?
- Do these changes break anything?
This discourse post explains where to find an installer for this build, which is not only built out of the rebuilt packages, but will install packages from the rebuild archive by default. Please note that this is for testing only. Systems installed using this installer will receive no security (or any other) updates and will be in no way suitable for use in production.
We will be making another post when our own benchmarking is complete to explain what we tested and the results we found. Stay tuned!
Talk to us today
Interested in running Ubuntu in your organisation?
Newsletter signup
Related posts
Maximizing CPU efficiency and energy savings with IntelⓇ QuickAssist Technology on Ubuntu 24.04
In this post, we show that IntelⓇ QAT can be used in Ubuntu 24.04 LTS to offload compute intensive workloads, maximizing CPU efficiency and driving cost savings.
Canonical at RISC-V Summit North America 2024
The RISC-V Summit North America is a premier annual event that brings together the global RISC-V community – including technical, industry, domain, ecosystem...
Canonical enables Ubuntu on Milk-V Mars, a credit-card-sized RISC-V SBC
May 28th, 2024 – Canonical announced that the optimised Ubuntu 24.04 image is available for Milk-V Mars, the first credit-card-sized high-performance RISC-V...