A sample use-case? I was developing an Erlang-like actor platform that should operate under Linux as well as a bare-metal microkernel, and all I needed is a light layer over syscalls instead of pulling the entire glibc. Also it provides a simple implementation for standard C functions (memcpy, printf) so I don't have to write them myself.
Another use-case is when you are writing threaded code that uses the clone() syscall instead of pthreads, usually for something with high performance, unusual clone flags, or a very small stack.
Most libc functions, including the syscall wrappers and all pthreads functions, aren't safe to call in threads created by raw clone(). Anything that reads or writes errno, for example, is not safe.
I've had to do this a couple of times. One a long time ago was an audio mixing real-time thread for a video game, which had to keep the audio device fed with low-latency frames for sound effects. In those days, pthreads wasn't good enough. For talking to the audio device, we had to use the Linux syscall wrapper macros, which have been replaced by nolibc now. More recently, a thread pool for high-performance storage I/O, which ran slightly faster than io_uring, and ran well on older kernels and ones with io_uring disabled for security.
As one of my class projects, I built a Linux compatibility layer for the toy OS we had built, by adding a proper ELF loader and emulating syscalls. I really struggled to get glibc or even musl working, and so I ended up hand-coding some `-nostdlib` programs instead of being able to use coreutils. If nolibc really works as a minimal libc, would have been incredibly cool to be able to run coreutils on my OS!
nolibc seems kinda neglected, or like a minimal subset of what's actually useful. There's no pread/pwrite etc, only read/write, forcing you to use lseek and ruining concurrent use.
It doesn't strive to be a complete implementation.
If users send patches support for them, new syscalls can be added.
Or downstream users can keep those definitions in their own program.
More complex features like threads are out of scope.
Do I understand correctly that nolibc is just another implementation of the C standard library in terms of Linux syscalls? Comparably to, say, musl libc?
In a head-on collision, the space shuttle passengers will fare better than the hatchback. Even so, it wouldn't be my first choice for most destinations.
Except there are some platforms where you need to go through libc and the direct syscall interface is considered private, or subject to change. OpenBSD is like this, and I believe Mac is too.
While Linux does have stable syscall interface, unlike other OS-es, you still want to go through glibc. At least for NSS, otherwise your app could be broken.
Golang has CGO_ENABLED=1 as the default for this reason.
Worth mentioning that the golang.org/x/sys/unix package has better support for syscalls than the og syscall package nowadays, especially for some of the newer ones like cachestat[0] which was added to the kernel in 6.5. AFAIK the original syscall package was 'frozen' a while back to preserve backward compatibility, and at one point there was even a bit of drama[1] around it being marked as deprecated instead of frozen.
Didn't they go back to Glibc in 2017 after a syscall silently corrupted several of their tightly packed tiny Go stacks? The page you link to seems to refer to a proposal from 2014 as "new".
> Didn't they go back to Glibc in 2017 after a syscall silently corrupted several of their tightly packed tiny Go stacks?
You must be thinking of https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/ which was about the vDSO (a virtual dynamically linked library which is automatically mapped by the kernel on every process), not system calls. You normally call into the vDSO instead of doing direct system calls, because the vDSO can do some things (like reading the clock) in an optimized way without entering the kernel, but you can always bypass it and do the system calls directly; doing the system calls directly will not use any of the userspace stack (it immediately switches to a kernel stack and does all the work there).
IIRC that was specifically on macOS and other BSDs which don't have a stable syscall interface. They still use raw syscalls on Linux, which guarantees syscall stability on pain of Linus Torvalds yelling at you if you break it.
Linus ships a kernel –– where would his stable interface live if not the syscall ABI? The *BSD and macOS folks ship operating systems, where they have the option of defining their ABI at a higher level of abstraction.
And for what it's worth, the reason Windows has such a high binary backwards compatibility that win 98 programs can easily run on win 11 or so is that they have this extra abstraction layer.
Linux could have made their own libc and mandated use of it. But they didn't. They chose a language agnostic binary interface that's documented at the instruction set level.
As a result of that brilliant design choice, every single language can make Linux system calls natively. It should be simple for JIT compilers to generate Linux system call code. No need to pull in some huge C library just for this. AOT compilers could have a linux_system_call builtin that just generates the required instructions. I actually posted this proposal to the GCC mailing list.
That's not what Linux is, though. It's a kernel. libc is a userspace library. The Linux developers could also make their own libpng and put their stable interface in there, but that's not in scope for their project.
> As a result of that brilliant design choice, every single language can make Linux system calls natively.
That is like saying it's a brilliant design choice for an artist to paint the sky blue on a sunny day. If Linux is a kernel, and if a kernel's interface with userspace is syscalls, and if Linux wants to avoid breaking userspace with kernel updates, then it needs a stable syscall interface.
> No need to pull in some huge C library just for this.
Again, I'm not sure why the Linux project would invent this "huge C library" to use as their stable kernel interface.
They could but they didn't. At some point, Linux almost got its own klibc. The developers realized such a thing wasn't needed in the kernel. Greg Kroah-Hartman told me about it when I asked on his AMA:
The importance of this design should not be understated. It's not really an obvious thing to realize. If it was, every other operating system and kernel out there would be doing it as well. They aren't. They all make people link against some library.
So Linux is actually pretty special. It's the only system where you actually can trash the entire userspace and rewrite the world in Rust. Don't need to link against any "core" system libraries. People usually do but it's not forced upon them.
> if Linux wants to avoid breaking userspace with kernel updates, then it needs a stable syscall interface
Every kernel and operating system wants to maximize backwards compatibility and minimize user space breakage. Most of them simply stabilize the system libraries instead. The core libraries are stable, the kernel interfaces used by those core libraries are not.
So it doesn't follow that it needs a stable syscall interface. They could have solved it via user space impositions. The fact they chose a better solution is one of many things that makes Linux special.
> The importance of this design should not be understated. It's not really an obvious thing to realize. If it was, every other operating system and kernel out there would be doing it as well.
No, they would not. I can say this with confidence because at any point in the last several decades, any OS vendor could have started to do so, and they have not. They have uniformly decided that having a userspace library as their stable kernel interface is easier to maintain, so that's what they do. The idea that the rest of the world hasn't "realized" that, in addition to maintaining binary compatibility in their libc, they could also maintain binary compatible syscalls is nonsensical.
The Linux kernel, on the other hand, doesn't ship a userspace. If they wanted their stable interface to be a userspace library, they'd need to invent one! And that would be more work than providing stable syscalls.
> So Linux is actually pretty special. It's the only system where you actually can trash the entire userspace and rewrite the world in Rust.
That's not rewriting the world, that would be a new userspace for the Linux kernel. You're still calling down into C, there's just one fewer indirection along the way.
> So it doesn't follow that it needs a stable syscall interface. They could have solved it via user space impositions.
They could have, but as Greg Kroah-Hartman pointed out, that would have just shifted the complexity around. Stability at the syscall level is the simplest solution to the problem that the Linux project has, so that's what they do.
It would be pretty funny if the kernel's stability strategy was in service of allowing userspace to avoid linking a C library, considering it's been 30+ years and the Linux userspace is almost entirely C and C++ anyway.
I’m aware of this but I really don’t the benefits of this approach; It causes issues in eg openbsd where you can only call syscalls from libc, and it seems like they’re trying to outsmart the os developers and I just don’t see an advantage.
> 1. No overhead from libc; minimizes syscall cost
The few nanoseconds of a straight function call are absolutely irrelevant vs the 10s of microseconds of a syscall cost and you lose out on any of the optimizations a libc has that you might not or didn't think about (like memoization of getpid() ) and you need to take on keeping up with syscall evolution / best practices which a libc generally has a good handle on.
> No dependency on libc and C language ABI/toolchains
This obviously doesn't apply to a C syscall header, though, such as the case in OP :)
Because of the aforementioned problems, since glibc 2.25,
the PID cache is removed: calls to getpid() always invoke
the actual system call, rather than returning a cached value.
Get rid of libc and you gain the ability to have zero global state in exchange. Freestanding C actually makes sense and is a very fun language to program in. No legacy nonsense to worry about. Even errno is gone.
> Get rid of libc and you gain the ability to have zero global state in exchange.
No you don't, you still have the global state the kernel is maintaining on your behalf. Open FDs, memory mappings, process & thread scheduling states, limits, the command line arguments, environment variables, etc... There's a shitload of global state in /proc/self/
Also the external connections to the process (eg, stdin/out/err) are still inherently global, regardless of however your runtime is pretending to treat them.
And it's not like you even managed to reduce duplicated state since every memory allocator is going to still track what regions it received from the kernel and recycle those.
> Freestanding C actually makes sense [..] No legacy nonsense to worry about
GNU aren't the OS developers of the Linux kernel. Think of the Go standard library on Linux as another libc-level library. On the BSDs there is a single libc that's part of the OS, on Linux there are several options for libc.
This is a big one. Linking against libc on many platforms also means making your binaries relocatable. It's a lot of unnecessary, incidental complexity.
You can still randomize heap allocations (but not with as much entropy), as usually the heap segment is quite large. But you don't get randomization of, e.g. the code.
ASLR is a weak defense. It's akin to randomizing which of the kitchen drawers you'll put your jewelry in. Not the same level of security as say, a locked safe.
Attacks are increasingly sophisticated, composed of multiple exploits in a chain, one of which is some form of ASLR bypass. It's usually one of the easiest links in the chain.
> On the other hand all of that comes back to bone you if you’re trying to benefit from vDSO without going through a libc.
At least the vDSO functions really don't need much in the way of stack space: generally there's nothing much there but clock_gettime() and gettimeofday(), which just read some values from the vvar area.
The bigger pain, of course, is actually looking up the symbols in the vDSO, which takes at least a minimal ELF parser.
> At least the vDSO functions really don't need much in the way of stack space: generally there's nothing much there but clock_gettime() and gettimeofday(), which just read some values from the vvar area.
> OpenBSD allows making syscalls from static binaries as well.
Do you have a source for this? My Google searches and personal recollections say that OpenBSD does not have a stable syscall ABI in the way that Linux does and the proper/supported way to make syscalls on OpenBSD is through dynamically linked libc; statically linking libc, or invoking the syscall mechanism it uses directly, results in binaries that can be broken on future OpenBSD versions.
I upvoted for the great links, but I still don't think a static binary that will break in the future is meeting the expectations many have when static linking.
> we here at OpenBSD are the kings of ABI-instability
> Program to the API rather than the ABI. When we see benefits, we change the ABI more often than the API.
> I have altered the ABI. Pray I do not alter it further.
The term ABI here though is a little imprecise. I believe it just refers to the syscall ABI. So, it should be possible to make an "almost static" binary by statically linking everything except libc, and that binary should continue to work in future versions of OpenBSD.
It's a lisp interpreter with a built in system-call primitive. The plan is to implement everything else from inside the language. Completely freestanding, no libc needed. In the future I expect to be able to boot Linux directly into this thing.
Only major feature still needed for kernel support is a binary structure parser for the C structures. I already implemented and tested the primitives for it. I even added support for unaligned memory accesses.
Iteration is the only major language feature that's still missing. I'm working on implementing continuations in the interpreter so that I can have elegant Ruby style iteration. This is taking longer than expected.
This interpreter can make the Linux kernel load lisp modules before its code even runs. I invented a self-contained ELF loading system that allows embedding arbitrary data into a loadable ELF segment that the kernel automatically maps into memory. Then it's just a matter of reaching it via the auxiliary vector, The interpreter uses this to automatically run code, allowing it to become a freestanding lisp executable.
nolibc is NOT under GPL. See first line of each file.
/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
It's technically not part Linux's headers either. It's published under the tools subdirectory, so it's something that ships along with the kernel, but not used by the kernel itself. Basically it's there as some people might find it useful, but could've as well been a separate repo.
Ya totally - those wacky people at chrome must've just never heard of those headers /s
What you don't understand, because you don't work on Chrome, or Chrome sized projects, is that generic, lowest common denominator implementations cannot be optimal for all use-cases and at scale (Chrome-sized project) those inefficiencies matter. That's why this exists, that's why folly exists, that's why abseil exists, that's why no not everyone can just use boost, etc etc etc
Well... last time I had a look at the assembly code of syscall entry on x86_64, I was scared away... this piece of "assembly" does require some very niche C compiler options to be compatible (stack alignment and more if I recall properly).
Linux "C" code hard dependency on gcc/clang/("ultra complex compilers") is getting worse by the day. It should (very easy to say, I know) have stayed very simple and plain C99+ with smart macro definitions to be replaced with pure assembly or the missing bits for modern hardware programming (atomics/memory barriers, explicit unaligned access, etc), but those abominations like _generic (or static assert,__thread,etc) are toxic additions to the C standard (there are too many toxic additions and not enough removal/simplification/hardening in ISO C, yes, we will have to define a "C profile" which breaks backward compatibility with hardening and simplifications).
I don't say all extensions are bad, but I think they need more "smart and pertinent pick and choose" (and I know this is a tough call), just because they "cost". For instance, for a kernel, we know it must have fine grained control of ELF object sections... or we would get much more source files (one per pertinent section) or "many more source configuration macros" (....but there I start to wonder if it was not actually the "right" way instead of requiring a C compiler to support such extension, it moves everything to the linker script... which is "required" anyway for a kernel).
Linus T. is not omnipotent and can do only that much and a lot of "official" linux devs are putting really nasty SDK dependency requirements in everyday/everywhere kernels.
That said, on my side, many of my user apps are now directly using linux syscalls... but are written in RISC-V assembly interpreted on x86_64 (I have a super lean interpreter/syscall translater written in x86_64 assembly and a super lean executable format wrapped in ELF executable format), or very plain and simple C99+ (legacy or because I want some apps to be a bit more 'platform crossy'... for now).
Can you elaborate on the complexity here for syscall entry on x86_64? (Or link to what you were reading?) Another commenter linked to Linux's own "nolibc" which is similar to, though simpler than, the Google project in the OP. Their x64_64 arch support is here, which looks simple enough, putting things into registers: https://github.com/torvalds/linux/blob/master/tools/include/...
I don't see any complex stack alignment or anything which reads to me like it would require "niche C compiler options", so I'm curious if I'm missing something?
It is hard to take seriously someone that claims that thread locals are a toxic addition to the standard. (incidentally __thread is a GCC extension that predates the standard by almost a decade).
This little keyword makes the compiler generates calls to the system threading libs (libpthread), or even worse, declare a TLS slot in the ELF format, which is a limited resource (you cannot reasonably resize the TLS segment on the fly).
Yes, that amount of complexity is obviously toxic... and saying otherwise is what will make you hard to be taken seriously, come on dude...
Disappointing that errors are still signaled by assigning to `errno` (you can apparently override this to some other global, but it has to be a global or global-like lvalue).
The kernel actually signals errors by returning a negative error code (on most arches), which seems like a better calling convention. Storing errors in something like `errno` opens a whole can of worms around thread safety and signal safety, while seemingly providing very little benefit beyond following tradition.
There's a funny circular dependency in glibc sources because errno lives in the TLS block which is allocated using __sbrk which can set the errno before it's allocated (see the __libc_setup_tls).
The branch that actually touches the errno is unlikely to be executed. However I did experience a puzzling crash with a cross-compiled libc because the compiler was smart enough to inject a speculative load of errno outside of the branch. Fun times.
While that might be true and the industry has evolved and learned about "better" ways, the old systems still exist. I don't see any reason to complain about it.
Yes, we can do better. Yes, we probably should do better. But in some cases you really have to think through every edge case and in the end someone has to do it. So just be grateful for what we have.
Disappointing is an understatement. Can't believe these people are making a browser. I'm sure they have some Google-flavored excuse for why to repeat this ridiculous threadlocal errno API.
> We try to hide some of the differences between arches when reasonably feasible. e.g. Newer architectures no longer provide an open syscall, but do provide openat. We will still expose a sys_open helper by default that calls into openat instead.
Sometimes you actually want to make sure that the exact syscall is called; e.g. you're writing a little program protected by strict seccomp rules. If the layer can magically call some other syscall under the hood this won't work anymore.
Glibc definitely does this transparent mapping as well. Calling
int fd = open(<path>, O_RDONLY)
yields
openat(AT_FDCWD, <path>, O_RDONLY)
when running through strace.
This really surprised me when I was digging into Linux tracing technology and noticed no `open` syscalls on my running system. It was all `openat`. I don't know when this transition happened, but I totally missed it.
You have to be pretty careful when using syscalls directly, at least in the presence of some libc. For example, from what I have gathered from random tidbits here and there, it's not a good idea to use any of the fork-like calls (fork, clone, etc.) if you have any threads that were created using glibc.
Just a friendly reminder that syscall() is a vararg function. Meaning, you can't just go throwing arguments at it (so maybe it's better to use this wrapper to avoid problems instead).
For example, on a 64-bit arch, this code would be sus.
The last argument would be on the stack instead of in a register which is where the kernel expects to find the arguments. But a proper syscall implementation would handle this just fine (e.g. <https://github.com/bminor/glibc/blob/ba60be873554ecd141b55ea...>), so I don't think there's anything sus about it.
The problem is something a bit else (jstarks figured it out somewhere below). I'm not a compiler/abi eng, but it seems to depend on a compiler, eg. consider this with clang-16:
#include <sys/syscall.h>
#include <unistd.h>
#include <alloca.h>
#include <string.h>
void s(long a, long b, long c, long d, long e, long f, long g) {
}
int main(void) {
long a = 0xFFFFFFFFFFFFFFFF;
s(a, a, a, a, a, a, a);
syscall(9999, 1, 2, 3, 4, 5, 6);
return 0;
}
I think you misunderstand. The red zone is on the opposite side of rsp. This line is trying to read an argument that may not exist, relying on the fact that this will put garbage in the register which syscall then ignores. But this only works if the memory is readable.
They are not some professor in my school, some valued colleague, or known kernel expert. They are a stranger on the internet. No, I can't be bothered to research every person that claim to have some wisdom that they won't articulate to cultivate an air of mystery.
Professors get paid, we got a pointer and a lesson for free.
Here's a free dollar. "Only one?"
They said the word "vararg". That is everything. You take that, and you say "oh shit, right, thanks for the heads up" or if you don't already know what's so special about that, you do know they obviously said that for some reason so you fucking google it.
Either way, they pointed you in the right direction, and that is helpful.
The further reading that you find so unbearable takes you exactly the same time to read something that has already been written and is just sitting out there to look up for free, as to read something you demand they write again on the spot bespoke for you.
And since as you say they aren't a proffessor or colleague you personally know and respect, why do you care if they write out a full article or just a pointer? You just said you don't trust a rando. You don't trust their full article anyway.
How do I know if they are "pointing me in the right direction"?
Once again, you assume the conclusion that their comment is helpful and correct and meaningful, and you work backwards to excuse their poor explanation that they justified with "let's say it's a quiz".
And if you don't like my reply, take your own advice and go away. Why do you care what I think of their phrasing? You're not going to get me to stop anyway, or the dozens or people who upvoted me.
Or keep swearing at me and getting downvoted, whatever floats your boat.
"How do I know if they are "pointing me in the right direction"?"
How do you justify complaining about how little they wrote for you when you will ignore them anyway, because "how do I know they are pointing me in the right direction?"
I guess if the arch’s varargs conventions do something other than put each 32-bit value in a 64-bit “slot” (likely for inputs that end up on the stack, at least), then some of the arguments will not line up. Probably some of the last args will get combined into high/low parts of a 64-bit register when moved into registers to pass to the kernel. And then subsequent register inputs will get garbage from the stack.
Need to cast them to long or size_t or whatever to prevent this.
See also Linux's nolibc headers, which allows one to write C software that completely bypass libc, but instead directly operate through syscalls.
https://github.com/torvalds/linux/tree/master/tools/include/...
A sample use-case? I was developing an Erlang-like actor platform that should operate under Linux as well as a bare-metal microkernel, and all I needed is a light layer over syscalls instead of pulling the entire glibc. Also it provides a simple implementation for standard C functions (memcpy, printf) so I don't have to write them myself.
Another use-case is when you are writing threaded code that uses the clone() syscall instead of pthreads, usually for something with high performance, unusual clone flags, or a very small stack.
Most libc functions, including the syscall wrappers and all pthreads functions, aren't safe to call in threads created by raw clone(). Anything that reads or writes errno, for example, is not safe.
I've had to do this a couple of times. One a long time ago was an audio mixing real-time thread for a video game, which had to keep the audio device fed with low-latency frames for sound effects. In those days, pthreads wasn't good enough. For talking to the audio device, we had to use the Linux syscall wrapper macros, which have been replaced by nolibc now. More recently, a thread pool for high-performance storage I/O, which ran slightly faster than io_uring, and ran well on older kernels and ones with io_uring disabled for security.
Where was this when I needed it!
As one of my class projects, I built a Linux compatibility layer for the toy OS we had built, by adding a proper ELF loader and emulating syscalls. I really struggled to get glibc or even musl working, and so I ended up hand-coding some `-nostdlib` programs instead of being able to use coreutils. If nolibc really works as a minimal libc, would have been incredibly cool to be able to run coreutils on my OS!
nolibc seems kinda neglected, or like a minimal subset of what's actually useful. There's no pread/pwrite etc, only read/write, forcing you to use lseek and ruining concurrent use.
It doesn't strive to be a complete implementation.
If users send patches support for them, new syscalls can be added. Or downstream users can keep those definitions in their own program. More complex features like threads are out of scope.
(Disclaimer: I'm one of the maintainers)
Sure, I'm more saying, because nolibc is so minimal, that still leaves room for things like this Chromium library.
Wasn’t that originally just for integration tests where you wanted to boot a minimal image that just runs your kernel CI test?
Do I understand correctly that nolibc is just another implementation of the C standard library in terms of Linux syscalls? Comparably to, say, musl libc?
glibc is a space shuttle, musl is a hatchback, nolibc is a skateboard
They all do the same thing (take you from A to B), but offer different levels of comfort, efficiency and utility :)
Who can take their space shuttle to work these days, what with the price of rocket fuel‽
It's a bit unwieldy, but the good thing is that it comes for free with your copy of GNU/Linux!
Apparently almost every linux app
And parking is always a nightmare for my shuttles
And passenger safety?
In a head-on collision, the space shuttle passengers will fare better than the hatchback. Even so, it wouldn't be my first choice for most destinations.
No it's not comparable to musl libc. Standard I/O functions don't support buffering and the printf implementation can't print floats, for example.
> What would be a use-case?
Maybe bootstapping a new language with no dependencies.
Yes. Go for example doesn't use glibc and instead interfaces with syscalls directly.
https://pkg.go.dev/syscall
Except there are some platforms where you need to go through libc and the direct syscall interface is considered private, or subject to change. OpenBSD is like this, and I believe Mac is too.
Linux is the only one that is not like this. I wrote an article about the subject:
https://www.matheusmoreira.com/articles/linux-system-calls
While Linux does have stable syscall interface, unlike other OS-es, you still want to go through glibc. At least for NSS, otherwise your app could be broken.
Golang has CGO_ENABLED=1 as the default for this reason.
Worth mentioning that the golang.org/x/sys/unix package has better support for syscalls than the og syscall package nowadays, especially for some of the newer ones like cachestat[0] which was added to the kernel in 6.5. AFAIK the original syscall package was 'frozen' a while back to preserve backward compatibility, and at one point there was even a bit of drama[1] around it being marked as deprecated instead of frozen.
[0]: https://github.com/golang/go/issues/61917 [1]: https://github.com/golang/go/issues/60797
Undeprecating something is truly a rare sight.
So far I only knew about PHP undeprecating "is_a" function, so I guess this puts Go in good company ^^
Didn't they go back to Glibc in 2017 after a syscall silently corrupted several of their tightly packed tiny Go stacks? The page you link to seems to refer to a proposal from 2014 as "new".
> Didn't they go back to Glibc in 2017 after a syscall silently corrupted several of their tightly packed tiny Go stacks?
You must be thinking of https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/ which was about the vDSO (a virtual dynamically linked library which is automatically mapped by the kernel on every process), not system calls. You normally call into the vDSO instead of doing direct system calls, because the vDSO can do some things (like reading the clock) in an optimized way without entering the kernel, but you can always bypass it and do the system calls directly; doing the system calls directly will not use any of the userspace stack (it immediately switches to a kernel stack and does all the work there).
IIRC that was specifically on macOS and other BSDs which don't have a stable syscall interface. They still use raw syscalls on Linux, which guarantees syscall stability on pain of Linus Torvalds yelling at you if you break it.
I'm 100% with Linus on this one.
Linus ships a kernel –– where would his stable interface live if not the syscall ABI? The *BSD and macOS folks ship operating systems, where they have the option of defining their ABI at a higher level of abstraction.
And for what it's worth, the reason Windows has such a high binary backwards compatibility that win 98 programs can easily run on win 11 or so is that they have this extra abstraction layer.
Linux could have made their own libc and mandated use of it. But they didn't. They chose a language agnostic binary interface that's documented at the instruction set level.
As a result of that brilliant design choice, every single language can make Linux system calls natively. It should be simple for JIT compilers to generate Linux system call code. No need to pull in some huge C library just for this. AOT compilers could have a linux_system_call builtin that just generates the required instructions. I actually posted this proposal to the GCC mailing list.
> Linux could have made their own libc
That's not what Linux is, though. It's a kernel. libc is a userspace library. The Linux developers could also make their own libpng and put their stable interface in there, but that's not in scope for their project.
> As a result of that brilliant design choice, every single language can make Linux system calls natively.
That is like saying it's a brilliant design choice for an artist to paint the sky blue on a sunny day. If Linux is a kernel, and if a kernel's interface with userspace is syscalls, and if Linux wants to avoid breaking userspace with kernel updates, then it needs a stable syscall interface.
> No need to pull in some huge C library just for this.
Again, I'm not sure why the Linux project would invent this "huge C library" to use as their stable kernel interface.
They could but they didn't. At some point, Linux almost got its own klibc. The developers realized such a thing wasn't needed in the kernel. Greg Kroah-Hartman told me about it when I asked on his AMA:
https://old.reddit.com/r/linux/comments/fx5e4v/im_greg_kroah...
The importance of this design should not be understated. It's not really an obvious thing to realize. If it was, every other operating system and kernel out there would be doing it as well. They aren't. They all make people link against some library.
So Linux is actually pretty special. It's the only system where you actually can trash the entire userspace and rewrite the world in Rust. Don't need to link against any "core" system libraries. People usually do but it's not forced upon them.
> if Linux wants to avoid breaking userspace with kernel updates, then it needs a stable syscall interface
Every kernel and operating system wants to maximize backwards compatibility and minimize user space breakage. Most of them simply stabilize the system libraries instead. The core libraries are stable, the kernel interfaces used by those core libraries are not.
So it doesn't follow that it needs a stable syscall interface. They could have solved it via user space impositions. The fact they chose a better solution is one of many things that makes Linux special.
> The importance of this design should not be understated. It's not really an obvious thing to realize. If it was, every other operating system and kernel out there would be doing it as well.
No, they would not. I can say this with confidence because at any point in the last several decades, any OS vendor could have started to do so, and they have not. They have uniformly decided that having a userspace library as their stable kernel interface is easier to maintain, so that's what they do. The idea that the rest of the world hasn't "realized" that, in addition to maintaining binary compatibility in their libc, they could also maintain binary compatible syscalls is nonsensical.
The Linux kernel, on the other hand, doesn't ship a userspace. If they wanted their stable interface to be a userspace library, they'd need to invent one! And that would be more work than providing stable syscalls.
> So Linux is actually pretty special. It's the only system where you actually can trash the entire userspace and rewrite the world in Rust.
That's not rewriting the world, that would be a new userspace for the Linux kernel. You're still calling down into C, there's just one fewer indirection along the way.
> So it doesn't follow that it needs a stable syscall interface. They could have solved it via user space impositions.
They could have, but as Greg Kroah-Hartman pointed out, that would have just shifted the complexity around. Stability at the syscall level is the simplest solution to the problem that the Linux project has, so that's what they do.
It would be pretty funny if the kernel's stability strategy was in service of allowing userspace to avoid linking a C library, considering it's been 30+ years and the Linux userspace is almost entirely C and C++ anyway.
I mean they could stuff kernel libc into vDSO.
Exactly, a stable language-agnostic binary ABI is the proper layering choice.
That is the documentation for the Go syscall package. If you scroll down to the bottom of the page you'll see links to the source files.
I’m aware of this but I really don’t the benefits of this approach; It causes issues in eg openbsd where you can only call syscalls from libc, and it seems like they’re trying to outsmart the os developers and I just don’t see an advantage.
Is it faster? More stable?
There are several advantages to using kernel syscalls directly:
1. No overhead from libc; minimizes syscall cost
2. No dependency on libc and C language ABI/toolchains
3. Reduced attack surface. libc can and does have bugs and potentially ROP or Spectre gadgets.
4. Bootstrapping other languages, e.g. Virgil
> 1. No overhead from libc; minimizes syscall cost
The few nanoseconds of a straight function call are absolutely irrelevant vs the 10s of microseconds of a syscall cost and you lose out on any of the optimizations a libc has that you might not or didn't think about (like memoization of getpid() ) and you need to take on keeping up with syscall evolution / best practices which a libc generally has a good handle on.
> No dependency on libc and C language ABI/toolchains
This obviously doesn't apply to a C syscall header, though, such as the case in OP :)
A syscall can be way less than 10us. Especially if it is not doing I/O.
But a kernel mode switch is definitely more expensive than a trivial (likely cached) jump instruction.
> you lose out on any of the optimizations a libc has that you might not or didn't think about (like memoization of getpid() )
Not much of a big deal. These "optimizations" caused enough bugs that they actually got reverted.
https://www.man7.org/linux/man-pages/man2/getpid.2.html
Get rid of libc and you gain the ability to have zero global state in exchange. Freestanding C actually makes sense and is a very fun language to program in. No legacy nonsense to worry about. Even errno is gone.> Get rid of libc and you gain the ability to have zero global state in exchange.
No you don't, you still have the global state the kernel is maintaining on your behalf. Open FDs, memory mappings, process & thread scheduling states, limits, the command line arguments, environment variables, etc... There's a shitload of global state in /proc/self/
Also the external connections to the process (eg, stdin/out/err) are still inherently global, regardless of however your runtime is pretending to treat them.
And it's not like you even managed to reduce duplicated state since every memory allocator is going to still track what regions it received from the kernel and recycle those.
> Freestanding C actually makes sense [..] No legacy nonsense to worry about
Except for, you know, the entire language :p
GNU aren't the OS developers of the Linux kernel. Think of the Go standard library on Linux as another libc-level library. On the BSDs there is a single libc that's part of the OS, on Linux there are several options for libc.
> I just don’t see an advantage.
You don’t have to deal with C ABI requirements with respect to stack, or registers management. You also don’t need to do dynamic linking.
On the other hand all of that comes back to bone you if you’re trying to benefit from vDSO without going through a libc.
> You also don’t need to do dynamic linking.
This is a big one. Linking against libc on many platforms also means making your binaries relocatable. It's a lot of unnecessary, incidental complexity.
It also means giving up ASLR, though.
You can still randomize heap allocations (but not with as much entropy), as usually the heap segment is quite large. But you don't get randomization of, e.g. the code.
ASLR is a weak defense. It's akin to randomizing which of the kitchen drawers you'll put your jewelry in. Not the same level of security as say, a locked safe.
Attacks are increasingly sophisticated, composed of multiple exploits in a chain, one of which is some form of ASLR bypass. It's usually one of the easiest links in the chain.
> On the other hand all of that comes back to bone you if you’re trying to benefit from vDSO without going through a libc.
At least the vDSO functions really don't need much in the way of stack space: generally there's nothing much there but clock_gettime() and gettimeofday(), which just read some values from the vvar area.
The bigger pain, of course, is actually looking up the symbols in the vDSO, which takes at least a minimal ELF parser.
The kernel also provides a minimal vdso elf parser:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
> At least the vDSO functions really don't need much in the way of stack space: generally there's nothing much there but clock_gettime() and gettimeofday(), which just read some values from the vvar area.
And yet that’s exactly one of the things Go fucked up in the past: https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/
> It causes issues in eg openbsd where you can only call syscalls from libc
OpenBSD allows making syscalls from static binaries as well. If Go binaries are static, it shouldn't cause any problems.
> OpenBSD allows making syscalls from static binaries as well.
Do you have a source for this? My Google searches and personal recollections say that OpenBSD does not have a stable syscall ABI in the way that Linux does and the proper/supported way to make syscalls on OpenBSD is through dynamically linked libc; statically linking libc, or invoking the syscall mechanism it uses directly, results in binaries that can be broken on future OpenBSD versions.
> > OpenBSD allows making syscalls from static binaries as well.
> Do you have a source for this?
One article from 2019 about this can be found at https://lwn.net/Articles/806776/ (later updates https://lwn.net/Articles/949078/ and https://lwn.net/Articles/959562/). Yes, it does not have a stable system call ABI, but as long as your program was statically compiled with the libc from the same OpenBSD release, AFAIK it should work.
I upvoted for the great links, but I still don't think a static binary that will break in the future is meeting the expectations many have when static linking.
Yeah. Do you have any information as to how/when the OpenBSD system call ABI has changed recently? I wouldn't expect that to happen very often.
From 2019: https://lwn.net/Articles/806776/
In particular, from Theo de Raadt himself:
> we here at OpenBSD are the kings of ABI-instability
> Program to the API rather than the ABI. When we see benefits, we change the ABI more often than the API.
> I have altered the ABI. Pray I do not alter it further.
The term ABI here though is a little imprecise. I believe it just refers to the syscall ABI. So, it should be possible to make an "almost static" binary by statically linking everything except libc, and that binary should continue to work in future versions of OpenBSD.
Go recently got run through the wringer to remove syscalls (and various Go ports are probably still broken) due to pinsyscalls.
Indeed, Zig does this for instance (at least for x86_64 linux [0]) as a way to avoid having to link libc at all
[0] https://github.com/ziglang/zig/blob/ee9f00d673f2bccddc2751c3...
I've been working on that language for a while now!
https://github.com/lone-lang/lone
It's a lisp interpreter with a built in system-call primitive. The plan is to implement everything else from inside the language. Completely freestanding, no libc needed. In the future I expect to be able to boot Linux directly into this thing.
Only major feature still needed for kernel support is a binary structure parser for the C structures. I already implemented and tested the primitives for it. I even added support for unaligned memory accesses.
Iteration is the only major language feature that's still missing. I'm working on implementing continuations in the interpreter so that I can have elegant Ruby style iteration. This is taking longer than expected.
This interpreter can make the Linux kernel load lisp modules before its code even runs. I invented a self-contained ELF loading system that allows embedding arbitrary data into a loadable ELF segment that the kernel automatically maps into memory. Then it's just a matter of reaching it via the auxiliary vector, The interpreter uses this to automatically run code, allowing it to become a freestanding lisp executable.
I wrote an article about how it works here:
https://www.matheusmoreira.com/articles/self-contained-lone-...
> I'm working on implementing continuations in the interpreter
You might find this paper (and everything else on the site) interesting and relevant.
https://okmij.org/ftp/continuations/ZFS/context-OS.pdf
> See also Linux's nolibc headers
Kind of an understatement. The existence of an official interface obsoletes 3rd party projects like the one posted.
Might be a license thing? The Linux headers are probably GPL like the rest of Linux.
The Linux kernel licence explicitly says programs using the syscall interface are not considered derivative works and that GPL does not apply to them: https://github.com/torvalds/linux/blob/master/LICENSES/excep...
nolibc is NOT under GPL. See first line of each file.
/* SPDX-License-Identifier: LGPL-2.1 OR MIT */
It's technically not part Linux's headers either. It's published under the tools subdirectory, so it's something that ships along with the kernel, but not used by the kernel itself. Basically it's there as some people might find it useful, but could've as well been a separate repo.
nolibc seems very minimal. For example, no pread/pwrite just read/write, forcing you to lseek and ruining concurrent use.
Ya totally - those wacky people at chrome must've just never heard of those headers /s
What you don't understand, because you don't work on Chrome, or Chrome sized projects, is that generic, lowest common denominator implementations cannot be optimal for all use-cases and at scale (Chrome-sized project) those inefficiencies matter. That's why this exists, that's why folly exists, that's why abseil exists, that's why no not everyone can just use boost, etc etc etc
Well... last time I had a look at the assembly code of syscall entry on x86_64, I was scared away... this piece of "assembly" does require some very niche C compiler options to be compatible (stack alignment and more if I recall properly).
Linux "C" code hard dependency on gcc/clang/("ultra complex compilers") is getting worse by the day. It should (very easy to say, I know) have stayed very simple and plain C99+ with smart macro definitions to be replaced with pure assembly or the missing bits for modern hardware programming (atomics/memory barriers, explicit unaligned access, etc), but those abominations like _generic (or static assert,__thread,etc) are toxic additions to the C standard (there are too many toxic additions and not enough removal/simplification/hardening in ISO C, yes, we will have to define a "C profile" which breaks backward compatibility with hardening and simplifications).
I don't say all extensions are bad, but I think they need more "smart and pertinent pick and choose" (and I know this is a tough call), just because they "cost". For instance, for a kernel, we know it must have fine grained control of ELF object sections... or we would get much more source files (one per pertinent section) or "many more source configuration macros" (....but there I start to wonder if it was not actually the "right" way instead of requiring a C compiler to support such extension, it moves everything to the linker script... which is "required" anyway for a kernel).
Linus T. is not omnipotent and can do only that much and a lot of "official" linux devs are putting really nasty SDK dependency requirements in everyday/everywhere kernels.
That said, on my side, many of my user apps are now directly using linux syscalls... but are written in RISC-V assembly interpreted on x86_64 (I have a super lean interpreter/syscall translater written in x86_64 assembly and a super lean executable format wrapped in ELF executable format), or very plain and simple C99+ (legacy or because I want some apps to be a bit more 'platform crossy'... for now).
Can you elaborate on the complexity here for syscall entry on x86_64? (Or link to what you were reading?) Another commenter linked to Linux's own "nolibc" which is similar to, though simpler than, the Google project in the OP. Their x64_64 arch support is here, which looks simple enough, putting things into registers: https://github.com/torvalds/linux/blob/master/tools/include/...
The non-arch-specific callers which use this are here, which also look relatively straightforward: https://github.com/torvalds/linux/blob/master/tools/include/...
I don't see any complex stack alignment or anything which reads to me like it would require "niche C compiler options", so I'm curious if I'm missing something?
You linked the same file twice, was that intentional?
Don't forget the cancer of AI bots...
Linux has literally never been standard C. Linus used as many GCC extensions as he could from day 1.
It is hard to take seriously someone that claims that thread locals are a toxic addition to the standard. (incidentally __thread is a GCC extension that predates the standard by almost a decade).
This little keyword makes the compiler generates calls to the system threading libs (libpthread), or even worse, declare a TLS slot in the ELF format, which is a limited resource (you cannot reasonably resize the TLS segment on the fly).
Yes, that amount of complexity is obviously toxic... and saying otherwise is what will make you hard to be taken seriously, come on dude...
Can't wait for Zig team to adopt this over libc, citing concerns about "libc not existing on certain configurations"[1]
[1] https://github.com/ziglang/zig/issues/1840
Zig on Linux already directly interfaces with syscalls,[0] unless your library or application directly links libc.
[0]: https://ziglang.org/documentation/master/std/#std.os.linux
Oh, there you go :)
Welcome to 2016. https://github.com/ziglang/zig/blob/5f0bfcac24036e1fff0b2bed...
:'(
Disappointing that errors are still signaled by assigning to `errno` (you can apparently override this to some other global, but it has to be a global or global-like lvalue).
The kernel actually signals errors by returning a negative error code (on most arches), which seems like a better calling convention. Storing errors in something like `errno` opens a whole can of worms around thread safety and signal safety, while seemingly providing very little benefit beyond following tradition.
There's a funny circular dependency in glibc sources because errno lives in the TLS block which is allocated using __sbrk which can set the errno before it's allocated (see the __libc_setup_tls).
The branch that actually touches the errno is unlikely to be executed. However I did experience a puzzling crash with a cross-compiled libc because the compiler was smart enough to inject a speculative load of errno outside of the branch. Fun times.
code that uses errno is also a bit harder to understand. I like the way Rust does it -- if a function can fail, it returns a Result.
While that might be true and the industry has evolved and learned about "better" ways, the old systems still exist. I don't see any reason to complain about it.
Yes, we can do better. Yes, we probably should do better. But in some cases you really have to think through every edge case and in the end someone has to do it. So just be grateful for what we have.
For old systems -- yes, of course. But designing a new, incompatible API around errno is just backwards.
I don't think this is an "old system" though.
Disappointing is an understatement. Can't believe these people are making a browser. I'm sure they have some Google-flavored excuse for why to repeat this ridiculous threadlocal errno API.
That seems a bit harsh. Maybe they just wanted it to be a drop-in replacement for glibc...
> We try to hide some of the differences between arches when reasonably feasible. e.g. Newer architectures no longer provide an open syscall, but do provide openat. We will still expose a sys_open helper by default that calls into openat instead.
Sometimes you actually want to make sure that the exact syscall is called; e.g. you're writing a little program protected by strict seccomp rules. If the layer can magically call some other syscall under the hood this won't work anymore.
musl does this too. glibc may also, I haven't checked in a long time. I bet rust, etc., does too. You always need to check.
Glibc definitely does this transparent mapping as well. Calling int fd = open(<path>, O_RDONLY) yields openat(AT_FDCWD, <path>, O_RDONLY) when running through strace.
This really surprised me when I was digging into Linux tracing technology and noticed no `open` syscalls on my running system. It was all `openat`. I don't know when this transition happened, but I totally missed it.
You have to be pretty careful when using syscalls directly, at least in the presence of some libc. For example, from what I have gathered from random tidbits here and there, it's not a good idea to use any of the fork-like calls (fork, clone, etc.) if you have any threads that were created using glibc.
I've been using my own version of this. Maybe I'll switch over, this looks more complete.
Using go is a nice way to do that by default as it also directly uses syscalls (see the sys package)
Just a friendly reminder that syscall() is a vararg function. Meaning, you can't just go throwing arguments at it (so maybe it's better to use this wrapper to avoid problems instead).
For example, on a 64-bit arch, this code would be sus.
syscall(__NR_syscall_taking_6_args, 1, 2, 3, 4, 5, 6);
Quiz: why
PS: it's a common mistake, so I thought I'd save you a trip down the debugging rabbit hole.
A quiz is the opposite of saving someone effort.
Exactly, I am now morally bound to figure out the answer instead of going to work.
The last argument would be on the stack instead of in a register which is where the kernel expects to find the arguments. But a proper syscall implementation would handle this just fine (e.g. <https://github.com/bminor/glibc/blob/ba60be873554ecd141b55ea...>), so I don't think there's anything sus about it.
> movq 8(%rsp),%r9
This is a huge edgecase but is 8(%rsp) guaranteed to be readable memory
Yes, see https://en.wikipedia.org/wiki/Red_zone_(computing)
The problem is something a bit else (jstarks figured it out somewhere below). I'm not a compiler/abi eng, but it seems to depend on a compiler, eg. consider this with clang-16:
Now, strace shows: objdump -d a Only 4 bytes are put on the stack, but syscall will read 8.It's tricky if one doesn't control types of arguments used in vararg.
I think you misunderstand. The red zone is on the opposite side of rsp. This line is trying to read an argument that may not exist, relying on the fact that this will put garbage in the register which syscall then ignores. But this only works if the memory is readable.
I never get people who aren't grateful for pointers. Being shown which direction to walk is of no value, they must also be carried there.
They didn't claim to save work, they claimed to save hitting a bug, and having to debug it.
They said the word "vararg". They gave you everything.
They are not some professor in my school, some valued colleague, or known kernel expert. They are a stranger on the internet. No, I can't be bothered to research every person that claim to have some wisdom that they won't articulate to cultivate an air of mystery.
They gave me everything to dismiss their claim.
Professors get paid, we got a pointer and a lesson for free.
Here's a free dollar. "Only one?"
They said the word "vararg". That is everything. You take that, and you say "oh shit, right, thanks for the heads up" or if you don't already know what's so special about that, you do know they obviously said that for some reason so you fucking google it.
Either way, they pointed you in the right direction, and that is helpful.
The further reading that you find so unbearable takes you exactly the same time to read something that has already been written and is just sitting out there to look up for free, as to read something you demand they write again on the spot bespoke for you.
And since as you say they aren't a proffessor or colleague you personally know and respect, why do you care if they write out a full article or just a pointer? You just said you don't trust a rando. You don't trust their full article anyway.
How do I know if they are "pointing me in the right direction"?
Once again, you assume the conclusion that their comment is helpful and correct and meaningful, and you work backwards to excuse their poor explanation that they justified with "let's say it's a quiz".
And if you don't like my reply, take your own advice and go away. Why do you care what I think of their phrasing? You're not going to get me to stop anyway, or the dozens or people who upvoted me.
Or keep swearing at me and getting downvoted, whatever floats your boat.
"How do I know if they are "pointing me in the right direction"?"
How do you justify complaining about how little they wrote for you when you will ignore them anyway, because "how do I know they are pointing me in the right direction?"
You can't have it both ways at the same time.
Pointing out the problem in their comment is the opposite of "ignoring them".
This is such a weird discussion honestly. I don't know what you want of me, but I sure can't wait for your next strawman!
You pointed out no problem.
https://xkcd.com/1058/ :)
I guess if the arch’s varargs conventions do something other than put each 32-bit value in a 64-bit “slot” (likely for inputs that end up on the stack, at least), then some of the arguments will not line up. Probably some of the last args will get combined into high/low parts of a 64-bit register when moved into registers to pass to the kernel. And then subsequent register inputs will get garbage from the stack.
Need to cast them to long or size_t or whatever to prevent this.
Yes
0-Day incoming
So web apps can make Linux sys calls? Or its about Chrome OS?
The chrome browser itself I would think