I may be mistaken, but I believe the bug still exists, but in a more esoteric manner; and a future change might cause the bug to exist again. The author might want to warn against usage of `tokio::task::block_in_place`, if the underlying issue can't be fixed.
The reason the current approach works is it runs on tokio's worker threads, which last the lifetime of the tokio runtime. However, if `tokio::task::block_in_place`, the current worker thread is demoted to a blocking thread pool, and the new worker thread is spawned in it's place.
There can be a situation when the stars align that:
1. Thread A spawns Process X.
2. N minutes/hours/days pass, and Thread A hits a section of code that calls `tokio::task::block_in_place`
3. Thread A goes into the blocking pool.
4. After some idle time, Thread A dies, prematurely killing Process X, causing the same bug again.
You can imagine that this would be much harder to reproduce and debug, because thread lifetime will be completely divorced from when you spawned the process. It's actually pretty lucky that the author reached for spawn_blocking, instead of block_in_place as when doing benchmarking it's a bit more tempting to use block_in_place. Had they used block_in_place it may have been harder to catch this bug.
My knowledge isn't very good here, but I assumed since they're using the single thread executor, everything was being spawned on the main thread. The only time new (temporary) threads were created was when calling `spawn_blocking`. And the main thread can't be moved because it's part of the `main()` call stack? Maybe...
> It is called PR_SET_DEATHSIG, and we configure it when spawning tasks using the prctl syscall like this
PDEATHSIG was to my knowledge (85% confidence) created for the original Linux userspace pthreads implementation (LinuxThreads¹, before NPTL) that was created back when it was implemented via kernel processes (the kernel had no concept of threads yet). This is AFAIK also why it behaves oddly in regards to later-added kernel threads. I have a flag for "don't use this, it's highly fragile" in my head but don't remember where that's from.
If the receiving side can be controlled, there's always the option of opening a pipe; if the other end dies that's always detectable. Doesn't work with arbitrary processes though (random other code won't care if some fd ≥3 is suddenly closed…)
There's been no fundamental change in the kernel level representation of pthreads, they are still clone()d processes with just some sharing flags set differently that eg affect how PIDs work.
If you want to be able to spawn processes that fask then `fork()` is NOT your friend. You want either `vfork()` (or `clone()` equivalent) or `posix_spawn()`.
`fork()` is inherently very slow due to the need to either copy the VM of the parent, or arrange to copy pages on write, or copy the resident set of the parent (then copy any pages paged-in when those page-in events happen -- all three of these options are very expensive.
Also, what I might recommend here is to create a `posix_spawn()`-like API that gives you asynchronous notification of the exec starting or failing, that way you don't block even for that. I'd use a `pipe()` that will be set to close on exec and which will therefore close when the exec starts, but if the exec fails I'd write the `errno` value into the pipe, that way EOF on the pipe implies the exec started while read on the pipe implies the exec failed and you can read the error number out of the pipe.
Normally I'd stay away from job control posix APIs - but since HyperQueue is a job control system, it might be appropriate if the worker was a session leader. If it dies than all its subprocesses would receive SIGHUP - which is fatal by default.
Generally you'd use this functionality to implement something like sshd or an interactive shell. HQ seems roughly analogous.
I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.
They would need one special "main thread" that deals with reaping and that isn't subject to tokio's runtime cleaning it up, but presumably they already have that, or else the fix they did apply wouldn't have worked.
Alternatively, if they want they could integrate with systemd, even just by wrapping the children all in 'systemd-run', which would reliably allow cleaning up of children (via cgroups).
> I think they don't want PR_SET_PDEATHSIG but rather PR_SET_CHILD_SUBREAPER, which I think would be both more correct than PDEATHSIG for letting them wait on grand-children / preventing grand-child-zombies, while also avoiding the issue they ran into here entirely.
PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
As far as I’m aware, PR_SET_CHILD_SUBREAPER doesn’t do anything if you die. Assuming you yourself don’t crash, it can be used to help clean up orphaned descendant processes, by ensuring they reparent to you instead of init; but in the event you do crash, it doesn’t do anything to help.
PID namespaces do exactly what you want - if their init process dies it automatically kills all its descendants. However, they require privilege - unless you use an unprivileged user namespace - but those are frequently disabled, and even when enabled, using them potentially introduces a whole host of other issues
> Alternatively, if they want they could integrate with systemd
The problem is a lot of code runs in environments without systemd-e.g. code running in containers (Docker, K8S, etc), most containers don’t contain systemd. So any systemd-centric solution is only going to work for some people
Really, it would be great if Linux added some new process grouping construct which included the “kill all members of this group if its leader dies” semantic of PID namespaces without any of its other semantics. It is those other semantics (especially the new PID number semantics) which are the primary source of the security concerns, so a construct which offered only the “kill-if-leader-dies” semantic should be safe to allow for unprivileged access. (The one complexity is setuid/setgid/file capabilities - allowing an unprivileged process to effectively kill a privileged process at an arbitrary point in its execution is a security risk-plausible solutions include refuse to execute any setuid/setgid/caps executable, or else allow them to run but remove the process from this grouping when it executes one)
> PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
It indirectly does, unless you unset it the child dying will trigger another run of PDEATHSIG on the grandchildren, and so on. (The setting is retained across forks, as shown in the original article.)
> when the orphan terminates, it is the subreaper process that will receive a SIGCHLD signal and will be able to wait(2) on the process to discover its termination status
Seems like you don’t need a dedicated “always alive” thread if it’s being delivered to the process and tokio automatically does masking for threads so that you register for listening to signals using it’s asynchronous mechanisms & don’t have issues around signal safety which it abstracts away for you (i.e. as long as you’re handling the SIGCHILD signal somewhere or even just ignoring it as I don’t think they actually care?).
That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies which is the effect they’re looking for here (not the reverse where you reap forked children as they die). So you may need to spawn a dedicated reaper process rather than thread to manage the lifetime of children which is much more complicated.
> That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies
CHILD_SUBREAPER kills neither children nor grandchildren. It's effect is in the other direction, inteded for sub-service-managers that want to keep track of all children. If the subreaper dies, children are reparented to the next subreaper up (or init).
Yeah, I was assuming they have something calling `wait` somewhere since they say "HyperQueue is essentially a process manager", and to me "process manager" implies pretty strongly "spawns and waits for processes".
> Edit: Someone on Reddit sent me a link to a method that can override the thread keep-alive duration. Its description makes it clear why the tasks were failing after exactly 10 seconds
> Yeah, testing if a task can run for 20 seconds isn’t great, but hey, at least it’s something
Well a reasonable thing to me is then to use the override within the test to shorten it (e.g. to 1s & use a 2s timeout).
> In particular, it is not always possible for HQ to ensure that when a process that spawns tasks (called worker) quits unexpectedly (e.g. when it receives SIGKILL), its spawned tasks will be cleaned up. Sadly, Linux does not seem to provide any way of implementing perfect structured process management in user space. In other words, when a parent process dies, it is possible for its (grand)children to continue executing.
Uh, it does? It's called pid-namesp—
> There is a solution for this called PID namespaces,
I think maybe you've got the wrong idea about what "in user space" means? — processes running as root are still "in user space". The opposite of "user space" is "in the kernel".
> but it requires elevated privileges
I think that's only technically true. I believe you can unshare the PID namespace if you first unshare the user namespace — which causes the thing doing the unsharing of the user namespace to become "root" within that new namespace, and from there is permitted to unshare the pid namespace. I think: https://unix.stackexchange.com/a/672462/6013
I have no idea why that hoop has to be jumped through / I don't know what is being protected against by preventing unprivileged processes from making pid namespaces.
Whether or not that fits well with HQ's design … you'd have to be the judge of that.
Leaving PDEATHSIG enabled would make it harder for me to sleep at night, but I understand why the alternatives probably aren't appealing. Seems like a future bug waiting to happen. At least the author knows what to expect now.
Good writeup of yet another bug different from all the other bugs.
The Linux kernel isn't really bothered by the difference between threads and processes. Threads are just processes that happen to share an address space, file descriptor table, and thread group ID (what most tools call a PID). I think there are some subtle things related to the thread group ID, but they're subtle. The rest is implemented in glibc.
The distinction isn't quite as subtle as you believe, it also shows up in e.g. file locks, AF_UNIX SO_PEERCRED, and with any process-directed signal.
As a matter of fact, the original implementation of POSIX threads for Linux was userspace based and had unfixable bugs and issues that necessitated introducing the concept of threads into the Linux kernel.
Are there any differences between threads and processes in how signals are handled?
I recently learned that aside from processes there are process groups, process sessions (setsid), process group and session leaders, trees have associated VT ownership data, systemd sessions (which seem to be inherited by the entire subtree and can't be purged), and possibly other layered metadata spaces that I haven't heard of yet.
And I feel like there's got to be some way to tag or associate custom metadata with processes, but I haven't found it yet.
I really wish there were an overview of all these things and how they interact with eachother somewhere.
> Are there any differences between threads and processes in how signals are handled?
Yes. As signal(7) notes [0], Linux has both “process-directed signals” (which can be handled by any thread in a process), and “thread-directed signals” (which are targeted at a specific thread and only handled by that thread). For user-generated signals, the classification depends on which syscall you use (kill/rt_sigqueueinfo generate process-directed signals, tgkill/rt_tsigqueueinfo generate thread-directed). For system-generated signals, it is up to the kernel code generating the signal to decide. So the same signal number can be thread-directed in some cases and process-directed in others
> systemd sessions (which seem to be inherited by the entire subtree and can't be purged)
At a kernel level those are implemented with cgroups.
> I really wish there were an overview of all these things
Unfortunately I think Linux has grown a complex mess of different features in this area, all of which are full of complicated limitations and gotchas. Despite attempts to introduce orthogonality (e.g. with several different types of namespaces), the end result is still a long way from any ideal of orthogonality
> Are there any differences between threads and processes in how signals are handled?
Yes, absolutely, there are thread-directed and process-directed signals; for the latter a thread is chosen at random (more or less) to handle the signal.
The reason the current approach works is it runs on tokio's worker threads, which last the lifetime of the tokio runtime. However, if `tokio::task::block_in_place`, the current worker thread is demoted to a blocking thread pool, and the new worker thread is spawned in it's place.
There can be a situation when the stars align that:
1. Thread A spawns Process X.
2. N minutes/hours/days pass, and Thread A hits a section of code that calls `tokio::task::block_in_place`
3. Thread A goes into the blocking pool.
4. After some idle time, Thread A dies, prematurely killing Process X, causing the same bug again.
You can imagine that this would be much harder to reproduce and debug, because thread lifetime will be completely divorced from when you spawned the process. It's actually pretty lucky that the author reached for spawn_blocking, instead of block_in_place as when doing benchmarking it's a bit more tempting to use block_in_place. Had they used block_in_place it may have been harder to catch this bug.
PDEATHSIG was to my knowledge (85% confidence) created for the original Linux userspace pthreads implementation (LinuxThreads¹, before NPTL) that was created back when it was implemented via kernel processes (the kernel had no concept of threads yet). This is AFAIK also why it behaves oddly in regards to later-added kernel threads. I have a flag for "don't use this, it's highly fragile" in my head but don't remember where that's from.
If the receiving side can be controlled, there's always the option of opening a pipe; if the other end dies that's always detectable. Doesn't work with arbitrary processes though (random other code won't care if some fd ≥3 is suddenly closed…)
¹ https://en.wikipedia.org/wiki/LinuxThreads
If you want to register per-thread signal handlers you're forced to step outside the bounds of glibc and pthreads which I think is quite unfortunate.
`fork()` is inherently very slow due to the need to either copy the VM of the parent, or arrange to copy pages on write, or copy the resident set of the parent (then copy any pages paged-in when those page-in events happen -- all three of these options are very expensive.
Also, what I might recommend here is to create a `posix_spawn()`-like API that gives you asynchronous notification of the exec starting or failing, that way you don't block even for that. I'd use a `pipe()` that will be set to close on exec and which will therefore close when the exec starts, but if the exec fails I'd write the `errno` value into the pipe, that way EOF on the pipe implies the exec started while read on the pipe implies the exec failed and you can read the error number out of the pipe.
Generally you'd use this functionality to implement something like sshd or an interactive shell. HQ seems roughly analogous.
https://notes.shichao.io/apue/ch9/#sessions
They would need one special "main thread" that deals with reaping and that isn't subject to tokio's runtime cleaning it up, but presumably they already have that, or else the fix they did apply wouldn't have worked.
Alternatively, if they want they could integrate with systemd, even just by wrapping the children all in 'systemd-run', which would reliably allow cleaning up of children (via cgroups).
PR_SET_PDEATHSIG automatically kills your children if you die, but unfortunately doesn’t extend to their descendants
As far as I’m aware, PR_SET_CHILD_SUBREAPER doesn’t do anything if you die. Assuming you yourself don’t crash, it can be used to help clean up orphaned descendant processes, by ensuring they reparent to you instead of init; but in the event you do crash, it doesn’t do anything to help.
PID namespaces do exactly what you want - if their init process dies it automatically kills all its descendants. However, they require privilege - unless you use an unprivileged user namespace - but those are frequently disabled, and even when enabled, using them potentially introduces a whole host of other issues
> Alternatively, if they want they could integrate with systemd
The problem is a lot of code runs in environments without systemd-e.g. code running in containers (Docker, K8S, etc), most containers don’t contain systemd. So any systemd-centric solution is only going to work for some people
Really, it would be great if Linux added some new process grouping construct which included the “kill all members of this group if its leader dies” semantic of PID namespaces without any of its other semantics. It is those other semantics (especially the new PID number semantics) which are the primary source of the security concerns, so a construct which offered only the “kill-if-leader-dies” semantic should be safe to allow for unprivileged access. (The one complexity is setuid/setgid/file capabilities - allowing an unprivileged process to effectively kill a privileged process at an arbitrary point in its execution is a security risk-plausible solutions include refuse to execute any setuid/setgid/caps executable, or else allow them to run but remove the process from this grouping when it executes one)
It indirectly does, unless you unset it the child dying will trigger another run of PDEATHSIG on the grandchildren, and so on. (The setting is retained across forks, as shown in the original article.)
That’s not what the man page says:
> The parent-death signal setting is cleared for the child of a fork(2).
https://man7.org/linux/man-pages/man2/pr_set_pdeathsig.2cons...
Unless the man page is wrong?
Seems like you don’t need a dedicated “always alive” thread if it’s being delivered to the process and tokio automatically does masking for threads so that you register for listening to signals using it’s asynchronous mechanisms & don’t have issues around signal safety which it abstracts away for you (i.e. as long as you’re handling the SIGCHILD signal somewhere or even just ignoring it as I don’t think they actually care?).
That being said, it’s not clear PR_SET_CHILD_SUBREAPER actually causes grand children to be killed when the reaper process dies which is the effect they’re looking for here (not the reverse where you reap forked children as they die). So you may need to spawn a dedicated reaper process rather than thread to manage the lifetime of children which is much more complicated.
CHILD_SUBREAPER kills neither children nor grandchildren. It's effect is in the other direction, inteded for sub-service-managers that want to keep track of all children. If the subreaper dies, children are reparented to the next subreaper up (or init).
> Yeah, testing if a task can run for 20 seconds isn’t great, but hey, at least it’s something
Well a reasonable thing to me is then to use the override within the test to shorten it (e.g. to 1s & use a 2s timeout).
Uh, it does? It's called pid-namesp—
> There is a solution for this called PID namespaces,
I think maybe you've got the wrong idea about what "in user space" means? — processes running as root are still "in user space". The opposite of "user space" is "in the kernel".
> but it requires elevated privileges
I think that's only technically true. I believe you can unshare the PID namespace if you first unshare the user namespace — which causes the thing doing the unsharing of the user namespace to become "root" within that new namespace, and from there is permitted to unshare the pid namespace. I think: https://unix.stackexchange.com/a/672462/6013
I have no idea why that hoop has to be jumped through / I don't know what is being protected against by preventing unprivileged processes from making pid namespaces.
Whether or not that fits well with HQ's design … you'd have to be the judge of that.
There's also prctl(PR_SET_CHILD_SUBREAPER, ...)
The Linux kernel isn't really bothered by the difference between threads and processes. Threads are just processes that happen to share an address space, file descriptor table, and thread group ID (what most tools call a PID). I think there are some subtle things related to the thread group ID, but they're subtle. The rest is implemented in glibc.
As a matter of fact, the original implementation of POSIX threads for Linux was userspace based and had unfixable bugs and issues that necessitated introducing the concept of threads into the Linux kernel.
I recently learned that aside from processes there are process groups, process sessions (setsid), process group and session leaders, trees have associated VT ownership data, systemd sessions (which seem to be inherited by the entire subtree and can't be purged), and possibly other layered metadata spaces that I haven't heard of yet.
And I feel like there's got to be some way to tag or associate custom metadata with processes, but I haven't found it yet.
I really wish there were an overview of all these things and how they interact with eachother somewhere.
Yes. As signal(7) notes [0], Linux has both “process-directed signals” (which can be handled by any thread in a process), and “thread-directed signals” (which are targeted at a specific thread and only handled by that thread). For user-generated signals, the classification depends on which syscall you use (kill/rt_sigqueueinfo generate process-directed signals, tgkill/rt_tsigqueueinfo generate thread-directed). For system-generated signals, it is up to the kernel code generating the signal to decide. So the same signal number can be thread-directed in some cases and process-directed in others
> systemd sessions (which seem to be inherited by the entire subtree and can't be purged)
At a kernel level those are implemented with cgroups.
> I really wish there were an overview of all these things
Unfortunately I think Linux has grown a complex mess of different features in this area, all of which are full of complicated limitations and gotchas. Despite attempts to introduce orthogonality (e.g. with several different types of namespaces), the end result is still a long way from any ideal of orthogonality
[0] https://man7.org/linux/man-pages/man7/signal.7.html
Yes, absolutely, there are thread-directed and process-directed signals; for the latter a thread is chosen at random (more or less) to handle the signal.