A Plumber’s Wish List for Linux, updated version 4

Update: this is the third version, it incorporates the original list, adds a couple of new items, and includes references to some useful feedback and patches that have already been prepared.

We’d like to share our current wish list of plumbing layer features we are hoping to see implemented in the near future in the Linux kernel and associated tools. Some items we can implement on our own, others are not our area of expertise, and we will need help getting them implemented.

Acknowledging that this wish list of ours only gets longer and not shorter, even though we have implemented a number of other features on our own in the previous years, we are posting this list here, in the hope to find some help.

If you happen to be interested in working on something from this list or able to help out, we’d be delighted. Please ping us in case you need clarifications or more information on specific items.

Thanks,

Kay, Lennart, Harald, David in the name of all the other plumbers

And here is the wish list, in no particular order:

tmpfs:

* [PATCH] support user quota on tmpfs to prevent DoS vulnerabilities on /tmp, /dev/shm, /run/user/$USER. This is kinda important. Idea: global RLIMIT_TMPFS_QUOTA over all mounted tmpfs file systems.

Patch from Davidlohr Bueso: https://lkml.org/lkml/2011/11/6/135

fanotify:

* events for renames

* allow safe unprivileged access

* pass information about the open flags to the file system monitors, in order to allow clients to figure out whether other applications opened files for writing or just read-only.

* allow to find out if a file actually was written to, when closed after opening it read-write

filesystems:

* (ioctl based?) interface to query and modify the label of a mounted FAT volume: A FAT label is implemented as a hidden directory entry in the file system, which need to be renamed when changing the file system label. This is impossible to do from userspace without remounting. Hence we’d like to see a kernel interface that is available on the mounted file system mount point itself. Of course, bonus points, if this new interface can be implemented for other file systems as well.

* faster xattrs on ext2/3/4 (i.e. allow userspace to make use of xattr without paying the performance penalty for the seeks. Alex Larsson will  provide you with the measurement data how xattr checking is magnitudes slower when trying to implement a simple file list). Suggestion: provide a simple flag in struct stat to inform userspace whether it is worth looking for xattrs (i.e. think STAT_XATTRS_FOUND or STAT_XATTRS_MAYBE)

* fsetxattrat() and friends. Necessary for race-free recursive SELinux relabelling New!

mounting:

* allow creation of read-only bind mounts in a single mount() call, instead of two

* allow marking mounts read-only recursively

* Similar, allow configuration of namespace propagation settings for mount points in the initial mount() syscall, instead of always requiring two (which is racy, and ugly, and stuff).

memory management:

* swappiness control as madvise() for individual memory pages

* A new flag for madvise() that allows marking of memory as “droppable”, meaning that it may be forgotten by the kernel under memory pressure, and in that case on the next access results in SIGBUS, which the process can hook into to refill the memory space. This would allow implementation of sensible userspace memory paging where memory can be filled on-demand as necessary and is automatically dropped as the kernel sees fit. (This feature has been requested by the Mozilla folks).

* Plugins, libraries and other code that needs to hook into SIGBUS/SIGSEGV would prefer to do that for specific memory ranges only, in order not to interfere with the handlers installed by other code within the same process. This is required to make userspace memory paging viable for modern applications which link a lot of code from different sources in the main process. It would hence be desirable if there was a way to define a set of SIGBUS/SIGSEGV handlers for specific memory ranges, to avoid any issues regarding “who controls SIGBUS?”. (Also a request by the Mozilla folks)

core kernel:

* allow 64 bit PIDs / use 32 bit pids by default (currently 15 bit by default, 22 bit possible), in order to fix PID recycle vulnerabilities

* allow changing argv[] of a process without mucking with environ[]:

Something like setproctitle() or a prctl() would be ideal. Of course it is questionable if services like sendmail make use of this, but otoh for services which fork but do not immediately exec() another binary being able to rename this child processes in ps is of importance.

* Reading /proc/$PID/exe should not require CAP_PTRACE but a more sane capability instead (or, even, no cap at all)

* sockopt for changing AF_UNIX datagram queue length (currently defaults to 10 globally)

* implement a childfd() API which allows userspace code to subscribe to the wait status of specific PIDs with a poll()able fd. The existing APIs of signalfd()+waitid() are not useful for this, since they require changing the process-wide signal mask before any thread is created, which is impossible for most libraries, especially if they are pulled in from loadable modules. Libraries generally are only interested in (and should only reap) the specific child PIDs they themselves actually created, and they often have many of them, which means the directed waitid(pid, …) call as well as the undirected waitid(-1, …) call are not appropriate/efficient for this use case.

Suggested API:

fd = childfd();

pid = fork();

if (pid == 0) { /* … child … */ ; exit(0); }

childfd_add(fd, pid);

* An API how userspace can be notified about suspend/resume cycles. Useful for code like Avahi, to implement the protocol for refreshing services on resume.

* namespaced core_patterns so that coredumps end up in the containers they are generated in rather than in the host. New!

* allow setting default foreground/background color for VTs. New!

* dumpable flag should be readable from /proc/$PID/status or so. New!

driver model:

* export ‘struct device_type fb/fbcon’ of ‘struct class graphics’

Userspace wants to easily distinguish ‘fb’ and ‘fbcon’ from each other without the need to match on the device name.

cgroups:

* fork throttling mechanism as basic cgroup functionality that is available in all hierarchies independent of the controllers used:

This is important to implement race-free killing of all members of a cgroup, so that cgroup member processes cannot fork faster then a cgroup supervisor process could kill them. This needs to be recursive, so that not only a cgroup but all its subgroups are covered as well.

Related: Patches for task_counter from Frederic Weisbecker

http://article.gmane.org/gmane.linux.kernel/1198795

Possibly use the freezer Tejun is looking into.

* proper cgroup-is-empty notification interface:

The current call_usermodehelper() interface is an unefficient and an ugly hack. Tools would prefer anything more lightweight like a netlink, poll() or fanotify interface. The current logic is completely broken in containers.

* allow making use of the “cpu” cgroup controller by default without breaking RT. Right now creating a cgroup in the “cpu” hierarchy that shall be able to take advantage of RT is impossible for the generic case since it needs an RT budget configured which is from a limited resource pool. What we want is the ability to create cgroups in “cpu” whose processes get an non-RT weight applied, but for RT take advantage of the parent’s RT budget. We want the separation of RT and non-RT budget assignment in the “cpu” hierarchy, because right now, you lose RT functionality in it unless you assign an RT budget. This issue severely limits the usefulness of “cpu” hierarchy on general purpose systems right now.

[PATCH] * allow user xattrs to be set on files in the cgroupfs (and maybe procfs?)

Patch from Li Zefan https://lkml.org/lkml/2012/1/16/51

[PATCH] * Add a timerslack cgroup controller, to allow increasing the timer slack of user session cgroups when the machine is idle.

  Patch from: Kirill A. Shutemov

  http://article.gmane.org/gmane.linux.kernel/1201782

  http://lwn.net/Articles/463357/

AF_UNIX:

* An auxiliary meta data message for AF_UNIX called SCM_CGROUPS (or something like that), i.e. a way to attach sender cgroup membership to messages sent via AF_UNIX. This is useful in case services such as syslog shall be shared among various containers (or service cgroups), and the syslog implementation needs to be able to distinguish the sending cgroup in order to separate the logs on disk. Of course stm SCM_CREDENTIALS can be used to look up the PID of the sender followed by a check in /proc/$PID/cgroup, but that is necessarily racy, and actually a very real race in real life.

* SCM_PROCSTATUS for retrieving sender process information supplying at least: comm, exec, cmdline, audit session, audit loginuid, thread id.

* Implicit SCM_CREDENTIALS should carry EUID too

New additions from LPC 2012:

* A new epoll_ctl() command to immediately wake up somebody sleeping in epoll_wait(). This can currently be emulated with eventfd(), but it would be nicer to do this directly. New!

* It would be good to have timer “ranges” in timerfd time events, i.e. allowing specification of two points in times between which the wakeup should happen, which would allow the kernel to more neatly coalesce wakeups. New!

* openat() with a NULL file name to create a file without a name but within a directory. This is useful to create temporary files whose fd can be passed around, without first creating a random name for it to avoid collisions and then immediately deleting it. This should make code a lot simpler and avoid the short window where a program created a temporary file in order to immediately delete it but is killed first leaving the temporary file around. New!

All time favourites:

These items have been requested many times already, and we want to make sure they aren’t forgotten. We know they are hard to implement, and we don’t know how to get there, but nonetheless, here they are:

* Oldie But Goldie: some kind of union mount. A minimal version that supports only read-only filesystems would already be a big step forward.

* revoke()

* Notifications when non-child processes die, in an efficient way focussing on explicit PIDs (i.e. not taskstats) in some form (idea: poll() for POLLERR on /proc/$PID)

DONE

[CLOSED] * module-init-tools: provide a proper libmodprobe.so from module-init-tools:

Early boot tools, installers, driver install disks want to access information about available modules, and match devices to available modules to hook up driver overwrites, driver update disks, installer tweaks, and to optimize bootup module handling.

Lucas de Marchi and his colleagues from ProFUSION: http://blog.gustavobarbieri.com.br/2011/12/21/kmod-announcement-and-how-to-help-testing-it/

[CLOSED] * expose CAP_LAST_CAP somehow in the running kernel at runtime:

Userspace needs to know the highest valid capability of the running kernel, which right now cannot reliably be retrieved from header files only. The fact that this value cannot be detected properly right now creates various problems for libraries compiled on newer header files which are run on older kernels. They assume capabilities are available which actually aren’t. Specifically, libcap-ng claims that all running processes retain the higher capabilities in this case due to the “inverted” semantics of CapBnd in /proc/$PID/status. (Fixed by Dan Ballard https://lkml.org/lkml/2011/10/12/452)

[CLOSED] * CPU modaliases in /sys/devices/system/cpu/cpuX/modalias:

useful to allow module auto-loading of e.g. cpufreq drivers and KVM modules. Andy Kleen has a patch to create the alias file itself. CPU ‘struct sysdev’ needs to be converted to ‘struct device’ and a ‘struct bus_type cpu’ needs to be introduced to allow proper CPU coldplug event replay at bootup. This is one of the last remaining places where automatic hardware-triggered module auto-loading is not available. And we’d like to see that fix to make numerous ugly userspace work-arounds to achieve the same go away.

Fixed by Andi Kleen. Rebased patches on the way to the next kernel.

[CLOSED] * hostname change notification:

Patch by Lucas de Marchi:

http://git.kernel.org/?p=linux/kernel/git/next/linux-next.git;a=commitdiff;h=70b932563a9514b248cc71a29bd0907bf95b4a5e

[CLOSED] * PR_SET_CHILD_SUBREAPER

Merged into Andrew’s -mm tree.

Patch by Kay Sievers and Lennart Poettering:

  http://permalink.gmane.org/gmane.linux.man/2071

[CLOSED] * support fallocate() properly:

    fallocate(5, 0, 0, 7663616) = -1 EOPNOTSUPP

[CLOSED] * fix the input subsystem to allow more than 32 devices per class, multi-seat setups run easily out of needed devices.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7f8d4cad1e4e11a45d02bd6e024cc2812963c38a