Opened 3 years ago

Closed 3 years ago

#4394 closed task (fixed)


Reported by: Douglas R. Reno Owned by: Douglas R. Reno
Priority: highest Milestone: 8.4
Component: Book Version: SVN
Severity: normal Keywords:


New version

New version.

Product: systemd (tmpfiles)
Versions-affected: 239 and earlier
Author: Michael Orlitzky
Fixed-in: v240
  Franck Bui of SUSE put forth a massive amount of effort to fix this,
  and Lennart Poettering consistently provided timely reviews over the
  course of a few months.

== Summary ==

Before version 240, the systemd-tmpfiles program will follow symlinks present in a non-terminal path component while adjusting permissions and ownership. Often -- and particularly with "Z" type entries -- an attacker can introduce such a symlink and take control of arbitrary files on the system to gain root. The "fs.protected_symlinks" sysctl does not prevent this attack. Version 239 contained a partial fix, but only for the easy-to-exploit recursive "Z" type entries.

== Details ==

The systemd-tmpfiles program tries to avoid following symlinks in the last component of a path. To that end, the following trick is used in src/tmpfiles/tmpfiles.c:

  fd = open(path, O_NOFOLLOW|O_CLOEXEC|O_PATH);
  xsprintf(fn, "/proc/self/fd/%i", fd);
  if (chown(fn, ...

The call to chown will follow the "/proc/self/fd/%i" symlink, but only once; it will then operate on the real file described by fd.

However, there is another way to exploit the code above. The call to open() will follow symlinks if they appear in a non-terminal component of path, even with the O_NOFOLLOW flag set. Citing the open(2) man page,


  If pathname is a symbolic link, then the open fails, with the error
  ELOOP. Symbolic links in earlier components of the pathname will still
  be followed.

So, for example, if the path variable contains "/run/foo/a/b" and if "a" is a symlink, then open() will follow it. If systemd-tmpfiles will be changing ownership of "/run/foo/a/b" after that of "/run/foo", then the owner of "/run/foo" can exploit that fact to gain root by replacing "/run/foo/a" with a symlink. With a Z-type tmpfiles.d entry, the attacker can create this situation himself.

The "fs.protected_symlinks" sysctl does not protect against these sorts of attacks. Due to the widespread and legitimate use of symlinks in situations like these, the symlink protection is much weaker than the corresponding hardlink protection.

== Exploitation ==

Consider the following entry in /etc/tmpfiles.d/exploit-recursive.conf:

  d /var/lib/systemd-exploit-recursive 0755 mjo mjo
  Z /var/lib/systemd-exploit-recursive 0755 mjo mjo

Once systemd-tmpfiles has been started once, my "mjo" user will own that directory:

  mjo $ sudo ./build/systemd-tmpfiles --create
  mjo $ ls -ld /var/lib/systemd-exploit-recursive
  drwxr-xr-x  2 mjo mjo 4.0K 2018-02-13 09:38 /var/lib/systemd...

At this point, I am able to create a directory "foo" and a file "foo/passwd" under /var/lib/systemd-exploit-recursive. The next time that systemd-tmpfiles is run (perhaps after a reboot), the tmpfiles.c function item_do_children() will be called on "foo". Within that function, there is a macro FOREACH_DIRENT_ALL that loops through the entries of "foo".

The FOREACH_DIRENT_ALL macro defers to readdir(3), and thus requires the real directory stream pointer for "foo", because we want it to see "foo/passwd". However, while the macro is iterating, the "q = action(i, p)" will be performed on "p" which consists of the path "foo" and some filename "d", but without reference to its file descriptor. So, between the time that item_do_children() is called on "foo" and the time that "q = action(i, p)" is run on "foo/passwd", I have the opportunity to replace "foo" with a symlink to "/etc", causing "/etc/passwd" to be affected by the change of ownership and permissions.

But there's more: the FOREACH_DIRENT_ALL macro processes the contents of "foo" in whatever order readdir() returns them. Since mjo owns "foo", I can fill it with junk to buy myself as much time as I like before "foo/passwd" is reached:

  mjo $ cd /var/lib/systemd-exploit-recursive
  mjo $ mkdir foo
  mjo $ cd foo
  mjo $ echo $(seq 1 500000) | xargs touch
  mjo $ touch passwd

Now, restarting systemd-tmpfiles will change ownership of all of those files...

  mjo $ sudo ./build/systemd-tmpfiles --create

and it takes some time for it to process the 500,000 dummy files before reaching "foo/passwd". At my leisure, I can replace foo with a symlink:

  mjo $ cd /var/lib/systemd-exploit-recursive
  mjo $ mv foo bar && ln -s /etc ./foo

After some time, systemd-tmpfiles will eventually reach the path "foo/passwd", which now points to "/etc/passwd", and grant me root access.

A similar, but more difficult attack works against non-recursive entry types. Consider the following tmpfiles.d entry:

  d /var/lib/systemd-exploit 0755 mjo mjo
  d /var/lib/systemd-exploit/foo 0755 mjo mjo
  f /var/lib/systemd-exploit/foo/bar 0755 mjo mjo

After "/var/lib/systemd-exploit/foo" is created but before the permissions are adjusted on "/var/lib/systemd-exploit/foo/bar", there is a short window of opportunity for me to replace "foo" with a symlink to (for example) "/etc/env.d". If I'm fast enough, tmpfiles will open "foo/bar", following the "foo" symlink, and give me ownership of something sensitive in the "/etc/env.d" directory. However, this attack is more difficult because I can't arbitrary widen my own window of opportunity with junk files, as was possible with the "Z" type entries.

== Resolution ==

Commit 936f6bdb, which is present in systemd v239, changes the recursive loop in two important ways. First, it passes file descriptors -- rather than parent paths -- to each successive iteration. That allows the next iteration to use the openat() system call, eliminating the non-terminal path components from the equation. Second, it ensures that each "open" call has the O_NOFOLLOW and O_PATH flags to prevent symlinks from being followed at the current depth. Note: only the recursive loop was made safe; the call to open() the top-level path will still follow non-terminal symlinks and is vulnerable to the second attack above.

The commits in pull request 8822 aim to make everything safe from this type of symlink attack. As far as tmpfiles is concerned, the main idea is to use the chase_symlinks() function in place of the open() system call. Since chase_symlinks() calls openat() recursively from the root up, it will never follow a non-terminal symlink. Commit 1f56e4ce then introduces the CHASE_NOFOLLOW flag for that function, preventing it from following terminal symlinks. In subsequent commits (e.g. addc3e30), the consumers of chase_symlinks() were updated to pass CHASE_NOFOLLOW to chase_symlinks(), preventing them from following any symlinks.

The complete fix is available in systemd v240.

Change History (4)

comment:1 by Douglas R. Reno, 3 years ago

Owner: changed from lfs-book to Douglas R. Reno
Status: newassigned

comment:2 by Douglas R. Reno, 3 years ago

Priority: normalhighest

Arbitrary Code Execution Fix (CVE-2018-15688):

An out-of-bounds write has been found in the dhcpv6 option handing code of systemd-networkd up to and including v239.

It was discovered that systemd-network does not correctly keep track of a buffer size  in the dhcp6_option_append_ia() function, when constructing DHCPv6 packets. This flaw may lead to an integer underflow that can be used to produce an heap-based buffer overflow. A malicious host on the same network segment as the victim's one may advertise itself as a DHCPv6 server and exploit this flaw to cause a Denial of Service or potentially gain code execution on the victim's machine. The overflow can be triggered relatively easy by advertising a DHCPv6 server with a server-id >= 493 characters long.

Privilege Escalation issue (CVE-2018-15687):

A security issue has been found in systemd up to and including 239, where a race condition in the chown_one() function can be used to escalate privileges via a crafted symlink.

Privilege Escalation Issue (CVE-2018-15686):

A security issue has been found in systemd up to and including 239, where the use of fgets() allows an attacker to escalate privilege via a crafted service with NotifyAccess.

That's in addition to the vulnerability in the description above.

comment:3 by Douglas R. Reno, 3 years ago

Actual change notes:


        * NoNewPrivileges=yes has been set for all long-running services
          implemented by systemd. Previously, this was problematic due to
          SELinux (as this would also prohibit the transition from PID1's label
          to the service's label). This restriction has since been lifted, but
          an SELinux policy update is required.
          (See e.g.

        * DynamicUser=yes is dropped from systemd-networkd.service,
          systemd-resolved.service and systemd-timesyncd.service, which was
          enabled in v239 for systemd-networkd.service and systemd-resolved.service,
          and since v236 for systemd-timesyncd.service. The users and groups
          systemd-network, systemd-resolve and systemd-timesync are created
          by systemd-sysusers again. Distributors or system administrators
          may need to create these users and groups if they not exist (or need
          to re-enable DynamicUser= for those units) while upgrading systemd.

        * When unit files are loaded from disk, previously systemd would
          sometimes (depending on the unit loading order) load units from the
          target path of symlinks in .wants/ or .requires/ directories of other
          units. This meant that unit could be loaded from different paths
          depending on whether the unit was requested explicitly or as a
          dependency of another unit, not honouring the priority of directories
          in search path. It also meant that it was possible to successfully
          load and start units which are not found in the unit search path, as
          long as they were requested as a dependency and linked to from
          .wants/ or .requires/. The target paths of those symlinks are not
          used for loading units anymore and the unit file must be found in
          the search path.

        * A new service type has been added: Type=exec. It's very similar to
          Type=simple but ensures the service manager will wait for both fork()
          and execve() of the main service binary to complete before proceeding
          with follow-up units. This is primarily useful so that the manager
          propagates any errors in the preparation phase of service execution
          back to the job that requested the unit to be started. For example,
          consider a service that has ExecStart= set to a file system binary
          that doesn't exist. With Type=simple starting the unit would be
          considered instantly successful, as only fork() has to complete
          successfully and the manager does not wait for execve(), and hence
          its failure is seen "too late". With the new Type=exec service type
          starting the unit will fail, as the manager will wait for the
          execve() and notice its failure, which is then propagated back to the
          start job.

          NOTE: with the next release 241 of systemd we intend to change the
          systemd-run tool to default to Type=exec for transient services
          started by it. This should be mostly safe, but in specific corner
          cases might result in problems, as the systemd-run tool will then
          block on NSS calls (such as user name look-ups due to User=) done
          between the fork() and execve(), which under specific circumstances
          might cause problems. It is recommended to specify "-p Type=simple"
          explicitly in the few cases where this applies. For regular,
          non-transient services (i.e. those defined with unit files on disk)
          we will continue to default to Type=simple.

        * The Linux kernel's current default RLIMIT_NOFILE resource limit for
          userspace processes is set to 1024 (soft) and 4096
          (hard). Previously, systemd passed this on unmodified to all
          processes it forked off. With this systemd release the hard limit
          systemd passes on is increased to 512K, overriding the kernel's
          defaults and substantially increasing the number of simultaneous file
          descriptors unprivileged userspace processes can allocate. Note that
          the soft limit remains at 1024 for compatibility reasons: the
          traditional UNIX select() call cannot deal with file descriptors >=
          1024 and increasing the soft limit globally might thus result in
          programs unexpectedly allocating a high file descriptor and thus
          failing abnormally when attempting to use it with select() (of
          course, programs shouldn't use select() anymore, and prefer
          poll()/epoll, but the call unfortunately remains undeservedly popular
          at this time). This change reflects the fact that file descriptor
          handling in the Linux kernel has been optimized in more recent
          kernels and allocating large numbers of them should be much cheaper
          both in memory and in performance than it used to be. Programs that
          want to take benefit of the increased limit have to "opt-in" into
          high file descriptors explicitly by raising their soft limit. Of
          course, when they do that they must acknowledge that they cannot use
          select() anymore (and neither can any shared library they use — or
          any shared library used by any shared library they use and so on).
          Which default hard limit is most appropriate is of course hard to
          decide. However, given reports that ~300K file descriptors are used
          in real-life applications we believe 512K is sufficiently high as new
          default for now. Note that there are also reports that using very
          high hard limits (e.g. 1G) is problematic: some software allocates
          large arrays with one element for each potential file descriptor
          (Java, …) — a high hard limit thus triggers excessively large memory
          allocations in these applications. Hopefully, the new default of 512K
          is a good middle ground: higher than what real-life applications
          currently need, and low enough for avoid triggering excessively large
          allocations in problematic software. (And yes, somebody should fix

        * The fs.nr_open and fs.file-max sysctls are now automatically bumped
          to the highest possible values, as separate accounting of file
          descriptors is no longer necessary, as memcg tracks them correctly as
          part of the memory accounting anyway. Thus, from the four limits on
          file descriptors currently enforced (fs.file-max, fs.nr_open,
          RLIMIT_NOFILE hard, RLIMIT_NOFILE soft) we turn off the first two,
          and keep only the latter two. A set of build-time options
          (-Dbump-proc-sys-fs-file-max=no and -Dbump-proc-sys-fs-nr-open=no)
          has been added to revert this change in behaviour, which might be
          an option for systems that turn off memcg in the kernel.

        * When no /etc/locale.conf file exists (and hence no locale settings
          are in place), systemd will now use the "C.UTF-8" locale by default,
          and set LANG= to it. This locale is supported by various
          distributions including Fedora, with clear indications that upstream
          glibc is going to make it available too. This locale enables UTF-8
          mode by default, which appears appropriate for 2018.

        * The "net.ipv4.conf.all.rp_filter" sysctl will now be set to 2 by
          default. This effectively switches the RFC3704 Reverse Path filtering
          from Strict mode to Loose mode. This is more appropriate for hosts
          that have multiple links with routes to the same networks (e.g.
          a client with a Wi-Fi and Ethernet both connected to the internet).

          Consult the kernel documentation for details on this sysctl:

        * CPUAccounting=yes no longer enables the CPU controller when using
          kernel 4.15+ and the unified cgroup hierarchy, as required accounting
          statistics are now provided independently from the CPU controller.

        * Support for disabling a particular cgroup controller within a sub-tree
          has been added through the DisableControllers= directive.

        * cgroup_no_v1=all on the kernel command line now also implies
          using the unified cgroup hierarchy, unless one explicitly passes
          systemd.unified_cgroup_hierarchy=0 on the kernel command line.

        * The new "MemoryMin=" unit file property may now be used to set the
          memory usage protection limit of processes invoked by the unit. This
          controls the cgroupsv2 memory.min attribute. Similarly, the new
          "IODeviceLatencyTargetSec=" property has been added, wrapping the new
          cgroupsv2 io.latency cgroup property for configuring per-service I/O

        * systemd now supports the cgroupsv2 devices BPF logic, as counterpart
          to the cgroupsv1 "devices" cgroup controller.

        * systemd-escape now is able to combine --unescape with --template. It
          also learnt a new option --instance for extracting and unescaping the
          instance part of a unit name.

        * sd-bus now provides the sd_bus_message_readv() which is similar to
          sd_bus_message_read() but takes a va_list object. The pair
          sd_bus_set_method_call_timeout() and sd_bus_get_method_call_timeout()
          has been added for configuring the default method call timeout to
          use. sd_bus_error_move() may be used to efficiently move the contents
          from one sd_bus_error structure to another, invalidating the
          source. sd_bus_set_close_on_exit() and sd_bus_get_close_on_exit() may
          be used to control whether a bus connection object is automatically
          flushed when an sd-event loop is exited.

        * When processing classic BSD syslog log messages, journald will now
          save the original time-stamp string supplied in the new
          SYSLOG_TIMESTAMP= journal field. This permits consumers to
          reconstruct the original BSD syslog message more correctly.

        * StandardOutput=/StandardError= in service files gained support for
          new "append:…" parameters, for connecting STDOUT/STDERR of a service
          to a file, and appending to it.

        * The signal to use as last step of killing of unit processes is now
          configurable. Previously it was hard-coded to SIGKILL, which may now
          be overridden with the new KillSignal= setting. Note that this is the
          signal used when regular termination (i.e. SIGTERM) does not suffice.
          Similarly, the signal used when aborting a program in case of a
          watchdog timeout may now be configured too (WatchdogSignal=).

        * The XDG_SESSION_DESKTOP environment variable may now be configured in
          the pam_systemd argument line, using the new desktop= switch. This is
          useful to initialize it properly from a display manager without
          having to touch C code.

        * Most configuration options that previously accepted percentage values
          now also accept permille values with the '‰' suffix (instead of '%').

        * systemd-resolved may now optionally use OpenSSL instead of GnuTLS for

        * systemd-resolved's configuration file resolved.conf gained a new
          option ReadEtcHosts= which may be used to turn off processing and
          honoring /etc/hosts entries.

        * The "--wait" switch may now be passed to "systemctl
          is-system-running", in which case the tool will synchronously wait
          until the system finished start-up.

        * hostnamed gained a new bus call to determine the DMI product UUID.

        * On x86-64 systemd will now prefer using the RDRAND processor
          instruction over /dev/urandom whenever it requires randomness that
          neither has to be crypto-grade nor should be reproducible. This
          should substantially reduce the amount of entropy systemd requests
          from the kernel during initialization on such systems, though not
          reduce it to zero. (Why not zero? systemd still needs to allocate
          UUIDs and such uniquely, which require high-quality randomness.)

        * networkd gained support for Foo-Over-UDP, ERSPAN and ISATAP
          tunnels. It also gained a new option ForceDHCPv6PDOtherInformation=
          for forcing the "Other Information" bit in IPv6 RA messages. The
          bonding logic gained four new options AdActorSystemPriority=,
          AdUserPortKey=, AdActorSystem= for configuring various 802.3ad
          aspects, and DynamicTransmitLoadBalancing= for enabling dynamic
          shuffling of flows. The tunnel logic gained a new
          IPv6RapidDeploymentPrefix= option for configuring IPv6 Rapid
          Deployment. The policy rule logic gained four new options IPProtocol=,
          SourcePort= and DestinationPort=, InvertRule=. The bridge logic gained
          support for the MulticastToUnicast= option. networkd also gained
          support for configuring static IPv4 ARP or IPv6 neighbor entries.

        * .preset files (as read by 'systemctl preset') may now be used to
          instantiate services.

        * /etc/crypttab now understands the sector-size= option to configure
          the sector size for an encrypted partition.

        * Key material for encrypted disks may now be placed on a formatted
          medium, and referenced from /etc/crypttab by the UUID of the file
          system, followed by "=" suffixed by the path to the key file.

        * The "collect" udev component has been removed without replacement, as
          it is neither used nor maintained.

        * When the RuntimeDirectory=, StateDirectory=, CacheDirectory=,
          LogsDirectory=, ConfigurationDirectory= settings are used in a
          service the executed processes will now receive a set of environment
          variables containing the full paths of these directories.
          LOGS_DIRECTORY, CONFIGURATION_DIRECTORY are now set if these options
          are used. Note that these options may be used multiple times per
          service in which case the resulting paths will be concatenated and
          separated by colons.

        * Predictable interface naming has been extended to cover InfiniBand
          NICs. They will be exposed with an "ib" prefix.

        * tmpfiles.d/ line types may now be suffixed with a '-' character, in
          which case the respective line failing is ignored.

        * .link files may now be used to configure the equivalent to the
          "ethtool advertise" commands.

        * The sd-device.h and sd-hwdb.h APIs are now exported, as an
          alternative to libudev.h. Previously, the latter was just an internal
          wrapper around the former, but now these two APIs are exposed

        * sd-id128.h gained a new function sd_id128_get_boot_app_specific()
          which calculates an app-specific boot ID similar to how
          sd_id128_get_machine_app_specific() generates an app-specific machine

        * A new tool systemd-id128 has been added that can be used to determine
          and generate various 128bit IDs.

        * /etc/os-release gained two new standardized fields DOCUMENTATION_URL=
          and LOGO=.

        * systemd-hibernate-resume-generator will now honor the "noresume"
          kernel command line option, in which case it will bypass resuming
          from any hibernated image.

        * The systemd-sleep.conf configuration file gained new options
          AllowSuspend=, AllowHibernation=, AllowSuspendThenHibernate=,
          AllowHybridSleep= for prohibiting specific sleep modes even if the
          kernel exports them.

        * portablectl is now officially supported and has thus moved to

        * bootctl learnt the two new commands "set-default" and "set-oneshot"
          for setting the default boot loader item to boot to (either
          persistently or only for the next boot). This is currently only
          compatible with sd-boot, but may be implemented on other boot loaders
          too, that follow the boot loader interface. The updated interface is
          now documented here:

        * A new kernel command line option systemd.early_core_pattern= is now
          understood which may be used to influence the core_pattern PID 1
          installs during early boot.

        * busctl learnt two new options -j and --json= for outputting method
          call replies, properties and monitoring output in JSON.

        * journalctl's JSON output now supports simple ANSI coloring as well as
          a new "json-seq" mode for generating RFC7464 output.

        * Unit files now support the %g/%G specifiers that resolve to the UNIX
          group/GID of the service manager runs as, similar to the existing
          %u/%U specifiers that resolve to the UNIX user/UID.

        * systemd-logind learnt a new global configuration option
          UserStopDelaySec= that may be set in logind.conf. It specifies how
          long the systemd --user instance shall remain started after a user
          logs out. This is useful to speed up repetitive re-connections of the
          same user, as it means the user's service manager doesn't have to be
          stopped/restarted on each iteration, but can be reused between
          subsequent options. This setting defaults to 10s. systemd-logind also
          exports two new properties on its Manager D-Bus objects indicating
          whether the system's lid is currently closed, and whether the system
          is on AC power.

        * systemd gained support for a generic boot counting logic, which
          generically permits automatic reverting to older boot loader entries
          if newer updated ones don't work. The boot loader side is implemented
          in sd-boot, but is kept open for other boot loaders too. For details

        * The SuccessAction=/FailureAction= unit file settings now learnt two
          new parameters: "exit" and "exit-force", which result in immediate
          exiting of the service manager, and are only useful in systemd --user
          and container environments.

        * Unit files gained support for a pair of options
          FailureActionExitStatus=/SuccessActionExitStatus= for configuring the
          exit status to use as service manager exit status when
          SuccessAction=/FailureAction= is set to exit or exit-force.

        * A pair of LogRateLimitIntervalSec=/LogRateLimitBurst= per-service
          options may now be used to configure the log rate limiting applied by
          journald per-service.

        * systemd-analyze gained a new verb "timespan" for parsing and
          normalizing time span values (i.e. strings like "5min 7s 8us").

        * systemd-analyze also gained a new verb "security" for analyzing the
          security and sand-boxing settings of services in order to determine an
          "exposure level" for them, indicating whether a service would benefit
          from more sand-boxing options turned on for them.

        * "systemd-analyze syscall-filter" will now also show system calls
          supported by the local kernel but not included in any of the defined

        * .nspawn files now understand the Ephemeral= setting, matching the
          --ephemeral command line switch.

        * sd-event gained the new APIs sd_event_source_get_floating() and
          sd_event_source_set_floating() for controlling whether a specific
          event source is "floating", i.e. destroyed along with the even loop
          object itself.

        * Unit objects on D-Bus gained a new "Refs" property that lists all
          clients that currently have a reference on the unit (to ensure it is
          not unloaded).

        * The JoinControllers= option in system.conf is no longer supported, as
          it didn't work correctly, is hard to support properly, is legacy (as
          the concept only exists on cgroupsv1) and apparently wasn't used.

        * Journal messages that are generated whenever a unit enters the failed
          state are now tagged with a unique MESSAGE_ID. Similarly, messages
          generated whenever a service process exits are now made recognizable,
          too. A taged message is also emitted whenever a unit enters the
          "dead" state on success.

        * systemd-run gained a new switch --working-directory= for configuring
          the working directory of the service to start. A shortcut -d is
          equivalent, setting the working directory of the service to the
          current working directory of the invoking program. The new --shell
          (or just -S) option has been added for invoking the $SHELL of the
          caller as a service, and implies --pty --same-dir --wait --collect
          --service-type=exec. Or in other words, "systemd-run -S" is now the
          quickest way to quickly get an interactive in a fully clean and
          well-defined system service context.

        * machinectl gained a new verb "import-fs" for importing an OS tree
          from a directory. Moreover, when a directory or tarball is imported
          and single top-level directory found with the OS itself below the OS
          tree is automatically mangled and moved one level up.

        * systemd-importd will no longer set up an implicit btrfs loop-back
          file system on /var/lib/machines. If one is already set up, it will
          continue to be used.

        * A new generator "systemd-run-generator" has been added. It will
          synthesize a unit from one or more program command lines included in
          the kernel command line. This is very useful in container managers
          for example:

          # systemd-nspawn -i someimage.raw -b'"some command line"'

          This will run "systemd-nspawn" on an image, invoke the specified
          command line and immediately shut down the container again, returning
          the command line's exit code.

        * The block device locking logic is now documented:

        * loginctl and machinectl now optionally output the various tables in
          JSON using the --output= switch. It is our intention to add similar
          support to systemctl and all other commands.

        * udevadm's query and trigger verb now optionally take a .device unit
          name as argument.

        * systemd-udevd's network naming logic now understands a new
          net.naming-scheme= kernel command line switch, which may be used to
          pick a specific version of the naming scheme. This helps stabilizing
          interface names even as systemd/udev are updated and the naming logic
          is improved.

        * sd-id128.h learnt two new auxiliary helpers: sd_id128_is_allf() and
          SD_ID128_ALLF to test if a 128bit ID is set to all 0xFF bytes, and to
          initialize one to all 0xFF.

        * After loading the SELinux policy systemd will now recursively relabel
          all files and directories listed in
          /run/systemd/relabel-extra.d/*.relabel (which should be simple
          newline separated lists of paths) in addition to the ones it already
          implicitly relabels in /run, /dev and /sys. After the relabelling is
          completed the *.relabel files (and /run/systemd/relabel-extra.d/) are
          removed. This is useful to permit initrds (i.e. code running before
          the SELinux policy is in effect) to generate files in the host
          filesystem safely and ensure that the correct label is applied during
          the transition to the host OS.

        * KERNEL API BREAKAGE: Linux kernel 4.18 changed behaviour regarding
          mknod() handling in user namespaces. Previously mknod() would always
          fail with EPERM in user namespaces. Since 4.18 mknod() will succeed
          but device nodes generated that way cannot be opened, and attempts to
          open them result in EPERM. This breaks the "graceful fallback" logic
          in systemd's PrivateDevices= sand-boxing option. This option is
          implemented defensively, so that when systemd detects it runs in a
          restricted environment (such as a user namespace, or an environment
          where mknod() is blocked through seccomp or absence of CAP_SYS_MKNOD)
          where device nodes cannot be created the effect of PrivateDevices= is
          bypassed (following the logic that 2nd-level sand-boxing is not
          essential if the system systemd runs in is itself already sand-boxed
          as a whole). This logic breaks with 4.18 in container managers where
          user namespacing is used: suddenly PrivateDevices= succeeds setting
          up a private /dev/ file system containing devices nodes — but when
          these are opened they don't work.

          At this point is is recommended that container managers utilizing
          user namespaces that intend to run systemd in the payload explicitly
          block mknod() with seccomp or similar, so that the graceful fallback
          logic works again.

          We are very sorry for the breakage and the requirement to change
          container configurations for newer kernels. It's purely caused by an
          incompatible kernel change. The relevant kernel developers have been
          notified about this userspace breakage quickly, but they chose to
          ignore it.

        Contributions from: afg, Alan Jenkins, Aleksei Timofeyev, Alexander
        Filippov, Alexander Kurtz, Alexey Bogdanenko, Andreas Henriksson,
        Andrew Jorgensen, Anita Zhang, apnix-uk, Arkan49, Arseny Maslennikov,
        asavah, Asbjørn Apeland, aszlig, Bastien Nocera, Ben Boeckel, Benedikt
        Morbach, Benjamin Berg, Bruce Zhang, Carlo Caione, Cedric Viou, Chen
        Qi, Chris Chiu, Chris Down, Chris Morin, Christian Rebischke, Claudius
        Ellsel, Colin Guthrie, dana, Daniel, Daniele Medri, Daniel Kahn
        Gillmor, Daniel Rusek, Daniel van Vugt, Dariusz Gadomski, Dave Reisner,
        David Anderson, Davide Cavalca, David Leeds, David Malcolm, David
        Strauss, David Tardon, Dimitri John Ledkov, Dmitry Torokhov, dj-kaktus,
        Dongsu Park, Elias Probst, Emil Soleyman, Erik Kooistra, Ervin Peters,
        Evgeni Golov, Evgeny Vereshchagin, Fabrice Fontaine, Faheel Ahmad,
        Faizal Luthfi, Felix Yan, Filipe Brandenburger, Franck Bui, Frank
        Schaefer, Frantisek Sumsal, Gautier Husson, Gianluca Boiano, Giuseppe
        Scrivano, glitsj16, Hans de Goede, Harald Hoyer, Harry Mallon, Harshit
        Jain, Helmut Grohne, Henry Tung, Hui Yiqun, imayoda, Insun Pyo, Iwan
        Timmer, Jan Janssen, Jan Pokorný, Jan Synacek, Jason A. Donenfeld,
        javitoom, Jérémy Nouhaud, Jeremy Su, Jiuyang Liu, João Paulo Rechi
        Vita, Joe Hershberger, Joe Rayhawk, Joerg Behrmann, Joerg Steffens,
        Jonas Dorel, Jon Ringle, Josh Soref, Julian Andres Klode, Jun Bo Bi,
        Jürg Billeter, Keith Busch, Khem Raj, Kirill Marinushkin, Larry
        Bernstone, Lennart Poettering, Lion Yang, Li Song, Lorenz
        Hübschle-Schneider, Lubomir Rintel, Lucas Werkmeister, Ludwin Janvier,
        Lukáš Nykrýn, Luke Shumaker, mal, Marc-Antoine Perennou, Marcin
        Skarbek, Marco Trevisan (Treviño), Marian Cepok, Mario Hros, Marko
        Myllynen, Markus Grimm, Martin Pitt, Martin Sobotka, Martin Wilck,
        Mathieu Trudel-Lapierre, Matthew Leeds, Michael Biebl, Michael Olbrich,
        Michael 'pbone' Pobega, Michael Scherer, Michal Koutný, Michal
        Sekletar, Michal Soltys, Mike Gilbert, Mike Palmer, Muhammet Kara, Neal
        Gompa, Neil Brown, Network Silence, Niklas Tibbling, Nikolas Nyby,
        Nogisaka Sadata, Oliver Smith, Patrik Flykt, Pavel Hrdina, Paweł
        Szewczyk, Peter Hutterer, Piotr Drąg, Ray Strode, Reinhold Mueller,
        Renaud Métrich, Roman Gushchin, Ronny Chevalier, Rubén Suárez Alvarez,
        Ruixin Bao, RussianNeuroMancer, Ryutaroh Matsumoto, Saleem Rashid, Sam
        Morris, Samuel Morris, Sandy Carter, scootergrisen, Sébastien Bacher,
        Sergey Ptashnick, Shawn Landden, Shengyao Xue, Shih-Yuan Lee
        (FourDollars), Silvio Knizek, Sjoerd Simons, Stasiek Michalski, Stephen
        Gallagher, Steven Allen, Steve Ramage, Susant Sahani, Sven Joachim,
        Sylvain Plantefève, Tanu Kaskinen, Tejun Heo, Thiago Macieira, Thomas
        Blume, Thomas Haller, Thomas H. P. Andersen, Tim Ruffing, TJ, Tobias
        Jungel, Todd Walton, Tommi Rantala, Tomsod M, Tony Novak, Tore
        Anderson, Trevonn, Victor Laskurain, Victor Tapia, Violet Halo, Vojtech
        Trefny, welaq, William A. Kennington III, William Douglas, Wyatt Ward,
        Xiang Fan, Xi Ruoyao, Xuanwo, Yann E. Morin, YmrDtnJu, Yu Watanabe,
        Zbigniew Jędrzejewski-Szmek, Zhang Xianwei, Zsolt Dollenstein

— Warsaw, 2018-12-21

NOTE: This fixes an incompatibility with Linux 4.18.x+

comment:4 by Douglas R. Reno, 3 years ago

Resolution: fixed
Status: assignedclosed

Fixed at r11495

Note: See TracTickets for help on using tickets.