OSPM-summit-18 topics
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

name/surnameaffiliationshort bioemailtitleabstract30min/50min slot
Rafael Wysocki
Thomas Ilsche
Intel &
Technical University of Dresden
rafael@kernel.orgCPU Idle Loop Ordering ProblemThere is a design issue in the Linux kernel's CPU idle loop that leads to undesirable outcomes, such as excessive energy usage or unnecessary overhead.  Namely, before invoking the CPU idle governor, the timekeeping subsystem stops the scheduler tick timer (and possibly reprograms the clock event device associated with it) unless there are other timers due to expire before it on the given CPU. There is substantial overhead related to that. Next, the idle governor predicts the duration of the idle time and selects the idle state to put the CPU into on this basis. If the scheduler tick timer has been stopped and if the idle time duration predicted by the governor is short, the outcome is problematic regardless of whether or not that prediction is accurate. If it is accurate, it means that the overhead related to the stopping of the scheduler tick timer was not necessary. If it is not accurate, the idle state selected by the governor is not adequate and the CPU will spend too much time in it which leads to excessive energy usage as recently demonstrated by researchers at the Technical University of Dresden. This issue is independent of the CPU idle governor in use, so the only way to really address it is to redesign the idle loop.50
Rafael WysockiIntelrafael@kernel.orgTowards a tighter integration of the system-wide and runtime PM of devicesGenerally speaking, the handling of devices during system-wide power transitions may be optimized in some cases. First, some devices. if they are in runtime suspend before a system-wide transition to a sleep state (or equivalent), may be left in suspend going forward. Second, some devices may be left in suspend after system-wide resume transitions. Finally, for some devices, the same driver callbacks may be used for runtime PM and system-wide PM. The problem is that, in general, each of these cases requires some kind of coordination between the device driver, the PM core and (possibly) some middle layer code between them, such as a bus type or a PM domain. Moreover, there are drivers that need to work with different middle layers (eg. different PM domains) or without a middle layer at all, so the coordination mechanism, whatever it is, needs to be agreed on an universal. The current proposal is to allow drivers to set flags informing the PM core and middle layers on the drivers' expectations. Some progress has been made in this direction and it will be good to report on the current status of it at least.50
Tiejun ChenVMwaretiejunc@vmware.comReal Time Virtualization Exploration Virtualization technologies has already been deployed in the embedded systems but in many use scenarios, applications/services should be responded in real time. Some typical hypervisors are constructed to be a real time hypervisor but this solution still cannot meet hard time requirement, even real time guest OS is adopted. Because hypervisors have no knowledge of tasks inside VM actually. So based on this two levels of shcedule domain, I'd like to discuss-to review several potential but different approaches to construct real time virtualization solution:1:paravirtualize guest OS like Linux to sync scheduling between GOS and hypervisor ;2:AMP hypervisor;3:single-process purpose OS; 4 EPT-only protected application/service; 5.some possible combinations. 30
Georgi Djakov
Vincent Guittot
Scaling Interconnect busModern SoCs have multiple processors and various dedicated cores (video, gpu, graphics, modem). These cores are talking to each other and can generate a lot of data flowing through the on-chip interconnects. These interconnect buses could form different topologies such as crossbar, point to point buses, hierarchical buses or use the network-on-chip concept.
These buses have been sized usually to handle use cases with high data throughput but it is not necessary all the time and consume a lot of power. Furthermore, the priority between masters can vary depending on the running use case like video playback or cpu intensive tasks.
Make a status on the framework and the API and discuss its usage by drivers and framework
Viresh KumarLinaroviresh.kumar@linaro.orgNotifying devices of performance
state changes of their master PM domain
Problem statement: With ARM DynamIQ technology, big and LITTLE CPUs can be part of the same cluster. We also have a DSU (DynamIQ shared unit) that comprises the L3 memory system. All or some of these entities can be part of the same voltage domain and these units need to synchronize while doing DVFS now. How do we make sure that the CPUs and the DSU are configured to give best performance for the power used.

Our proposed solution is to capitalize on the performance state work done on the power domain core (genpd), and handle the voltage domain as a power domain with performance states. These power domains will allow devices to register their callbacks, which will be called when the performance state of the domain changes and so the devices can align their frequencies to the next selected performance state. So, if a CPU needs to run at 800 MHz and that needs a performance state 1, then it requests for it. But if the eventual state selected for the voltage domain is 5 (because DSU requested a higher state), then the CPU should actually run at 1.3 GHz instead as we are already running at high voltage.
Joel FernandesGooglejoelaf@google.comeBPF super powers on arm64 and AndroideBPF has gained popularity in low overhead tracing and data aggregation used to understand inner workings of the Linux kernel. The bcc tools based on eBPF work only on machines where the host and target of the development are the same. For the tools to work, llvm libraries, Python and the kernel sources are required on the target before they can even work. This installation is easy when host/target are the same, but is more difficult and cumbersome in a cross-development model such as development for arm64 where the kernel sources are typically on the host not the target.The proposed solution also avoids the extra step of cross-compiling the tools and libraries which also takes up target disk space. The proposed solution runs bcc tools on a remote machine which has all the dependencies already installed (such as an x86 machine), with the target that needs to be traced remotely connected over USB or the network (such as an arm64 machine). This is possible with a daemon Joel hacked together, called bpfd, along with changes to core BCC libraries to talk to the daemon. Using this solution, no preparation of the remote target is needed and debugging of the remote target is possible effortlessly. This session will go through the proposed solution and challenges, along with a possible demo.
Tomasz KlodaUniversity of Modenatomasz.kloda@unimore.itHERCULES: a real-time architecture for low-power embedded systemsIn this talk, we will illustrate the HERCULES software architecture, which aims
at reducing energy consumption while meeting real-time constraints on modern
multi-core ARM architectures.
The infrastructure consists of a hypervisor for the concurrent execution of OSs
with different levels of criticality (i.e., Linux and RTOS).
The access to the shared hardware resources is managed by implementing
mechanisms (e.g., PREM, MemGuard, cache coloring, etc.) to reduce the
interference between different OSs and cores.
Charles Garcia-TobinArmcharles.garcia-tobin@arm.comSecurity and perforamance trade offs in power interfacesMany Arm embedded SoCs present very low level power controls to the kernel, with separate voltage regulator and frequency controls. Whilst this provides a rich enviroment for developing cpufreq drivers, it comes at the cost of security. The ClkScrew attack [1] enabled leaking of trustzone secrets, and even code injection into trustzone, through seperate manipulation of voltage and frequency. Firmware interfaces for performance/power are also moving away from direct voltage and frequency control to more abstract performance requests. These interfaces rely on having an inherently trusted controller for power, which does not expose seperate controls. Here we present two such interfaces, Arm's System Control and Management Interface (SCMI), and ACPI's Collaborative Processor Performance Control (CPPC), and relate the two.
[1] https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-tang.pdf
Daniel LezcanoLinarodaniel.lezcano@linaro.orgDo more with lessIn the mobile world, thermal mitigation is challenging as it must
reduce the power without actively dissiping the heat. In front of this
problem, the DVFS took a major place as an efficient passive cooling
device. In conjunction with the power model and the intelligent power
allocator governor, a SoC can efficienly compute the best OPP to
sustain the power budget and mitigate the temperature at a desired

The presentation will introduce the idle alter ego for the cpufreq
cooling device where we mathematically demonstrate the auto adjustable
running cycle regarding a fixed idle injection cycle. Thanks to the
available power information in the DT nowadays we can precisely
compute the idle/running cycle ratio.

Both of these cooling devices, cpufreq and cpuidle, can mitigate the
temperature but at the cost of a performance loss. We will show, we
can use an smart CPU cooling device which will compute and choose the
best strategy by aggregating and combining the cooling effect from
each of them.
Vincent GuittotLinarovincent.guittot@linaro.orgWhat is still missing in load tracking ?The load tracking mechanism in the scheduler is regularly evolving to track more accurately the load an the utilization of the CPUs. During this presentation, we will start with a short status of the current state of the load tracking mecanism and will then dig into the open items that remains to solve.50
Patrick BellasiARMpatrick.bellasi@arm.comStatus update on Utilization ClampingUtilization clamping is a mechanism which allows to enforce a lower and/or upper bound on the utilization of SCHED_OTHER tasks. While the actual utilization of a task, tracked by PELT is not affected, the clamped value can be conveniently used to bias decisions of both the CPUFreq schedutil governor and the FAIR scheduler, e.g. running a task within a maximum frequency or preferring certain CPUs instead of others on SD_ASYM_CPUCAPACITY systems.
The most appropriate clamp values for a task is something which is expected to be defined from user-space, thus a suitable interface has to be designed to support this new mechanism.
This presentation will give a short introduction of the status of the proposal and then it aims at collecting feedbacks on what should be the most appropriate userspace interface.
Morten Rasmussen/Lorenzo PieralisiArmmorten.rasmussen@arm.comarm64 topology representationTo be updated... Challenges related to describing CPU topology for a wide variety of systems, particularly relation between clusters and packages, in DT and ACPI and how the information can be used to affect scheduling decisions.
Chris RedpathArmchris.redpath@arm.comWhy Android phones don't run mainline Linux (yet)50
Alessio BalsiniArmalessio.balsini@arm.comSCHED_DEADLINE: The Android Audio Pipeline Case StudyCounting more than one billion of devices, Android is nowadays one of the most popular general-purpose operating systems, based on Linux. It manages many different workloads, some of them requiring performance/QoS guarantees, e.g. multimedia real-time streaming. This work focuses on the challenges and improvements to the real-time performance of the Android audio pipeline, through proper modifications to the Android and the Linux kernel internals related to management of the real-time tasks, and specifically to the SCHED_DEADLINE scheduler.
Quentin Perret / Dietmar EggemannArmquentin.perret@arm.comEAS in Mainline: where we areTo be updated ... Energy Aware Scheduling has been used for years in Android and shipped in several products, but didn't make it's way to the mainline Linux scheduler yet. This talk details the design choices and experimental results achieved with a simplified, less intrusive and yet efficient version of EAS aimed to be pushed for mainline.50
Dhaval GianiOracledhaval.giani@oracle.comWorkloads consolidation by dynamic adapationWhile consolidating different workloads on a system, there are resource mutual performance interference. Linux Scheduler by itself does not eliminate mutual interference and QoS impacts due to shared hardware components such as shared pipelines or caches. When applications use cgroup shares, threads of other workloads running on the system can mix with it, leading to performance interference. With exclusive cpusets, if the workload needs additional CPUs, it cannot possibly get those. At lower utilization, workloads with no consolidation technique (unbound) perform better. As the utilization of each workload increases they start interfering with each other, and the workloads which were assigned their own mutually exclusive cpusets (hard-bound) perform better than "unbound". We want something which can dynamically adapt to utilization within constraints, performs similar to "unbound" at low utilization and like "hard-bound" at higher utilization to provide "best of both worlds".
Dhaval GianiOracledhaval.giani@oracle.comTowards a constant time search for wakeup fast pathThe objective of the wakeup fast path is to find a idle CPU/core for the wakee task as fast as possible keeping the cache affinity in mind. Although it sounds like a trivial problem, there are inherent scaling problems associated with it. Idle CPU scanning cost should not be more than the average idle time of the runqueue. The current implementation tries to find a fully idle core sharing the same last level cache (LLC) in an iterative fashion. If that fails, then it searches for an idle CPU. This cost is proportional to the number of CPUs in the LLC (esp as number of cores per LLC start to increase). To contain the cost of the search, we use heuristics. These heuristics cause undeterministic behaviour for some workloads. This topic discusses how we can try to improve the search cost, ideally trying to head towards a constant time search. Building on from the wakeup fast path discussion, we also want to talk about adding heuristics around "large" SMT ( >> 4 SMTs / core). These two topics can be combined if needed.
Dhaval GianiOracledhaval.giani@oracle.comScheduler unit testsAt Linux Plumbers Conference, there was talk about generating traces to run with rt-app to test the scheduler. A consensus was obtained that we should generate these traces and store them in a "central" project. Let's discuss this, and try to come up with a plan.
Ulf HanssonLinaroulf.hansson@linaro.orgNext steps of CPU cluster idlingThe CPU cluster idling series has been iterated on the mailing lists for quite some time now, although not all changes have been applied. This sessions intends to give an update on the current status, as well as disusing some of the next step and its related issues. For example one of the next steps is to hook into non-CPU devices and non-CPU PM domains into the CPU PM domain topology. This is needed, when for example a device need to put QoS constraints on the cluster idle state. However this may get complicated, if the cluster have multiple idle states. Another example is to deal with SoC specific operations, such as for example saving and re-storing register context of interconnect devices, which could be needed in the immediate relation to when the cluster powers off/on for coherency reasons.30
Ulf HanssonLinaroulf.hansson@linaro.orgIntegration of system-wide/runtime PM of devices - what are the requirements for a common solution?While optimizing device PM for system-wide PM, there are a number of requirements that must be fulfilled in a generic solution. While we continue to explore the current two existing approaches, which consists of either using pm_runtime_force_suspend|resume() or the new driver PM flags, it also seems like a good idea to try to agree on what the requirements are for the generic solution. (I can either run this as a part of Rafael's session for the similar topic, or run it as an introduction to it.)30
Luca AbeniScuola S. Anna, Pisaluca.abeni@santannapisa.itPower-aware and Capacity-aware migrations for real-time tasksCurrently, both SCHED_DEADLINE and SCHED_{FIFO,RR} migrate tasks between
CPUs / cores (to implement global EDF or global fixed priorities) without
caring about the speed / capacity of the various CPUs or cores.
In this way, the scheduler assumes a symmetric multiprocessor architecture
and does not consider heterogeneous systems (for example, ARM big.LITTLE)
or CPUs running at different frequences.
Since real-time scheduling on heterogeneous multiprocessors has already
been studied in literature, these theoretical results can be implemented
in the Linux scheduler to achieve better performance on modern systems.
Also, migrations between "slow" cores and "fast" cores can be used instead
of DVFS to save energy (while not breaking real-time guarantees) when the
frequency switch time is too high (on some systems, more than 3ms have been
Houssam-Eddine ZahafUniversité de Lillehoussam-eddine.zahaf@univ-lille1.frEnergy-aware real-time scheduling: Parallel or sequential? From analysis to implementationHeterogeneous Multicore Processors (HMP) such as ARM big.LITTLE are designed to combine both performances and energy efficiency, so they are one of the preferred choices for parallel applications especially when these systems are operated by battery power. These platforms are becoming pervasive and may run time-critical applications. Recently,
few works has focused on modeling intra-task parallelism while reducing energy consumption for real-time systems on heterogeneous single-ISA architectures.

In this presentation, we introduce two tasks models and two sufficient feasibility tests. Based on these tests, we present twoenergy-aware partitioning techniques.

First, we present a model of the performance and energy consumption of a parallel real-time task executed on an ARM bigLITTLE architectur! e. Later, We use this model in the second part to introduce the partitioning heuristics. Hence, this presentation has a special focus on Dynamic Voltage and Frequency Scaling (DVFS), parallelization, real-time scheduling and resource allocation techniques.

The presentation has two parts:

In the first I present power and energy consumption models. For this part, I injected real-time code into MiBench on an ARM bigLITTLE platform and "physically" measured the power dissipation of the cores and RAM memory and based on that, I build an energy profile (it is different than what is usually used in the litterature, you can take a quick view of the power model on this link: http://www.sciencedirect.com/science/article/pii/S138376211730019X ).

In the second part, I present two parallel task model. I restricted in this part into analysis. The tool has been written in SCALA. No kernel-level implementation is needed bacause all my proposals can be done in userspace (I have directed a master student to b! uild a openMP like platform to easily express our models using C code and it is still on going with a new one this year).
Hans de GoedeRed Hathdegoede@redhat.comImproving Linux battery life through better defaultsModern laptops can use a lot less energy then laptops from a decade ago. But in order to actually get this low energy usage the operating system needs to make efficient use of the hardware. Linux supports a lot of hardware power-saving features, but many of them are disabled by default because they cause problems on certain devices or in certain, often corner-case, circumstances. This first half of the presentation will be a short description of recent work <https://fedoraproject.org/wiki/Changes/ImprovedLaptopBatteryLife> done to improve some of the pm related default settings. The second half is intended for discussion on how to pick better defaults for new pm related features in the future.30
Tot. 50 min8
Tot. 30 min8
Unspecified duration7
Main menu