Enhance Linux System Call in Hyper-Threading CPU
Taken over from Vasily Tarasov on 2008.4.1
Last update 2008.4.19
A HT processor has two logic processors (or hardware threads). Our plan isto dedicate a logical processor (say LP1) to an OS process that is constantly running and the other logical processor (LP2) to application processeses. Every time an application process makes a system call, the OS process takes notice, services the system call, and returns the result back to the application process. A shared memory memory region is set up to pass input arguments and results of system calls between the application and OS process. In normal case, the OS process is sitting in a polling loop; when the application process makes a system call, the application process sits in a polling loop waiting for the system call to be completed. The important point here is that no context switching takes place during a system call, and therefore its latency is expected to be much reduced.
There are four main tasks in this project:
(1) Implement the basic design above so that application process and OS process can pass data and control between each other.
(2) Whenever a new application process is scheduled on LP2, the context of the OS process (CR3) should be set to that of the new application process. This allows the OS process to directly access the memory of the application process currently running on LP2.
(3) Reduce the CPU consumption of the OS process during normal operation (when there is no system call), by forcing it to yield in some ways.
(4) For system calls that may block, use the standard way of invoking system calls, because latency is not important for these system calls.
Linux System Call
When a user process make a system call, Linux does the following
(1) Issuing system call via int 0x80 (trap gate in the Interrupt
Descriptor Table)
* for example, a process call getpid()
(2) Switching to kernel mode
cs: from user code segment to kernel code segment
ss: from user stack segment to kernel stack segment
cr3: doesn't change when switch to kernel mode.
(3) Saving the current context (pushing onto the kernel stack)
save interrupt vector to %eax register
save eflags, cs, eip, ss, esp by hardware control unit.
save registers used later by the interrupt handler.
*
step 2 and 3 are defined in linux/arch/i386/kernel/entry.S
system_call section; kernel has already setup to IDT by set_system_gate(0x80, &system_call)
(4) Invoking the service routine identified by the interrupt vectors
(%eax) and return the result in %eax
* for example invoke sys_getpid(); but notice that these sys_ initial function are not exported to kernel symbols. On the other hand, if the user issues gettimeofday(), it will trigger sys_gettimeofday(), and then trigger an exported function do_gettimeofday()
(5) Restore the previous context (popping from the kernel stack)
reverse the step (3)
(6) switching to user mode (cs, ss)
Fast System Call
But in our fast system call design, (1)(2)(3)(5)(6) are not necessary.
The reasons that we can successfully do this are
(a) Run a system process (SP) on one processor and the application
process (AP) on another processor.
(b) We let SP to poll on a flag of shared memory region to wait for
serving the AP request, and let AP to poll on a flag in the region to
get the response from SP.
(c) Passing the 3 things in a shared memory from AP to SP in a shared
memory region
- system call number
- address of to current process pointer (current)
- address of parameters
First, the address here is referred to linear address in the process
address space of the AP.
Since cr3 is not changed when switching to kernel mode, the linear
address we use in the user mode can still be used in kernel mode routines.
When we need to pass a long result from SP to AP, the following can
achieve that.
In SP, we retrieve the cr3 from current->mm->pgd field, which refer to
the Page Global Directory (PGD) of AP.
Then we replace the cr3 of SP by that of AP, and invoke the desired
interrupt service routine on SP.
Afterwards, use copy_to_user kernel routine to pass the result back to AP.
(with same linear address and page table, it emulates that SP and AP are
running on the same CPU)
Linux kernel assigns each process a Page Global
Directory (PGD) for paging. One part of the PGD is the same among all
the processes, which is for paging inside the kernel (3G to 4G). The above
explanation will be reasonable if we assume the following:
the kernel part of PGD in SP and AP are all the same, even there are 2
different cr3 on SP and AP (because they run on different CPUs).
OS Setup
(1) Get the kernel source (currently ubuntu-7.10-server edition)
sudo apt-get build-dep linux-source-2.6.22
apt-get source linux-source-2.6.22
(2) Configure the kernel
make mrproper
cp /boot/config-`uname -r` ./.config
make menuconfig and make sure we have the following options enabled
CONFIG_ACPI
CONFIG_SMP
CONFIG_X86_SMP
CONFIG_X86_HT
(3) To build and install the kernel, execute these three commands:
make all
make modules_install
make install
(4) Add it to the boot menu
vi /boot/grub/menu.lst
Detail of Implementation
ACPU: application CPU
SCPU: system CPU
* htctl.c (1) create a /dev/htctl device file
- register with the following operations for character device
static struct file_operations htctl_fops = {
.owner = THIS_MODULE,
.unlocked_ioctl = htctl_ioctl,
.mmap = htctl_mmap,
};
- htctl_ioctl: for user process to issue ioctl system call (the actually system call are identified by watching on the shared memory region, and dispatched by the htctl_ioctl. The sub-function htctl_get_handler will find out the actual command issue from user process.
- htctl_mmap: for user process to map to the shared region
(2) create share memory region (struct shared_region *to_kernel)
struct shared_region
{
int flag;
unsigned long currentptr;
int syscall_num;
union {
struct {
struct timeval *tv;
struct timezone *tz;
} gettimeofday;
struct {
int pid;
} getpid;
};
};
- kmalloc() will return the start linear address of a continuous physical memory region.
- SetPageReserved(virt_to_page(virt_addr)) to reserve the pages for in kernel space.
- allocate some redundant pages when before and after the shared region
* When we call mmap in user space as the following
shm_fd = open("/dev/htctl", O_RDWR, 0);
to_kernel = mmap(0, sizeof(struct shared_region),
(PROT_READ | PROT_WRITE),
MAP_FILE | MAP_SHARED, shm_fd, 0);
Then, sys_mmap2() in kernel will be called, which will prepare a linear address interval, and then invoke the htctl_mmap() in htctl.c from the following place
http://lxr.linux.no/linux+v2.6.22.14/mm/mmap.c#L1101 Afterwards, we use the following to map the physical address of shared_region_area inside kernel thread to the vma (linear address interval) of user process.
ret = remap_pfn_range (vma,
vma->vm_start,
virt_to_phys((void*)((unsigned long)shared_region_area))
>> PAGE_SHIFT,
vma->vm_end-vma->vm_start,
PAGE_SHARED);
(3) htioctl_register
- maintain a LIST_HEAD structure, and insert the possible ioctl operations (by htioctl_info structure), which will be issued by user process on the device file
struct htioctl_info {
unsigned cmd;
int (*ioctl)(struct file *, unsigned int, unsigned long);
struct module *owner;
struct list_head list;
};
- the cmd here is used for user process to pass the commmand number to the kernel process. for example, cmd can be HTTHREAD_IOCTL_CMD, which is generated by the following
#define HTTHREAD_IOCTL_CMD _IOR(0x0, 0x0, struct htthread_ioctl_args)
* htthread.cAll the system call routines are implemented here, and they will be latter inserted into the list by htioctl_register. The above cmd corresponds to the following functions inside htthread.c
(1) htthread_ioctl: create the kernel thread on SCPU
- htthread_ioctl
//create kernel thread running on htthread_loop function
htthread_system = kthread_create(htthread_loop, NULL, "htthread_system");
//bind the thread to SCPU
kthread_bind(htthread_system, STCPUNUM);
//insert the process into run queue and change its state to TASK_RUNNING
wake_up_process(htthread_system);
- kthread_loop
To yield the CPU when kernel has nothing to handle, we use the following in kthread_loop.
if(!to_kernel->flag) {
// watches the flag in to_kernel
__monitor((void *)&to_kernel->flag, 0, 0);
// wait until some changes happen
__mwait(0, 0);
}
... //checks the flag in to_kernel modified by user process, and service the system call
// yield to other processes
schedule();
// also refer to http://softwarecommunity.intel.com/Wiki/DevelopforCoreprocessor/284.htm
- Reference of monitor & mwait
MWAIT--Monitor_Wait,
Guideline MONITOR sets up an address range for the monitor hardware using the
content of EAX as a logical address and resets the monitor event pending
flag (of CPU). The memory address range should be within memory of the write-back
caching type. A store to the specified address range will set the monitor
event pending flag. The content of ECX and EDX are used to communicate
other information to the MONITOR instruction.
MWAIT takes the argument in EAX as a hint extension and is architected to take the argument in ECX as an instruction extension
// MWAIT EAX, ECX
{ WHILE (! ("monitor_event_pending_flag" OR "monitor_not_active")) {
implementation_dependent_optimized_state(EAX, ECX);
}
Clear monitor_event_pending_flag;
}
- from asm-i386/processor.h
static inline void __monitor(const void *eax, unsigned long ecx,
179 unsigned long edx)
180{
181
182 asm volatile(
183 ".byte 0x0f,0x01,0xc8;"
184 : :"a" (eax), "c" (ecx), "d"(edx));
185}
186
187static inline void __mwait(unsigned long eax, unsigned long ecx)
188{
189
190 asm volatile(
191 ".byte 0x0f,0x01,0xc9;"
192 : :"a" (eax), "c" (ecx));
193}
(2) htmigr_ioctl: migrate all processes to ACPU
(3) htshmem_ioctl: not implemented
(4) getcurrent_ioctl: get the value of current (current process pointer)
Because the "current" will change from time to time, such as after transferring from user space to kernel space. So the only way for user process to get the "current" is to ask the kernel when it is running. The way is to issue a ioctl command to kernel, and the value is just stored in the "current" variable in kernel context.
(5) unhtthread_ioctl: stop the kernel thread
Problem when enhancing CPU idle performance
(1) monitor/mwait problem
First look at the following code, and the kernel thread will run on
this loop to wait for user process request.
---------------------------------
while(1) {
monitor(flag);
mwait();
if(flag == 1) {
service_the_syscall();
flag = 0;
}
schedule();
}
----------------------------------
Because monitor/mwait will only leave the state if
- the flag changes (either from 1 to 0, or from 0 to 1)
- interrupt comes in
The kernel thread will spend a very long time if the following steps
happen consecutively
- kernel thread start this looping
- execute monitor/mwait (flag is 0 here)
- interrupt comes (ex: timer interrupt 250 times every second)
- the "if(flag)" has been checked and skip (flag is 0)
- user process issue system call (flag set to 1)
- monitor/mwait wait on flag (flag is 1, kernel thread will hang
here, because user process has already requested. Only interrupt can
get the thread out of the mwait status)
Thus, we need to check the flag just before entering monitor/mwait as
the following
------------------------------------------
if (!flag) {
monitor(flag);
mwait();
}
----------------------------------------
Another thing is the schedule() can not be removed, because
(2) For isolating the system cpu from all processes, the kernel
boot option "isolcpus=1" can force scheduler not to schedule
further process onto system cpu (the same effect as the code to set
affinity. but this way is more safe and clean)
Synchronization Problem
In this project, we are also dealing with more than one CPU, so the synchronization problem will occur when 2 CPU appear.
Spin Locks
Read/Write Spin Locks
Seqlocks
Read-Copy Update (RCU): rcu_read_lock();
Reference: http://www.informit.com/articles/article.aspx?p=414983
Different classes of system call
Non-blocking system call:
getpid system call * Ultimate purpose
improve performance Virtual machines because the current I/O consumes too much time.
keyword: hypervisor, xen