Enhance Linux System Call in Hyper-Threading CPU

Taken over from Vasily Tarasov on 2008.4.1
Last update 2008.4.19

A HT processor has two logic processors (or hardware threads). Our plan isto dedicate a logical processor (say LP1) to an OS process that is constantly running and the other logical processor (LP2) to application processeses. Every time an application process makes a system call, the OS process takes notice, services the system call, and returns the result back to the application process. A shared memory memory region is set up to pass input arguments and results of system calls between the application and OS process. In normal case, the OS process is sitting in a polling loop; when the application process makes a system call, the application process sits in a polling loop waiting for the system call to be completed. The important point here is that no context switching takes place during a system call, and therefore its latency is expected to be much reduced.

There are four main tasks in this project:
(1) Implement the basic design above so that application process and OS process can pass data and control between each other.
(2) Whenever a new application process is scheduled on LP2, the context of the OS process (CR3) should be set to that of the new application process. This allows the OS process to directly access the memory of the application process currently running on LP2.
(3) Reduce the CPU consumption of the OS process during normal operation (when there is no system call), by forcing it to yield in some ways.
(4) For system calls that may block, use the standard way of invoking system calls, because latency is not important for these system calls.

Linux System Call

When a user process make a system call, Linux does the following
(1) Issuing system call via int 0x80 (trap gate in the Interrupt Descriptor Table)
* for example, a process call getpid()
(2) Switching to kernel mode
cs: from user code segment to kernel code segment
ss: from user stack segment to kernel stack segment
cr3: doesn't change when switch to kernel mode.
(3) Saving the current context (pushing onto the kernel stack)
save interrupt vector to %eax register
save eflags, cs, eip, ss, esp by hardware control unit.
save registers used later by the interrupt handler.
* step 2 and 3 are defined in  linux/arch/i386/kernel/entry.S system_call section; kernel has already setup to IDT by set_system_gate(0x80, &system_call)
(4) Invoking the service routine identified by the interrupt vectors (%eax) and return the result in %eax
* for example invoke sys_getpid(); but notice that these sys_ initial function are not exported to kernel symbols. On the other hand, if the user issues gettimeofday(), it will trigger sys_gettimeofday(), and then trigger an exported function do_gettimeofday()
(5) Restore the previous context (popping from the kernel stack)
reverse the step (3)
(6) switching to user mode (cs, ss)

Fast System Call

But in our fast system call design, (1)(2)(3)(5)(6) are not necessary.
The reasons that we can successfully do this are
(a) Run a system process (SP) on one processor and the application process (AP) on another processor.
(b) We let SP to poll on a flag of shared memory region to wait for serving the AP request, and let AP to poll on a flag in the region to get the response from SP.
(c) Passing the 3 things in a shared memory from AP to SP in a shared memory region
- system call number
- address of to current process pointer (current)
- address of parameters
First, the address here is referred to linear address in the process address space of the AP.
Since cr3 is not changed when switching to kernel mode, the linear address we use in the user mode can still be used in kernel mode routines.
When we need to pass a long result from SP to AP, the following can achieve that.
In SP, we retrieve the cr3 from current->mm->pgd field, which refer to the Page Global Directory (PGD) of AP.
Then we replace the cr3 of SP by that of AP, and invoke the desired interrupt service routine on SP.
Afterwards, use copy_to_user kernel routine to pass the result back to AP.
(with same linear address and page table, it emulates that SP and AP are running on the same CPU)

Linux kernel assigns each process a Page Global Directory (PGD) for paging. One part of the PGD is the same among all the processes, which is for paging inside the kernel (3G to 4G). The above explanation will be reasonable if we assume the following:
the kernel part of PGD in SP and AP are all the same, even there are 2 different cr3 on SP and AP (because they run on different CPUs).

OS Setup

(1) Get the kernel source (currently ubuntu-7.10-server edition)
sudo apt-get build-dep linux-source-2.6.22
apt-get source linux-source-2.6.22
(2) Configure the kernel

make mrproper
cp /boot/config-`uname -r` ./.config

make menuconfig and make sure we have the following options enabled

CONFIG_ACPI
CONFIG_SMP
CONFIG_X86_SMP
CONFIG_X86_HT

(3) To build and install the kernel, execute these three commands:

make all
make modules_install
make install

(4) Add it to the boot menu
vi /boot/grub/menu.lst

Detail of Implementation

ACPU: application CPU
SCPU: system CPU
* htctl.c 
(1) create a /dev/htctl device file
- register with the following operations for character device
static struct file_operations htctl_fops = {
        .owner          = THIS_MODULE,
        .unlocked_ioctl = htctl_ioctl,
        .mmap           = htctl_mmap,
};
- htctl_ioctl: for user process to issue ioctl system call (the actually system call are identified by watching on the shared memory region, and dispatched by the htctl_ioctl.  The sub-function htctl_get_handler will find out the actual command issue from user process.
- htctl_mmap: for user process to map to the shared region
(2) create share memory region (struct shared_region *to_kernel)
struct shared_region
{
    int flag;
    unsigned long currentptr;
    int syscall_num;
    union {
        struct {
            struct timeval *tv;
            struct timezone *tz;
        } gettimeofday;
        struct {
            int    pid;
        } getpid;
    };
};
- kmalloc() will return the start linear address of a continuous physical memory region.
- SetPageReserved(virt_to_page(virt_addr)) to reserve the pages for in kernel space.
- allocate some redundant pages when before and after the shared region

* When we call mmap in user space as the following
 shm_fd = open("/dev/htctl", O_RDWR, 0);
 to_kernel =  mmap(0, sizeof(struct shared_region),
                (PROT_READ | PROT_WRITE),
                MAP_FILE | MAP_SHARED, shm_fd, 0);
Then, sys_mmap2() in kernel will be called, which will prepare a linear address interval, and then invoke the htctl_mmap() in htctl.c from the following place
http://lxr.linux.no/linux+v2.6.22.14/mm/mmap.c#L1101
Afterwards, we use the following to map the physical address of shared_region_area inside kernel thread to the vma (linear address interval) of user process.
ret = remap_pfn_range (vma,
   vma->vm_start,
   virt_to_phys((void*)((unsigned long)shared_region_area))
                             >> PAGE_SHIFT,
   vma->vm_end-vma->vm_start,
   PAGE_SHARED);
(3) htioctl_register
- maintain a LIST_HEAD structure, and insert the possible ioctl operations (by htioctl_info structure), which will be issued by user process on the device file
struct htioctl_info {
        unsigned cmd;
        int (*ioctl)(struct file *, unsigned int, unsigned long);
        struct module *owner;
        struct list_head list;
};
- the cmd here is used for user process to pass the commmand number to the kernel process. for example, cmd can be HTTHREAD_IOCTL_CMD, which is generated by the following
#define HTTHREAD_IOCTL_CMD      _IOR(0x0, 0x0, struct htthread_ioctl_args)
* htthread.c
All the system call routines are implemented here, and they will be latter inserted into the list by htioctl_register. The above cmd corresponds to the following functions inside htthread.c
(1) htthread_ioctl: create the kernel thread on SCPU
- htthread_ioctl
//create kernel thread running on htthread_loop function
htthread_system = kthread_create(htthread_loop, NULL, "htthread_system");
//bind the thread to SCPU
kthread_bind(htthread_system, STCPUNUM);
//insert the process into run queue and change its state to TASK_RUNNING
wake_up_process(htthread_system);
- kthread_loop
To yield the CPU when kernel has nothing to handle, we use the following in kthread_loop.
if(!to_kernel->flag) {
  // watches the flag in to_kernel
  __monitor((void *)&to_kernel->flag, 0, 0);
  // wait until some changes happen
  __mwait(0, 0);
}
... //checks the flag in to_kernel modified by user process, and service the system call
// yield to other processes
schedule();
// also refer to http://softwarecommunity.intel.com/Wiki/DevelopforCoreprocessor/284.htm
- Reference of monitor & mwait MWAIT--Monitor_WaitGuideline
  MONITOR sets up an address range for the monitor hardware using the content of EAX as a logical address and resets the monitor event pending flag (of CPU). The memory address range should be within memory of the write-back caching type. A store to the specified address range will set the monitor event pending flag. The content of ECX and EDX are used to communicate other information to the MONITOR instruction.
  MWAIT takes the argument in EAX as a hint extension and is architected to take the argument in ECX as an instruction extension

// MWAIT EAX, ECX

{ WHILE (! ("monitor_event_pending_flag" OR "monitor_not_active")) {

implementation_dependent_optimized_state(EAX, ECX);

}

Clear monitor_event_pending_flag;

}

- from asm-i386/processor.h
static inline void __monitor(const void *eax, unsigned long ecx,
179 unsigned long edx)
180{
181 /* "monitor %eax,%ecx,%edx;" */
182 asm volatile(
183 ".byte 0x0f,0x01,0xc8;"
184 : :"a" (eax), "c" (ecx), "d"(edx));
185}
186
187static inline void __mwait(unsigned long eax, unsigned long ecx)
188{
189 /* "mwait %eax,%ecx;" */
190 asm volatile(
191 ".byte 0x0f,0x01,0xc9;"
192 : :"a" (eax), "c" (ecx));
193}
(2) htmigr_ioctl: migrate all processes to ACPU
(3) htshmem_ioctl: not implemented
(4) getcurrent_ioctl: get the value of current (current process pointer)
Because the "current" will change from time to time, such as after transferring from user space to kernel space. So the only way for user process to get the "current" is to ask the kernel when it is running. The way is to issue a ioctl command to kernel, and the value is just stored in the "current" variable in kernel context.
(5) unhtthread_ioctl: stop the kernel thread

Problem when enhancing CPU idle performance

(1) monitor/mwait problem
First look at the following code, and the kernel thread will run on this loop to wait for user process request.
---------------------------------
while(1) {
  monitor(flag);
  mwait();
  if(flag == 1) {
     service_the_syscall();
     flag = 0;
  }
  schedule();
}
----------------------------------
Because monitor/mwait will only leave the state if
- the flag changes (either from 1 to 0, or from 0 to 1)
- interrupt comes in
The kernel thread will spend a very long time if the following steps happen consecutively
- kernel thread start this looping
- execute monitor/mwait  (flag is 0 here)
- interrupt comes (ex: timer interrupt 250 times every second)
- the "if(flag)" has been checked and skip (flag is 0)
- user process issue system call (flag set to 1)
- monitor/mwait wait on flag (flag is 1, kernel thread will hang here, because user process has already requested. Only interrupt can get the thread out of the mwait status)
Thus, we need to check the flag just before entering monitor/mwait as the following
------------------------------------------
  if (!flag) {
     monitor(flag);
     mwait();
  }
----------------------------------------
Another thing is the schedule() can not be removed, because

(2) For isolating the system cpu from all processes, the kernel boot option "isolcpus=1" can force scheduler not to schedule further process onto system cpu (the same effect as the code to set affinity. but this way is more safe and clean)

Synchronization Problem

In this project, we are also dealing with more than one CPU, so the synchronization problem will occur when 2 CPU appear.
Spin Locks
Read/Write Spin Locks
Seqlocks
Read-Copy Update (RCU): rcu_read_lock();
Reference: http://www.informit.com/articles/article.aspx?p=414983

Different classes of system call

Non-blocking system call: getpid system call

* Ultimate purpose
improve performance Virtual machines because the current I/O consumes too much time.
keyword: hypervisor, xen