Linux Internals

1 Survival Kit

2 An Operating System, What for?

3 System Calls

3.1 Standards

3.1.1 The C Standard Library

3.1.2 POSIX

3.2 Principles and Implementation

3.2.1 Kernel (or system) mode and user mode

3.3 POSIX Essentials

3.3.1 POSIX Standard

3.3.2 System Calls Essentials

4 Files and File Systems

4.1 Inode

4.2 Structure and Storage

4.2.1 Mounting

/etc/fstab

4.3 Kernel Structures

4.4 System Calls

4.4.1 Inode Manipulation

stat()

access()

link(), symlink(), unlink()

chown()

chmod()

umask()

mknod()

open(), creat()

close()

lseek()

read(), write()

fread(), fwrite(), fopen(), fclose()

4.4.2 File Descriptor Manipulation

dup()

4.5 Directories

4.6 Extended File Descriptor Manipulation

4.7 Device-Specific Operations

ioctl()

4.8 I/O Redirection

dup(int fd), dup2(int fd1, int fd2)

4.9 Communication Through a Pipe

pipe(int p[2])

mkfifo()(3)

5 Processes and Memory Management

5.1 Process Abstraction

brk(), sbrk()

5.2 Introduction to Memory Management

malloc()

5.3 Process Implementation

5.4 States and Scheduling

5.5 Programmer Interface

fork()

execve()

fcntl()

_exit() and exit()

5.6 Process Genealogy

5.7 Daemons, Sessions and Groups

getpgid(int pid), getpgrp()

setsid()

6 Process Event Flow

6.1 Monitoring Processes

wait(), waitpid()

6.2 Signals

kill()

signal() (ISO C)

pause()

6.3 Typical Applications

alarm()

sleep() (ISO C)

(g/s)etitimer

6.4 Advanced Synchronization With Signals

sigaction()

7 Communication and Synchronization

7.1 Message Queues

mq_open()

mq_send(), mq_receive(), mq_getattr(), mq_setattr(), mq_close(), mq_unlink()

7.2 Advanced Memory Management

mmap()

munmap()

7.3 Shared Memory Segments

shm_open(), shm_unlink() (3)

8 Concurrency and Mutual Exclusion

8.1 Mutual Exclusion in Shared Memory

8.1.1 Dekker's Algorithm

8.1.2 Peterson's Algorithm

For Two Processes

For n Processes

8.1.3 Other Algorithms

8.1.4 Shared Memory Consistency Model

8.2 Semaphores

Lock

Semaphore

Read-Write Semaphore

C std lib primitives

8.3 Mutual Exclusion and Deadlocks

8.3.1 Dining Philosophers Problem

8.3.2 How to Avoid Deadlocks

8.4 File Locks

Fcntl

8.5 System V IPC

9 Threads

9.1 Applications

9.2 Principles

Thread vs. Process

9.3 Programmer Interface

Example

Primitives

pthread_create

pthread_init_attr, pthread_attr_destroy

9.4 Threads and Signals

9.6 Threads and Mutual Exclusion

9.7 Logical Threads vs. Hardware Threads

10 Network Interface

10.1 Principles

10.2 Connectionless Communications

10.3 Connection-Based Communications

10.4 Programmer Interface

10.5 Threaded Server Model

10.6 Distributed Systems

11 Kernel Design

11.1 Interrupts and Exceptions

11.1.1 Interrupts

11.1.2 Exceptions

11.2 Low-Level Synchronization

11.3 Low-Level Input/Output

11.4 Devices and Driver Model

11.5 File Systems and Persistent Storage

11.6 Memory Management

11.7 Process Management and Scheduling

11.8 Operating System Trends

11.9 Alternative Operating System Designs

12 Introduction to Virtual Machines

12.1 Modern Applications

12.1.1 Machine-level virtualization

12.1.2 Processor-level virtualization

12.1.3 System-level virtualization

12.1.4 Language-level virtualization

12.2 Challenges of Virtual Machine Monitors

12.3 Historical Perspective

12.4 Classification

13 Appendix A: List of System Calls

13.1 Files Manipulation

13.2 Processes

14 Standard C Library Functions

14.1 File Manipulation

[Made a set in Quizlet for system calls.]

[Notes mostly on http://www.enseignement.polytechnique.fr/informatique/INF583/]

[See also Understanding the Lunix Kernel, 3rd ed. Read until 1.5 An Overview of the Linux Filesystem excluded]

1 Survival Kit

2 An Operating System, What for?

3 System Calls

3.1 Standards

3.1.1 The C Standard Library

[https://en.wikipedia.org/wiki/C_standard_library]

A ISO standard (also called ISO C lib, or ANSI C lib?), which specify an API, which actually contains only 29 headers. The C library POSIX specification is a superset of it, which specifies Unix-specific functionality (ex: unistd.h, signal.h).

Documented in the man pages, section 3.

See also man 7 glibc, man 7 libc

Some implementations:

Both Unix and C were created at AT&T's Bell Laboratories in the late 1960s and early 1970s.

A header vs. standard list: http://www.schweikhardt.net/identifiers.html. From this list, it seems the C POSIX library is just the POSIX specification for system calls. Official list of headers in the C POSIX lib: http://pubs.opengroup.org/onlinepubs/9699919799/idx/head.html 

3.1.2 POSIX

[https://en.wikipedia.org/wiki/POSIX]

IEEE standard aiming at compatibility between OSs. Developed at the same time as ANSI C (C89), first version in 1988. Most recent version is POSIX.1 2013

Contains C API and shell utilities.

The standard is available here: http://pubs.opengroup.org/onlinepubs/9699919799/ 

Has 4 volumes (see top left column).

Glibc implements part of POSIX

  1. 3.2 Principles and Implementation

Are part of the C Standard library(?), but the C std lib also include mere library routines, like malloc. Some of those routine may call system calls.

If you download the C library source, it seems most system calls are unimplemented and throw ENOSYS. Is the kernel just an implementation of those system calls?

Man page sections:

[Glibc source code: https://www.gnu.org/software/libc/download.html. (checked out in ~/code/major-open-source/glibc, made eclipse project)]

[copy-paste]{

#define __NR_chmod 90

#define SYS_chmod __NR_chmod

extern int __dup (int __fd);

libc_hidden_proto (__dup)

}

3.2.1 Kernel (or system) mode and user mode

= modes of operation of the CPU, tracked in some register, which place restrictions on the type and scope of operations that can be performed.

see http://www.linfo.org/kernel_mode.html:

When the CPU is in kernel mode, it is assumed to be executing trusted software, and thus it can execute any instructions and reference any memory addresses (i.e., locations in memory). The kernel (which is the core of the operating system and has complete control over everything that occurs in the system) is trusted software, but all other programs are considered untrusted software. Thus, all user mode software must request use of the kernel by means of a system call in order to perform privileged instructions, such as process creation or input/output operations.

Wrapper’s tasks :

  1. Move parameters from the user stack to processor registers.  Passing arguments through registers is easier than playing with both user and kernel stacks at the same time (about stacks, see chapter about Processes and Memory Management)
  2. Switch to kernel mode and jump to the system call handler Call processor-specific instruction (trap, sysenter, ...)
  3. Post-process the return value and compute errno . Linux: typically negate the value returned by the service function

Handler’s tasks :

  1. Save processor registers into the kernel mode stack Call the service function in the kernel
  2. Linux: array of function pointers indexed by system call number
  3. Restore processor registers
  4. Switch back to user mode Call processor-specific instruction (rti, sysexit, ...)

  1. 3.3 POSIX Essentials

= Portable Operating System Interface

3.3.1 POSIX Standard

The standard itself must be purchased on IEEE website here.

Includes:

POSIX is portable and does not evolve much, but it is still too high level for many OS interactions. E.g., it does not specify file systems, network interfaces or power management.

Other Linux standard exist for that:

C applications deal with portability with:

3.3.2 System Calls Essentials

Returns int or long:

errno:

        

if (mysyscall() == -1) {

                printf(errno);

}

System types:

Interrupted System calls:

$ strace ./myProgram

execve("./hello", ["./hello"], [/* 36 vars */])        = 0

brk(0)                                                        = 0xf7f2e000

access("/etc/ld.so.nohwcap", F_OK)                        = -1 ENOENT (No such file or

  directory)

...


4 Files and File Systems

In UNIX, everything is a file…

See also man 7 file-hierarchy

4.1 Inode

inode = data structure to store file metadata. Name of data structure is actually stat (see man 2 stat for the sys call and data struct). Set of minimum attributes (defined by POSIX), which must be provided by any FS. For disk-based filesystems, usually corresponds to a file control block stored on disk.

What is a directory? (assumption)

A file containing a list of:

Note on hard and soft links:

Consequences:

        

4.2 Structure and Storage

ext4 FS simplified (see also man ext4, which documents a long list of flags -> not so interesting, https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout):

1 data block or physical record = (usually?) 512B. All block devices have blocks. Speeds up I/O. Reads/writes per whole blocks. Block size in FS may be different to physical block size.

-> there is the inode on disk and the inode data structure = stat.

Boot block:

Super block:

In ext4, every Inode has a table of block addresses [sl. 49/6]:

Q/ Does that mean data is fragmented?

Yes, it can be, but Linux tries to reduce it.

4.2.1 Mounting

mount -t type device dir

Previous content + owner + mode of dir become invisible

Device = either:

Types depend of platform. Ex: adfs, autofs, auto, cifs, ext4, hfs, hpfs, msdos, nfs, ntfs, proc (for the special proc mount point), ramfr, romfs, smbfs, sysv, usbfs, vfat…

Mount with no arguments shows a list of mounted FS: rootfs, sysfs, udev… Also available in /proc/mounts

...XXX

/etc/fstab

man fstab

List of file systems maintained by a human administrator. Can be used by mount to mount those FS more easily, or mount them automatically at boot?

First column is either special file path, UUID or label.

...

4.3 Kernel Structures

cf. Understanding the Linux Kernel 12.2.6 p. 478/496

file descriptor = a non-negative integer corresponding to an index in a descriptor table. returned by open(), argument of read()

The open file table is system-wide. Entry = {offset, opening flags from open()}. Names for its entries:

# descriptors / # openings = nb of arrows pointing to this entry. Allows to free the entry when not referred any longer.

To list the FD of a process with ID 1234, ls /proc/1234/fd

ls -l /proc/1234/fd

lrwx------ root root ….. 0 -> /dev/null

lrwx------ root root ….. 1 -> /dev/null

File locks: see ch. 8

4.4 System Calls

4.4.1 Inode Manipulation

l-prefix variant (ex: lstat()): do not follow soft links.

f-prefix variant (ex: fstat()): operate on file descriptor instead of file path. /!\ some f- variants actually are C standard library routines: fread, fwrite, fopen

-at suffix variant (ex: openat): instead of pathname arg, takes dirfd + rel path. Rationale: see notes at the end of man 2 open:

stat()

int stat(const char *path, struct stat *inode);

stat = inode data struct -> get the inode

access()

int access(const char *pathname, int mode);

check whether calling process can access file with mode mode = R_OK, W_OK, X_OK of F_OK (for mere existence check).

link(), symlink(), unlink()

int link(const char *oldpath, const char *newpath);

int unlink(const char* path)

(cannot hard link directory, to maintain tree structures of the directories)

unlink actually used to delete a file (but not directories: use rmdir())

chown()

int chown(const char *path, uid_t owner, gid_t group);

chmod()

int chmod(const char *path, mode_t mode);

change access rights, using masks

Ex: mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH; // 0644

umask()

A process has a umask. Files created with mknod, open, mkdir, creat, etc. have mode m & ~umask (where m is the mode passed as argument).

A common value is 022 = does not allow to create files writable by grp and others.

mknod()

int mknod(const char *pathname, mode_t mode, dev_t dev);

Make an inode, mostly used to create device special files or pipes. (Use open() to create a regular file)

NB: there is also a mknod command:

mknod name type major minor

with type ∈ {b (=block = buffered), c or u (=character = unbuffered), p (= pipe = FIFO => major and minor not needed)}.  

dev: use makedev(unsigned int maj, unsigned int min)(3) to create. Cf. also major(dev_t), minor(dev_t).

Mode must or one of those: S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, S_IFSOCK

open(), creat()

int open(const char *pathname, int flags, [mode_t mode]);

int creat(const char *pathname, mode_t mode);

Return fd (the lowest available).

Flag must include one of O_RDONLY, O_WRONLY, O_RDWR + optionally any of these:

creat() = O_CREAT | O_WRONLY | O_TRUNC

Some interesting errors:

See also the notes at the end of the man page…

close()

int close(int fd);

Allow fd to be reused. If open file table entry is not used anymore, it is freed. If file was unlinked and has no path reference left and this was the last file descriptor to it, the file is deleted.

/!\ Any locks on the associated files is removed, regardless of the fd used to obtain the lock, and even if the file is still pointed to by another fd.

lseek()

off_t lseek(int fd, off_t offset, int whence)

Set offset in open file table entry. Whence belongs to {

Result of those last two depend on file system, because holes are managed by FS.

If new offset is after end of file, file size does not change until data is written. If data is written, that creates a hole on FS that support it (i.e., the hole is not actually always filled with zeros on the device).

Notes: some devices are incapable of seeking.

read(), write()

ssize_t read(int fd, void *buf, size_t count);

ssize_t write(int fd, const void *buf, size_t count);

ssize_t = signed size type, because can return -1 in case of error

Attempts to read (resp. write) up to count bite, returns number of bytes effectively read (resp. written) (may stop reading resp. writing because of EOF, signal, reading from socket…). Read returns 0 if at EOF.

Reading errors

fread(), fwrite(), fopen(), fclose()

[TD 1]

FILE *fopen(const char *path, const char *mode);

FILE *fdopen(int fd, const char *mode);

size_t fread(void *ptr, size_t size, size_t nmemb, FILE * stream );

size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);

int fclose(FILE *fp);

Not system calls but part of the Standard C Library. Use FILE* instead of fd, i.e. uses file streams: enables buffering of IO. Flushing the stream (with fclose of fflush) ensures data is written to the stream, but not necessarily to the file yet: synchronisation between stream and file is done by kernel.

4.4.2 File Descriptor Manipulation

/!\ Use fdopen() and fileno() to get a file C library FILE* from a file descriptor and reciprocally, but do not mix C library and system call I/O on the same file (because of C library internal buffers)

dup()

will see when study redirections

4.5 Directories

Part of C Standard Library.

Types: 

Primitives

See also man 7 path_resolution

4.6 Extended File Descriptor Manipulation

fcntl(int fd, int cmd, [... /*  args */])

cmd allows to manipulate attributes of the entries in the open file table (see 4.3)?

Type of args depends on the command.

cmd (with the form F_CMD(arg type)):

4.7 Device-Specific Operations

ioctl()

int ioctl(int fd, unsigned long request, ...);

... is traditionally a char *argp. fd is an open fd.

For other operations than read and write, for example: eject a disk., control a terminal.

Request is device-dependent. List in man 2 ioctl_list, but not documented...

4.8 I/O Redirection

dup(int fd), dup2(int fd1, int fd2)

If dt is the descriptor table of a process, copy dt[fd] in some dt[newfd] and returns newfd. -> used for redirection. newfd is either the lowest available (dup) or fd2 (dup2), which is closed beforehand if necessary. The new descriptor's close-on-exec flag is reset (to set it, use dup3).

Bildschirmfoto 2016-08-06 um 10.59.52.png

4.9 Communication Through a Pipe

Usual pipe is

but can also be

Limited capacity (65536 bytes (= 16 * 4kB pages) on Linux). fcntl(F_SETPIPE_SZ) can adjust pipe size.

Bildschirmfoto 2016-08-06 um 11.03.24.png

NB: There is a FIFO inode, but as for pipes, data is in memory. They have "kernel persistence" (= stored in the memory of the kernel?).

pipe(int p[2])

stores a pair of fd in p, one with O_RDONLY, the other with O_WRONLY

See also man 7 pipe.

(because execve preserves file descriptors, see 5.5)

mkfifo()(3)

int mkfifo(const char* pathname, mode_t mode)

(front-end to mknod)

(Mode modified by the process's umask.)

FIFO supports rendez-vous protocol: open at one ends blocks until other end also opened, except if O_NONBLOCK: opening O_RDONLY succeeds, O_WRONLY fails with ENXIO.

See also man 7 fifo.


5 Processes and Memory Management

5.1 Process Abstraction

In the kernel: process descriptor, …

See slide 91 and search

Memory segments:

[http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/, provides also many possibly-interesting pieces of info which I did not write down]

Notes on some segments:

Segment frontiers are randomized in Linux for security reasons.

Memory-related faults:

Segments are divided in areas, ex:

Areas can be examined in /proc/<pid>/maps

brk(), sbrk()

int brk(void *end_data_segment);

void *sbrk(intptr_t displacement);

move the program break = end of heap (referred to as "end of data segment" in man pages, but some people use data segment to refer to data segment + BSS + heap [see Gustavo Duarte blog post])

(only used in the kernel)

Actually allocate or deallocate memory to the process.

On success, sbrk returns previous program break = pointer to the start of the newly allocated segment memory (=> sbrk(0) can be used to simply get the program break).

See also man 3 end, edata, etext.

How to get the addresses of each segment:

double t[0x02000000]

void segments()

{

static int s = 42;

void *p = malloc(1024);

printf( "stack: %010p \n",                                 &p)

printf( "brk: %010p \n",                                 sbrk(0))

printf( "heap: %010p \n",                                 p)

printf( "static (bss): %010p \n",                         t)

printf( "static (initialized): %010p \n",         &s)

printf( "text : %010p \n",                                 segments)

}

Stack                                0xbff86fe0

Brk                                0x1806b000

Heap                                0x1804a008

Static (bss)                         0x08049720

Static (initialized)                 0x080496e4

Text                                0x080483f4

5.2 Introduction to Memory Management

[Notes here also include some of 7.2 Advanced Memory Management]

Virtual address mechanism

Page = fixed size memory region (2n, e.g. 4 KB)

Translation mechanism: not clear… By CPU (Memory Management Unit (MMU) + Translation Lookaside Buffer (TLB))

The following is architecture dependent (valid at least for x86)?

Bildschirmfoto 2016-09-13 um 08.57.40.png

One paging control register (PCR) in the processor, pointing to the beginning of the page directory.

At context switch, the content of this register is modified (+TLB is flushed) => one page directory per process.

In memory, there is one page directory and several page tables per process.

1 page directory size = 1 page table size = 1024 * 32-bits entries = 1 page.

Translation:

This memory translation is done in hardware in Memory Management Unit (MMU). This is further speeded up by Translation Look-aside Buffer (TLB).

.INF583_05_TLB.png

See /proc/<pid>/pagemap

A page table entry stores the physical address of a page, i.e. a page-aligned address => there are 12 bits available to contain page status:

mem_map_t structure stores info about used physical pages:

[See also 2.4 in https://www.kernel.org/doc/gorman/html/understand/understand005.html for more details]

Saving resources and enhancing performance :

Software caches: use memory pages to cache files or block devices (...XXX)

malloc()

void *malloc(size_t size)

void *calloc                        allocate several objects

void *realloc(void *ptr, size_t size)

void *valloc                         aligns allocated memory on page boundary

void  free(void *ptr)                 ptr must be return value of previous malloc. If not, you

have a nasty bug

How it works:

valgrind tool: detects memory leaks, phantom pointers, corrupt calls to free, and other bugs…

5.3 Process Implementation

[See also http://www.makelinux.net/books/lkd2/ch03lev1sec1]

Process descriptor task_struct is a fairly big struct (1.7 kB for 32 bit arch). Stored in the task list = circular doubly-linked list.

[the rest of the section is not interesting, will be described in more details further]

5.4 States and Scheduling

Bildschirmfoto 2016-07-23 um 14.08.27.png

Bildschirmfoto 2016-07-23 um 14.15.27.png

5.5 Programmer Interface

fork()

pid_t fork(void);

Child identical to parent:

/!\ Child has single thread (the one that forked), even though memory is exactly the same => pthread_atfork(3) may be handy.

This is a system call. There is also a glibc wrapper. See also clone() (for both process and thread)...XXX

Ex:

switch (cpid = fork()) {

        case -1:

                …

                exit(-1);

        case 0:

                continue_child();

                break;

        default:

                continue_parent(cpid);

                break;

}

execve()

int execve(const char *filename, char *const argv[], char *const envp[]);

Arguments:

Does not return on success, instead “replaces” the process (returns -1 on error, e.g. EACCESS, ENOEXEC):

Variants:

int execl(const char *path, const char *arg, ...);

int execv(const char *path, char *const argv[]);

int execlp(const char *file, const char *arg, ...);

int execvp(const char *file, char *const argv[]);

int execle(const char *path, const char * arg, ..., char *const envp[]);

int execve(const char *filename, char *const argv[], char *const envp[]);

execl*:         arg list terminated with null

exec*p:         file in the $PATH

Environment manipulation:

Simply declare: extern char **environ (but recommended use only as execve argument)

fcntl()

int fcntl(int fd, int cmd, [long arg]);

Commands:

_exit() and exit()

void _exit(int status);

Argument:

Terminates process:

Never fails and does not return!

void exit(int status);

-> Calls in addition methods registered with atexit() (in reverse order of registration).

5.6 Process Genealogy

Swapper process (pid 0)

init process (pid 1)

5.7 Daemons, Sessions and Groups

Ex: inetd: (sl.126/45 …)

About PID, PPID, PGID, SID, see man 7 credentials

[https://www.win.tue.nl/~aeb/linux/lk/lk-10.html]

Process sessions and process groups:

[cf. Understanding the Linux Kernel 1.6.7.2 p. 29/47]

getpgid(int pid), getpgrp()

setsid()

6 Process Event Flow

6.1 Monitoring Processes

wait(), waitpid()

pid_t wait(int *status); /* waits for any child */

pid_t waitpid(pid_t pid, int *status, int options);

Monitor state changes (i.e. all sorts of signals received by the child, not just termination) of one or more children, and return PID of the child whose state has changed.

If a child terminates, it remains in a zombie state until wait() is performed to retrieve its state (and free the associated process descriptor)

pid special values:

status can be inspected with macros:

6.2 Signals

[see also man 7 signal]

Signals can have signal handlers to “catch” them (runs in user mode).

Asynchronous:

/!\ No queueing of pending signals.

[signal(7)]

POSIX defines:

Signal disposition = process (default?) behavior when delivered the signal = one of:

Can be changed with sigaction(2).

By default, the signal handler is invoked on the normal process stack.  It is possible to arrange that the signal handler uses an alternate stack; see sigaltstack(2) for a discussion of how to do this and when it might be useful.

...

kill()

int kill(pid_t pid, int sig);

syscall (also shell command) used to send any signal.

signal() (ISO C) (deprecated)

typedef void (*sighandler_t)(int);

sighandler_t signal(int signum, sighandler_t handler);

Deprecated with multithread: use sigaction() (see 6.4)

If signal received during handler execution, deferred.

pause()

Suspend a process until it is delivered a signal (except if this signal is SIG_IGNed).

6.3 Typical Applications

alarm()

deliver SIGALRM after a delay (/!\ default action is to terminate).

sleep() (ISO C)

(g/s)etitimer

interval timer…

6.4 Advanced Synchronization With Signals

What happens if CONT signal somehow delivered before the thread pauses? Deadlock! → Flow in ISO C signals (see 10.4 in Advanced Programming in the UNIX Environment).

sigaction()

int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact);

struct sigaction {

void     (*sa_handler)(int);

void     (*sa_sigaction)(int, siginfo_t *, void *);

sigset_t   sa_mask;

int        sa_flags;

};

signum is any valid signal except SIGKILL and SIGSTOP.

...XXX

Signals in Terminals

Ctrl-Z         → SIGTSTP        tty stop

fg         → SIGCONT        continue

Ctrl-C        → SIGINT        Interrupt

7 Communication and Synchronization

7.1 Message Queues

= sth yet different as signals and FIFOs.

[see also mq_overview(7)]

In Linux: a pseudo-FS mounted at /dev/mqueue

Bildschirmfoto 2016-09-13 um 08.50.08.png

mq_open()

mqd_t mq_open(const char *name, int flags);

mqd_t mq_open(const char *name, int flags, mode_t mode, struct mq_attr *attr);

Arguments:

mq_send(), mq_receive(), mq_getattr(), mq_setattr(), mq_close(), mq_unlink()

7.2 Advanced Memory Management

mmap()

void *mmap(void *start, size_t length, int protection, int flags, int fd, off_t offset);

Map file or device into memory. Transparent synchronization between memory page and file.

Facilitates sharing of memory pages: use file name to identify shared memory region.

Arguments:

Returns: the start address of the memory region, or MAP_FAILED (= (void *)-1)

munmap()

int munmap(void *start, size_t length);

Unmaps. Synchronizes any pending changes (see also msync()). NB: closing a fd does not unmap the region.

7.3 Shared Memory Segments

mmap allows sharing memory region, but uses disk space for the actual file. => Use shared memory system calls. They actually are like mmap(MAP_SHARED) but mapping special files from pseudo FS /dev/shm.

See shm_overview(7)...XXX

shm_open(), shm_unlink() (3)

int shm_open(const char *name, int flags, mode_t mode);

Creates and opens a new, or opens an existing, POSIX shared memory object.

Flag = one of (O_RDONLY, O_RDWR) OR any of:

Returns a fd with FD_CLOEXEC set.

Errors:

8 Concurrency and Mutual Exclusion

Critical section = program section accessing shared resource.

Mutual exclusion = make sure at most one process or thread enters a critical section.

Properties to guarantee:

8.1 Mutual Exclusion in Shared Memory

8.1.1 Dekker's Algorithm

[wikipedia]

Uses 3 variables: wants_to_enter[0], wants_to_enter[1] and turn.

A process must:

But still, why exactly this way? I think the answer is in the original papers: http://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD123.html. This may be interesting too: http://cs.stackexchange.com/questions/12621/contrasting-peterson-s-and-dekker-s-algorithms 

variables

        wants_to_enter : array of 2 booleans

        turn : integer

    wants_to_enter[0] ← false

    wants_to_enter[1] ← false

    turn ← 0   // or 1

p0:

   wants_to_enter[0] ← true

   while wants_to_enter[1] {

      if turn ≠ 0 {

         wants_to_enter[0] ← false

         while turn ≠ 0 {

           // busy wait

         }

         wants_to_enter[0] ← true

      }

   }

   // critical section

   ...

   turn ← 1

   wants_to_enter[0] ← false

   // remainder section

p1:

   wants_to_enter[1] ← true

   while wants_to_enter[0] {

      if turn ≠ 1 {

         wants_to_enter[1] ← false

         while turn ≠ 1 {

           // busy wait

         }

         wants_to_enter[1] ← true

      }

   }

 

   // critical section

   ...

   turn ← 0

   wants_to_enter[1] ← false

   // remainder section

Problems:

8.1.2 Peterson's Algorithm

Original paper: http://cs.nyu.edu/~lerner/spring12/Read03-MutualExclusion.pdf 

For Two Processes

The wrong algorithm with a mutex variable, for process 1:

while (mutexBelongsTo != -1) {} // wait while mutex belongs to someone

mutexBelongsTo = 1 // take the mutex

// critical section

mutexBelongsTo = -1 // Give away the mutex

=> Correctness is not satisfied because of race condition.

=> need to know:

=> use variable isApplying and turn, arbitrarily initialized to 0.

For process 1:

isApplying[1] = 1;

while (isApplying[0] && turn == 0) {} // process 0 is applying and it is its turn, so wait

// Now it is my turn or process 0 stopped applying so I can go

// Critical section

// Give away the turn and stop applying

turn = 0;

isApplying[1] = 0;

This one above I found myself, and it is quite similar to the Peterson's algorithm (btw, there is an error in the slides, see below):

For process 1:

isApplying[1] = 1;

turn = 0; // Process 1 says: "Please go ahead process 0"

while (isApplying[0] && turn == 0) { } // same as above

// Critical section

isApplying[1] = 0;

For n Processes

[Wikipedia]

The level variables take on values up to N−1, each representing a distinct "waiting room" before the critical section.[6] Processes advance from one room to the next, finishing in room N−1 which is the critical section. Specifically, to acquire a lock, process i executes:

for ℓ from 0 to N−1 exclusive
   level[i] ← ℓ
   last_to_enter[ℓ] ← i
   
while last_to_enter[ℓ] = i and there exists k ≠ i, such that level[k] ≥ ℓ
       
Wait

8.1.3 Other Algorithms

8.1.4 Shared Memory Consistency Model

Defined by when a write is visible by the other processes.

Dekker and Peterson require the strongest consistency model: sequential consistency = (according to Leslie Lamport):

The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”

But usually, consistency is weaker, because of:

=> Solution = hardware support:

8.2 Semaphores

A lock or mutex is useful when processes are competing for only one resource.

=> A semaphore is like a lock for several resources: it has some "counter".

8.2.1 Lock

Implementable with test-and-set:

void lock(volatile int *lock_pointer) {

    while (test_and_set(lock_pointer) == 1);

}

void unlock(volatile int *lock_pointer) {

    *lock_pointer = 0 // Release lock

}

int lock_variable = 1;

void lock_example() {

    lock(&lock_variable);

    // Critical section

    unlock(&lock_variable);

}

8.2.2 Semaphore

void init(semaphore s, int v) {

    s.value = v;

}

void procure(semaphore s) { // also called down() or wait()

    wait until (s.value > 0);

    s.value--;     // Must be atomic with the previous test

}

void vacate(semaphore s) { // also called up(), post() or signal()

    s.value++;     // Must be atomic

}

8.2.3 Read-Write Semaphore

Allowing multiple readers and a single writer:

void init(rw_semaphore l) {

    l.value = 0;   // Number of readers (resp. writers)

                   //   if positive (resp. negative)

}

void procure_read(rw_semaphore l) {

    wait until (l.value >= 0);

    l.value++;     // Must be atomic with the previous test

}

void vacate_read(rw_semaphore l) {

    l.value--;     // Must be atomic

}

void procure_write(rw_semaphore l) {

    wait until (l.value == 0);

    l.value = -1;  // Must be atomic with the previous test

}

void vacate_write(rw_semaphore l) {

    l.value = 0;

}

8.2.4 C std lib primitives

man 7 sem_overview

As with shm_ primitives, can be named = associated to a file /dev/shm.

sem_t * sem_open   (const char *name, int flags, [mode_t mode, unsigned int value]);

int     sem_wait   (sem_t *sem);

int     sem_post   (sem_t *sem);

int     sem_close  (sem_t *sem);

int     sem_unlink (const char *name);

Arguments flags and mode allow for a subset of their values for open():

close(): Undefined behavior when closing a semaphore other processes are currently blocked on.

Other calls:

8.3 Mutual Exclusion and Deadlocks

8.3.1 Dining Philosophers Problem

By Dijkstra and Hoare

Showing deadlock when more than one processes want to access more than one resources each.

Each philosopher first take the chopstick on their left, then attempt to take the one on their right => deadlock.

8.3.2 How to Avoid Deadlocks

8.4 File Locks

=> to get a write / read lock, file must be open for writing / reading.

=> When inode removed on file closure, locks are also deleted. But actually, closing any fd on such a file releases the locks, which is unwanted.

Locks are advisory = not enforced. The program must comply with the lock rule, i.e. not write or read when it does not have the lock.

Locks are local on some file region = record = file segment.

Inheritance:

fcntl

int fcntl(int fd, int cmd, ... /* arg */ );

With:

F_SETLK (struct flock *)

Acquire a lock (when l_type is F_RDLCK or F_WRLCK) or release a lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EACCES or EAGAIN.

F_SETLKW (struct flock *)

Same as for F_SETLK, but wait. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).

F_GETLK (struct flock *)

On input to this call, lock describes a lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged. If one or more incompatible locks would prevent this lock being placed, then fcntl() returns details about one of these locks in the l_type, l_whence, l_start, and l_len fields of lock and sets l_pid to be the PID of the process holding that lock.

8.5 System V IPC

9 Threads

9.1 Applications

Servers (web, database...), computational (numerical simulation, signal processing…)...

9.2 Principles

POSIX thread = logical thread.

Linux implementation: NPTL (Native POSIX Thread Library)

All memory segments are in common (data, heap, code), except stacks.

Kernel responsible for scheduling, mapping on hardware threads

man 7 pthreads

Compile with -pthread

9.2.1 Thread vs. Process

In the kernel memory: (XXX review each unclear point. This list is also in man 7 pthreads)

Some common attributes:

Distincts attributes:

9.3 Programmer Interface

9.3.1 Example

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

#include <errno.h>

#include <sys/times.h>

#define NTHREADS 5

void *thread_func(void *num) {

    int i = *(int *)num;

    printf("Thread %d\n", i); // Or pthread_self()

    // ...

    // More thread-specific code

    // ...

    pthread_exit(NULL);

}

pthread_t threads[NTHREADS];

int main(int argc, char *argv[]) {

    pthread_attr_t attr;

    int i, error;

    for (i = 0; i < NTHREADS; i++) {

        pthread_attr_init(&attr);

        int *ii = malloc(sizeof(int)); *ii = i;

        error = pthread_create(&threads[i], &attr, thread_func, ii);

        if (error != 0) {

            fprintf(stderr, "Error in pthread_create: %s \n",

                    strerror(error));

            exit(1);

        }

    }

    for (i=0; i < NTHREADS; i++) {

        error = pthread_join(threads[i], NULL);

        if (error != 0) {

            fprintf(stderr, "Error in pthread_join: %s \n", strerror(error));

            exit(1);

        }

    }

}

9.3.2 Primitives

pthread_create

int pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine)(void *), void *arg);

Create and start a thread.

pthread_t: opaque type for the thread ID. Also returned by pthread_t pthread_self()(3), compare with int pthread_equal(pthread_t, pthread_t)(3).

attr: see pthread_init_attr(). Can be NULL.

On error, errno is not set.

Thread termination is either:

When thread terminates, its resources are still there until thread is detached or joined, see below.

Inheritance:

Errors:

Joinable vs. detached:

Thread stack size: default 2 MB, or RLIMIT_STACK in NPTL. See also pthread_attr_setstacksize(3).

pthread_init_attr, pthread_attr_destroy

int pthread_attr_init(pthread_attr_t *attr);

int pthread_attr_destroy(pthread_attr_t *attr);

Initialize attr with default values. Then fields can be modified individually using primitives like pthread_attr_*(). See See Also section.

...XXX

(Undefined behavior if attr has already been initialized. OK if has been destroyed.)

Once attr no longer needed, should be destroyed.

pthread_exit

void pthread_exit(void *retval);

Second arg of pthread_join is set to retval.

Called implicitly if thread function returns, with arg = the return value.

Never returns.

Does some cleanup:

pthread_join

int pthread_join(pthread_t thread, void **thread_return);

Waits until either:

If calling thread is canceled, thread remains joinable.

Some errors:

pthread_detach

int pthread_detach(pthread_t thread);

Mark thread as detached. Can alternatively be made detached at creation using pthread_attr_setdetachstate(3).

Can't be made joinable again afterwards. Detaching already detached thread => unspecified behavior.

9.3.3 Thread-Specific Data (TSD)

For thread-specific objects that need to be available across function call ~= global data, but per thread.

Finalization...XXX

pthread_key_create

...XXX

9.4 Threads and Signals

pthread_kill

…XXX

pthread_sigmask

…XXX

sigwait

...XXX

9.5 Threads and Mutual Exclusion

9.5.1 Mutex

pthread_mutex_init

pthread_mutex_lock, pthread_mutex_unlock, pthread_mutex_trylock, pthread_mutex_timedlock

9.5.2 R/W Lock

pthread_rwlock_*

9.5.3 Condition Variable

9.5.4 Futex

9.6 Logical Threads vs. Hardware Threads

...

10 Network Interface

10.1 Principles

The layers…

10.2 Connectionless Communications

10.3 Connection-Based Communications

10.4 Programmer Interface

10.5 Threaded Server Model

10.6 Distributed Systems

11 Kernel Design

11.1 Interrupts and Exceptions

11.1.1 Interrupts

Received by the processor from a piece of hardware or software.

=> Involves an electrical signal called interrupt request (IRQ).

Lowest level hardware synchronization mechanism.

Requires immediate attention: processor switches to kernel mode and calls a specific interrupt service routine or interrupt handler.

Can be masked, except for non-maskable interrupts (NMI).

Multiple drivers may share a single IRQ line → IRQ handler must identify the source of the interrupt to call the proper service routine

11.1.2 Exceptions

Are originally raised by CPU. CPU switches to kernel mode and calls a specific exception handler or exception service routine, which often sends signal to faulty process (e.g. SIGFPE if division by 0).

[see also Understanding the Linux Kernel, 3rd ed., 4.5]

"Mechanism to implement system call"-> Does that mean a syscall always starts with an exception to switch to kernel mode? Se maybe http://stackoverflow.com/questions/19057503/switching-from-user-mode-to-kernel-mode XXX

11.2 Low-Level Synchronization

11.3 Low-Level Input/Output

11.4 Devices and Driver Model

2 Types: block- or character-oriented, accessed through block or character special files, located in /dev:

Type

Ex device

Ex device special file

Block

Disks and other memories

/dev/had, /dev/sdb2, /dev/md1…

Character

serial ports, console terminals, audio

/ dev/tty0, /dev/pts/0, /dev/usb/hiddev0, /dev/mixer, /dev/null

Major and minor numbers: to (logically) project device drivers to device special files

rest is not so clear… See also 4.24 in Advanced Programming in the UNIX Environment.

11.5 File Systems and Persistent Storage

11.6 Memory Management

11.7 Process Management and Scheduling

11.8 Operating System Trends

11.9 Alternative Operating System Designs

12 Introduction to Virtual Machines

12.1 Modern Applications

12.1.1 Machine-level virtualization

Ex: VMware, Parallel Desktop, Virtual Box

12.1.2 Processor-level virtualization

Ex: VMware, Virtual Box, QEMU, Rosetta

Basically translates instructions from guest processor to host processor.

12.1.3 System-level virtualization

Ex: Xen (linux only)

ease OS development…

12.1.4 Language-level virtualization

Ex: Java, .NET

12.2 Challenges of Virtual Machine Monitors

12.3 Historical Perspective

12.4 Classification

13 Appendix A: List of System Calls

man section: 2

13.1 Files Manipulation

stat()

Get the inode

access()

check access permission

link(), symlink(), unlink()

chown()

chmod()

mknod()

open(), close()

may also create a new file

create()

read(), write()

provides unbuffered I/O: need to provide manually set buffer and size to read. For buffered I/O, use standard C library

lseek()

move r/w pointer in file

fcntl()

manipulate file descriptor: duplicate (slightly != dup()), flags, locking, manage signals, leases...

dup()

duplicate a file descriptor

13.2 Processes

fork()

exec()

exit()

14 Standard C Library Functions

man section: 3

14.1 File Manipulation

see also ‘man stdio’

(f)getc(), (f)gets(),(f)getw()

read character, entire line (‘s’ = string) or word

(f)putc(), (f)puts(), (f)putw()

printf(), fprintf(), sprintf(), vprintf()

formatted print to stdout, f- variant print to file, other variants have just different signatures…

fread(), fwrite()

same as read()/write(), but with FILE * instad of fd

fopen(), fdopen()

returns FILE * instead of a fd

fseek()