A ISO standard (also called ISO C lib, or ANSI C lib?), which specify an API, which actually contains only 29 headers. The C library POSIX specification is a superset of it, which specifies Unix-specific functionality (ex: unistd.h, signal.h).

Documented in the man pages, section 3.

3.1.2 POSIX

[https://en.wikipedia.org/wiki/POSIX]

IEEE standard aiming at compatibility between OSs. Developed at the same time as ANSI C (C89), first version in 1988. Most recent version is POSIX.1 2013

Contains C API and shell utilities.

The standard is available here: http://pubs.opengroup.org/onlinepubs/9699919799/

Has 4 volumes (see top left column).

Glibc implements part of POSIX

3.2 Principles and Implementation

Are part of the C Standard library(?), but the C std lib also include mere library routines, like malloc. Some of those routine may call system calls.

If you download the C library source, it seems most system calls are unimplemented and throw ENOSYS. Is the kernel just an implementation of those system calls?

Man page sections:

2 = system calls, see man -k . | grep "(2)"
3 = library calls, see man -k . | grep "(3)"

[Glibc source code: https://www.gnu.org/software/libc/download.html. (checked out in ~/code/major-open-source/glibc, made eclipse project)]

[copy-paste]{

All system calls defined in OS-specific header file. Linux: /usr/include/sys/syscall.h (which includes /usr/include/bits/syscall.h) (exact location depends on distro)

Location of header files is configured in gcc, see this. Similarly location of libraries like libc is configured in ld, see this. For example /lib

In debian:

/usr/include/syscall.h -> sys/syscall.h
/usr/include/x86_64-linux-gnu/sys/syscall.h ->

asm/unistd.h = /usr/include/x86_64-linux-gnu/asm/unistd.h -> asm/unistd_64.h: list of

#define __NR_chmod 90

bits/syscall.h: list of

#define SYS_chmod __NR_chmod

In glibc:

include/syscall.h -> misc/syscall.h -> sys/syscall.h: this one does not exist. Probably refers to sysdeps/*/sys/syscall.h after compilation.
sysdeps/generic/sys/syscall.h: empty
sysdeps/mach/sys/syscall.h: empty with comment saying calls are in sysdeps/mach/syscall_sw.h, but this does not exist
sysdeps/unix/sysv/linux/sys/syscall.h ->

asm/unistd.h: there 3 unistd.h, but no asm dir

include/unistd.h:

Includes posix/unistd.h
List of declaration like:

extern int __dup (int __fd);

libc_hidden_proto (__dup)

posix/unistd.h:

Incude features.h
Many constants and others

include/bits/unistd.h -> posix/bits/unistd.h
include/sys/unistd.h -> posix/sys/unistd.h
posix/bits/unistd.h: many stuffs
posix/sys/unistd.h -> unistd.h

bits/syscall.h: not found. Lots of bits directories:

One for each sysdeps/<proc-arch>
One in include, string, io, locale, math...

Syscall.h defines some SYS_syscallname macros as numbers, used then in the 3rd box in below diagram
C library wraps processor-specific parts into a plain function

}

3.2.1 Kernel (or system) mode and user mode

= modes of operation of the CPU, tracked in some register, which place restrictions on the type and scope of operations that can be performed.

see http://www.linfo.org/kernel_mode.html:

When the CPU is in kernel mode, it is assumed to be executing trusted software, and thus it can execute any instructions and reference any memory addresses (i.e., locations in memory). The kernel (which is the core of the operating system and has complete control over everything that occurs in the system) is trusted software, but all other programs are considered untrusted software. Thus, all user mode software must request use of the kernel by means of a system call in order to perform privileged instructions, such as process creation or input/output operations.

Wrapper’s tasks :

Move parameters from the user stack to processor registers. Passing arguments through registers is easier than playing with both user and kernel stacks at the same time (about stacks, see chapter about Processes and Memory Management)
Switch to kernel mode and jump to the system call handler Call processor-specific instruction (trap, sysenter, ...)
Post-process the return value and compute errno . Linux: typically negate the value returned by the service function

Handler’s tasks :

Save processor registers into the kernel mode stack Call the service function in the kernel
Linux: array of function pointers indexed by system call number
Restore processor registers
Switch back to user mode Call processor-specific instruction (rti, sysexit, ...)

3.3 POSIX Essentials

= Portable Operating System Interface

3.3.1 POSIX Standard

The standard itself must be purchased on IEEE website here.

Includes:

UNIX commands and shell scripts
The OS "API" (= the system calls themselves?)
Inode attributes

POSIX is portable and does not evolve much, but it is still too high level for many OS interactions. E.g., it does not specify file systems, network interfaces or power management.

Other Linux standard exist for that:

Linux Standard Base (LSB),
Filesystem Hierarchy Standard (FHS)...

C applications deal with portability with:

C preprocessor: conditional compilation
Multi-target Makefile rules or GNU configure scripts to generate Makefiles
Shell environment variables (LD LIBRARY PATH, LD PRELOAD), etc.

3.3.2 System Calls Essentials

Returns int or long:

>= 0: successful
-1: error

errno:

#include <errno.h>

if (mysyscall() == -1) {

printf(errno);

}

Error codes: cf. ‘man errno’. Ex: EPERM, ENOENT, ESRCH, EINTR, EIO, ECHILD, EACCESS, EAGAIN/EWOULDBLOCK…
Print error message: perror() (see also strerror())

System types:

#include <sys/types.h>
Ex: clock_t, dev_t, uid_t, gid_t, pid_t, mode_t, sigset_t, size_t, time_t

Interrupted System calls:

Can be interrupted by signals but not hardware interrupt
2 possible behaviors:

Fail with EINTR
Does not fail but return info about partial execution, ex: read(), write()

$ strace ./myProgram

execve("./hello", ["./hello"], [/* 36 vars */]) = 0

brk(0) = 0xf7f2e000

access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or

directory)

...

4 Files and File Systems

In UNIX, everything is a file…

4.1 Inode

inode = data structure to store file metadata. Name of data structure is actually stat (see man 2 stat for the sys call and data struct). Set of minimum attributes (defined by POSIX), which must be provided by any FS. For disk-based filesystems, usually corresponds to a file control block stored on disk.

Inode number: ls -i
Number of hard links (they all share the same inode)
Size:

File length in bytes
Block size for file system I/O: give the "preferred” blocksize for efficient file system I/O. Writing in smaller chunks may cause inefficient read-modify-rewrite.
Number of 512B blocks allocated (may be smaller than #bytes/512 if file has holes): ls -s

Device identifier (DID) of device containing the file (see also 4.24 in Advanced Programming in the UNIX Environment): type dev_t -> use major() and minor() macros
If this is a special file representing a device, DID of this device
User identifier (UID, file owner)

User group identifier (GID, user group of the file)

Timestamps:

last status change (any change in the inode, e.g., creation): ls -lc
last modification: ls -l
last access: ls -lu

Access rights and file mode. Itself a data struct which contains the following info:

Access rights:

whether user/group/others have read/write/exec access

File mode: consists of several flags (see man 2 stat for details):

File type. Excerpt from ls man page:

- Regular file.
b special file which represents a block device (usually in /dev, see 11.4)
c special file which represent a character device (usually in /dev, see 11.4)
d Directory.
l Symbolic link.
s Socket
p FIFO, or rather, named pipe

suid (or setuid flag): if set, then when file executed, gets uid of the owner instead of user which executes the file.

ls /usr/bin/passwd: -rwsr-xr-x. This s means both file executable + suid set. If file not executable but suid set, then would be S.
Chmod u+s, chmod g+s, chmod +s = chmod a+s gives rwsrwsrwx

sgid: idem with group. If file is a directory, means files created in dir have GID of dir and if those files are dirs, they also get the sgid. + Other special uses, see man page...

Would be -rwxr-s--- ?

Sticky bit: I think various meanings depending on platform / distribution. Ex Debian: on a dir, indicate that a file in the dir can be renamed or deleted only by the owner of the dir or file or a priviledged user. (else, anyone with write permission on the dir can do that)

ls -l: -rwxrwxrwt (t always appears in the 'other'section). Appears as T if 'others' don't have exec permission.
How to set: chmod +t, chmod o+t, cmod a+t (chmod u+t and chmod g+t do nothing

Access rights (r/w/x for u/g/o).

o = others excluding group. g = group exluding user. Means ---rwxrwx allows everything to everyone except user.
/!\ write permission of file does not determin rights to rename / delete the file. That is controlled by the write permission of the directory containing the file. If in addition sticky bit of directory is set, then only owner of dir or of file can rename (but must still have write permission as well) (only on some platforms?)
x for directory means searchable = as if you have access to the list of file names in the directory but not to their pointers to inodes (see "What is a directory below")?

Necessary to cd in directory.
Necessary to write file in dir
Not necessary to ls, but necessary to ls -l

Changing the mode or other inode content is doable only by file owner, regardless of access rghts.

What is a directory? (assumption)

A file containing a list of:

name1 : pointer -> inode1
name2 : pointer -> inode2
name3 : pointer -> inode3
...

Note on hard and soft links:

soft link: a separate file which contains the path to another file
hard link: an entry in a directory (which is itself a file) which points to the same inode as another file

Consequences:

soft links can be broken, by moving or removing pointed file. Hard links cannot.
hard links have the same file attributes: size, owner, access right, last status change/last modification…
cannot hard link directories. Otherwise, the directory tree would be a graph. But why is that a problem exactly?

4.2 Structure and Storage

ext4 FS simplified (see also man ext4, which documents a long list of flags -> not so interesting, https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout):

1 data block or physical record = (usually?) 512B. All block devices have blocks. Speeds up I/O. Reads/writes per whole blocks. Block size in FS may be different to physical block size.

-> there is the inode on disk and the inode data structure = stat.

Boot block:

Bootstrap mode and "bootable" flag
Link to data block holding boot code

Super block:

File system status (mount point)
Number of allocated and free nodes + link to lists of allocated and free nodes

In ext4, every Inode has a table of block addresses [sl. 49/6]:

Q/ Does that mean data is fragmented?

Yes, it can be, but Linux tries to reduce it.

4.2.1 Mounting

mount -t type device dir

Previous content + owner + mode of dir become invisible

Device = either:

Special file, using its path like /dev/sda1, or with -L its “filesystem label”/“partition label”, or with -U its UUID (see /proc/partitions?)
host:/path for NFS

Types depend of platform. Ex: adfs, autofs, auto, cifs, ext4, hfs, hpfs, msdos, nfs, ntfs, proc (for the special proc mount point), ramfr, romfs, smbfs, sysv, usbfs, vfat…

Mount with no arguments shows a list of mounted FS: rootfs, sysfs, udev… Also available in /proc/mounts

...XXX

/etc/fstab

man fstab

List of file systems maintained by a human administrator. Can be used by mount to mount those FS more easily, or mount them automatically at boot?

First column is either special file path, UUID or label.

...

4.3 Kernel Structures

cf. Understanding the Linux Kernel 12.2.6 p. 478/496

file descriptor = a non-negative integer corresponding to an index in a descriptor table. returned by open(), argument of read()…

The open file table is system-wide. Entry = {offset, opening flags from open()}. Names for its entries:

open file description
= open file objects
= file handle
= open file table entry
= struct file.

# descriptors / # openings = nb of arrows pointing to this entry. Allows to free the entry when not referred any longer.

To list the FD of a process with ID 1234, ls /proc/1234/fd

ls -l /proc/1234/fd

lrwx------ root root ….. 0 -> /dev/null

lrwx------ root root ….. 1 -> /dev/null

File locks: see ch. 8

4.4 System Calls

4.4.1 Inode Manipulation

l-prefix variant (ex: lstat()): do not follow soft links.

f-prefix variant (ex: fstat()): operate on file descriptor instead of file path. /!\ some f- variants actually are C standard library routines: fread, fwrite, fopen…

-at suffix variant (ex: openat): instead of pathname arg, takes dirfd + rel path. Rationale: see notes at the end of man 2 open:

Allow concurrent modification of the path behind dirfd
Allow implementation of a “per-thread current directory”

stat()

int stat(const char *path, struct stat *inode);

stat = inode data struct -> get the inode

access()

int access(const char *pathname, int mode);

check whether calling process can access file with mode mode = R_OK, W_OK, X_OK of F_OK (for mere existence check).

link(), symlink(), unlink()

int link(const char *oldpath, const char *newpath);

int unlink(const char* path)

(cannot hard link directory, to maintain tree structures of the directories)

unlink actually used to delete a file (but not directories: use rmdir())

chown()

int chown(const char *path, uid_t owner, gid_t group);

chmod()

int chmod(const char *path, mode_t mode);

change access rights, using masks

Ex: mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH; // 0644

umask()

A process has a umask. Files created with mknod, open, mkdir, creat, etc. have mode m & ~umask (where m is the mode passed as argument).

A common value is 022 = does not allow to create files writable by grp and others.

mknod()

int mknod(const char *pathname, mode_t mode, dev_t dev);

Make an inode, mostly used to create device special files or pipes. (Use open() to create a regular file)

NB: there is also a mknod command:

mknod name type major minor

with type ∈ {b (=block = buffered), c or u (=character = unbuffered), p (= pipe = FIFO => major and minor not needed)}.

dev: use makedev(unsigned int maj, unsigned int min)(3) to create. Cf. also major(dev_t), minor(dev_t).

Mode must or one of those: S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, S_IFSOCK

open(), creat()

int open(const char *pathname, int flags, [mode_t mode]);

int creat(const char *pathname, mode_t mode);

Return fd (the lowest available).

Flag must include one of O_RDONLY, O_WRONLY, O_RDWR + optionally any of these:

O_APPEND: See also O_TRUNC
O_ASYNC: allow signal driven IO (for pipes, sockets, TTY…)
O_CLOEXEC: execve closes fd (not the default behavior). Sets a flag on the file descriptor, not in the open file table entry. This is afaik the only fd flag. See fcntl, which can also set / get this flag.
O_CREAT: if the file does not exist, it will be created. Then mode must be provided. See also O_EXCL.
O_DIRECT: try minimizing cache effects of the I/O to and from this file. See also O_DSYNC and O_SYNC
O_DIRECTORY: makes the call fail if pathname is not a directory
O_DSYNC: makes write synchronous: by the time writes returns, everything written is readable by another process
O_EXCL: O_CREAT must also be set: makes the call fail if file already exists.
O_LARGEFILE: for files more than 4GB (needs size represented by off64_t instead of off_t)
O_NOATIME: do not update access time
O_NOCTTY: if pathname is a tty, it will not become the process’s controlling terminal even if it does not have one.
O_NONBLOCK: if possible, neither open nor read/write will cause the process to wait
O_PATH: does not actually open the file, but get a fd that allows only inode operations
O_SYNC: Synchronous I/O
O_TMPFILE: creates a tmp file…
O_TRUNC: reverse of O_APPEND

creat() = O_CREAT | O_WRONLY | O_TRUNC

Some interesting errors:

EMFILE: process already has max nb of files open
EOVERFLOW: file size > (1<<31)-1 and application was compiles on 32-bit platform without flag -D_FILE_OFFSET_BITS=64

See also the notes at the end of the man page…

close()

int close(int fd);

Allow fd to be reused. If open file table entry is not used anymore, it is freed. If file was unlinked and has no path reference left and this was the last file descriptor to it, the file is deleted.

/!\ Any locks on the associated files is removed, regardless of the fd used to obtain the lock, and even if the file is still pointed to by another fd.

lseek()

off_t lseek(int fd, off_t offset, int whence)

Set offset in open file table entry. Whence belongs to {

SEEK_SET: new offset = offset arg
SEEK_CUR: new offset = current offset + offset arg
SEEK_END: new offset = file size + offset arg.
SEEK_DATA: new offset = next offset after offset arg which contains data
SEEK_HOLE: new offset = next hole after offset arg (which contains no data)

Result of those last two depend on file system, because holes are managed by FS.

If new offset is after end of file, file size does not change until data is written. If data is written, that creates a hole on FS that support it (i.e., the hole is not actually always filled with zeros on the device).

Notes: some devices are incapable of seeking.

read(), write()

ssize_t read(int fd, void *buf, size_t count);

ssize_t write(int fd, const void *buf, size_t count);

ssize_t = signed size type, because can return -1 in case of error

Attempts to read (resp. write) up to count bite, returns number of bytes effectively read (resp. written) (may stop reading resp. writing because of EOF, signal, reading from socket…). Read returns 0 if at EOF.

Reading errors

EINTR: interrupted before anything was read.
EAGAIN: non-blocking I/O was selected and no data was available.

fread(), fwrite(), fopen(), fclose()

[TD 1]

FILE *fopen(const char *path, const char *mode);

FILE *fdopen(int fd, const char *mode);

size_t fread(void *ptr, size_t size, size_t nmemb, FILE * stream );

size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);

int fclose(FILE *fp);

Not system calls but part of the Standard C Library. Use FILE* instead of fd, i.e. uses file streams: enables buffering of IO. Flushing the stream (with fclose of fflush) ensures data is written to the stream, but not necessarily to the file yet: synchronisation between stream and file is done by kernel.

4.4.2 File Descriptor Manipulation

/!\ Use fdopen() and fileno() to get a file C library FILE* from a file descriptor and reciprocally, but do not mix C library and system call I/O on the same file (because of C library internal buffers)

dup()

will see when study redirections

4.5 Directories

Part of C Standard Library.

Types:

DIR * (= a directory stream = something to read the entries of a directory, with an internal pointer that progresses from one entry to the next)
dirent * (= directory entry)

inode number
file name (POSIX requires it is the last field in dirent, since it is the only variable-length field). Usually there is a max possible length for it.
non-POSIX:

an "offset" which is actually not an offset but the the position within the dir (returned also by telldir(DIR*)(3)). Not an integer nor a file offset but an FS-dependent opaque value.
Type of file (only on some FS)
Length of the record:

Primitives

DIR *opendir(const char *name);

Returns NULL on error

struct dirent *readdir(DIR *dir);

Returns NULL at end of dir or if error

int closedir(DIR *dir);

Also closes the underlying fd associated to the dir.

4.6 Extended File Descriptor Manipulation

fcntl(int fd, int cmd, [... /* args */])

cmd allows to manipulate attributes of the entries in the open file table (see 4.3)?

Type of args depends on the command.

cmd (with the form F_CMD(arg type)):

F_DUPFD(int), F_DUPFD_CLOEXEC(int): duplicate fd by finding lowest available fd >= arg. != dup2. _CLOEXEC version sets close-on-exec flag on the duplicate file descriptor.
F_GETFD, F_SETFD(int): get / set the only file descriptor flag which is the close-on-exec flag.
F_GETFL, F_SETFL(int): get / set file status flag, stored in the open file table entry, set by open. Some flags in arg are ignored (access mode, O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC). Can only modify O_APPEND, O_ASYNC, and so on.
Some locking commands. ...XXX
Some signal related commands. ...XXX
Leases commands: see locks? ...XXX
F_NOTIFY(int): notify directory change…XXX
F_GETPIPE_SZ, F_SETPIPE_SZ(int): Change the capacity of a pipe.

4.7 Device-Specific Operations

ioctl()

int ioctl(int fd, unsigned long request, ...);

... is traditionally a char *argp. fd is an open fd.

For other operations than read and write, for example: eject a disk., control a terminal.

Request is device-dependent. List in man 2 ioctl_list, but not documented...

4.8 I/O Redirection

dup(int fd), dup2(int fd1, int fd2)

If dt is the descriptor table of a process, copy dt[fd] in some dt[newfd] and returns newfd. -> used for redirection. newfd is either the lowest available (dup) or fd2 (dup2), which is closed beforehand if necessary. The new descriptor's close-on-exec flag is reset (to set it, use dup3).

Bildschirmfoto 2016-08-06 um 10.59.52.png

4.9 Communication Through a Pipe

Usual pipe is

anonymous = it persists only as long the process is running, created with pipe(2),

but can also be

named = usually named FIFO or named pipe, created with mkfifo (1,2).

Limited capacity (65536 bytes (= 16 * 4kB pages) on Linux). fcntl(F_SETPIPE_SZ) can adjust pipe size.

Blocking read (if empty) and write (if full) by default
Illegal to write into a pipe without reader (if the reading end fd has been closed for example): will cause a SIGPIPE signal to the writer process. If signal ignored, write() fails with EPIPE .

Write to a full pipe will block (except if O_NONBLOCK)

A pipe without writer (i.e. writer fd is closed) simulates end-of-file: read() returns 0.

Read of an empty pipe will block (except if O_NONBLOCK)

Writes of less than PIPE_BUF are atomic. (PIPE_BUF >= 512, = 4096 on Linux) Writes of more than PIPE_BUF may be interleaved with writes from other processes. If O_NONBLOCK, writes may fail with EAGAIN or write partially.
Other than O_NONBLOCK, O_ASYNC is the only meaningful flag that can be set on a pipe fd. O_ASYNC on read end of pipe => process receives SIGIO when new input becomes available in the pipe.

Bildschirmfoto 2016-08-06 um 11.03.24.png

NB: There is a FIFO inode, but as for pipes, data is in memory. They have "kernel persistence" (= stored in the memory of the kernel?).

pipe(int p[2])

stores a pair of fd in p, one with O_RDONLY, the other with O_WRONLY

mkfifo()(3)

int mkfifo(const char* pathname, mode_t mode)

(front-end to mknod)

(Mode modified by the process's umask.)

FIFO supports rendez-vous protocol: open at one ends blocks until other end also opened, except if O_NONBLOCK: opening O_RDONLY succeeds, O_WRONLY fails with ENXIO.

5 Processes and Memory Management

5.1 Process Abstraction

In the kernel: process descriptor, …

See slide 91 and search

Memory segments:

[http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/, provides also many possibly-interesting pieces of info which I did not write down]

Notes on some segments:

Kernel space: contains the kernel stack of the process = the stack of system calls.
Stack:

One stack frame per function call
There is actually one per thread.
Dynamic growth: if stack too small, get page fault, which is handled by expanding stack. Stack can't grow beyond RLIMIT_STACK -> would trigger seg fault

Memory map segment: see mmap()

Can also be anonymous, i.e. does not actually map a file. E.g. when calling malloc() of size > MMAP_THRESHOLD (default 128 kB, adjustable via mallopt())

Heap: malloc() in C, new in Java and other languages...
BSS = historically Block Started by Symbol
Data segment: maps binary file privately, i.e. writes here won't update the binary file.

Segment frontiers are randomized in Linux for security reasons.

Memory-related faults:

Page fault:

Dynamic stack extension
...

Segmentation fault:

Stack extension beyond RLIMIT_STACK
Any access to unmapped virtual memory region which is not in the context of dynamic stack extension.
Read or write in kernel space
Writes to text segment
…

Segments are divided in areas, ex:

One per memory-mapped file
Dynamic libraries have extra areas similar to BSS and data.

Areas can be examined in /proc/<pid>/maps

brk(), sbrk()

int brk(void *end_data_segment);

void *sbrk(intptr_t displacement);

move the program break = end of heap (referred to as "end of data segment" in man pages, but some people use data segment to refer to data segment + BSS + heap [see Gustavo Duarte blog post])

(only used in the kernel)

Actually allocate or deallocate memory to the process.

On success, sbrk returns previous program break = pointer to the start of the newly allocated segment memory (=> sbrk(0) can be used to simply get the program break).

5.2 Introduction to Memory Management

[Notes here also include some of 7.2 Advanced Memory Management]

Virtual address mechanism

Page = fixed size memory region (2n, e.g. 4 KB)

Translation mechanism: not clear… By CPU (Memory Management Unit (MMU) + Translation Lookaside Buffer (TLB))

The following is architecture dependent (valid at least for x86)?

Bildschirmfoto 2016-09-13 um 08.57.40.png

One paging control register (PCR) in the processor, pointing to the beginning of the page directory.

At context switch, the content of this register is modified (+TLB is flushed) => one page directory per process.

In memory, there is one page directory and several page tables per process.

1 page directory size = 1 page table size = 1024 * 32-bits entries = 1 page.

Translation:

10 bits of address to refer to 1 of 1024 page directory entries
10 bits of address to refer to 1 of 1024 page table entries
12 bits of address to refer to 1 of 4096 page bytes
= 32 bits

This memory translation is done in hardware in Memory Management Unit (MMU). This is further speeded up by Translation Look-aside Buffer (TLB).

See /proc/<pid>/pagemap

A page table entry stores the physical address of a page, i.e. a page-aligned address => there are 12 bits available to contain page status:

_PAGE_PRESENT Page is resident in memory and not swapped out
_PAGE_PROTNONE Page is resident but not accessable
_PAGE_RW Set if the page may be written to
_PAGE_USER Set if the page is accessible from user space
_PAGE_DIRTY Set if the page is written to
_PAGE_ACCESSED Set if the page is accessed

mem_map_t structure stores info about used physical pages:

counter: how many users are mapping a physical page
age: timestamp for swapping heuristics: Belady algorithm
map nr: physical page number

[See also 2.4 in https://www.kernel.org/doc/gorman/html/understand/understand005.html for more details]

Saving resources and enhancing performance :

Demand-paging: delay allocation and mapping until the process accesses an address in the page.
Overcommitting (on Linux): malloc() of n*4 KB returns ok even is not n pages are available. When really runs out of free space, kills non-root processes.
copy-on-write: when cloning process, do not copy its pages, but mark them as “to be copied on next write access”.

Software caches: use memory pages to cache files or block devices (...XXX)

malloc()

void *malloc(size_t size)

void *calloc allocate several objects

void *realloc(void *ptr, size_t size)

void *valloc aligns allocated memory on page boundary

void free(void *ptr) ptr must be return value of previous malloc. If not, you

have a nasty bug

How it works:

for intra-page allocation: free list

buddy system: allocation of 2n bytes (2n > page size) will be done at an address which is a multiple of 2n. Better TLB usage and DRAM control (XXX)

valgrind tool: detects memory leaks, phantom pointers, corrupt calls to free, and other bugs…

5.3 Process Implementation

Process descriptor task_struct is a fairly big struct (1.7 kB for 32 bit arch). Stored in the task list = circular doubly-linked list.

State ready/running, stopped, zombie...
Kernel stack typically one memory page
Flags e.g., FD_CLOEXEC
Memory map pointer to table of memory page descriptors (maps)
Parent pointer to parent process (allow to obtain PPID)
TTY control terminal (if any)
Thread TID and thread information
Files working directory and table of file descriptors
Limits resource limits, see getrlimit()
Signals signal handlers, masked and pending signals
Prio, policy ...XXX
...

[the rest of the section is not interesting, will be described in more details further]

5.4 States and Scheduling

Bildschirmfoto 2016-07-23 um 14.08.27.png

Stopped waits for CONT signal
Suspended waits for wake-up condition from the kernel
Traced = being debugged

Bildschirmfoto 2016-07-23 um 14.15.27.png

5.5 Programmer Interface

fork()

pid_t fork(void);

Child identical to parent:

The whole memory is copied (actually just “copy-on-write”, see 5.1)
fd table is the same (but open file table entries are not duplicated).

Same for open message queue descriptors (see mq_overview(7))
Same for open directory streams (see opendir(3)). Except in linux, they don't share the directory stream positioning.

Except for:

pid
IPC objects: memory locks (see mlock(2)), ...
pending signals
file locks
...

/!\ Child has single thread (the one that forked), even though memory is exactly the same => pthread_atfork(3) may be handy.

This is a system call. There is also a glibc wrapper. See also clone() (for both process and thread)...XXX

Ex:

switch (cpid = fork()) {

case -1:

…

exit(-1);

case 0:

continue_child();

break;

default:

continue_parent(cpid);

break;

}

execve()

int execve(const char *filename, char *const argv[], char *const envp[]);

Arguments:

filename: absolute path of executable file (executable or text starting with shebang)
argv: arguments (starting with program name)
envp: environment variables (strings of the form “name=value”)

Does not return on success, instead “replaces” the process (returns -1 on error, e.g. EACCESS, ENOEXEC):

overwrites all memory segments
preserve pid, ppid, open file descriptors
(if file has setuid/setgid bits set, change uid/gid to the owners of the file.)

Variants:

int execl(const char *path, const char *arg, ...);

int execv(const char *path, char *const argv[]);

int execlp(const char *file, const char *arg, ...);

int execvp(const char *file, char *const argv[]);

int execle(const char *path, const char * arg, ..., char *const envp[]);

int execve(const char *filename, char *const argv[], char *const envp[]);

execl*: arg list terminated with null

exec*p: file in the $PATH

Environment manipulation:

getenv(), setenv(), putenv(), unsetenv()
also accessible through global variable environ, which is made available by the exec() call at the origin or the process.

Simply declare: extern char **environ (but recommended use only as execve argument)

man 7 environ

fcntl()

int fcntl(int fd, int cmd, [long arg]);

Commands:

F_GETFD: get fd flags
F_SETFD: set fd flags. Only FD CLOEXEC is defined: sets the file descriptor to be closed upon calls to execve() (typically a security measure)

_exit() and exit()

void _exit(int status);

Argument:

status:

0 if success, any other value if error (not to be confused with sys call return code! >=0 if success, -1 if error)
(status & 0xff) (= least significant byte of status) will be passed to any other process which has called wait()

Terminates process:

close files
free pages (except shared ones)
child processes are inherited by init
parent gets SIGCHLD signal
If the process is a session leader and its controlling terminal also controls the session, disassociate the terminal from the session and send a SIGHUP signal to all processes in the foreground group (terminate process by default)

Never fails and does not return!

void exit(int status);

-> Calls in addition methods registered with atexit() (in reverse order of registration).

5.6 Process Genealogy

Swapper process (pid 0)

One per CPU
Historically called swapper: used to swap whole memory of processes in and out (now paged-base swapping)
No program on disk corresponds to this process, which is part of the kernel.
Built from scratch by the kernel at boot time (not created using fork(), like every other process)
Is a system process (!= user process), runs in kernel mode.
Uses statically-allocated data
Constructs memory structures and initializes virtual memory
Initializes the main kernel data structures
Creates kernel threads (swap, kernel logging, etc.)
Enables interrupts, and creates a kernel thread with PID = 1
Becomes then scheduler process on some Unices (one per CPU) or the idle task on Linux (so that there is always a process to be scheduled).
Great history and explanations here: http://superuser.com/questions/377572/what-is-the-main-purpose-of-the-swapper-process-in-unix

init process (pid 1)

Shares all its data with process 0
Completes the initialization of the kernel
Switch to user mode
Executes /sbin/init, becoming a regular process and burying the structures and address space of process 0
/sbin/init

Builds the OS environment

From /etc/inittab: type of bootstrap sequence, control terminals
From /etc/rc*.d: scripts to run system daemons

ancestor of all processes (with a succession of fork() calls). Adopts all orphans.

man init, man 7 bootup, man shutdown

5.7 Daemons, Sessions and Groups

Ex: inetd: (sl.126/45 …)

/etc/services…
/etc/inetd.conf…

About PID, PPID, PGID, SID, see man 7 credentials

[https://www.win.tue.nl/~aeb/linux/lk/lk-10.html]

Process sessions and process groups:

Sessions and groups are orthogonal to process hierarchy
SID/PGID = PID of the leader of the session/group.
Children inherit PGID and SID of their parents.
session:

Can have a controlling tty
At most one group is the foreground group. Receives ˆC, ˆZ, etc. signals. When exit, session leader send SIGHUP to foreground groups (but probably also background depending on the shell?).
Typically associated to user login, interactive terminals, daemon processes

Group:

If shell manages jobs, like bash, piped processes will be in the same group. Else all children are in the same group as the shell.
Creation: with setpgid(pid, pgid) with pgid = 0 or pgid = pid, or with setpgrp() (<=> setpgid(0, 0)). Setpgid can only be called by:

Process itself
Parent if in the same session and before child calls exec().
This ensures that regardless of whether parent or child is scheduled first, the process group setting is as expected by both:
p = fork();
if (p == (pid_t) -1) { /* ERROR */
...
} else if (p == 0) { /* CHILD */
setpgid(0, pgid);
...
} else { /* PARENT */
setpgid(p, pgid);
...
}

Can be killed (killpg(pgid, sig)), or waited for (waitpid(-pgid, …))

[cf. Understanding the Linux Kernel 1.6.7.2 p. 29/47]

getpgid(int pid), getpgrp()

setsid()

…

6 Process Event Flow

6.1 Monitoring Processes

wait(), waitpid()

pid_t wait(int *status); /* waits for any child */

pid_t waitpid(pid_t pid, int *status, int options);

Monitor state changes (i.e. all sorts of signals received by the child, not just termination) of one or more children, and return PID of the child whose state has changed.

If a child terminates, it remains in a zombie state until wait() is performed to retrieve its state (and free the associated process descriptor)

When a zombie’s parent terminates, the zombie is adopted by init, which always waits for its children -> a zombie is also removed when its parent terminates.

pid special values:

-1: waits for any child.
< -1: understands it as the opposite of a PGID.
0: any child in the same process group

status can be inspected with macros:

WIFEXITED(status), WEXITSTATUS(status)
WIFSIGNALED(status), WTERMSIG(status)
WCOREDUMP(status)
WIFSTOPPED(status), WSTOPSIG(status)
WIFCONTINUED(status)

6.2 Signals

[see also man 7 signal]

Signals can have signal handlers to “catch” them (runs in user mode).

Asynchronous:

signal can be pending (= delivered but not yet handled, because blocked, or the kernel did not yet check for its delivery status XXX).

/!\ No queueing of pending signals.

signal caught when the process switches from kernel to user mode (context switch of syscall return)

[signal(7)]

POSIX defines:

Reliable or standard signals =?XXX
Real-time signals

Signal disposition = process (default?) behavior when delivered the signal = one of:

Term
Ign
Core (terminate + core dump)
Stop
Cont

Can be changed with sigaction(2).

By default, the signal handler is invoked on the normal process stack. It is possible to arrange that the signal handler uses an alternate stack; see sigaltstack(2) for a discussion of how to do this and when it might be useful.

...

kill()

int kill(pid_t pid, int sig);

syscall (also shell command) used to send any signal.

signal() (ISO C) (deprecated)

typedef void (*sighandler_t)(int);

sighandler_t signal(int signum, sighandler_t handler);

Deprecated with multithread: use sigaction() (see 6.4)

If signal received during handler execution, deferred.

pause()

Suspend a process until it is delivered a signal (except if this signal is SIG_IGNed).

6.3 Typical Applications

alarm()

deliver SIGALRM after a delay (/!\ default action is to terminate).

sleep() (ISO C)

(g/s)etitimer

interval timer…

6.4 Advanced Synchronization With Signals

What happens if CONT signal somehow delivered before the thread pauses? Deadlock! → Flow in ISO C signals (see 10.4 in Advanced Programming in the UNIX Environment).

sigaction()

int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact);

struct sigaction {

void (*sa_handler)(int);

void (*sa_sigaction)(int, siginfo_t *, void *);

sigset_t sa_mask;

int sa_flags;

};

sa_handler (exclusive with sa_sigaction)
sa_sigaction (exclusive with sa_handler)
sa_mask: mask of blocked signals when executing the signal handler
sa_flags: bitwise or of handler behavior options...XXX

signum is any valid signal except SIGKILL and SIGSTOP.

...XXX

Signals in Terminals

Ctrl-Z → SIGTSTP tty stop

fg → SIGCONT continue

Ctrl-C → SIGINT Interrupt

7 Communication and Synchronization

7.1 Message Queues

= sth yet different as signals and FIFOs.

[see also mq_overview(7)]

In Linux: a pseudo-FS mounted at /dev/mqueue

Bildschirmfoto 2016-09-13 um 08.50.08.png

mq_open()

mqd_t mq_open(const char *name, int flags);

mqd_t mq_open(const char *name, int flags, mode_t mode, struct mq_attr *attr);

Arguments:

Name: looks like "/myQueue"(only one / at the beginning)
...

mq_send(), mq_receive(), mq_getattr(), mq_setattr(), mq_close(), mq_unlink()

…

7.2 Advanced Memory Management

mmap()

void *mmap(void *start, size_t length, int protection, int flags, int fd, off_t offset);

Map file or device into memory. Transparent synchronization between memory page and file.

Facilitates sharing of memory pages: use file name to identify shared memory region.

Arguments:

start: a multiple of page size, usually NULL, i.e. let the system decide.
fd, offset: fd to map with an offset.
protection: memory protection, either PROT_NONE or an OR of

PROT_EXEC
PROT_READ
PROT_WRITE

flags: either

MAP_PRIVATE: Writes won't affect the file. Copy-on-write.
MAP_SHARED: Share with all other processes which map this file.
MAP_ANONYMOUS: Does not actually map a file. Used whenever need to grow/shrink virtual memory segment (e.g. malloc, stack management)

Returns: the start address of the memory region, or MAP_FAILED (= (void *)-1)

munmap()

int munmap(void *start, size_t length);

Unmaps. Synchronizes any pending changes (see also msync()). NB: closing a fd does not unmap the region.

7.3 Shared Memory Segments

mmap allows sharing memory region, but uses disk space for the actual file. => Use shared memory system calls. They actually are like mmap(MAP_SHARED) but mapping special files from pseudo FS /dev/shm.

See shm_overview(7)...XXX

shm_open(), shm_unlink() (3)

int shm_open(const char *name, int flags, mode_t mode);

Creates and opens a new, or opens an existing, POSIX shared memory object.

Flag = one of (O_RDONLY, O_RDWR) OR any of:

O_CREAT: creates with mode if does not exist.
O_EXCL: with O_CREAT: triggers error if already exists.
O_TRUNC: if already exists, truncate to 0 bytes.

Returns a fd with FD_CLOEXEC set.

Errors:

Opening a non-existing file (without O_CREAT): ENOENT

8 Concurrency and Mutual Exclusion

Critical section = program section accessing shared resource.

Mutual exclusion = make sure at most one process or thread enters a critical section.

Properties to guarantee:

Correctness: only one process
No deadlocks
Both:

Guaranteed progress: If only one process willing to enter critical section, it can do it
Bounded waiting or starvation free

Performance:

Reduce overhead
Scalable parallelism

8.1 Mutual Exclusion in Shared Memory

8.1.1 Dekker's Algorithm

[wikipedia]

Uses 3 variables: wants_to_enter[0], wants_to_enter[1] and turn.

A process must:

unset its flag if it is not its turn, can keep it up when it is its turn;
never enter critical section when the other process is trying or it is not its turn.

But still, why exactly this way? I think the answer is in the original papers: http://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD123.html. This may be interesting too: http://cs.stackexchange.com/questions/12621/contrasting-peterson-s-and-dekker-s-algorithms

variables

wants_to_enter : array of 2 booleans

turn : integer

wants_to_enter[0] ← false

wants_to_enter[1] ← false

turn ← 0 // or 1

p0:

wants_to_enter[0] ← true

while wants_to_enter[1] {

if turn ≠ 0 {

wants_to_enter[0] ← false

while turn ≠ 0 {

// busy wait

}

wants_to_enter[0] ← true

}

// critical section

...

turn ← 1

wants_to_enter[0] ← false

// remainder section

p1:

wants_to_enter[1] ← true

while wants_to_enter[0] {

if turn ≠ 1 {

wants_to_enter[1] ← false

while turn ≠ 1 {

// busy wait

}

wants_to_enter[1] ← true

}

// critical section

...

turn ← 0

wants_to_enter[1] ← false

// remainder section

Problems:

Not fair?
May cause starvation?

8.1.2 Peterson's Algorithm

Original paper: http://cs.nyu.edu/~lerner/spring12/Read03-MutualExclusion.pdf

For Two Processes

The wrong algorithm with a mutex variable, for process 1:

while (mutexBelongsTo != -1) {} // wait while mutex belongs to someone

mutexBelongsTo = 1 // take the mutex

// critical section

mutexBelongsTo = -1 // Give away the mutex

=> Correctness is not satisfied because of race condition.

=> need to know:

whether the other process also wants the mutex
and need to agree whose turn it is to take the mutex.

=> use variable isApplying and turn, arbitrarily initialized to 0.

For process 1:

isApplying[1] = 1;

while (isApplying[0] && turn == 0) {} // process 0 is applying and it is its turn, so wait

// Now it is my turn or process 0 stopped applying so I can go

// Critical section

// Give away the turn and stop applying

turn = 0;

isApplying[1] = 0;

This one above I found myself, and it is quite similar to the Peterson's algorithm (btw, there is an error in the slides, see below):

For process 1:

isApplying[1] = 1;

turn = 0; // Process 1 says: "Please go ahead process 0"

while (isApplying[0] && turn == 0) { } // same as above

// Critical section

isApplying[1] = 0;

For n Processes

[Wikipedia]

The level variables take on values up to N−1, each representing a distinct "waiting room" before the critical section.[6] Processes advance from one room to the next, finishing in room N−1 which is the critical section. Specifically, to acquire a lock, process i executes:

for ℓ from 0 to N−1 exclusive
level[i] ← ℓ
last_to_enter[ℓ] ← i
while last_to_enter[ℓ] = i and there exists k ≠ i, such that level[k] ≥ ℓ
Wait

8.1.3 Other Algorithms

Lamport's bakery algorithm
Szymanski's algorithm
Taubenfeld's black-white bakery algorithm

8.1.4 Shared Memory Consistency Model

Defined by when a write is visible by the other processes.

Dekker and Peterson require the strongest consistency model: sequential consistency = (according to Leslie Lamport):

“The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”

But usually, consistency is weaker, because of:

Compilation: loop-invariant code motion, instruction reordering
Hardware/run-time optimizations: out-of-order superscalar execution (local), out-of-order cache coherence (multi-processor)

=> Solution = hardware support:

with memory fences = commit all pending memory and cache coherence transactions. Forbids cross-fence code motion by the compiler.
In C, use volatile to indicate a variable can be accessed by other processes:

No compile-time reordering of accesses to volatile variables
Never consider accesses to volatile variables as dead code

With atomic operations, each having specific processor instruction, e.g.:

Atomic exchange: exchange value of a register and a memory location, atomically
Test-and-set: a particular case of atomic exchange for booleans: set a memory location to 1 and return whether the old value was null or not
Compare-and-swap:
bool compare_and_swap(int *accum, int *dest, int newval) {
if (*accum == *dest) {
*dest = newval;
return true;
} else {
*accum = *dest;
return false;
}
}

8.2 Semaphores

A lock or mutex is useful when processes are competing for only one resource.

=> A semaphore is like a lock for several resources: it has some "counter".

8.2.1 Lock

Implementable with test-and-set:

void lock(volatile int *lock_pointer) {

while (test_and_set(lock_pointer) == 1);

}

void unlock(volatile int *lock_pointer) {

*lock_pointer = 0 // Release lock

}

int lock_variable = 1;

void lock_example() {

lock(&lock_variable);

// Critical section

unlock(&lock_variable);

}

8.2.2 Semaphore

void init(semaphore s, int v) {

s.value = v;

}

void procure(semaphore s) { // also called down() or wait()

wait until (s.value > 0);

s.value--; // Must be atomic with the previous test

}

void vacate(semaphore s) { // also called up(), post() or signal()

s.value++; // Must be atomic

}

8.2.3 Read-Write Semaphore

Allowing multiple readers and a single writer:

Cannot read or write when someone writes
(But multiple readers at the same time is ok)

void init(rw_semaphore l) {

l.value = 0; // Number of readers (resp. writers)

// if positive (resp. negative)

}

void procure_read(rw_semaphore l) {

wait until (l.value >= 0);

l.value++; // Must be atomic with the previous test

}

void vacate_read(rw_semaphore l) {

l.value--; // Must be atomic

}

void procure_write(rw_semaphore l) {

wait until (l.value == 0);

l.value = -1; // Must be atomic with the previous test

}

void vacate_write(rw_semaphore l) {

l.value = 0;

}

8.2.4 C std lib primitives

man 7 sem_overview

As with shm_ primitives, can be named = associated to a file /dev/shm.

sem_t * sem_open (const char *name, int flags, [mode_t mode, unsigned int value]);

int sem_wait (sem_t *sem);

int sem_post (sem_t *sem);

int sem_close (sem_t *sem);

int sem_unlink (const char *name);

Arguments flags and mode allow for a subset of their values for open():

flags: only O_CREAT and O_EXCL (FD_CLOEXEC flag is set automatically)
mode: S_IRUSR, S_IWUSR, S_IXUSR, etc.

close(): Undefined behavior when closing a semaphore other processes are currently blocked on.

Other calls:

sem_init()
sem_getvalue()
sem_destroy() = close + unlink
sem_trywait(), sem_timedwait()

8.3 Mutual Exclusion and Deadlocks

8.3.1 Dining Philosophers Problem

By Dijkstra and Hoare

Showing deadlock when more than one processes want to access more than one resources each.

Each philosopher first take the chopstick on their left, then attempt to take the one on their right => deadlock.

8.3.2 How to Avoid Deadlocks

Avoid cyclic / symmetric acquire patterns if possible
Use higher level mutual exclusion:

Atomic transactions
Monitors = ??

Dynamic deadlock avoidance: build a graph of resource usage, detect cycles and break them, see also Banker's algorithm (Dijkstra) (XXX)
Breaking deadlocks:

1. Timeout
2. Analyze
3. Attempt to reacquire different resources or in a different order
!! May trigger livelocks (ex: 2 people meeting in a corridor moving aside at the same time…)

8.4 File Locks

=> to get a write / read lock, file must be open for writing / reading.

=> When inode removed on file closure, locks are also deleted. But actually, closing any fd on such a file releases the locks, which is unwanted.

Locks are advisory = not enforced. The program must comply with the lock rule, i.e. not write or read when it does not have the lock.

Locks are local on some file region = record = file segment.

Inheritance:

Not inherited across fork
Inherited across execve

fcntl

int fcntl(int fd, int cmd, ... /* arg */ );

With:

cmd = F_GETLK, F_SETLK or F_SETLKW
arg of type flock *:
struct flock {
…
short l_type; /* Type of lock: F_RDLCK,
F_WRLCK, F_UNLCK */
short l_whence; /* How to interpret l_start:
SEEK_SET, SEEK_CUR, SEEK_END */
off_t l_start; /* Starting offset for lock */
off_t l_len; /* Number of bytes to lock */
pid_t l_pid; /* PID of process blocking our lock
(F_GETLK only) */
…
};

F_SETLK (struct flock *)

Acquire a lock (when l_type is F_RDLCK or F_WRLCK) or release a lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EACCES or EAGAIN.

F_SETLKW (struct flock *)

Same as for F_SETLK, but wait. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).

F_GETLK (struct flock *)

On input to this call, lock describes a lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged. If one or more incompatible locks would prevent this lock being placed, then fcntl() returns details about one of these locks in the l_type, l_whence, l_start, and l_len fields of lock and sets l_pid to be the PID of the process holding that lock.

8.5 System V IPC

9 Threads

9.1 Applications

Servers (web, database...), computational (numerical simulation, signal processing…)...

9.2 Principles

POSIX thread = logical thread.

Linux implementation: NPTL (Native POSIX Thread Library)

All memory segments are in common (data, heap, code), except stacks.

Kernel responsible for scheduling, mapping on hardware threads

man 7 pthreads

Compile with -pthread

9.2.1 Thread vs. Process

In the kernel memory: (XXX review each unclear point. This list is also in man 7 pthreads)

Some common attributes:

PID, PPID, PGID, SID, UID, GID
Files:

Current and root directory (chdir, chroot),
controlling terminal
Open file descriptors
Record locks
File creation mask (umask)

Timers, signal settings
Priority (nice value, setpriority())
Resource limits (setrlimits) and usage (setrusage)

Distincts attributes:

Thread id (in pthread_t), unique within process. pthread_t is an opaque type. While it is often represented by an int or long, you cannot guarantee the value.
Signal mask (pthread_sigmask()), alternate signal stack (sigaltstack())
Errno
Scheduling policy and real-time priority (man 7 sched)
CPU affinity (sched_setaffinity())
Capabilities (man 7 capabilities)

9.3 Programmer Interface

9.3.1 Example

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <string.h>

#include <errno.h>

#include <sys/times.h>

#define NTHREADS 5

void *thread_func(void *num) {

int i = *(int *)num;

printf("Thread %d\n", i); // Or pthread_self()

// ...

// More thread-specific code

// ...

pthread_exit(NULL);

}

pthread_t threads[NTHREADS];

int main(int argc, char *argv[]) {

pthread_attr_t attr;

int i, error;

for (i = 0; i < NTHREADS; i++) {

pthread_attr_init(&attr);

int *ii = malloc(sizeof(int)); *ii = i;

error = pthread_create(&threads[i], &attr, thread_func, ii);

if (error != 0) {

fprintf(stderr, "Error in pthread_create: %s \n",

strerror(error));

exit(1);

}

for (i=0; i < NTHREADS; i++) {

error = pthread_join(threads[i], NULL);

if (error != 0) {

fprintf(stderr, "Error in pthread_join: %s \n", strerror(error));

exit(1);

}

9.3.2 Primitives

pthread_create

int pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine)(void *), void *arg);

Create and start a thread.

pthread_t: opaque type for the thread ID. Also returned by pthread_t pthread_self()(3), compare with int pthread_equal(pthread_t, pthread_t)(3).

attr: see pthread_init_attr(). Can be NULL.

On error, errno is not set.

Thread termination is either:

Call pthread_exit(void *)
return x ⇔ pthread_exit(x)
Canceled with pthread_cancel()
Any thread calls exit() (terminate all threads)

When thread terminates, its resources are still there until thread is detached or joined, see below.

Inheritance:

Inherits :

Sig mask (see pthread_sigmask(3)).
Floating point environment (fenv(3))
Capability sets (man 7 capabilities)
CPU affinity mask (sched_setaffinity(2))

Does not inherits:

Pending signal set
Alternate signal stack (sigaltstack(3))

Errors:

On error, thread is undefined
EAGAIN: nb. Thread limit reached. Either

RLIMIT_NPROC (setrlimit(2)) (per user limit)
/proc/sys/kernel/threads-max (proc(5)) (system-wide)
/proc/sys/kernel/pid_max, (proc(5))

Joinable vs. detached:

Joinable: (default) after termination, resources stay there (= zombie thread) until the thread is joined by another thread with pthread_join(3). Joining allows to obtain the thread exit value. Joining a non-terminated thread makes you wait.
Detached: Upon termination, resources are freed, return value is lost. See pthread_detach(3) and pthread_attr_setdetachstate(3).

Thread stack size: default 2 MB, or RLIMIT_STACK in NPTL. See also pthread_attr_setstacksize(3).

pthread_init_attr, pthread_attr_destroy

int pthread_attr_init(pthread_attr_t *attr);

int pthread_attr_destroy(pthread_attr_t *attr);

Initialize attr with default values. Then fields can be modified individually using primitives like pthread_attr_*(). See See Also section.

...XXX

(Undefined behavior if attr has already been initialized. OK if has been destroyed.)

Once attr no longer needed, should be destroyed.

pthread_exit

void pthread_exit(void *retval);

Second arg of pthread_join is set to retval.

Called implicitly if thread function returns, with arg = the return value.

Never returns.

Does some cleanup:

Cleanup handlers …
…

pthread_join

int pthread_join(pthread_t thread, void **thread_return);

Waits until either:

thread exits => thread_return is set to the arg of pthread_exit() (must be first NULL).
thread is cancelled => thread_return is set to PTHREAD_CANCELED.

If calling thread is canceled, thread remains joinable.

Some errors:

Thread is not joinable (=> is detached?) => EINVAL
Joining a thread that has already been joined => undefined behavior.
Another thread is already waiting to join with this thread => EINVAL
Join deadlock, e.g. threads joining each other or thread joining itself => EDEADLK

pthread_detach

int pthread_detach(pthread_t thread);

Mark thread as detached. Can alternatively be made detached at creation using pthread_attr_setdetachstate(3).

Can't be made joinable again afterwards. Detaching already detached thread => unspecified behavior.

9.3.3 Thread-Specific Data (TSD)

For thread-specific objects that need to be available across function call ~= global data, but per thread.

Finalization...XXX

pthread_key_create

...XXX

9.4 Threads and Signals

pthread_kill

…XXX

pthread_sigmask

…XXX

sigwait

...XXX

9.5 Threads and Mutual Exclusion

9.5.1 Mutex

pthread_mutex_init

…

pthread_mutex_lock, pthread_mutex_unlock, pthread_mutex_trylock, pthread_mutex_timedlock

…

9.5.2 R/W Lock

pthread_rwlock_*

…

9.5.3 Condition Variable

…

9.5.4 Futex

9.6 Logical Threads vs. Hardware Threads

...

10 Network Interface

10.1 Principles

The layers…

10.2 Connectionless Communications

10.3 Connection-Based Communications

10.4 Programmer Interface

10.5 Threaded Server Model

10.6 Distributed Systems

11 Kernel Design

11.1 Interrupts and Exceptions

11.1.1 Interrupts

Received by the processor from a piece of hardware or software.

=> Involves an electrical signal called interrupt request (IRQ).

Lowest level hardware synchronization mechanism.

Requires immediate attention: processor switches to kernel mode and calls a specific interrupt service routine or interrupt handler.

Can be masked, except for non-maskable interrupts (NMI).

Multiple drivers may share a single IRQ line → IRQ handler must identify the source of the interrupt to call the proper service routine

11.1.2 Exceptions

Are originally raised by CPU. CPU switches to kernel mode and calls a specific exception handler or exception service routine, which often sends signal to faulty process (e.g. SIGFPE if division by 0).

"Mechanism to implement system call"-> Does that mean a syscall always starts with an exception to switch to kernel mode? Se maybe http://stackoverflow.com/questions/19057503/switching-from-user-mode-to-kernel-mode XXX

11.2 Low-Level Synchronization

11.3 Low-Level Input/Output

11.4 Devices and Driver Model

2 Types: block- or character-oriented, accessed through block or character special files, located in /dev:

Type	Ex device	Ex device special file
Block	Disks and other memories	/dev/had, /dev/sdb2, /dev/md1…
Character	serial ports, console terminals, audio	/ dev/tty0, /dev/pts/0, /dev/usb/hiddev0, /dev/mixer, /dev/null

Major and minor numbers: to (logically) project device drivers to device special files

automatically assigned by the kernel

rest is not so clear… See also 4.24 in Advanced Programming in the UNIX Environment.

11.5 File Systems and Persistent Storage

11.6 Memory Management

11.7 Process Management and Scheduling

11.8 Operating System Trends

11.9 Alternative Operating System Designs

12 Introduction to Virtual Machines

12.1 Modern Applications

12.1.1 Machine-level virtualization

Ex: VMware, Parallel Desktop, Virtual Box

…

12.1.2 Processor-level virtualization

Ex: VMware, Virtual Box, QEMU, Rosetta

Basically translates instructions from guest processor to host processor.

12.1.3 System-level virtualization

Ex: Xen (linux only)

ease OS development…

12.1.4 Language-level virtualization

Ex: Java, .NET

12.2 Challenges of Virtual Machine Monitors

12.3 Historical Perspective

12.4 Classification

13 Appendix A: List of System Calls

man section: 2

13.1 Files Manipulation

stat()	Get the inode
access()	check access permission
link(), symlink(), unlink()
chown()
chmod()
mknod()
open(), close()	may also create a new file
create()
read(), write()	provides unbuffered I/O: need to provide manually set buffer and size to read. For buffered I/O, use standard C library
lseek()	move r/w pointer in file
fcntl()	manipulate file descriptor: duplicate (slightly != dup()), flags, locking, manage signals, leases...
dup()	duplicate a file descriptor

13.2 Processes

fork()
exec()
exit()

14 Standard C Library Functions

man section: 3

(f)getc(), (f)gets(),(f)getw()	read character, entire line (‘s’ = string) or word
(f)putc(), (f)puts(), (f)putw()
printf(), fprintf(), sprintf(), vprintf()	formatted print to stdout, f- variant print to file, other variants have just different signatures…
fread(), fwrite()	same as read()/write(), but with FILE * instad of fd
fopen(), fdopen()	returns FILE * instead of a fd
fseek()