Linux Internals
2 An Operating System, What for?
3.2 Principles and Implementation
3.2.1 Kernel (or system) mode and user mode
fread(), fwrite(), fopen(), fclose()
4.4.2 File Descriptor Manipulation
4.6 Extended File Descriptor Manipulation
4.7 Device-Specific Operations
dup(int fd), dup2(int fd1, int fd2)
4.9 Communication Through a Pipe
5 Processes and Memory Management
5.2 Introduction to Memory Management
5.7 Daemons, Sessions and Groups
6.4 Advanced Synchronization With Signals
7 Communication and Synchronization
mq_send(), mq_receive(), mq_getattr(), mq_setattr(), mq_close(), mq_unlink()
7.2 Advanced Memory Management
8 Concurrency and Mutual Exclusion
8.1 Mutual Exclusion in Shared Memory
8.1.4 Shared Memory Consistency Model
8.3 Mutual Exclusion and Deadlocks
8.3.1 Dining Philosophers Problem
pthread_init_attr, pthread_attr_destroy
9.6 Threads and Mutual Exclusion
9.7 Logical Threads vs. Hardware Threads
10.2 Connectionless Communications
10.3 Connection-Based Communications
11.1 Interrupts and Exceptions
11.2 Low-Level Synchronization
11.5 File Systems and Persistent Storage
11.7 Process Management and Scheduling
11.9 Alternative Operating System Designs
12 Introduction to Virtual Machines
12.1.1 Machine-level virtualization
12.1.2 Processor-level virtualization
12.1.3 System-level virtualization
12.1.4 Language-level virtualization
12.2 Challenges of Virtual Machine Monitors
13 Appendix A: List of System Calls
14 Standard C Library Functions
[Made a set in Quizlet for system calls.]
[Notes mostly on http://www.enseignement.polytechnique.fr/informatique/INF583/]
[See also Understanding the Lunix Kernel, 3rd ed. Read until 1.5 An Overview of the Linux Filesystem excluded]
[https://en.wikipedia.org/wiki/C_standard_library]
A ISO standard (also called ISO C lib, or ANSI C lib?), which specify an API, which actually contains only 29 headers. The C library POSIX specification is a superset of it, which specifies Unix-specific functionality (ex: unistd.h, signal.h).
Documented in the man pages, section 3.
See also man 7 glibc, man 7 libc
Some implementations:
Both Unix and C were created at AT&T's Bell Laboratories in the late 1960s and early 1970s.
A header vs. standard list: http://www.schweikhardt.net/identifiers.html. From this list, it seems the C POSIX library is just the POSIX specification for system calls. Official list of headers in the C POSIX lib: http://pubs.opengroup.org/onlinepubs/9699919799/idx/head.html
[https://en.wikipedia.org/wiki/POSIX]
IEEE standard aiming at compatibility between OSs. Developed at the same time as ANSI C (C89), first version in 1988. Most recent version is POSIX.1 2013
Contains C API and shell utilities.
The standard is available here: http://pubs.opengroup.org/onlinepubs/9699919799/
Has 4 volumes (see top left column).
Glibc implements part of POSIX
Are part of the C Standard library(?), but the C std lib also include mere library routines, like malloc. Some of those routine may call system calls.
If you download the C library source, it seems most system calls are unimplemented and throw ENOSYS. Is the kernel just an implementation of those system calls?
Man page sections:
[Glibc source code: https://www.gnu.org/software/libc/download.html. (checked out in ~/code/major-open-source/glibc, made eclipse project)]
[copy-paste]{
#define __NR_chmod 90
#define SYS_chmod __NR_chmod
extern int __dup (int __fd);
libc_hidden_proto (__dup)
}
= modes of operation of the CPU, tracked in some register, which place restrictions on the type and scope of operations that can be performed.
see http://www.linfo.org/kernel_mode.html:
When the CPU is in kernel mode, it is assumed to be executing trusted software, and thus it can execute any instructions and reference any memory addresses (i.e., locations in memory). The kernel (which is the core of the operating system and has complete control over everything that occurs in the system) is trusted software, but all other programs are considered untrusted software. Thus, all user mode software must request use of the kernel by means of a system call in order to perform privileged instructions, such as process creation or input/output operations.
Wrapper’s tasks :
Handler’s tasks :
= Portable Operating System Interface
The standard itself must be purchased on IEEE website here.
Includes:
POSIX is portable and does not evolve much, but it is still too high level for many OS interactions. E.g., it does not specify file systems, network interfaces or power management.
Other Linux standard exist for that:
C applications deal with portability with:
Returns int or long:
errno:
if (mysyscall() == -1) {
printf(errno);
}
System types:
Interrupted System calls:
$ strace ./myProgram
execve("./hello", ["./hello"], [/* 36 vars */]) = 0
brk(0) = 0xf7f2e000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or
directory)
...
In UNIX, everything is a file…
See also man 7 file-hierarchy
inode = data structure to store file metadata. Name of data structure is actually stat (see man 2 stat for the sys call and data struct). Set of minimum attributes (defined by POSIX), which must be provided by any FS. For disk-based filesystems, usually corresponds to a file control block stored on disk.
What is a directory? (assumption)
A file containing a list of:
Note on hard and soft links:
Consequences:
ext4 FS simplified (see also man ext4, which documents a long list of flags -> not so interesting, https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout):
1 data block or physical record = (usually?) 512B. All block devices have blocks. Speeds up I/O. Reads/writes per whole blocks. Block size in FS may be different to physical block size.
-> there is the inode on disk and the inode data structure = stat.
Boot block:
Super block:
In ext4, every Inode has a table of block addresses [sl. 49/6]:
Q/ Does that mean data is fragmented?
Yes, it can be, but Linux tries to reduce it.
mount -t type device dir
Previous content + owner + mode of dir become invisible
Device = either:
Types depend of platform. Ex: adfs, autofs, auto, cifs, ext4, hfs, hpfs, msdos, nfs, ntfs, proc (for the special proc mount point), ramfr, romfs, smbfs, sysv, usbfs, vfat…
Mount with no arguments shows a list of mounted FS: rootfs, sysfs, udev… Also available in /proc/mounts
...XXX
man fstab
List of file systems maintained by a human administrator. Can be used by mount to mount those FS more easily, or mount them automatically at boot?
First column is either special file path, UUID or label.
...
cf. Understanding the Linux Kernel 12.2.6 p. 478/496
file descriptor = a non-negative integer corresponding to an index in a descriptor table. returned by open(), argument of read()…
The open file table is system-wide. Entry = {offset, opening flags from open()}. Names for its entries:
# descriptors / # openings = nb of arrows pointing to this entry. Allows to free the entry when not referred any longer.
To list the FD of a process with ID 1234, ls /proc/1234/fd
ls -l /proc/1234/fd
lrwx------ root root ….. 0 -> /dev/null
lrwx------ root root ….. 1 -> /dev/null
File locks: see ch. 8
l-prefix variant (ex: lstat()): do not follow soft links.
f-prefix variant (ex: fstat()): operate on file descriptor instead of file path. /!\ some f- variants actually are C standard library routines: fread, fwrite, fopen…
-at suffix variant (ex: openat): instead of pathname arg, takes dirfd + rel path. Rationale: see notes at the end of man 2 open:
int stat(const char *path, struct stat *inode);
stat = inode data struct -> get the inode
int access(const char *pathname, int mode);
check whether calling process can access file with mode mode = R_OK, W_OK, X_OK of F_OK (for mere existence check).
int link(const char *oldpath, const char *newpath);
int unlink(const char* path)
(cannot hard link directory, to maintain tree structures of the directories)
unlink actually used to delete a file (but not directories: use rmdir())
int chown(const char *path, uid_t owner, gid_t group);
int chmod(const char *path, mode_t mode);
change access rights, using masks
Ex: mode = S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH; // 0644
A process has a umask. Files created with mknod, open, mkdir, creat, etc. have mode m & ~umask (where m is the mode passed as argument).
A common value is 022 = does not allow to create files writable by grp and others.
int mknod(const char *pathname, mode_t mode, dev_t dev);
Make an inode, mostly used to create device special files or pipes. (Use open() to create a regular file)
NB: there is also a mknod command:
mknod name type major minor
with type ∈ {b (=block = buffered), c or u (=character = unbuffered), p (= pipe = FIFO => major and minor not needed)}.
dev: use makedev(unsigned int maj, unsigned int min)(3) to create. Cf. also major(dev_t), minor(dev_t).
Mode must or one of those: S_IFREG, S_IFCHR, S_IFBLK, S_IFIFO, S_IFSOCK
int open(const char *pathname, int flags, [mode_t mode]);
int creat(const char *pathname, mode_t mode);
Return fd (the lowest available).
Flag must include one of O_RDONLY, O_WRONLY, O_RDWR + optionally any of these:
creat() = O_CREAT | O_WRONLY | O_TRUNC
Some interesting errors:
See also the notes at the end of the man page…
int close(int fd);
Allow fd to be reused. If open file table entry is not used anymore, it is freed. If file was unlinked and has no path reference left and this was the last file descriptor to it, the file is deleted.
/!\ Any locks on the associated files is removed, regardless of the fd used to obtain the lock, and even if the file is still pointed to by another fd.
off_t lseek(int fd, off_t offset, int whence)
Set offset in open file table entry. Whence belongs to {
Result of those last two depend on file system, because holes are managed by FS.
If new offset is after end of file, file size does not change until data is written. If data is written, that creates a hole on FS that support it (i.e., the hole is not actually always filled with zeros on the device).
Notes: some devices are incapable of seeking.
ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);
ssize_t = signed size type, because can return -1 in case of error
Attempts to read (resp. write) up to count bite, returns number of bytes effectively read (resp. written) (may stop reading resp. writing because of EOF, signal, reading from socket…). Read returns 0 if at EOF.
Reading errors
[TD 1]
FILE *fopen(const char *path, const char *mode);
FILE *fdopen(int fd, const char *mode);
size_t fread(void *ptr, size_t size, size_t nmemb, FILE * stream );
size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);
int fclose(FILE *fp);
Not system calls but part of the Standard C Library. Use FILE* instead of fd, i.e. uses file streams: enables buffering of IO. Flushing the stream (with fclose of fflush) ensures data is written to the stream, but not necessarily to the file yet: synchronisation between stream and file is done by kernel.
/!\ Use fdopen() and fileno() to get a file C library FILE* from a file descriptor and reciprocally, but do not mix C library and system call I/O on the same file (because of C library internal buffers)
will see when study redirections
Part of C Standard Library.
Types:
Primitives
See also man 7 path_resolution
fcntl(int fd, int cmd, [... /* args */])
cmd allows to manipulate attributes of the entries in the open file table (see 4.3)?
Type of args depends on the command.
cmd (with the form F_CMD(arg type)):
int ioctl(int fd, unsigned long request, ...);
... is traditionally a char *argp. fd is an open fd.
For other operations than read and write, for example: eject a disk., control a terminal.
Request is device-dependent. List in man 2 ioctl_list, but not documented...
If dt is the descriptor table of a process, copy dt[fd] in some dt[newfd] and returns newfd. -> used for redirection. newfd is either the lowest available (dup) or fd2 (dup2), which is closed beforehand if necessary. The new descriptor's close-on-exec flag is reset (to set it, use dup3).
Usual pipe is
but can also be
Limited capacity (65536 bytes (= 16 * 4kB pages) on Linux). fcntl(F_SETPIPE_SZ) can adjust pipe size.
NB: There is a FIFO inode, but as for pipes, data is in memory. They have "kernel persistence" (= stored in the memory of the kernel?).
stores a pair of fd in p, one with O_RDONLY, the other with O_WRONLY
See also man 7 pipe.
(because execve preserves file descriptors, see 5.5)
int mkfifo(const char* pathname, mode_t mode)
(front-end to mknod)
(Mode modified by the process's umask.)
FIFO supports rendez-vous protocol: open at one ends blocks until other end also opened, except if O_NONBLOCK: opening O_RDONLY succeeds, O_WRONLY fails with ENXIO.
See also man 7 fifo.
In the kernel: process descriptor, …
See slide 91 and search
Memory segments:
[http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/, provides also many possibly-interesting pieces of info which I did not write down]
Notes on some segments:
Segment frontiers are randomized in Linux for security reasons.
Memory-related faults:
Segments are divided in areas, ex:
Areas can be examined in /proc/<pid>/maps
int brk(void *end_data_segment);
void *sbrk(intptr_t displacement);
move the program break = end of heap (referred to as "end of data segment" in man pages, but some people use data segment to refer to data segment + BSS + heap [see Gustavo Duarte blog post])
(only used in the kernel)
Actually allocate or deallocate memory to the process.
On success, sbrk returns previous program break = pointer to the start of the newly allocated segment memory (=> sbrk(0) can be used to simply get the program break).
See also man 3 end, edata, etext.
How to get the addresses of each segment:
double t[0x02000000]
void segments()
{
static int s = 42;
void *p = malloc(1024);
printf( "stack: %010p \n", &p)
printf( "brk: %010p \n", sbrk(0))
printf( "heap: %010p \n", p)
printf( "static (bss): %010p \n", t)
printf( "static (initialized): %010p \n", &s)
printf( "text : %010p \n", segments)
}
Stack 0xbff86fe0
Brk 0x1806b000
Heap 0x1804a008
Static (bss) 0x08049720
Static (initialized) 0x080496e4
Text 0x080483f4
[Notes here also include some of 7.2 Advanced Memory Management]
Virtual address mechanism
Page = fixed size memory region (2n, e.g. 4 KB)
Translation mechanism: not clear… By CPU (Memory Management Unit (MMU) + Translation Lookaside Buffer (TLB))
The following is architecture dependent (valid at least for x86)?
One paging control register (PCR) in the processor, pointing to the beginning of the page directory.
At context switch, the content of this register is modified (+TLB is flushed) => one page directory per process.
In memory, there is one page directory and several page tables per process.
1 page directory size = 1 page table size = 1024 * 32-bits entries = 1 page.
Translation:
This memory translation is done in hardware in Memory Management Unit (MMU). This is further speeded up by Translation Look-aside Buffer (TLB).
.
See /proc/<pid>/pagemap
A page table entry stores the physical address of a page, i.e. a page-aligned address => there are 12 bits available to contain page status:
mem_map_t structure stores info about used physical pages:
[See also 2.4 in https://www.kernel.org/doc/gorman/html/understand/understand005.html for more details]
Saving resources and enhancing performance :
Software caches: use memory pages to cache files or block devices (...XXX)
void *malloc(size_t size)
void *calloc allocate several objects
void *realloc(void *ptr, size_t size)
void *valloc aligns allocated memory on page boundary
void free(void *ptr) ptr must be return value of previous malloc. If not, you
have a nasty bug
How it works:
valgrind tool: detects memory leaks, phantom pointers, corrupt calls to free, and other bugs…
[See also http://www.makelinux.net/books/lkd2/ch03lev1sec1]
Process descriptor task_struct is a fairly big struct (1.7 kB for 32 bit arch). Stored in the task list = circular doubly-linked list.
[the rest of the section is not interesting, will be described in more details further]
pid_t fork(void);
Child identical to parent:
/!\ Child has single thread (the one that forked), even though memory is exactly the same => pthread_atfork(3) may be handy.
This is a system call. There is also a glibc wrapper. See also clone() (for both process and thread)...XXX
Ex:
switch (cpid = fork()) {
case -1:
…
exit(-1);
case 0:
continue_child();
break;
default:
continue_parent(cpid);
break;
}
int execve(const char *filename, char *const argv[], char *const envp[]);
Arguments:
Does not return on success, instead “replaces” the process (returns -1 on error, e.g. EACCESS, ENOEXEC):
Variants:
int execl(const char *path, const char *arg, ...);
int execv(const char *path, char *const argv[]);
int execlp(const char *file, const char *arg, ...);
int execvp(const char *file, char *const argv[]);
int execle(const char *path, const char * arg, ..., char *const envp[]);
int execve(const char *filename, char *const argv[], char *const envp[]);
execl*: arg list terminated with null
exec*p: file in the $PATH
Environment manipulation:
Simply declare: extern char **environ (but recommended use only as execve argument)
int fcntl(int fd, int cmd, [long arg]);
Commands:
void _exit(int status);
Argument:
Terminates process:
Never fails and does not return!
void exit(int status);
-> Calls in addition methods registered with atexit() (in reverse order of registration).
Swapper process (pid 0)
init process (pid 1)
Ex: inetd: (sl.126/45 …)
About PID, PPID, PGID, SID, see man 7 credentials
[https://www.win.tue.nl/~aeb/linux/lk/lk-10.html]
Process sessions and process groups:
[cf. Understanding the Linux Kernel 1.6.7.2 p. 29/47]
…
pid_t wait(int *status); /* waits for any child */
pid_t waitpid(pid_t pid, int *status, int options);
Monitor state changes (i.e. all sorts of signals received by the child, not just termination) of one or more children, and return PID of the child whose state has changed.
If a child terminates, it remains in a zombie state until wait() is performed to retrieve its state (and free the associated process descriptor)
pid special values:
status can be inspected with macros:
[see also man 7 signal]
Signals can have signal handlers to “catch” them (runs in user mode).
Asynchronous:
/!\ No queueing of pending signals.
[signal(7)]
POSIX defines:
Signal disposition = process (default?) behavior when delivered the signal = one of:
Can be changed with sigaction(2).
By default, the signal handler is invoked on the normal process stack. It is possible to arrange that the signal handler uses an alternate stack; see sigaltstack(2) for a discussion of how to do this and when it might be useful.
...
int kill(pid_t pid, int sig);
syscall (also shell command) used to send any signal.
typedef void (*sighandler_t)(int);
sighandler_t signal(int signum, sighandler_t handler);
Deprecated with multithread: use sigaction() (see 6.4)
If signal received during handler execution, deferred.
Suspend a process until it is delivered a signal (except if this signal is SIG_IGNed).
deliver SIGALRM after a delay (/!\ default action is to terminate).
interval timer…
What happens if CONT signal somehow delivered before the thread pauses? Deadlock! → Flow in ISO C signals (see 10.4 in Advanced Programming in the UNIX Environment).
int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact);
struct sigaction {
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
sigset_t sa_mask;
int sa_flags;
};
signum is any valid signal except SIGKILL and SIGSTOP.
...XXX
Ctrl-Z → SIGTSTP tty stop
fg → SIGCONT continue
Ctrl-C → SIGINT Interrupt
= sth yet different as signals and FIFOs.
[see also mq_overview(7)]
In Linux: a pseudo-FS mounted at /dev/mqueue
mqd_t mq_open(const char *name, int flags);
mqd_t mq_open(const char *name, int flags, mode_t mode, struct mq_attr *attr);
Arguments:
…
void *mmap(void *start, size_t length, int protection, int flags, int fd, off_t offset);
Map file or device into memory. Transparent synchronization between memory page and file.
Facilitates sharing of memory pages: use file name to identify shared memory region.
Arguments:
Returns: the start address of the memory region, or MAP_FAILED (= (void *)-1)
int munmap(void *start, size_t length);
Unmaps. Synchronizes any pending changes (see also msync()). NB: closing a fd does not unmap the region.
mmap allows sharing memory region, but uses disk space for the actual file. => Use shared memory system calls. They actually are like mmap(MAP_SHARED) but mapping special files from pseudo FS /dev/shm.
See shm_overview(7)...XXX
int shm_open(const char *name, int flags, mode_t mode);
Creates and opens a new, or opens an existing, POSIX shared memory object.
Flag = one of (O_RDONLY, O_RDWR) OR any of:
Returns a fd with FD_CLOEXEC set.
Errors:
Critical section = program section accessing shared resource.
Mutual exclusion = make sure at most one process or thread enters a critical section.
Properties to guarantee:
[wikipedia]
Uses 3 variables: wants_to_enter[0], wants_to_enter[1] and turn.
A process must:
But still, why exactly this way? I think the answer is in the original papers: http://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD123.html. This may be interesting too: http://cs.stackexchange.com/questions/12621/contrasting-peterson-s-and-dekker-s-algorithms
variables
wants_to_enter : array of 2 booleans
turn : integer
wants_to_enter[0] ← false
wants_to_enter[1] ← false
turn ← 0 // or 1
p0: wants_to_enter[0] ← true while wants_to_enter[1] { if turn ≠ 0 { wants_to_enter[0] ← false while turn ≠ 0 { // busy wait } wants_to_enter[0] ← true } } // critical section ... turn ← 1 wants_to_enter[0] ← false // remainder section | p1: wants_to_enter[1] ← true while wants_to_enter[0] { if turn ≠ 1 { wants_to_enter[1] ← false while turn ≠ 1 { // busy wait } wants_to_enter[1] ← true } }
// critical section ... turn ← 0 wants_to_enter[1] ← false // remainder section |
Problems:
Original paper: http://cs.nyu.edu/~lerner/spring12/Read03-MutualExclusion.pdf
The wrong algorithm with a mutex variable, for process 1:
while (mutexBelongsTo != -1) {} // wait while mutex belongs to someone
mutexBelongsTo = 1 // take the mutex
// critical section
mutexBelongsTo = -1 // Give away the mutex
=> Correctness is not satisfied because of race condition.
=> need to know:
=> use variable isApplying and turn, arbitrarily initialized to 0.
For process 1:
isApplying[1] = 1;
while (isApplying[0] && turn == 0) {} // process 0 is applying and it is its turn, so wait
// Now it is my turn or process 0 stopped applying so I can go
// Critical section
// Give away the turn and stop applying
turn = 0;
isApplying[1] = 0;
This one above I found myself, and it is quite similar to the Peterson's algorithm (btw, there is an error in the slides, see below):
For process 1:
isApplying[1] = 1;
turn = 0; // Process 1 says: "Please go ahead process 0"
while (isApplying[0] && turn == 0) { } // same as above
// Critical section
isApplying[1] = 0;
[Wikipedia]
The level variables take on values up to N−1, each representing a distinct "waiting room" before the critical section.[6] Processes advance from one room to the next, finishing in room N−1 which is the critical section. Specifically, to acquire a lock, process i executes:
for ℓ from 0 to N−1 exclusive
level[i] ← ℓ
last_to_enter[ℓ] ← i
while last_to_enter[ℓ] = i and there exists k ≠ i, such that level[k] ≥ ℓ
Wait
Defined by when a write is visible by the other processes.
Dekker and Peterson require the strongest consistency model: sequential consistency = (according to Leslie Lamport):
“The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”
But usually, consistency is weaker, because of:
=> Solution = hardware support:
A lock or mutex is useful when processes are competing for only one resource.
=> A semaphore is like a lock for several resources: it has some "counter".
Implementable with test-and-set:
void lock(volatile int *lock_pointer) {
while (test_and_set(lock_pointer) == 1);
}
void unlock(volatile int *lock_pointer) {
*lock_pointer = 0 // Release lock
}
int lock_variable = 1;
void lock_example() {
lock(&lock_variable);
// Critical section
unlock(&lock_variable);
}
void init(semaphore s, int v) {
s.value = v;
}
void procure(semaphore s) { // also called down() or wait()
wait until (s.value > 0);
s.value--; // Must be atomic with the previous test
}
void vacate(semaphore s) { // also called up(), post() or signal()
s.value++; // Must be atomic
}
Allowing multiple readers and a single writer:
void init(rw_semaphore l) {
l.value = 0; // Number of readers (resp. writers)
// if positive (resp. negative)
}
void procure_read(rw_semaphore l) {
wait until (l.value >= 0);
l.value++; // Must be atomic with the previous test
}
void vacate_read(rw_semaphore l) {
l.value--; // Must be atomic
}
void procure_write(rw_semaphore l) {
wait until (l.value == 0);
l.value = -1; // Must be atomic with the previous test
}
void vacate_write(rw_semaphore l) {
l.value = 0;
}
man 7 sem_overview
As with shm_ primitives, can be named = associated to a file /dev/shm.
sem_t * sem_open (const char *name, int flags, [mode_t mode, unsigned int value]);
int sem_wait (sem_t *sem);
int sem_post (sem_t *sem);
int sem_close (sem_t *sem);
int sem_unlink (const char *name);
Arguments flags and mode allow for a subset of their values for open():
close(): Undefined behavior when closing a semaphore other processes are currently blocked on.
Other calls:
By Dijkstra and Hoare
Showing deadlock when more than one processes want to access more than one resources each.
Each philosopher first take the chopstick on their left, then attempt to take the one on their right => deadlock.
=> to get a write / read lock, file must be open for writing / reading.
=> When inode removed on file closure, locks are also deleted. But actually, closing any fd on such a file releases the locks, which is unwanted.
Locks are advisory = not enforced. The program must comply with the lock rule, i.e. not write or read when it does not have the lock.
Locks are local on some file region = record = file segment.
Inheritance:
int fcntl(int fd, int cmd, ... /* arg */ );
With:
F_SETLK (struct flock *)
Acquire a lock (when l_type is F_RDLCK or F_WRLCK) or release a lock (when l_type is F_UNLCK) on the bytes specified by the l_whence, l_start, and l_len fields of lock. If a conflicting lock is held by another process, this call returns -1 and sets errno to EACCES or EAGAIN.
F_SETLKW (struct flock *)
Same as for F_SETLK, but wait. If a signal is caught while waiting, then the call is interrupted and (after the signal handler has returned) returns immediately (with return value -1 and errno set to EINTR; see signal(7)).
F_GETLK (struct flock *)
On input to this call, lock describes a lock we would like to place on the file. If the lock could be placed, fcntl() does not actually place it, but returns F_UNLCK in the l_type field of lock and leaves the other fields of the structure unchanged. If one or more incompatible locks would prevent this lock being placed, then fcntl() returns details about one of these locks in the l_type, l_whence, l_start, and l_len fields of lock and sets l_pid to be the PID of the process holding that lock.
Servers (web, database...), computational (numerical simulation, signal processing…)...
POSIX thread = logical thread.
Linux implementation: NPTL (Native POSIX Thread Library)
All memory segments are in common (data, heap, code), except stacks.
Kernel responsible for scheduling, mapping on hardware threads
man 7 pthreads
Compile with -pthread
In the kernel memory: (XXX review each unclear point. This list is also in man 7 pthreads)
Some common attributes:
Distincts attributes:
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/times.h>
#define NTHREADS 5
void *thread_func(void *num) {
int i = *(int *)num;
printf("Thread %d\n", i); // Or pthread_self()
// ...
// More thread-specific code
// ...
pthread_exit(NULL);
}
pthread_t threads[NTHREADS];
int main(int argc, char *argv[]) {
pthread_attr_t attr;
int i, error;
for (i = 0; i < NTHREADS; i++) {
pthread_attr_init(&attr);
int *ii = malloc(sizeof(int)); *ii = i;
error = pthread_create(&threads[i], &attr, thread_func, ii);
if (error != 0) {
fprintf(stderr, "Error in pthread_create: %s \n",
strerror(error));
exit(1);
}
}
for (i=0; i < NTHREADS; i++) {
error = pthread_join(threads[i], NULL);
if (error != 0) {
fprintf(stderr, "Error in pthread_join: %s \n", strerror(error));
exit(1);
}
}
}
int pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine)(void *), void *arg);
Create and start a thread.
pthread_t: opaque type for the thread ID. Also returned by pthread_t pthread_self()(3), compare with int pthread_equal(pthread_t, pthread_t)(3).
attr: see pthread_init_attr(). Can be NULL.
On error, errno is not set.
Thread termination is either:
When thread terminates, its resources are still there until thread is detached or joined, see below.
Inheritance:
Errors:
Joinable vs. detached:
Thread stack size: default 2 MB, or RLIMIT_STACK in NPTL. See also pthread_attr_setstacksize(3).
int pthread_attr_init(pthread_attr_t *attr);
int pthread_attr_destroy(pthread_attr_t *attr);
Initialize attr with default values. Then fields can be modified individually using primitives like pthread_attr_*(). See See Also section.
...XXX
(Undefined behavior if attr has already been initialized. OK if has been destroyed.)
Once attr no longer needed, should be destroyed.
void pthread_exit(void *retval);
Second arg of pthread_join is set to retval.
Called implicitly if thread function returns, with arg = the return value.
Never returns.
Does some cleanup:
int pthread_join(pthread_t thread, void **thread_return);
Waits until either:
If calling thread is canceled, thread remains joinable.
Some errors:
int pthread_detach(pthread_t thread);
Mark thread as detached. Can alternatively be made detached at creation using pthread_attr_setdetachstate(3).
Can't be made joinable again afterwards. Detaching already detached thread => unspecified behavior.
For thread-specific objects that need to be available across function call ~= global data, but per thread.
Finalization...XXX
...XXX
…XXX
…XXX
...XXX
…
…
…
…
...
The layers…
Received by the processor from a piece of hardware or software.
=> Involves an electrical signal called interrupt request (IRQ).
Lowest level hardware synchronization mechanism.
Requires immediate attention: processor switches to kernel mode and calls a specific interrupt service routine or interrupt handler.
Can be masked, except for non-maskable interrupts (NMI).
Multiple drivers may share a single IRQ line → IRQ handler must identify the source of the interrupt to call the proper service routine
Are originally raised by CPU. CPU switches to kernel mode and calls a specific exception handler or exception service routine, which often sends signal to faulty process (e.g. SIGFPE if division by 0).
[see also Understanding the Linux Kernel, 3rd ed., 4.5]
"Mechanism to implement system call"-> Does that mean a syscall always starts with an exception to switch to kernel mode? Se maybe http://stackoverflow.com/questions/19057503/switching-from-user-mode-to-kernel-mode XXX
2 Types: block- or character-oriented, accessed through block or character special files, located in /dev:
Type | Ex device | Ex device special file |
Block | Disks and other memories | /dev/had, /dev/sdb2, /dev/md1… |
Character | serial ports, console terminals, audio | / dev/tty0, /dev/pts/0, /dev/usb/hiddev0, /dev/mixer, /dev/null |
Major and minor numbers: to (logically) project device drivers to device special files
rest is not so clear… See also 4.24 in Advanced Programming in the UNIX Environment.
Ex: VMware, Parallel Desktop, Virtual Box
…
Ex: VMware, Virtual Box, QEMU, Rosetta
Basically translates instructions from guest processor to host processor.
Ex: Xen (linux only)
ease OS development…
Ex: Java, .NET
man section: 2
stat() | Get the inode |
access() | check access permission |
link(), symlink(), unlink() | |
chown() | |
chmod() | |
mknod() | |
open(), close() | may also create a new file |
create() | |
read(), write() | provides unbuffered I/O: need to provide manually set buffer and size to read. For buffered I/O, use standard C library |
lseek() | move r/w pointer in file |
fcntl() | manipulate file descriptor: duplicate (slightly != dup()), flags, locking, manage signals, leases... |
dup() | duplicate a file descriptor |
fork() | |
exec() | |
exit() |
man section: 3
see also ‘man stdio’
(f)getc(), (f)gets(),(f)getw() | read character, entire line (‘s’ = string) or word |
(f)putc(), (f)puts(), (f)putw() | |
printf(), fprintf(), sprintf(), vprintf() | formatted print to stdout, f- variant print to file, other variants have just different signatures… |
fread(), fwrite() | same as read()/write(), but with FILE * instad of fd |
fopen(), fdopen() | returns FILE * instead of a fd |
fseek() |