sort -m -o segfault

Version:

coreutils-7.2 (fixed in 7.3)

How to reproduce?

coreutils-7.2/src/sort -m -o ./text1 ./text1

Symptom:

sort -m -o text1 text1

segmentation fault

Root Cause:

It forgot a error check earlier the execution and allowed an error to escape, and mistakenly overwritten the content of a ‘file’ buffer, causing the execution totally went south and eventually crashed. In fact, since the crash point is quite far away from the root cause, the ‘segmentation fault’ msg, or even the coredump, is not that effective in diagnosing the cause.

Here is the backtrace at the segmentation fault point:

Program received signal SIGSEGV, Segmentation fault.

0x000000000040c3e6 in hash_lookup (table=0x0, entry=0x7fff0b8fb7f0)

        at hash.c:250

250             = table->bucket + table->hasher (entry, table->n_buckets);

(gdb) bt

#0  0x000000000040c3e6 in hash_lookup (table=0x0, entry=0x7fff0b8fb7f0)

        at hash.c:250

#1  0x0000000000402762 in wait_proc (pid=49) at sort.c:680

#2  0x0000000000402f5c in open_temp (name=0x0, pid=49) at sort.c:995

#3  0x0000000000405d98 in open_input_files (files=0x7b26250, nfiles=1,

        pfps=0x7fff0b8fb910) at sort.c:2206

#4  0x000000000040733a in merge (files=0x7b26250, ntemps=0, nfiles=1,

        output_file=0x7fff0b8fdb0b "text1") at sort.c:2700

#5  0x00000000004092c7 in main (argc=5, argv=0x7fff0b8fbde8) at sort.c:3559

Comments written by us are in blue color...

------------------ hash.c -------------------------------------------

/* table = NULL here... */

hash_lookup (const Hash_table *table, const void *entry)

{

 struct hash_entry const *bucket

   = table->bucket + table->hasher (entry, table->n_buckets);

}

Caller of hash_lookup:

------------------------ sort.c ----------------------------------------

wait_proc (pid_t pid) {

 struct procnode test, *node;

 test.pid = pid;

 node = hash_lookup (proctab, &test);

 // global, not initialized.

  ...

}

...

In fact, the root cause is so far way. So let me explain with the call path:

In buggy execution, the call path is:

main

-> merge (file[0].name = “./text1”, file[0].pid = 0)

 -> open_input_files {files[i].pid == 49

   ? open_temp (files[i].name == 0x00, files[i].pid == 49)

   : stream_open (files[i].name, "r"));}

   -> open_temp  // Here is the ERROR

   -> wait_proc

   -> hash_lookup

In correct execution:

main

  -> merge (file[0].name = “./text1”, file[0].pid = 0)

      -> open_input_files {files[i].pid == 0

         ? open_temp (files[i].name == "/tmp/sortmFa5YE", files[i].pid == 0)

         : stream_open (files[i].name, "r"));}

     -> stream_open

So in other words, starting from “open_input_files”, the execution starts to deviate. In fact, it has less to do with the “NULL pointer dereference”.

The root cause of such a deviation of execution flow is the function “avoid_trashing_input”, which is called:

main

  -> merge (file[0].name = “./text1”, file[0].pid = 0)

       -> avoid_trashing_input (file[0].name = “./text1”, file[0].pid = 0)

            // After this, files[0].name = 0x0, files[0].pid = 49, which is wrong!

       -> open_input_files {

                files[i].pid == 49

                ? open_temp (files[i].name == 0x00, files[i].pid == 49)

                 : stream_open (files[i].name, "r"));}

                  -> open_temp

                          -> wait_proc

                                  -> hash_lookup

     

avoid_trashing_input{

  ...

  while (i + num_merged < nfiles)

        {

          num_merged += mergefiles (&files[i], 0, nfiles - i, tftp, temp);

          files[i].name = temp;

          files[i].pid = pid;

           /* Here, i = 0, num_merged = 1, nfiles = 1. So in the buggy execution, memmove

              will reset files[0], which shouldn’t happen. The fix is to add a condition

              check and not calling memmove. */

+          if (i + num_merged < nfiles) // i = 0, num_merged = 1, nfiles = 1

+            memmove(&files[i + 1], &files[i + num_merged], num_merged * sizeof *files);

-             memmove(&files[i], &files[i + num_merged], num_merged * sizeof *files);

          ntemps += 1;

          nfiles -= num_merged - 1;;

          i += num_merged;

        }

...

}

So in correct version, after running “avoid_trashing_input”, file[0] becomes:

(gdb) p files[0]

$2 = {name = 0x1208c27c "/tmp/sortW6WiL5", pid = 0}

Does the application print any error msg?

No. The “segmentation fault” message is printed by the OS, but not caught and printed by the application.

Can Errlog anticipate the error and print an error message?

Yes, with the exception signal pattern.