1 of 43

Differentiating Communication Styles

of Leaders on the

Linux Kernel Mailing List

Daniel Schneider, Scott Spurlock, Megan Squire

Elon University

@MeganSquire0

FLOSSmole / FLOSSdata / FLOSSpapers

OpenSym '16 (Berlin)

August 17, 2016

2 of 43

The LKML

Linux Kernel Mailing List

--one of 148 email mailing lists used to manage the Linux operating system

--7000 subscribers

--300 messages per day

--earliest archive is 1995

--2,160,000 messages (1995 - March 2015)

--54,000 unique senders (addresses)

3 of 43

The Linux Kernel Civility Incident - July 2013

4 of 43

Linus Torvalds

5 of 43

Greg Kroah-Hartman

6 of 43

Sarah Sharp

7 of 43

8 of 43

9 of 43

10 of 43

11 of 43

Research Questions

RQ1: Considering the two LKML leaders who were at the center of the 2013 controversy (Torvalds & Kroah-Hartman), what are the interesting features of their written discourse? How different is their communication style?

RQ2: Can we automatically differentiate emails written by each person, solely based on their (non-code) content? What features of the email content are most helpful to this task?

12 of 43

First we collected all the emails sent by Linus & by Greg

13 of 43

14 of 43

RQ1: lexical differences

15 of 43

RQ1: lexical differences

Flesch-Kincaid Reading Ease (FKRE) is a simplistic method for measuring how easy it is to read a text (higher scores are easier to read).

FKRE can also be converted to a school grade level equivalent (FKGL).

16 of 43

RQ1: lexical differences

17 of 43

RQ1: lexical differences

18 of 43

RQ1: expletive caveats

  • No obscured expletives included, e.g. f*ck or sh!t
  • Related terms have been combined (crap, crappy)
  • We added the word bullshit
  • English language expletives only

19 of 43

RQ2: automatically distinguishing the two authors

  • We decided to make a classifier that could automatically distinguish between the two authors.

  • And, with our findings from RQ1 (parts of speech & expletive usage), we can experiment with tuning the classifier.

  • What goes into the classifier:
    • Bag of words
    • Adverb count
    • Expletive count

20 of 43

RQ2: automatically distinguishing the two authors

Perhaps this is too good?

21 of 43

Which features are actually interesting?

22 of 43

Thanks

At first we suspected that Greg simply used a "thanks" signature and Linus did not.

However, this is not the whole story.

  1. While many times Greg does say "Thanks, Greg" as a sign-off, he also capitalizes the word differently, punctuates it differently. Therefore we can not consider it to be necessarily boilerplate text, since he means to type it.
  2. While both Greg and Linus use the word "thanks" inside text paragraphs, Greg uses the word more than Linus does.

23 of 43

Examples of "thanks" (all Greg)

Greg:

  • thanks a lot for the review, I really appreciate it
  • Thanks to [name] for pointing out my mistake
  • Thanks to [name] for the big patch, and all of the other people who helped figure this out.

24 of 43

Which features are actually interesting?

25 of 43

Sorry

Greg says sorry a lot more than Linus does.

  1. The raw numbers were enough for sorry to be a discriminating feature
  2. In addition, when Linus uses sorry he tends to be sarcastic, whereas Greg does not.

26 of 43

Examples of 'sorry' usage

Greg:

  • sorry, my script isn't that smart...
  • sorry it's late

Linus:

  • Sorry, but you're a bit late
  • Ugh. Sorry, but this patch just looks stupid

27 of 43

Which features are actually interesting?

28 of 43

AdverbCount

Here we did include actually both in the total count of adverbs and on its own.

This adverb usage is an interesting fingerprint to Linus' speech. Linus' most-often used adverbs include:

  • Actually
  • Always
  • Basically
  • Never
  • Only
  • Really
  • Totally
  • Very

29 of 43

Examples of adverbs (all Linus)

  • but that's not actually true at all
  • Let's hope it actually works. Because otherwise this was just a totally pointless pain in the *ss.
  • The other point of irritation was that there really was a lot of stuff that came in yesterday and basically treated the merge window as some kind of high-tech limbo dance.
  • Right. And that's basically how this 'patch' was actually tested originally - by doing this by hand, without actually having a patch in hand. I told people: this seems to work really automatically.

30 of 43

Which features are actually interesting?

31 of 43

ExpletivesCount

Whether expletives were used or not was an interesting feature.

  1. Linus has certain mild expletives that he uses a lot.
  2. He does tend to obfuscate his stronger expletives (f*ck, sh!t), so we should solve for that in the future.

32 of 43

Which features are actually interesting?

33 of 43

Names

  1. We removed self-names from both authors
  2. Greg refers to "Linus" more than Linus refers to "Greg".
  3. This is likely due to their relative status (both leaders, but one higher than the other)

34 of 43

Examples of Name Usage

Feel free to play around with this patch, I've sent it on to Linus.

Thanks, I've applied this to my trees, and will include it in the next round of changesets to Linus.

==

Greg seems to use some seriously bad drugs, and creates totally crap commit messages that are just annoying when you have to look at them because there's some conflict. Greg - please fix your crazy tools. Look at this:...and tell me why the f*&% you have commit messages like this....

35 of 43

Which features are actually interesting?

36 of 43

Thing

I was curious why the word "thing" should be such a strong fingerprint for Linus.

  1. He uses the word "thing" many times as part of a multi-word phrase prefaced by "the", as in "the _____ thing"
  2. This construction is called the step-back construction
  3. Step-back construction is used to create distance between the speaker and the item being discussed.

37 of 43

Examples of "the thing" (all Linus)

  • But I don't actuall ysee [sic] the "hardlink+complex file" thing as a very hard thing to do necessarily
  • Note that I'll probably use the "ALIGN(4096)" thing for other things too
  • The obvious breakage is that even though you disabled the IS_SOFT testing for pending signals, you didn't do the "current-state" thing right.
  • The "task-refcount" thing is just silly
  • It was just this stupid bridge setup thing that was broken

38 of 43

Future Work

  • Obfuscated expletives

39 of 43

Future Work

  • Obfuscated expletives
  • Changes after Code of Conflict [sic] was passed (March 2015)

40 of 43

Future Work

  • Obfuscated expletives
  • Changes after Code of Conflict [sic] was passed (March 2015)
  • Compare other leaders

41 of 43

Future Work

  • Obfuscated expletives
  • Changes after Code of Conflict [sic] was passed (March 2015)
  • Compare other leaders
  • Compare change over time

42 of 43

Future Work

  • Obfuscated expletives
  • Changes after Code of Conflict [sic] was passed (March 2015)
  • Compare other leaders
  • Compare change over time
  • Heightened emotional content
    • sentiment
    • _Emotional_ **TYPING**!

43 of 43

Differentiating Communication Styles

of Leaders on the

Linux Kernel Mailing List

Daniel Schneider, Scott Spurlock, Megan Squire

Elon University

@MeganSquire0

FLOSSmole / FLOSSdata / FLOSSpapers

OpenSym '16 (Berlin)

August 17, 2016