1 of 24

Diffs and Commenting on Diffs

Create Deep Dive

Oswaldo Ferreira - Backend Engineer

January 28, 2018

1

2 of 24

Today we’ll cover

  • Overview of what a “git diff” is
  • Where do we present diffs on GitLab? (Demo / Introduction)
  • Overview of how diffs are stored, fetched and presented on GitLab and Gitaly
    • For standard comparison view
    • For merge requests
    • For comments on merge requests and commits

2

3 of 24

Table of Contents

  • Demo
  • Workflows
  • Tables
  • Caching layers
  • Code dive
  • Questions

3

4 of 24

What are git diffs?

git-diff is a function that takes two input data sets and outputs the changes between them. git diff is a multi-use Git command that when executed runs a diff function on Git data sources. These data sources can be commits, branches, files and more.

4

5 of 24

What are git diffs?

🧙

5

6 of 24

Demo

6

7 of 24

Standard comparison view diffs (workflow)

  • Fetching
    • Submits a diff request to Gitaly (through diff#commit_diff RPC) with limits and refs
    • Diff limits are applied (on Gitaly) to the diff file collection
      • Gitlab::Diff::FileCollection::Compare
      • Gitlab::Git::DiffCollection
    • Most of the process is triggered via Gitlab::Diff::FileCollection::Compare
  • Presentation
    • Each diff file is parsed
      • Gitlab::Diff::File
      • Gitlab::Diff::Line
    • Load through project/diffs/_diffs.html.haml (not async)

7

8 of 24

Merge request diffs

  • Storage
  • Fetching
  • Presentation

8

9 of 24

Merge request diffs (Storage)

9

10 of 24

Merge request diffs (Storage workflow)

  • When a push is received for a branch (source), a new MR version is stored (fetch is done via Gitaly though the same process of the comparison view)
    • MergeRequests::ReloadDiffsService#execute
  • Creates a new merge_request_diffs record and one or more merge_request_diff_files (one for each file)
  • Refreshes the diff highlighting cache (Highlighting is essentially a heavy process)
  • If a new push is received, a new merge_request_diffs is created, and merge_request_diff_files are re-created (no deletion or update happens here)
  • We’re looking forward to store these in Object Storage soon

10

11 of 24

Merge request diffs (Fetching workflow)

  • Once we have the persisted MR version, we fetch it from DB
    • Gitlab::Diff::FileCollection::MergeRequestDiff
  • The diff highlighting cache is refreshed (if empty) and used (7 days cache)
  • Diff stats (files additions and deletions) for each file are also fetched from Gitaly (diff_service#diff_stats RPC) on this process
  • If there is any diff comments in positions outside the diff (the diff was expanded by the user and a comment was left), we unfold it (see: Gitlab::Diff::LinesUnfolder) on the fly

11

12 of 24

Merge request diffs (Presentation workflow)

  • Unlike the standard comparison view, we do load diffs async for MR Diffs tab
  • Diff files and lines serialization is mainly done by DiffFileEntity
  • We can see a few performance issues with the actual size of the serialized JSON and we’re looking forward to improve that by:
    • Reducing the amount of data we return (being mindful if everything is really needed by FE)
    • In the future, loading diffs in sequential batches, which should lead to a much better UX

12

13 of 24

Merge request diffs (caching layers)

  • Postgres: The actual raw diff files content
    • Why: At some point in time, we didn’t have keep-around refs, therefore, after a MR was merged it was impossible to present the diffs
    • “Side-effect”: Performance improvement 🚀
    • How: merge_request_diff_files table
  • Redis: Latest highlighted diff content
    • Why: Generating diff highlight HTML (today we use Rouge) is a relatively slow task to make under a request
    • How: Gitlab::Diff::HighlightCache

13

14 of 24

Comments on diffs (Discussion tab)

14

15 of 24

Comments on diffs

  • Storage
  • Fetching
  • Presentation

15

16 of 24

Comments on diffs (Storage)

File path SHA, e.g. 02d635fb83402a9a1a0c113772f1e6d365723b95_93_90

16

17 of 24

Comments on merge request diffs (Storage)

Obs: We delete merge_request_diff_files after a MR gets merged, therefore reusing all diffs from MRs is not possible.

17

18 of 24

Comments on diffs (Storage)

  • DiffNote positions
    • original_position
      • Mainly used to present the comment in the Discussion tab of Merge Requests
      • Isn’t updated as new MR versions are added
    • position
      • Mainly used to know where exactly we should present the comment in the Diffs tab
      • Is updated to the latest MR version if the line wasn’t changed (outdated)
    • change_position
      • Mainly used to know in which context the commented line was changed
      • Is updated when the line the note was left was changed (position stops being updated)
  • base_sha: Point in time where it was branched off of the target branch
  • start_sha: Latest HEAD of target branch
  • head_sha: Latest HEAD of source branch

18

19 of 24

Comments on diffs (Storage workflow)

  • When someone comments in a diff (the whole diff is not persisted):
    • It fetches the raw diff for the original_position ref (which won’t change after updating the MR), commit_service#find_commit RPC is used
      • original_position is a Gitlab::Diff::Position containing the line positions and ref in the diff when it was originally received
    • It chunks the diff (because we don’t need all of it), then persist on separate note_diff_files
    • The same process happens when leaving a comment in a commit
  • When the diff is updated (push for instance):
    • We use the Gitlab::Diff::PositionTracer to update the position of every diff note (if needed)
      • That’s exactly what makes a diff note outdated or not, or move it around if needed. If the position reference stays in a revision behind the MR HEAD, we got an outdated note

19

20 of 24

Comments on diffs (caching layers)

  • Postgres: The commented diff file hunks
    • Why: As you might expect, in the past we were fetching diffs in a N+1 manner for different revisions from Gitaly. In a MR with more than 100 comments, things started a getting bit out of control.
      • Additionally, we started deleting diff files from DB after the MR got merged (table getting too big), therefore, no way to reuse all existing persisted diffs.
    • How: Chunking the diff file (from top to commented line) and persisting the end result to note_diff_files
  • Redis: The highlighted diff file hunks
    • Why: Same as the standard MR diff file problem. It was spending way too much time highlighting every diff hunk
    • How: Gitlab::DiscussionsDiff::HighlightCache

20

21 of 24

Questions?

21

22 of 24

Code Dive

22

23 of 24

Questions?

23

24 of 24

Thank you

Oswaldo Ferreira - Backend Engineer

oswaldo@gitlab.com

24