#10 roughtime response

(Incident #10)

Date: 2020/08/14

Authors: Terence, Raul, Preston

Status: Awaiting resolution. Root cause identified. Issue mitigated.

Network: Medalla

Summary: One of the roughtime servers returned wrong information, and Prysm nodes did not properly fallback from this situation. This bug caused all Prysm nodes to exhibit clock skew. Because of this clock skew, validators incorrectly proposed blocks and attestations for future slots. Read the blog post write-up here.

Impact: Due to the erroneous roughtime responses and the clock skew, validators would propose invalid attestations and blocks from the future. This impacts global participation. All prysm nodes were affected from approximately 17:30 to 18:45 UTC.

Root Causes: Erroneous roughtime responses from one of the default set of roughtime servers configured by the Cloudflare roughtime client library. Specifically, the “ticktock” roughtime server managed by mixmin.net reported a time 24 hours into the future. This timestamp, averaged with all 6 servers, made for a +4 hour clock adjustment in all Prysm nodes.

Resolution: Start with an opt-in flag for roughtime while we evaluate potential issues stemming from erroneous roughtime responses.

Detection: It was first observed by Terence, he saw a local beacon node rejecting future blocks and future attestations. A few mins after, the on calls were paged by High roughtime offset alert. At the same time, many more users in the #general and #bug-report rooms began to report local nodes rejecting future blocks and attestations.

Action Items:

Action Item	Type	Owner	Relevant Link
Announce the issue and escalation steps in comm channels	Announcement	Terence
Make rough time an opt in flag	Code	Ivan/Raul	https://github.com/prysmaticlabs/prysm/pull/6898
Ensure the hot-fix works as expected and leads to a healthy beacon node	Pre-flight test	Raul/Terence(tested with 512 validators)
Build the new release with the opt in flag	Release	Preston
Announce new build	Release	Preston
Add slasher to prylabs infra	Infrastructure	Preston	Blocked, remote slashing protection does not work
Fix broken test in alpha21	Test	Terence
Work on a blog post to further explain the incidents and lesson learned	Reflection	Terence/Raul
Investigate external slashing protection failures	Investigation
Reach out to “ticktock” server owners	Investigation	Preston

Lessons Learned

What went wrong

We mistakenly believed we had a proper fallback in place for roughtime servers going down.
Every single Prysm node in the network was simultaneously affected, leading to significant participation drops.
Prysmatic Labs team believed there was enough distribution in the NTP servers with 6 endpoints running by default that this type of incident would not occur as a global outage.

Where we got lucky

One of our contributors already had an in-progress PR (thanks @ncitron), which would opt-out of roughtime recalibration.
We were able to immediately opt-out of roughtime by using a flag, making the fix trivial and easy to verify in a single pull request
Users were consistently engaged on discord, coming in droves when their nodes were having issues, offering detailed reports and important metrics of their failures
We had a constant re-sync that would keep updating our local node’s time when it had an offset over 2 seconds. We kept recalibrating roughtime in an attempt to resolve the issue faster. This likely saved 30 minutes to 1 hour on the incident end time.
The erroneous roughtime results seemed to resolve itself after approximately 90 minutes and the incident had ended before we were able to push an emergency release.

Timeline (all times UTC):

2020/08/14

17:25: Terence realized his local nodes began to observe large amounts of warnings on rejecting incoming blocks due to slots from the future. The incoming blocks all 4+ hours into the future

17:28: Received high roughtime offset alerts via prometheus alerts, receiving a 10 epochs since finality pager via on-call.

17:35: At least 30 Discord users in the lobby started to report on the following warnings:

WARN roughtime: Roughtime reports your clock is off by more than 2 seconds offset=4h0m0.028854657s

17:43: Terence alerted @everyone in the #war-room channel this is a PS0 incident and we need all hands on deck.

17:45: Discord users noted that restarting both beacon node and validator clients does not temporarily resolve this situation. The most viable option moving forward it to implement the roughtime opt in flag

17:41: Issues escalated to the multi-client chat rooms

17:42: Ivan picked up and completed https://github.com/prysmaticlabs/prysm/pull/6898

18:00: Terence tested 6898 locally with 512 validators

18:20 Captured debug logs show that “ticktock” server was reporting +24 hours.

Midpoint	Server Name
Fri, 14 Aug 2020 18:20:23 GMT	Google-Sandbox-Roughtime
Fri, 14 Aug 2020 18:20:23 GMT	Caesium
Fri, 14 Aug 2020 18:20:23 GMT	Chainpoint-Roughtime
Fri, 14 Aug 2020 18:20:23 GMT	Cloudflare-Roughtime
Fri, 14 Aug 2020 18:20:23 GMT	int08h-Roughtime
Sat, 15 Aug 2020 18:20:23 GMT	ticktock

18:27: Raul got a hold of Preston. Preston will be back in an hour to build the release. In the meantime, we would advertise docker images

18:40: Preston mentioned the fix was not enough, we need to disable roughtime 100% by default

18:42: Raul begins investigating Kibana and analyzing debug log responses from roughtime, utilizing the fluentd filters

18:43: Terence cross-checks kubectl logs for all pods in the beacon-chain namespace. Pods unhealthy with roughtime offset issues as expected.

18:46: Raul pushed correct fix to 6898

19:05: Raul confirms the fix allows the nodes to work locally. If clock skew exists, it will log that warning to the user, but will not attempt to update based on the roughtime servers.

19:08: Terence announces to everyone in our discord channel: “The prysm nodes have encountered erroneous rough-time responses and the rough-time fall back did not work as expected. We have identified the issue, and are pushing an emergency hotfix soon + release in the next hour. The next upcoming release will make roughtime sync opt-out by default.“

19:18: Buildkite unit tests, spec tests, docker images successfully build, e2e tests pending completion. Preston ready to start the release process.

19:22: Commit hash ready: https://github.com/prysmaticlabs/prysm/commit/d24f99d66db22691b69c76bc57c7509e7f3ba8fe, Terence confirms it fixes his validators. Preston starting a rolling restart to our prod stateful set pods with the new docker image. Cluster validators are updated with the new images (ephemeral disk not preserved).

19:34: Docker images are tagged for alpha.21 and stable, binaries being built

19:34: Health for statefulset prod pods being monitored, ensuring rolling update succeeded

19:36: Began rolling start to our validator pods with the new docker image.

20:29: Looking at logs for returned delay values. On average, they seem to be under 0.1s. Update: Delay is not the key metric to investigate, rather the “midpoint” is the field to study. Note: chart screenshot is in PST. https://kibana.prylabs.network/goto/e5f5f64a4426c85aee1d76abc2d994be

20:32: Looking at reported offsets over 2 seconds. The data indicates an incident that lasted about 90 minutes where Prylabs production nodes observed an offset of about 14000 seconds. Note: chart screenshot is in PST. https://kibana.prylabs.network/goto/6ce2d73c13c0eef600b604fee6d8f4f4

20:41: Looking at prometheus data for the average offset indicates a clear issue starting around 17:30 and increasing until ~18:45 then the offset starts dropping and resolving.

20:52: Immediate investigation ends. It is clear that the incident has ended and a fix has been released. Nodes that have updated will immediately be resolved, nodes that have not been updated will resolve in the coming hours. Oncall is monitoring participation which is gradually increasing.

22:20: Users are reporting slashing protection kicking in. This is due to the earlier clock skew which produced attestations and blocks for 4 hours into the future. Instead of repeating that work and potentially getting slashed, Prysm validators are refusing to produce the slashable condition.

2020/08/15

00:13 INCIDENT REOPENED

00:13: Nishant notices a critical flaw in the fix PR 6898. Feature flags are not being set unless the roughtime flag is on.

00:16 Preston updates the “latest” binary to point to alpha.20 as a temporary rollback and announces a recommendation to rollback to alpha.20. We are now waiting for PR 7004 to merge as the release candidate for alpha.22.

00:45 Oncalls are evaluating whether or not to extend the size of the hot state cache to help resolve finality recovery faster for alpha.22. The current default hot state cache size is for 8 epochs, but Medalla is approaching closer to 100 epochs since finality. While Prysm can resolve this, it would be faster to use a bit more cache memory for hot states.

01:12 Oncalls decide to update the default cache size to 64 and make it configurable via flag. Some initial testing is that this will inflate memory usage potentially by 1.5gb. The flag can be tuned later after finality is reached.

01:57 All of Prysmatic Labs validators produced slashable attestations. Prylabs validators local storage was wiped in the emergency fix, no external slasher was running. To be confirmed… At least 800 of 1024 validators slashed or will be slashed.

02:37 Several reports of nodes unable to sync. The current thinking is that so much of the network is trying to sync at the same time. Alpha.22 is delayed until further notice.

02:46 Prylabs team decides the best approach now is to wait. Users should run alpha.20 or the latest docker image.

18:12 Investigation into syncing difficulties is ongoing.

2020/08/16

03:36 Nishant and Victor open a potential fix for initial sync. PR 7012

17:51 PR 7012 merges. Some users are reporting successful sync with that branch. Prysmatic Labs starts deploying to their production nodes.

21:15 Build started for alpha.22 at commit 0be1957c2897909b943b80fdd028f5346ae6cde6.

21:33 Alpha.22 release is available. Link.

21:40 Alpha.22 announced via discord. Prysmatic oncalls continue to monitor sync for potential optimizations while more users reach the head of the chain.

2020/08/17

04:53 Alpha.23 is released and announced on discord. This release contains several sync fixes and is expected to resolve Medalla issues. Users are recommended to run with “--dev” flag for a better experience.