#10 roughtime response
(Incident #10)
Date: 2020/08/14
Authors: Terence, Raul, Preston
Status: Awaiting resolution. Root cause identified. Issue mitigated.
Network: Medalla
Summary: One of the roughtime servers returned wrong information, and Prysm nodes did not properly fallback from this situation. This bug caused all Prysm nodes to exhibit clock skew. Because of this clock skew, validators incorrectly proposed blocks and attestations for future slots. Read the blog post write-up here.
Impact: Due to the erroneous roughtime responses and the clock skew, validators would propose invalid attestations and blocks from the future. This impacts global participation. All prysm nodes were affected from approximately 17:30 to 18:45 UTC.
Root Causes: Erroneous roughtime responses from one of the default set of roughtime servers configured by the Cloudflare roughtime client library. Specifically, the “ticktock” roughtime server managed by mixmin.net reported a time 24 hours into the future. This timestamp, averaged with all 6 servers, made for a +4 hour clock adjustment in all Prysm nodes.
Resolution: Start with an opt-in flag for roughtime while we evaluate potential issues stemming from erroneous roughtime responses.
Detection: It was first observed by Terence, he saw a local beacon node rejecting future blocks and future attestations. A few mins after, the on calls were paged by High roughtime offset alert. At the same time, many more users in the #general and #bug-report rooms began to report local nodes rejecting future blocks and attestations.
Action Items:
Action Item | Type | Owner | Relevant Link |
Announce the issue and escalation steps in comm channels | Announcement | Terence | |
Make rough time an opt in flag | Code | Ivan/Raul | |
Ensure the hot-fix works as expected and leads to a healthy beacon node | Pre-flight test | Raul/Terence(tested with 512 validators) | |
Build the new release with the opt in flag | Release | Preston | |
Announce new build | Release | Preston | |
Add slasher to prylabs infra | Infrastructure | Preston | Blocked, remote slashing protection does not work |
Fix broken test in alpha21 | Test | Terence | |
Work on a blog post to further explain the incidents and lesson learned | Reflection | Terence/Raul | |
Investigate external slashing protection failures | Investigation | ||
Reach out to “ticktock” server owners | Investigation | Preston |
What went wrong
Where we got lucky
Timeline (all times UTC):
2020/08/14
17:25: Terence realized his local nodes began to observe large amounts of warnings on rejecting incoming blocks due to slots from the future. The incoming blocks all 4+ hours into the future
17:28: Received high roughtime offset alerts via prometheus alerts, receiving a 10 epochs since finality pager via on-call.
17:35: At least 30 Discord users in the lobby started to report on the following warnings:
WARN roughtime: Roughtime reports your clock is off by more than 2 seconds offset=4h0m0.028854657s
17:43: Terence alerted @everyone in the #war-room channel this is a PS0 incident and we need all hands on deck.
17:45: Discord users noted that restarting both beacon node and validator clients does not temporarily resolve this situation. The most viable option moving forward it to implement the roughtime opt in flag
17:41: Issues escalated to the multi-client chat rooms
17:42: Ivan picked up and completed https://github.com/prysmaticlabs/prysm/pull/6898
18:00: Terence tested 6898 locally with 512 validators
18:20 Captured debug logs show that “ticktock” server was reporting +24 hours.
Midpoint | Server Name |
Fri, 14 Aug 2020 18:20:23 GMT | Google-Sandbox-Roughtime |
Fri, 14 Aug 2020 18:20:23 GMT | Caesium |
Fri, 14 Aug 2020 18:20:23 GMT | Chainpoint-Roughtime |
Fri, 14 Aug 2020 18:20:23 GMT | Cloudflare-Roughtime |
Fri, 14 Aug 2020 18:20:23 GMT | int08h-Roughtime |
Sat, 15 Aug 2020 18:20:23 GMT | ticktock |
18:27: Raul got a hold of Preston. Preston will be back in an hour to build the release. In the meantime, we would advertise docker images
18:40: Preston mentioned the fix was not enough, we need to disable roughtime 100% by default
18:42: Raul begins investigating Kibana and analyzing debug log responses from roughtime, utilizing the fluentd filters
18:43: Terence cross-checks kubectl logs for all pods in the beacon-chain namespace. Pods unhealthy with roughtime offset issues as expected.
18:46: Raul pushed correct fix to 6898
19:05: Raul confirms the fix allows the nodes to work locally. If clock skew exists, it will log that warning to the user, but will not attempt to update based on the roughtime servers.
19:08: Terence announces to everyone in our discord channel: “The prysm nodes have encountered erroneous rough-time responses and the rough-time fall back did not work as expected. We have identified the issue, and are pushing an emergency hotfix soon + release in the next hour. The next upcoming release will make roughtime sync opt-out by default.“
19:18: Buildkite unit tests, spec tests, docker images successfully build, e2e tests pending completion. Preston ready to start the release process.
19:22: Commit hash ready: https://github.com/prysmaticlabs/prysm/commit/d24f99d66db22691b69c76bc57c7509e7f3ba8fe, Terence confirms it fixes his validators. Preston starting a rolling restart to our prod stateful set pods with the new docker image. Cluster validators are updated with the new images (ephemeral disk not preserved).
19:34: Docker images are tagged for alpha.21 and stable, binaries being built
19:34: Health for statefulset prod pods being monitored, ensuring rolling update succeeded
19:36: Began rolling start to our validator pods with the new docker image.
20:29: Looking at logs for returned delay values. On average, they seem to be under 0.1s. Update: Delay is not the key metric to investigate, rather the “midpoint” is the field to study. Note: chart screenshot is in PST. https://kibana.prylabs.network/goto/e5f5f64a4426c85aee1d76abc2d994be
20:32: Looking at reported offsets over 2 seconds. The data indicates an incident that lasted about 90 minutes where Prylabs production nodes observed an offset of about 14000 seconds. Note: chart screenshot is in PST. https://kibana.prylabs.network/goto/6ce2d73c13c0eef600b604fee6d8f4f4
20:41: Looking at prometheus data for the average offset indicates a clear issue starting around 17:30 and increasing until ~18:45 then the offset starts dropping and resolving.
20:52: Immediate investigation ends. It is clear that the incident has ended and a fix has been released. Nodes that have updated will immediately be resolved, nodes that have not been updated will resolve in the coming hours. Oncall is monitoring participation which is gradually increasing.
22:20: Users are reporting slashing protection kicking in. This is due to the earlier clock skew which produced attestations and blocks for 4 hours into the future. Instead of repeating that work and potentially getting slashed, Prysm validators are refusing to produce the slashable condition.
2020/08/15
00:13 INCIDENT REOPENED
00:13: Nishant notices a critical flaw in the fix PR 6898. Feature flags are not being set unless the roughtime flag is on.
00:16 Preston updates the “latest” binary to point to alpha.20 as a temporary rollback and announces a recommendation to rollback to alpha.20. We are now waiting for PR 7004 to merge as the release candidate for alpha.22.
00:45 Oncalls are evaluating whether or not to extend the size of the hot state cache to help resolve finality recovery faster for alpha.22. The current default hot state cache size is for 8 epochs, but Medalla is approaching closer to 100 epochs since finality. While Prysm can resolve this, it would be faster to use a bit more cache memory for hot states.
01:12 Oncalls decide to update the default cache size to 64 and make it configurable via flag. Some initial testing is that this will inflate memory usage potentially by 1.5gb. The flag can be tuned later after finality is reached.
01:57 All of Prysmatic Labs validators produced slashable attestations. Prylabs validators local storage was wiped in the emergency fix, no external slasher was running. To be confirmed… At least 800 of 1024 validators slashed or will be slashed.
02:37 Several reports of nodes unable to sync. The current thinking is that so much of the network is trying to sync at the same time. Alpha.22 is delayed until further notice.
02:46 Prylabs team decides the best approach now is to wait. Users should run alpha.20 or the latest docker image.
18:12 Investigation into syncing difficulties is ongoing.
2020/08/16
03:36 Nishant and Victor open a potential fix for initial sync. PR 7012
17:51 PR 7012 merges. Some users are reporting successful sync with that branch. Prysmatic Labs starts deploying to their production nodes.
21:15 Build started for alpha.22 at commit 0be1957c2897909b943b80fdd028f5346ae6cde6.
21:33 Alpha.22 release is available. Link.
21:40 Alpha.22 announced via discord. Prysmatic oncalls continue to monitor sync for potential optimizations while more users reach the head of the chain.
2020/08/17
04:53 Alpha.23 is released and announced on discord. This release contains several sync fixes and is expected to resolve Medalla issues. Users are recommended to run with “--dev” flag for a better experience.