Thanks for your review of our paper and sharing your thoughts! You raised a number of good points. I just want to clarify a few points:
1. Why not using specification to define gray failure?
First, as we mentioned in the paper and the talk (https://www.cs.jhu.edu/~huang/talk/hotos17_talk.pdf), there are various ways to define gray failure, and the model we propose is admittedly not the only one (although engineers we talked to seem to agree with the differential observability perspective most among various other definitions). We did think about it from system specification perspective you proposed but did not go with that route for the reasons below.
In an ideal world, if we can fully specify the expected behaviors of a system under all sorts of operational conditions and workloads, then yes, we can use that specification to "watch the watchers" and define gray failures as a buggy detector missing some specification checks. Indeed, if the full specification is available, we can even use it to characterize arbitrary faults (https://www.cis.upenn.edu/~ahae/papers/detection-opodis2009.pdf). But in practice, it is extremely difficult to come up with that full specification beforehand for real-world complex systems, at least regarding the system implementations. Certain behaviors are also inherently difficult to "specify" whether they are acceptable or not without considering the context, e.g., an agent leaking memory slowly (https://aws.amazon.com/message/680342/); the network having 2% packet loss rate; a component is unresponsive for 10 ms.
So overall, I think if we purely interpret gray failures from the specification gap aspect, the main implication is that we should try to write down the specification for each system component and make them complete. The statement is true, but it would require arduous efforts for every component's engineers and perhaps hindsights from many incidents. To be clear, I think for protocol-level faults, this is a principled way to characterize and address the problem. But for implementation-level faults or faults related to the asynchrony of a system, this is hard (even detecting simple crash faults is difficult). Jeff Mogul raised a great point at the workshop that some gray failures are extrinsic, while others are intrinsic. The extrinsic ones are primarily due to a poor job in implementing the fault detectors, which are the relatively easy cases and can perhaps be fixed by gradually fixing the bugs in the detectors. But the intrinsic ones are due to inherent difficulty in defining and measuring the SLOs (https://ai.google/research/pubs/pub48033) .
2. Why using the differential observability aspect to define gray failure?
We want to define the problem by taking into considerations the practical constraints and characteristics of the cloud systems. Because cloud systems have many complex components, and one component may be serving many different kinds of "applications" that have all sorts of "requirements", we are pessimistic that we would be able to obtain the perfect fault definition for every component. Even if we do, the measurement and checking might be prohibitively expensive. In addition, we should allow the complex cloud systems to operate in a degraded mode as all complex systems do. In other words, having some faults like memory leak or unresponsiveness may be acceptable at times. Our main insight is that while judging the absolute health of a system component in isolation could be tricky, the gray faults that practitioners care most usually have some "external effect". This is again considering the characteristics of cloud systems -- as the system consists of many highly interactive components across layers, so when a component becomes unhealthy, the issue is likely observable through its negative effects on the execution of some other components in the system. The "multi-tenancy", which is a key feature of cloud, also implies that different components may observe the system status differently. So we decide to focus on the differential observability aspect.
The main implication of our definition is that only doing instrumentation, logging, and measurement at one component in isolation may not be enough for gray failures (i.e., sometimes the nines we measure about our system could be meaningless), we need to consider the views from the "applications"/consumers. This is different from just advocating we should add more logging/metrics for our software. We actually leverage this implication and above insight to build a solution later (shameless plug here): https://www.cs.jhu.edu/~huang/paper/panorama-osdi18.pdf [a][b]
3. Latent masked-faults vs. gray failures
Actually, the latent masked-faults are not considered as gray failures in our paper. If the system is aware of a fault and masked it without causing disruption to others, we think this is the good form of differential observability (case 3 in Table 1) where the fault-tolerance mechanism is working. In other words, we are not advocating exposing the latent faults to applications. Exposing faults is an interesting idea that is explored in Pigeon (https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final90.pdf). We do advocate that at the application/consumer side, in addition to handling faults, it's beneficial to expose that
error evidence to a system fault detector. That's the basic idea of our OSDI '18 work listed above.
4. The high redundancy hurts example
The example is actually based on the real story in Azure rather than being contrived. The caveat is the networking infrastructure is a separate team. When the team designs the infrastructure, they do not necessarily anticipate the buggy switch and all traffic patterns before-hand. So intuitively, engineers assumed adding more switches means getting higher availability. And it did work great initially. But later when another team, e.g., Bing, later started to shift to use this infrastructure, they start to notice this weird symptom. After some calculations, the problem is kind of "obvious". Yes, after the issue occurred, we can perhaps change the morale of the story to be "multiplicity does not imply redundancy". But it is hard to get it right at once.
5. Robust-yet-fragile concept
I did not know much about this concept. Thanks for sharing it.
6. Accrual detector
The accrual detectors are more advanced than simple heartbeats. But they are still based on heartbeat signals (with more statistical analysis) and are still crash detectors. They actually address the other aspect of the fault detection problem: when the detector has *not* received a heartbeat from another component, it does not necessarily mean the other component is down due to asynchrony. The accrual detector's contribution is that it will eventually have high confidence when it convicts one component as "down". For gray failures, it is even when the detector has received the heartbeat, it does not necessarily mean the other component is healthy. Regarding the question "is there any other fuzzy detectors work for identifying latent failures", besides the Panorama [OSDI '18] observers, we are working on automatically constructing detectors specifically for partial failures. If you are interested, we have a workshop paper at HotOS this year that describes our preliminary investigation: https://www.cs.jhu.edu/~huang/paper/watchdog-hotos19-preprint.pdf
Sorry for all the shameless plugs :)
And thanks again for your review and thoughts!
[a]I can not help but think of the proliferation of open source distributed tracing systems and their overlap with Panorama. I may possibly be making a wrongful comparison, based on both being instrumented at the component level. I'm curious to know your thoughts.
[b]It appears that the mental model for these open source distributed tracing systems is the research publications behind Google's Dapper Distributed Tracing System.