Architecture Paper
Will Kubernetes work with Aurae?
Does Aurae replace Kubernetes?
Control Systems (Control Plane)
Aurae Daemon Identity and Authentication Tani Aurae
Service Transit Systems (Data Planes)
Decision to stay away from BitTorrent infrastructure
Aurae is an opinionated systems kernel designed to be a building block for higher level distributed systems such as Kubernetes. Aurae intends to simplify and secure the relationship between a Node and a distributed system, as well as provide powerful networking primitives to higher order systems in the stack.
A primary goal of Aurae is to replace systemd. We also aim to replace the runtime node components in a modern distributed system such as Kubernetes.
Aurae calls out a standard library and a Turing complete scripting language on which platform teams can develop and compose their platforms in an opinionated and scalable way. Our hope is that distributed systems such as Kubernetes and modern service mesh systems will find value in building on top of Aurae.
Aurae is the Turing complete platform language designed for platform teams.
Auraed is the core daemon that supports it.
We believe that there is untapped opportunity in how we manage nodes in distributed systems. Specifically, we believe that better multi-tenant building blocks at the node level will unlock more effective platform abstractions (such as the Kubernetes control plane) on top.
The project is a result of a simplification of the core needs of a modern platform infrastructure team. Aurae attempts to replace systemd and the lower level Kubernetes components that bring a traditional Kubernetes environment to life.
Aurae was originally started by Kris Nóva. The project draws inspiration from Plan 9, Kubernetes, and the COSI project. The primary motivation for the project was to follow the dream that distributed systems kernels could adopt a broader scope while also undergoing simplification.
Core engineers to the project in its early stages include Duffie Coolie, and Tani Aura.
Works that influenced and validated the project include
Yes. We intend to simplify a layer of the stack. The Aurae API will solve many of the same problems as Kubernetes, and there is no reason a translation between Kubernetes objects and Aurae API calls cannot be maintained. In fact – the project calls out the ability to have a transparent Kubernetes deployment running on top of Aurae as one of its goals.
It depends – but mostly no. Think of Aurae as systemd and the kubelet wrapped up into a single system that takes networking, storage, identity, and runtime into scope. Instead of managing an entire Linux system, kernel, systemd services, container runtimes, CNI, CSI, and etc underneath Kubernetes, you can just run Aurae on the node instead. The Aurae APIs will enable the same functionality that is otherwise available on a traditional Kubernetes node. Aurae aims to allow an engineer to leverage the Aurae mechanics to support higher level orchestration systems such as the Kubernetes API and control plane. Aurae goes after the node, not the cluster.
Aurae intends to replace systemd and the kubelet in a single fell swoop. Will it be successful in its mission? All we can do now is pray.
No.
Aurae does not have a centralized source of truth like a Kubernetes cluster does. However Aurae does persist configuration at the node level. Each node in a mesh is responsible for maintaining its own source of truth. Additionally each node provides functionality that allows a system on one node to mutate a system on another node.
We use SQLite for our primary data store. The database runs on each Node instead of a centralized model.
Also for consideration SpiceDB which has authorization and identity baked in at the database level.
The simplest model for Aurae is running multiple tenants on a single system.
Each tenant (in this example) [a][b]leverages the Aurae language over mTLS gRPC over a Unix domain socket. Each tenant has unique certificate material loaded at runtime. Each tenant executes against the same core daemon. Each tenant is reasoned about independently by the daemon based on the tenant’s identity at runtime.
Containers run in pods. Pods run in MicroVM isolation sandboxes. Sandboxes are the new namespace. They all run on a single piece of hardware.[c][d]
Aurae is intended to grow organically with the needs of a business. Starting with Aurae is simple as the recommendation is to always start with a single node. Add a second node when you have reached a critical size on the first node. And so on.
Aurae nodes build and maintain a peer-to-peer mesh at runtime.
Auraed will listen on a Unix domain socket by default. Auraed will manage network devices for the system itself, as well as guests.
Auraed will provide service discovery and a lookup mechanism for other nodes in the mesh.
Each connection from a single node to another will be a direct point to point connection.
Aurae is designed to work in a decentralized way. Work is scheduled directly where it runs, or work is not scheduled at all. The networking model is designed to be as flat as possible. Aurae nodes navigate the mesh via calculating Hamiltonian paths[g][h][i][j][k][l][m][n] at runtime.
Aurae will also call out a node registry as a supported service in the future.
Aurae improves DNS[o] and addresses the decentralization problem by calling out a simple routing syntax for the nodes to follow.
@service@node@domain[p]
For example, if a user wanted to route a packet to a service, they would need to know the service name[q][r][s][t][u][v], the node name, and the domain name of the intended destination. For example, routing to my blog running on a node name “alice” would look like this.
@blog@alice@nivenly.com
@blog@*@nivenly.com
@blog@nivenly.com
Aurae calls mesh management and higher order decision making out of scope. However the project will likely end up maintaining one or more higher order services that will reason about where to send various messages and instructions in a mesh. The control system will be “node aware” and will use the awareness to make scheduling, and routing decisions.
Aurae leverages an internally scoped DHT for service discovery. The paradigm for a DHT resembles that of public DNS. Each Aurae node must be hydrated with a routable address in order to begin finding other nodes in the mesh.[w]
The DHT hydration paradigm is no different than defining 8.8.8.8 in resolv.conf.
A small public facing service can be scheduled on an Aurae node to begin serving as the initial bootstrapping hop.
Every Aurae daemon will have a cryptographic identity based on a combination of Public Key Infrastructure (PKI) Certificates and SSH keys. An admin may connect and control an Aurae [x][y]Daemon using an ‘SSH Certificate’ that is a combination of SSH keys and a signed time-limited certificate. The Aurae Daemon itself also receives an SSH Certificate that it uses to authenticate itself to incoming SSH connections, and to authenticate itself with other Aurae Daemons.
For production environments scaling beyond one one Aurae Daemon, the CA should be configured before the Aurae Daemon starts. A plugin model will be provided to support various key management solutions such as Cloud KMS or proprietary solutions.
An Aurae Daemon may also start in detached mode, where it is not federating with other systems. In detached mode, authorized SSH keys are configured[z][aa][ab] to control who or what is allowed to bootstrap the system. The Aurae Daemon will create a private CA, sign SSH keys, and return the SSH certificate to the user.
This approach should be flexible enough for a developer to get started quickly with safe defaults. For major enterprises, the PKI may be backed by KMS, HSM, or other devices, providing full control over the attestation and signing process.
Authentication between systems with different root CAs will require a mechanism to share the CA across boundaries. This will allow for arbitrary federation of identities between organizations.
There is also an opportunity to federate with cloud infrastructure identities using OIDC Connect that will be investigated.
Authorization is maintained separately from Authentication and consumes Aurae’s cryptographic identity.
The endpoints model is what gives Aurae the ability to power large service mesh topologies seen with projects like Istio, DAPR, and Linkerd. Instead of positioning a sidecar next to application, Aurae will instead deploy endpoints onto the node. These endpoints provide powerful networking functionality such as service discovery, NAT translation, Proxy routing, and name service resolution.
Aurae will ship with a set of opinionated flat networking endpoints by default, however more advanced networking topology will be possible simply by implementing the endpoint interface on the system.
Aurae will need to identify a way for services to communicate within the mesh. There are two networking layers that will need to be discussed.
Host Mesh Network | Service Mesh Network |
Composed of node-aware network endpoints that are connected together. The host mesh network will be end-to-end encrypted by default. | Pod level networking managed by injecting interfaces directly into pods and sandboxes. The possibility to leverage existing CNI networking toolchains here is in scope. |
The minimal requirements to join one Aurae instance to another is a communication bridge and node awareness. In other words a public DHT, public DNS, or hard coded network addresses will need to be identified for a specific mesh. The more known routable nodes in a mesh, the more resilient the joining process will be.
It will be possible to join a node through a control system as the control system will be aware of every node in the mesh. Thus the only awareness a node will need, is the root control system.
Aurae is built on the concept of subsystems, similar to Linux subsystems or Kubernetes resource groups.
Subsystem | Description | Examples |
Runtime | Stateless executive subsystem for direct interaction with a system’s runtime resources such as Linux processes, container runtimes, microVM hypervisors, process management, etc | runtime.Run(myPod) runtime.Run(myDaemon) runtime.Stop(myPod.name) |
Schedule | Higher level stateful wrapper system for Runtime. Here is where systemd unit files, and Kubernetes manifests will become relevant. This subsystem is responsible for scheduling runtime events under various criteria. | schedule.Cron(myPod, “* * *”) schedule.Now(myPod) schedule.Pin(myPod, node) schedule.Lax(myPod) |
Secrets should never enter a codebase. The goal is easy to do the right thing with secrets. | myPod.Env(“user”, “nova”) myPod.Env(“pass”, secrets.Get(“nova”)) | |
Identity | Identity is a wrapper subsystem that brings certificate management, authorization (authz) and IAM identity as low as possible in the stack. Auditing and identity should be easy to set up and manage by default. | nova = identity.User(“nova) nova.Allow(runtime) nova.Allow(runtime.run) nova.Allow(runtime.Stop) nova.Deny(schedule.Cron) |
Aurae will capture all stdout and stderr on a system and manage it for the daemon. Observe is how this data, and other data is accessed. | observe.Stdout(myPod) observe.Stderr(myPod) observe.Stdevent(myPod) observe.Metrics(myPod) observe.Stdout(myPod).withContext(ctx) | |
Route | Routing is a network abstraction that abstracts most of a networking stack away from the daemon. Here is where endpoints take over with service to service routing. | route.Open(myPod, “@foo@nivenly”) route.Open(myPod, “@baz@nivenly”) |
Batch[ao] | The batch system is a way of mutating large groups of Aurae objects at runtime. This feature is a core primitive of Aurae and will replace systems such as Helm for managing YAML. | myPod1.Name = “nova” myPod2.Name = “alice” myPod3.Name = “emma” [ “myPod1”, “myPod2”, “myPod3” ]) |
Mount | Mount will attach POSIX compliant storage to a pod. | mount.Device(myPod, s3, “/data”) |
The Aurae language is a turing complete alternative to YAML that ships with the memory safety and runtime guarantees as the Rust programming language.[ap][aq][ar][as][at][au][av][aw][ax][ay][az][ba]
Application owners and platform engineers will use the same language to represent static applications that can be used to mutate a system. In other words aurae files can reference each other. Application teams will be able to define their components without taking action with them just by providing a static file with their application needs.
Cluster specific configuration will be managed using the bath subsystem. Application owners can build their applications however they like. The infrastructure specific changes will come later in a build pipeline.
The Architecture tries to define and reference resources whenever possible. We do most of that here in the Appendix.
We do our best not to refer to a group of Aurae nodes as a cluster but rather as a mesh. We do this in order to outline the difference in the peer to peer relationship between nodes, and the organic growth paradigm of Aurae.
The Aurae project does not use any public bittorrent or libp2p infrastructure in any way.
While the core DHT paradigms might be similar, an Aurae mesh will be responsible for hosting or identifying its own public service discovery infrastructure such as public DNS or a DHT.
Authors: Kris Nóva
[a]What is a Tenant ?
Abstractly here I am thinking that #!/bin/aurae manages a tenant process of some form -- for example a CronJob or a DaemonSet. Am I close ?
[b]is it a namespace kind of thing for Aurae? Like can each tenant have different kubelet, cni configs etc?
[c]More: https://github.com/aurae-runtime/aurae/issues/21
https://github.com/aurae-runtime/aurae/issues/20
[d]each sandbox is a MicroVM with its own cri, kubelet?
[e]What does it mean "scaling aurae"? I was thinking, that Aurae is single node system, and k8s manages set of them.
[f]aurae is a single node, but it's being designed so that it can federate with other node. It will be possible to build a scheduler on top of aurae and we intend to produce at least one such scheduler.
[g]Could we consider the Hamiltonian Path as a backup in case direct connectivity is not possible?
Also, I think it would be worth studying Kademlia to help with scaling. BitTorrent is the killer app for Kademlia, but I think it could apply nicely here too.
[h]So this is a big decision as we need each node aware of its neighbors in order to make this work at the mesh level.
I think we can certainly implement a DHT per mesh, which obviously would also solve service discovery for us.
I do not want to use any of the native bittorrent or libp2p public infrastructure. The libraries and techniques however are in scope.
I still think we will need a Ham path to traverse the nodes, which can be derived from a DHT. We can start with a DHT and have each node register itself.
[i]What's "neighbor" in networking sense? Same subnet?
[j]Yeah just a node it can route to somehow. In theory it doesn't have to be on the same subnet -- just something that routes.
[k]Do you think nodes that can't route to one another are within scope? Like, do we want cluster that doesn't have routes from node A to node C unless it goes through B?
[l]I think that in some ways aurae can also be represented as a collection of disconnected meshes. Where we share a control plane across pockets of aurae nodes. Think of an edge compute model. I want apps deployed to nodes in one store to allow for communications between themselves. But I may not want the traffic from those nodes to be able to route to nodes in another store. Each store is an island.
[m]I added some information about aurae daemon authentication pattern further down that may help here. I think this model could be expanded to the running pods and provide support in constructing a service mesh. https://github.com/aurae-runtime/architecture/blob/main/accepted/001.md also has more information.
[n]I think we need to write down what we want to achieve here. Currently it says: We're using Mesh and Hamilton Paths instead of a centralized architecture.
But why? What problems are solved by doing this? Which other problems do we get by going this way?
[o]2 total reactions
Deleted user reacted with 🎉 at 2022-09-08 07:37 AM
Kris Nóva reacted with 😀 at 2022-09-14 21:03 PM
[p]Should we consider using plain old domain names, routed with a different protocol? For example, aurae://service.node.domain?
[q]Maybe I missed this, what mechanism is used to prevent service (and node/domain) name collisions within the mesh?
[r](Hi and welcome to Aurae!)
Each auread is coupled to an SSL signed identity with SPIFFE/SPIRE. We use the central authority and soon-to-be-ironed-out auth mechanism to ensure that we only issue a single cert for a single identity. If another node tries to register that identity we should reject the signing request for a new cert and thus the node won't be able to authenticate with any other nodes in the mesh.
[s]Ah ok, so the mesh is decentralized but authn/z is not. I miss understood.
[t]_Marked as resolved_
[u]_Re-opened_
[v]Mind if I keep this open for my notes? I want to clarify this eventually.
[w]Dumb question: how does this differ (if at all) from a gossip protocol?
[x]How do we configure the root ca and serve cert/key on the nodes?
[y]I have some minor comments on that a bit further down. I'm not entirely sure what mechanism we should use just yet. I think this will need to be pluggable though. Eventually, identity providers will probably want to provide implementations specific to themselves. E.g. Keyfactor, Venafi, Cloud KMS in all major clouds.
I'll fill out the next paragraph on this.
[z]Do we adopt openssh conventions here? we can vendor the library defaults directly from the C source code
[aa]_Marked as resolved_
I think we should, unless there is a compelling reason not to.
[ab]_Re-opened_
[ac]Other subsystems to consider not yet called out:
Building (CI/CD/OCI)
Pipeline (Scan, Sign, Compile)
OCI (Build, Format, Deploy, Push, Pull)
Prevention (Admissions, Validation)
Detection (Observability, eBPF, Kernel monitoring)
[ad]I also would like to re imagine what /proc might look like if it was represented at a higher level? Perhaps this could finally be the abstraction we need to deprecate the junk drawer of proc once and for all?
Proc (kernel metrics, events, process data, device tree, etc)
[ae]how do we store secrets in mesh (without etcd or another state store)?
[af]Secrets are a higher order primitive on top of the mount subsystem.
Aurae will implement a sqllite store on every node (via the mount subsystem) by default. I am unsure if I want Aurae to call encrypting secrets in scope or not.
Ultimately Aurae will form *some* opinion on secrets and store those in a reasonable way.
[ag]In my opinion, identity _could_ very fundamentally disintermediate credentials. So "secrets" might be an eternal temporary "hack". I mention this as there might be room to represent this in the correlations of these two subsystems.
[ah]Ideally, we can do away with secrets through the use of cryptographic identity. Once a workload has a cryptographic identity, it can authenticate itself with a system designed to store and manage secrets. This needs to be pluggable so that companies can make use of their current choices.
[ai]are we rolling our own observe or do we want to use something like otel?
[aj]Do you have a recommendation?
[ak]Ideally I'd like to roll with open telemetry. The main thing is we'd have to write logging ourselves since rust and friends don't have a logger yet
[al]Would you be willing to write a proposal on this in https://github.com/aurae-runtime/architecture/
I'd be happy to help you if you need.
[am]yeah will do I'm a bit pressed for time since I'm traveling next week but I will get something out there next week probably
[an]There is no rush, it can wait until you have more time. Thank you!
[ao]This looks a bit overzealous to me. If I understand this point correctly, this seems to be the domain of config languages such as Jsonnet, Nickel, Cue, Dhall or Nix. Is it?
[ap]The code snippet looks imperative to me. That raises the question: where is the reconciliation (control-)loop designed to reside? Is the user expected to script it?
I have trouble pinning it down, but I have a sense that the lazy evaled functional paradigm of Nix or Nickel to render a description of the desired state can provide a "better" interface, if resource identity can be abstracted away.
Nix only has the `derivation` "resource", which represents a file on disk. That file is used as indirection for configuration state.
Aurae and Nickel might present an opportunity to implement the distinct aurae built-ins (such as `container()`) as part of Nickel reconciliation targets (backed up by a reconciliation loop).
I apologize, this is a still a volatile thought in the making. So at this point, I might just invite you to scrutinize the Nickel configuration language.
You'd might come up with the more mature conclusions, anyway.
[aq]https://github.com/tweag/nickel
[ar]_Marked as resolved_
[as]_Re-opened_
[at]Thanks, will spend some time thinking about this and forming an opinion. I'm familiar with nix and cue, but haven't seen nickel. Will definitely take a look.
[au]I've reached out to the maintainer of nickel. I'd wish your paths may cross one day (at least in the technical sense). Of course, I'm doing a matchmaker's work, here, for which' intrusivness I sincerely apologize. Otoh, as a principle, I could only be a "rotten telephone". :-)
[av]Based on my yesterday's post, I updated myself on nickel again a bit and it looks like the only real endpoint of concern for nickel is data(-structures). So the control loop domain would still be _aura_ native, which is of course good news.
In that sense, maybe a shallow integration is really just good enough.
Since you're familiar with Nix, you might have thought about (folks like) me who struggle with the domain walls between cloud schedulers and pet schedulers (systemd as used by NixOS) every day.
Some influential Nix folks seem already to be aware of your project and I would absolutely cheer a vision where NixOS embraced aura as a vehicle to gradually emancipate itself from the workstation use case, while still porting over the utility of the immensly vast collection of the NixOS (systemd) service library.
[aw]It would be nice if Yann was to read this and can share his perspective / vision about possible integration points...
[ax]One last thing for today:
When I mentioned the problem of "resource identity" (or identification), I might have had a proto-remenberence that just reified today for the Nomia project which set out to generically solve resource naming and resolutions (and adquisition) issues.
It seems not being actively developed at the moment, but the fundamental ideas and concepts (and maybe even its grammar) might cross-pollinate resource naming, resolution (and adquisition) in the context of aura:
https://github.com/scarf-sh/nomia
[ay]Thank you for the thoughtful response! Matchmaking is always a good thing to do. I think diversity of opinions is good, though we will eventually need to converge, form an opinion, and execute against it.
I didn't realize any of the nixos people were paying attention to us yet. Once we're further along, I do intend to submit the package to nixpkgs and create a nixos service so that people can set up and deploy easily.
Our intention is to consume OCI images. I know that nixos has the capability to generate these with pkgs.dockerTools.buildImage and buildLayeredImage, but this feels orthogonal to where the image should run. An initial integration could be just wiring these up to aurae. I'm sure a plugin with a deeper integration could be possible. I'm hesitant to add it to the scope here, but I think another document outlining a possible NixOS integration path would be amazing. I would love to engage with the NixOS community to see where this can go. Please forward this document to anyone who you think would find this interesting.
I'll also take a look at nomia. I feel like this would need to live at a higher layer than auraed since it appears to be scheduling workloads. Our intention is to have auraed just run workloads. Something else will need to schedule what runs on any given auraed. This will eventually be in scope, but we're focusing on getting auraed out the door first.
Again, thank you soooo much!!! You've given me things to think about and I am very excited that word is getting out even when we're so early!
[az]Re: OCI images, on a tangent... A couple of folks have gathered together to potentially amend the OCI image layer standard with Nix locators (image layer of type `[...].nar`) and thereby short-circuiting on the (very slow) overlay constructions that the runtime still is burdoned with.
This works, since Nix implements non-conflicting identifiers for a file system within it's immutable subtree/store.
OCI steering folks seemed preliminarily interested. This might converge matters further around runnable artifact resolution (and adquisition), independent of the jail that you (or the user) end up choosing.
Edit: I'm still the same, no clue why this shows as "anonymous".
[ba]Ah weird, not sure what google docs is doing here. There are some workloads that are delivered by OCI that nar will likely break since it doesn't support extended attributes. Fortunately, these will be in the minority.