Support for pods in Mesos
6.3.3.1. Launch a sub-container
6.3.3.2. Destroy a sub-container
6.3.3.3. Update a sub-container
6.3.4.1. Executor to agent messages
6.3.4.2. Sub-container reaping
6.3.4.3. Checkpointing and recovery
6.3.4.6. Why let agent fork-exec?
Mesos doesn’t have first class support for running pods. A pod can be defined as a set of containers co-located and co-managed on an agent that share some resources (e.g., network namespace, volumes) but not others (e.g., container images, resource limits).
Mesos already has a concept of Executors and Tasks. An executor can launch multiple tasks, where each of the tasks as well as the executor itself can specify a ContainerInfo. The main idea is to leverage the executor and task abstractions to provide pod semantics, where the executor runs in an outer container (also called executor container) and each of its tasks might run in separate nested containers (also called sub containers) inside the outer container.
The limitation of the existing Scheduler and Executor APIs is that there is no way to send a collection of tasks, atomically, to an executor. Currently, even if a scheduler launches multiple tasks targeted for the same executor in a LAUNCH offer operation, those tasks are delivered to the executor one at a time via separate LAUNCH events. This lack of atomicity is problematic in providing pod semantics because some of the tasks might be dropped (e.g., network issues) resulting in a pod that might be in a partially functioning state. It will be easy to reason about if the pod has all or nothing semantics w.r.t to its constituent containers.
To achieve this, we will introduce new abstraction called TaskGroupInfo. A `TaskGroupInfo` is a collection of tasks, potentially with some metadata, that is atomically delivered to the executor to launch. An executor that wants to provide pod semantics could interpret the `TaskGroupInfo` as a group of nested/sub containers.
The workflow for schedulers to launch pods will be very similar to how they launch tasks today. When a scheduler gets an offer, it will use a new LAUNCH_GROUP offer operation to launch a collection of tasks. Master does the required validation and accounting and forwards the task group via a new RunGroupMessage message to Agent. Agent first sets up the outer container and launches an executor inside it. When an executor registers, the agent sends a new LAUNCH_GROUP executor event. If the executor wants to set up a nested container for its task, it can send a new LAUNCH call to the agent. Note that an executor might not want to isolate its tasks and hence not put them inside nested containers. (TODO: Need a flow diagram here)
message TaskGroupInfo {
repeated TaskInfo tasks = 2;
}
message Offer {
...
...
message Operation {
enum Type {
LAUNCH = 1;
LAUNCH_GROUP = 6; # <-- New operation
…
…
}
...
…
message LaunchGroup {
required ExecutorInfo executor = 1;
required TaskGroupInfo task_group = 1;
}
…
…
optional LaunchGroup launch_group = 7;
}
}
To allow frameworks to launch a task group with the default mesos executor, we will let frameworks set `LaunchGroup.ExecutorInfo` with only some fields set (e.g., ExecutorID). Mesos master/agent will fill in the necessary fields in the ExecutorInfo (e.g, CommandInfo, Resources) for the default executor. Having framework set the ExecutorID is important because it lets the framework easily shut down the pod/executor (and its constituent tasks) and/or know the termination status of the executor. To make this work we need to change `ExecutorInfo.command` from required to optional, thus making all fields, except `executor_id`, in ExecutorInfo optional. We will also add a new `ExecutorInfo.type` field to explicitly distinguish default and custom executors.
message ExecutorInfo {
enum Type {
UNKNOWN = 0;
// Mesos provides a simple built-in default executor that frameworks can
// leverage to run shell commands and containers.
//
// NOTES:
//
// 1) `command` must not be set when using a default executor.
//
// 2) If "cpu" or "mem" resources are not set, default values of 0.1 cpu
// and 32 MB memory will be used; schedulers should account for these
// resources when accepting offers.
//
// 3) Default executor only accepts a *single* `LAUNCH` or `LAUNCH_GROUP`
// offer operation.
DEFAULT = 1;
// For frameworks that need custom functionality to run tasks, a `CUSTOM`
// executor can be used. Note that `command` must be set when using a
// `CUSTOM` executor.
CUSTOM = 2;
}
// For backwards compatibility, if this field is not set when using `LAUNCH`
// offer operation, Mesos will infer the type by checking if `command` is
// set (`CUSTOM`) or unset (`DEFAULT`). `type` must be set when using
// `LAUNCH_GROUP` offer operation.
//
// TODO(vinod): Add support for explicitly setting `type` to `DEFAULT `
// in `LAUNCH` offer operation.
optional Type type = 15;
...
...
optional CommandInfo command = 7; # <-- Change this from `required` to `optional`
...
...
}
Note that, when using the default executor (identified by `ExecutorInfo.type` being set to DEFAULT) frameworks are expected to set the resources needed for the executor. Specifically, the only fields that frameworks are not allowed to set when using the default executor are:
To allow containers to share a network namespace and volumes, frameworks can set `Volume` and `NetworkInfo` in `ExecutorInfo.container`. Subsequently, frameworks can set `Volume` on `TaskInfo` to mount a volume set on `ExecutorInfo` (See `Volume Sharing` section below for details).
On the executor side, we need to add a new LAUNCH_GROUP event. There are other changes needed for Executor API that are discussed in the “Agent Changes” section.
message Event {
enum Type {
...
...
LAUNCH = 2;
LAUNCH_GROUP = 8; # <-- new
...
}
message LaunchGroup {
required TaskGroupInfo task_group = 1;
}
…
…
optional LaunchGroup launch_group = 8;
}
Every task will have a “restart policy”, that determines what an executor should do to a task if the task terminates[b][c][d][e][f][g][h].
message TaskInfo {
message RestartPolicy {
enum Type {
ALWAYS = 1,
ON_FAILURE = 2,
}
optional Type type = 1;
optional int max_restarts = 2;
optional DurationInfo min_restart_delay = 3;
}
...
...
optional RestartPolicy restart_policy = 13; # <-- New
}
If a restart policy is set for a task, the executor is responsible for restarting the task (by communicating with the agent) per policy. The executor could also send a status update when a task restarts (TASK_RESTARTING) to inform the scheduler about the restart, so that the scheduler can use the information to better reschedule the pod.
For MVP, we might get away with not having a first class task level restart policy. Instead the default executor will terminate if any of the tasks terminate with a non-zero exit code.
Master should have a handler for the new LAUNCH_GROUP operation. After receiving the operation it does the requisite validation (e.g., total resources of tasks and executor is not more than offered resources). For the MVP, master will store tasks belonging to a TaskGroup in a flat list of tasks just like it does today for non-task-group tasks.
Master sends the `TaskGroup` information to agent via the new `RunGroupMessage` message. Master should know whether a given agent can actually understand `RunGroupMessage`. Ideally, this will be done via a newly introduced agent capabilities. For MVP, master will leverage the agent version for this.
message RunGroupMessage {
required FrameworkInfo framework = 1;
required TaskGroupInfo task_group = 2;
optional string pid = 3; // Framework PID for sending Framework messages.
}
The default executor should be able to leverage the existing containerization infrastructure to create sub-containers[i], instead of implementing its own containerization. Ideally, any custom executor should be able to leverage the same primitives to create sub-containers.
Agent will be responsible to fork-exec the sub-container, ideally the same way it fork-execs the executor container. Executor will initiate the sub-container creation by making an API request to the agent. The request will contain all the information about the sub-container. The agent will use the same flow to provision the rootfs, invoking isolators, fork the child process, exec the command, and finally reap the container. The isolators and launcher need to be nesting aware so that cgroups, namespaces and filesystems can be nested properly under the executor container.
We might not want to always tie sub-containers to tasks. Eventually, we’d like to allow executors to create sub-containers that are not tied to tasks. For instance, a custom executor can use the same primitive to launch a sub docker container for some executor related work and share the executor’s resources.
Currently, the following isolators need to be adjusted to be nested aware[m]:
For resource related isolators like `cgroups/*` and `disk/*` isolators, we plan not to tackle them in MVP. Resource isolation is still enforced at the executor level.[n][o][p][q][r][s][t][u][v][w][x][y][z][aa][ab][ac][ad][ae][af][ag][ah][ai]
When executor container terminates, all its sub-containers will be recursively destroyed as well. Launcher will also GC all the checkpointed pid and exit status at that time for sub-containers because they are no longer needed.
Update will not be supported for sub-containers. This should not be a problem for the Pod case as we’ll tie a task to a sub-container, and task’s resources cannot be changed after it’s launched.
The following calls will be added to the Agent API to manage sub containers. An executor or any of its tasks or sub-containers can use this API to launch sub-containers.
package mesos.agent;
message Call {
enum Type {
...
// Calls for managing nested containers underneath an executor's container.
NESTED_CONTAINER_LAUNCH = 14; // See 'NestedContainerLaunch' below.
NESTED_CONTAINER_WAIT = 15; // See 'NestedContainerWait' below.
NESTED_CONTAINER_KILL = 16; // See 'NestedContainerKill' below.
}
// Launches a nested container within an executor's tree of containers.
message NestedContainerLaunch {
required ContainerID container_id = 1;
optional CommandInfo command = 2;
optional ContainerInfo container = 3;
repeated Resource resources = 4;
}
// Waits for the nested container to terminate and receives the exit status.
message NestedContainerWait {
required ContainerID container_id = 1;
}
// Kills the nested container. Currently only supports SIGKILL.
message NestedContainerKill {
required ContainerID container_id = 1;
}
optional Type type = 1;
...
optional NestedContainerLaunch nested_container_launch = 6;
optional NestedContainerWait nested_container_wait = 7;
optional NestedContainerKill nested_container_kill = 8;
}
message Response {
enum Type {
...
NESTED_CONTAINER_WAIT = 13; // See 'NestedContainerWait' below.
}
// Returns termination information about the nested container.
message NestedContainerWait {
optional int32 exit_status = 1;
}
optional Type type = 1;
...
optional NestedContainerWait nested_container_wait = 14;
}
Since the Agent API changes above allows any client to potentially launch sub-containers, these calls should be properly authenticated and authorized.
To do this, we will add a new field `principal` to ExecutorInfo to identify the principal of the executor container. Note that all sub-containers launched within the executor container are expected to use the same principal for launching further sub-containers.
message ExecutorInfo {
…
…
optional string principal = 10;
}
While the framework is expected to set the principal of the executor, it is the agent’s (or an agent module’s) responsibility to inject the proper credentials for that principal into the container. We do not want to add a `secret` field to ExecutorInfo because it’s not a secure way to transmit credentials.
For agents configured with the default CRAM-MD5 authenticator, the agent will inject the `principal` and `secret` in the container’s (or sub-container’s) environment. Agent looks inside the credentials provided by `--http_credentials` flag to find the matching secret for a given principal.
For authorizing sub-container management (i.e., launch, wait and kill API calls) we will add a new authorizable action `MANAGE_NESTED_CONTAINER`.
// NOTE: This is an existing proto.
message Object {
optional string value = 1;
optional FrameworkInfo framework_info = 2;
optional Task task = 3;
optional TaskInfo task_info = 4;
optional ExecutorInfo executor_info = 5;
optional quota.QuotaInfo quota_info = 6;
}
enum Action {
…
…
// This action will authorize `LAUNCH_NESTED_CONTAINER`, `WAIT_NESTED_CONTAINER` and
// `KILL_NESTED_CONTAINER` agent API calls.
// This will have an object with `FrameworkInfo` and `ExecutorInfo` set.
MANAGE_NESTED_CONTAINER = 19;
}
The default local authorizer will have the following new ACLs.
message ACL {
…
…
message ManageNestedContainer {
required Entity principals = 1;
required Entity users = 2;
}
}
Note that this means a sub-container inside an executor can launch a sub-container inside another executor’s container, assuming it has the proper privileges!
Since agent forks the sub-container, executor cannot directly reap the exit status of the sub-container. However, the executor does need to know the exit status to be able to generate proper status update to the framework.
One way to solve this problem is to improve the existing launcher so that it serves as a nanny process, reaping and pid as well as forwarding the signals. The pid as well as the exit status of the sub-container will be written to a well known location (e.g., /var/run/mesos/launcher/linux/<container_id>) so that we can get the exit status even if the agent crashes and restarts. We need to introduce a `wait` function for the launcher instead of relying on `process::reap` which assumes a parent child relationship.
Currently, containerizer checkpoints the pid of the executor (`forkedPid`) so that when the agent crashes and restarts, it can use that information to wait for the executor to exit (i.e., `process::reap`). Some isolators (e.g., the port mapping network isolator) also rely on that to do proper recovery.
Ideally, we should improve the Launcher interface to handle the wait, instead of calling `process::reap`. This is because after the agent crashes and restarts, the parent-child relationship is broken. Calling `waitpid` on the executor pid won’t get us the exit code. Although this is not a problem for executors, it’ll become a problem for task containers because the pod executor needs the exit code to decide what status update to send back to the framework.
We should introduce a new method in the Launcher interface (e.g., `wait`) to wait for a container to terminate and return the exit code.
class Launcher
{
public:
virtual Future<Option<int>> wait(const ContainerID& containerId);
};
In fact, Launcher should be the one that checkpoints the pid, instead of the containerizer. For backwards compatibility, the containerizer still needs to know the pid of the executor because some isolators still depend on the that. Also, we need to deal with the scenarios where the Launcher does not have checkpointed pid (since Launcher does not do checkpointing currently).
For orphan containers, the logic should be:
The sandbox layout for an executor with nested containers will be look like the following:
.../executors/<executorId>/runs/<containerId>/
|--- stdout
|--- stderr
|--- volume/
|--- .containers/
|--- <containerId>/
|--- stdout
|--- stderr
|--- volume/
|--- .containers/
|--- …
For command tasks, this is not a backwards compatible change. We need to have some exception in the code to deal with that.
This is a common requirement that containers in a pod should be able to share volumes so that they can communicate through the filesystem.
Since the rootfs of the sub-container sits on the host filesystem, therefore, the executor container might not have access to that. That means the sub-container has to be able to see the host filesystem before it changes filesystem root. The indication of that is we might not be able to support a host volume for a sub-container whose source is from its parent container. We can still support that if it’s from parent container’s sandbox. In that case, we need to introduce a new Volume.Source.
Persistent volumes or external volumes can be specified at either executor level or task level. To be able to share a persistent volume or external volume between containers in a Pod, the framework should specify those volumes in the executor level and define a new volume for the task whose source is from the executor’s sandbox[ak][al][am][an][ao] (the new Volume.Source discussed above).
We ask the agent to do the fork-exec for the sub-containers, instead of creating a binary for the executor to do the fork-exec for the following reasons:
Currently, the command executor and the task are in the same executor container. Eventually, we should not treat that as a special case in the sense that the command executor should be in the executor container and create a sub-container for its task.
Ideally, we should improve command executor to support Pod, making it a general *default* executor in Mesos. Alternatively, if this is too aggressive, we can still keep the existing command executor and create a new one executor for Pod. We’ll eventually retire the old command executor once the Pod executor is stable.[ap]
class Containerizer
{
public:
virtual Future<Nothing> recover(...);
// Launch the executor container.
// @param environment additional environment variables for the executor
// @param pidFile path to the pid file for checkpointing (if needed).
// This will be deprecated soon in favor of letting Launcher to checkpoint.
virtual Future<bool> launch(
const ContainerID& containerId,
const ExecutorInfo& executorInfo[aq],
const string& directory,
const Option<string>& user,
const map<string, string>& environment,
const Option<string>& pidFile);
// Launch a sub-container nested under the executor container.
virtual Future<bool> launch(
const ContainerID& containerId,
const ContainerID& parentContainerId,
const CommandInfo& command,
const Option<ContainerInfo>& containerInfo,
const Resources& resources,
const string& directory,
const Option<string>& user);
// This only applies to executor container. It should be called
// only if the total resources of an executor changes. For example,
// it receives a new task, or a task becomes terminal.
virtual Future<Nothing> update(
const ContainerID& containerId,
const Resources& resources);
// Resources and status can only be get at executor level for now.
virtual Future<ResourceStatistics> usage(const ContainerID& containerId);
virtual Future<ContainerStatus> status(const ContainerID& containerId);
// Wait for the container to terminate.
virtual Future<ContainerTermination> wait(const ContainerID& containerId);
// Destroy the given container.
virtual Future<Nothing> destroy(const ContainerID& containerId);
};
message ContainerTermination {
message Reason {
enum Type {
MEMORY_LIMIT_REACHED,
DISK_LIMIT_REACHED,
LAUNCH_FAILED,
UPDATE_FAILED,
}
optional Type type;
optional string message;
}
// Exit status of the init process.
optional int32 status;
// Reasons why cause the termination.
repeated Reason reasons;
};
http://kubernetes.io/docs/user-guide/pods/
https://github.com/docker/docker/issues/8781
http://aurora.apache.org/documentation/latest/reference/configuration/
[a]+clambert@mesosphere.io here is the blurb about the default executor resources. the min values allowed are 0.1 cpu and 32 MB (see proto above).
[b]What about taskgroup level restart policy? If we delegate that to the framework, will that be latency issue? or what if there's a network partition.
[c]is it really a task-group level "restart" policy, or "longevity" policy?
1. group lives as long as ANY task is running
2. group lives as long as ALL tasks are running
[d]as a follow-up, how should a pod react to a task-group that exits w/ a failure condition? does it stop processing all following task groups, or attempt to continue?
[e]There can be tasks that are as important as the executor within the Taskgroup. In such cases we might want to kill the task group or restart the entire TaskGroup if that specific task dies. How do we handle TaskGroup restarts based on status of tasks other than the executor?
[f]We will punt on this for MVP. The default executor will terminate if any of the tasks terminate.
[g]_Marked as resolved_
[h]_Re-opened_
Your proposal for the MVP will make for **very** fragile pods
[i]It would be helpful to have a strong definition on what is meant by "sub-container".
[j]SHOULD or MUST?
[k]Is nesting strictly hierarchical? That is, can a nested child access something in the parent? Can this express the need (for example), for multiple tasks to be in the same PID namespace?
[l]Maybe this should be --nested_isolation to match the --isolation flag?
[m]What is the contract with an isolator defined to be 'nested aware'? Where is the definition of what 'nested' even means?
[n]Can an Executor successfully launch multiple TaskGroups? If there is no resource isolation within the Executor then separate Pods will be able to harm each other through excessive consumption of resources.
[o]Resource isolation is still enforced at the Pod level. Currently, we do not support launching multiple TaskGroups in an executor.
[p]Will this be enforced on Custom Executors or are you just referring to the behavior of the Command Executor?
[q]Yes, it'll be enforced for custom executor as well.
[r]How will this enforcement manifest itself? If a CustomExecutor is currently running, and a Scheduler Launches a new TaskGroup specifying that already running Executor, what will happen?
[s]It will be rejected by the agent.
[t]Ok. Seems an odd change in behavior. Tasks can reuse running Executors. TaskGroups cannot. This seems to be a side effect of the desire to delay tackling sub-container cgroups, and not necessarily something which should be true forever. If we pretended that cgroup and disk isolators were already implemented for sub-tasks would you still make this design decision?
[u]I agree this seems an odd behavior, and I think it does not make sense for agent to reject such request (launching one more task group with an existing customer executor), the reason is customer executor is free to launch a task groups as sub-containers or whatever way it wants, but agent can never know this in advance, so what if a customer executor just wants to launch a new task group as some normal processes rather than sub-containers? Does it still make sense for agent to reject it?
Actually I think customer executor should be free to launch any number of task groups in whatever way they want, if it wants to launch a task group as sub-containers, then that task group is a pod. And it is OK for command executor to only launch one task group just like now it can only launch one task.
[v]There are some open questions related to allowing multiple task groups (which also exist today with multi-task executors). For example, if the executor exits just before task group 2 is received by the master, should we re-launch the executor automatically? Is that what the user/framework wanted? The resource accounting for executor also gets tricky because the master has no clue whether the executor is already running or just terminated and will be relaunched by the agent etc. We can allow multiple task groups with the same caveats, but not sure if it's useful or becomes a crutch for users. Another option is to disallow it, until we figure out the right way to solve this.
[w]Can you clarify which issues you describe here are specific to multi-task Executors and which are specific to TaskGroups running on Executors?
ExecutorIDs are listed in Offers today. It would perhaps disambiguate some of the problems you describe if an attempt to launch a Task or TaskGroup on an already running Executor treated running Executors like any other Resource. If the Executor crashes just before launch of a Task or TaskGroup then the Task or TaskGroup cannot be launched as the Resource (running Executor) is no longer available. Just an idea. The answer to the question above is still important to me.
[x]@Vinod, for the issue you described, if the executor exits just before task group 2 is received by the master, I think the executor will be relaunched by the agent when the agent received the task group 2, that's should be the current behavior of multi-tasks executor, right? I think we can just follow it for multi-task groups executor.
Actually when you look at Kubernetes on Mesos, each Mesos agent node will have a kubelet running on it as a custom executor, and that kubelet will be responsible for launching multiple pods, it will be strange for it to just launch a single pod. I think we should enable any Mesos custom executor with similar capability.
[y]We will allow multiple task groups for custom executor with the above caveats. It won't be supported in the default pod executor.
[z]_Marked as resolved_
[aa]_Re-opened_
OK, so that means custom executor can launch multiple pods. And what about network? Can each pod have its own network? Or they all have to be in the same network with the executor?
[ab]not multiple pods. multiple task groups per pod. there is still going to be one network info that will be shared by all task groups.
[ac]I am a bit confused about the difference between pod and task group. So here a pod is:
1. A default executor + a single task group.
2. A custom executor + multiple task groups.
Right?
[ad]Task group has nothing to do with Pod. It is just a mechanism in Mesos to send a group of task with all or nothing semantics. It's a building block for building pod support. Can you state your use case and it will be easier to go from there.
[ae]The use case is that user may want to launch a group of co-located containers which share the same network so that they can access each other with localhost. This is similar to how Kubernetes supports pod.
[af]OK, this sounds quite doable using the existing primitive. the executor container will have a network namespace which all sub-containers underneath will share. sub-contianer does not necessarily tied to tasks.
[ag]But user may want the custom executor to launch multiple task groups and the tasks of each group are in a separate network. Consider that the framework may be multi-user/multi-tenant aware, so each framework user may launch their own task group, but they may not want all these task groups in the same network. It is natural that user1 wants to launch a task group in network1, and user2 wants to launch a task group in network2. I do not think the existing primitive in this design doc can support it.
And can you please elaborate a bit about "sub-container does not necessarily tied to tasks"? I think each task in a task group should be a sub-container, right?
[ah]ok, in that case, can you use multiple executor containers each of which has a separate network stack?
[ai]Yes, we can. But the problem is, how can framework know when it should launch multiple executors each of which is responsible for launching a task group in a separate network, and when it should launch a single executor which is responsible for launching multiple task groups which are all in the same network? And as framework users, they do not care about executor, they only care about their task group and the network it will join, but with the current primitives, how can a user specify whether a task group should be put into a separate network or not?
[aj]+benh@mesosphere.io please take a look at the newly added authn and authz section for nested containers.
[ak]Is this also how I can share scratch space between containers? I'm imagining it would consume executor "disk" resources; NOT persistent or external. +jie@mesosphere.io
[al]What's the behavior of a HOST volume mount with a relative path?
[am]It'll be the same.
[an]So the relative path .. it is relative to the sandbox of the executor?
[ao]relative to the sandbox of the sub-container
[ap]Strongly +1 to this. We've already felt the pain to support multiple types of executors (even possibly written in different programming languages) in a multi framework setup in the same Mesos cluster. With POD support, there are less reason to reinvent the wheels because many usage are just because they need to run multiple tasks under the same executor.
[aq]Think about if we can make containerizer interface not depend on `ExecutorInfo` and `TaskInfo`. Probably should use a protobuf to wrap all of them.