Support for pods in Mesos

1. Motivation

2. User Stories

3. Goals

4. Non-goals

5. Solution Overview

6. Solution Details

6.1. Framework API Changes

6.1.1. Restart Policy

6.2. Master Changes

6.3. Agent Changes

6.3.1. Goal

6.3.2. Main idea

6.3.3. Workflow overview

6.3.3.1. Launch a sub-container

6.3.3.2. Destroy a sub-container

6.3.3.3. Update a sub-container

6.3.4. Details

6.3.4.1. Executor to agent messages

6.3.4.2. Sub-container reaping

6.3.4.3. Checkpointing and recovery

6.3.4.4. Sandbox layout

6.3.4.5. Volume sharing

6.3.4.6. Why let agent fork-exec?

6.3.4.7. Command executor

6.3.4.8. Containerizer API

6.3.5. Parking lot

7. References

1. Motivation

Mesos doesn’t have first class support for running pods. A pod can be defined as a set of containers co-located and co-managed on an agent that share some resources (e.g., network namespace, volumes) but not others (e.g., container images, resource limits).

2. User Stories

  • As a user, I want to be able to run a side-car container (e.g., logger) next to my main application controller.
  • As a user, I want to be able to run an adapter container (e.g., metrics endpoint, queue consumer) next to my main container
  • As a user, I want to run transient tasks inside a pod for operations which are short-lived and whose exit does not imply that a pod should exit (e.g. a task which backs up data in a persistent volume)

3. Goals

  • Frameworks should be able to launch a group of co-located containers on an agent and manage them as a single unit.
  • Containers in a pod should be able to share network namespaces and volumes but potentially have separate resource limits and images.
  • Both Mesos built-in executor and custom executors should be able to launch and manage pods. The built-in executor might be different from the existing built-in command and docker executors.

4. Non-goals

  • Updating the pod.
  • Adding support for pods in DockerContainerizer.

5. Solution Overview

Mesos already has a concept of Executors and Tasks. An executor can launch multiple tasks, where each of the tasks as well as the executor itself can specify a ContainerInfo. The main idea is to leverage the executor and task abstractions to provide pod semantics, where the executor runs in an outer container (also called executor container) and each of its tasks might run in separate nested containers (also called sub containers) inside the outer container.

The limitation of the existing Scheduler and Executor APIs is that there is no way to send a collection of tasks, atomically, to an executor. Currently, even if a scheduler launches multiple tasks targeted for the same executor in a LAUNCH offer operation, those tasks are delivered to the executor one at a time via separate LAUNCH events. This lack of atomicity is problematic in providing pod semantics because some of the tasks might be dropped (e.g., network issues) resulting in a pod that might be in a partially functioning state. It will be easy to reason about if the pod has all or nothing semantics w.r.t to its constituent containers.

To achieve this, we will introduce new abstraction called TaskGroupInfo. A `TaskGroupInfo` is a collection of tasks, potentially with some metadata, that is atomically delivered to the executor to launch. An executor that wants to provide pod semantics could interpret the `TaskGroupInfo` as a group of nested/sub containers.

6. Solution Details

The workflow for schedulers to launch pods will be very similar to how they launch tasks today. When a scheduler gets an offer, it will use a new LAUNCH_GROUP offer operation to launch a collection of tasks. Master does the required validation and accounting and forwards the task group via a new RunGroupMessage message to Agent. Agent first sets up the outer container and launches an executor inside it. When an executor registers, the agent sends a new LAUNCH_GROUP executor event. If the executor wants to set up a nested container for its task, it can send a new LAUNCH call to the agent. Note that an executor might not want to isolate its tasks and hence not put them inside nested containers. (TODO: Need a flow diagram here)

6.1. Framework API Changes

message TaskGroupInfo {
 repeated
TaskInfo tasks = 2;
}


message
Offer {
 
...
 
...
 message
Operation {
   
enum Type {
     LAUNCH
= 1;
     
LAUNCH_GROUP = 6;    # <-- New operation
     

      …
   
}
   
...
   

   message
LaunchGroup {

      required ExecutorInfo executor = 1;
     required
TaskGroupInfo task_group = 1;
   
}

  …

  …

  optional LaunchGroup launch_group = 7;
 
}
}

To allow frameworks to launch a task group with the default mesos executor, we will let frameworks set `LaunchGroup.ExecutorInfo` with only some fields set (e.g., ExecutorID). Mesos master/agent will fill in the necessary fields in the ExecutorInfo (e.g, CommandInfo, Resources) for the default executor. Having framework set the ExecutorID is important because it lets the framework easily shut down the pod/executor (and its constituent tasks) and/or know the termination status of the executor. To make this work we need to change `ExecutorInfo.command` from required to optional, thus making all fields, except `executor_id`, in ExecutorInfo optional. We will also add a new `ExecutorInfo.type` field to explicitly distinguish default and custom executors.

message ExecutorInfo {
   enum Type {

    UNKNOWN = 0;

    // Mesos provides a simple built-in default executor that frameworks can

    // leverage to run shell commands and containers.

    //

    // NOTES:

    //

    // 1) `command` must not be set when using a default executor.

    //

    // 2) If "cpu" or "mem" resources are not set, default values of 0.1 cpu

    //    and 32 MB memory will be used; schedulers should account for these

    //    resources when accepting offers.

    //

    // 3) Default executor only accepts a *single* `LAUNCH` or `LAUNCH_GROUP`

    //    offer operation.

    DEFAULT = 1;

    // For frameworks that need custom functionality to run tasks, a `CUSTOM`

    // executor can be used. Note that `command` must be set when using a

    // `CUSTOM` executor.

    CUSTOM = 2;

  }

  // For backwards compatibility, if this field is not set when using `LAUNCH`

  // offer operation, Mesos will infer the type by checking if `command` is

  // set (`CUSTOM`) or unset (`DEFAULT`). `type` must be set when using

  // `LAUNCH_GROUP` offer operation.

  //

  // TODO(vinod): Add support for explicitly setting `type` to `DEFAULT `

  // in `LAUNCH` offer operation.

  optional Type type = 15;

  ...

  ...
 optional
CommandInfo command = 7;         # <-- Change this from `required` to `optional`
 
...
 
...
}

Note that, when using the default executor (identified by `ExecutorInfo.type` being set to DEFAULT) frameworks are expected to set the resources needed for the executor. Specifically, the only fields that frameworks are not allowed to set when using the default executor are:

  • `ExecutorInfo.command`
  • `ExecutorInfo.container.type`, `ExecutorInfo.container.docker` and `ExecutorInfo.container.mesos`[a]

To allow containers to share a network namespace and volumes, frameworks can set `Volume` and `NetworkInfo` in `ExecutorInfo.container`. Subsequently, frameworks can set `Volume` on `TaskInfo` to mount a volume set on `ExecutorInfo` (See `Volume Sharing` section below for details).

On the executor side, we need to add a new LAUNCH_GROUP event. There are other changes needed for Executor API that are discussed in the “Agent Changes” section.

message Event {
 
enum Type {
   
...
   
...
   LAUNCH
= 2;
   LAUNCH_
GROUP = 8; # <-- new
   
...
 
}
 
 message
LaunchGroup {
   required
TaskGroupInfo task_group = 1;
 
}

  …

  …

  optional LaunchGroup launch_group = 8;
}

6.1.1. Restart Policy

Every task will have a “restart policy”, that determines what an executor should do to a task if the task terminates[b][c][d][e][f][g][h].


message
TaskInfo {

  message RestartPolicy {
   
enum Type {
     
ALWAYS = 1,
     ON_FAILURE
= 2,
    }
 
   optional
Type type = 1;
 
   optional
int max_restarts = 2;
   optional
DurationInfo min_restart_delay = 3;
 
}

 
...
 
...
 optional
RestartPolicy restart_policy = 13; # <-- New
}

If a restart policy is set for a task, the executor is responsible for restarting the task (by communicating with the agent) per policy. The executor could also send a status update when a task restarts (TASK_RESTARTING) to inform the scheduler about the restart, so that the scheduler can use the information to better reschedule the pod.

For MVP, we might get away with not having a first class task level restart policy. Instead the default executor will terminate if any of the tasks terminate with a non-zero exit code.

6.2. Master Changes

Master should have a handler for the new LAUNCH_GROUP operation. After receiving the operation it does the requisite validation (e.g., total resources of tasks and executor is not more than offered resources). For the MVP, master will store tasks belonging to a TaskGroup in a flat list of tasks just like it does today for non-task-group tasks.

Master sends the `TaskGroup` information to agent via the new `RunGroupMessage` message. Master should know whether a given agent can actually understand `RunGroupMessage`. Ideally, this will be done via a newly introduced agent capabilities. For MVP, master will leverage the agent version for this.

message RunGroupMessage {
 required
FrameworkInfo framework = 1;
 required
TaskGroupInfo task_group = 2;
 
optional string pid = 3; // Framework PID for sending Framework messages.
}

6.3. Agent Changes

6.3.1. Goal

The default executor should be able to leverage the existing containerization infrastructure to create sub-containers[i], instead of implementing its own containerization. Ideally, any custom executor should be able to leverage the same primitives to create sub-containers.

6.3.2. Main idea

Agent will be responsible to fork-exec the sub-container, ideally the same way it fork-execs the executor container. Executor will initiate the sub-container creation by making an API request to the agent. The request will contain all the information about the sub-container. The agent will use the same flow to provision the rootfs, invoking isolators, fork the child process, exec the command, and finally reap the container. The isolators and launcher need to be nesting aware so that cgroups, namespaces and filesystems can be nested properly under the executor container.

6.3.3. Workflow overview

We might not want to always tie sub-containers to tasks. Eventually, we’d like to allow executors to create sub-containers that are not tied to tasks. For instance, a custom executor can use the same primitive to launch a sub docker container for some executor related work and share the executor’s resources.

6.3.3.1. Launch a sub-container

  1. Executor sends a Call to the agent. The message should[j] include `ContainerInfo`, `CommandInfo` and `Resources`. In order for the executor to wait for the sub-container, the executor has to generate a `ContainerID` as well.
  2. Agent receives the message, finds the corresponding executor container id according to the connection, and asks the containerizer to launch a sub-container nested under the parent executor container.
  3. Containerizer will provision the image specified in `ContainerInfo` in a given location. The output of this step is a path to the root filesystem of the sub-container and the corresponding image manifest information.
  4. For each isolator, containerizer will call prepare. The isolator needs to be nesting aware. For instance, cgroups isolators will create sub-cgroups that are nested under the parent cgroup. Some isolators might not support nesting[k]. This is OK because the executor container is still properly isolated! The output of this stage is a launch info containing information about the env/workdir/capability/namespaces/etc (basically, how the sub-container should be launched).
  5. Launcher is used to fork the sub-container. Launcher will remember the sub-container in its in-memory data structure so that it can destroy the sub-container as well. The Launcher is responsible for putting the sub-container into properly nested namespaces (e.g., for pid namespace, the launcher will call setns and unshare).
  6. Containerizer will call launcher `wait` on the sub-container which waits for the sub-container to finish. The `wait` will return the exit status code of the sub-container. The reaping of the sub-container is discussed here.
  7. For a set of operator configured isolators (e.g., `--nested_isolators[l]`), containerizer will call isolate with the pid. The pid needs to be the pid in the host pid namespace. These isolators need to be nesting aware.
  8. Containerizer will fetch URIs if specified.
  9. Containerizer will signal the sub-container to continue and exec the command.
  10. Agent will reply to the executor with a Response indicating the launch has been successful or not. If not successful, an error message will be included in the message.
  11. The executor will send another Call to the agent to wait for the sub-container to terminate. The agent will then call containerizer to wait on the given sub-container id. If the sub-container terminates, a Response will be sent back to the executor.

Currently, the following isolators need to be adjusted to be nested aware[m]:

  • `filesystem/linux`: for sandbox mounts, persistent volumes
  • `docker/runtime`: for getting env/workdir/etc. information from image manifest
  • `docker/volume`: for docker volumes used by sub-container
  • `network/cni`: for preparing /etc/* files for each sub-container
  • `gpu/nvidia`: for nvidia library files
  • `disk/du`: to exclude volumes for tasks

For resource related isolators like `cgroups/*` and `disk/*` isolators, we plan not to tackle them in MVP. Resource isolation is still enforced at the executor level.[n][o][p][q][r][s][t][u][v][w][x][y][z][aa][ab][ac][ad][ae][af][ag][ah][ai]

6.3.3.2. Destroy a sub-container

  1. In two scenarios a sub-container destroy will be initiated: 1) the launcher `wait` returns indicating the init process has exited. 2) executor receives a kill task (or due to some kill policy) deciding to kill the sub-container.
  2. In the second case above, the executor will send a Call to the agent with the corresponding sub-container id.
  3. Launcher calls ‘destroy` with the sub-container id to properly kill the sub-container.
  4. For each isolator, containerizer will call `cleanup` with the sub-container id.
  5. Containerizer will call deprovision with the sub-container id.
  6. The containerizer `wait` on the sub-container id will be ready, and the agent will send a Response to the executor notifying that the sub-container has terminated (i.e., a Response will be generated for the wait Call with the exit status).

6.3.3.3. Destroy an executor container

When executor container terminates, all its sub-containers will be recursively destroyed as well. Launcher will also GC all the checkpointed pid and exit status at that time for sub-containers because they are no longer needed.

6.3.3.4. Update a sub-container

Update will not be supported for sub-containers. This should not be a problem for the Pod case as we’ll tie a task to a sub-container, and task’s resources cannot be changed after it’s launched.

6.3.4. Details

6.3.4.1. Agent API changes

The following calls will be added to the Agent API to manage sub containers. An executor or any of its tasks or sub-containers can use this API to launch sub-containers.

package mesos.agent;

message Call {

  enum Type {

    ...

    // Calls for managing nested containers underneath an executor's container.

    NESTED_CONTAINER_LAUNCH = 14;  // See 'NestedContainerLaunch' below.

    NESTED_CONTAINER_WAIT = 15;    // See 'NestedContainerWait' below.

    NESTED_CONTAINER_KILL = 16;    // See 'NestedContainerKill' below.

  }

  // Launches a nested container within an executor's tree of containers.

  message NestedContainerLaunch {

    required ContainerID container_id = 1;

    optional CommandInfo command = 2;

    optional ContainerInfo container = 3;

    repeated Resource resources = 4;

  }

  // Waits for the nested container to terminate and receives the exit status.

  message NestedContainerWait {

    required ContainerID container_id = 1;

  }

  // Kills the nested container. Currently only supports SIGKILL.

  message NestedContainerKill {

    required ContainerID container_id = 1;

  }

  optional Type type = 1;

  ...

  optional NestedContainerLaunch nested_container_launch = 6;

  optional NestedContainerWait nested_container_wait = 7;

  optional NestedContainerKill nested_container_kill = 8;

}

message Response {

  enum Type {

    ...

    NESTED_CONTAINER_WAIT = 13;    // See 'NestedContainerWait' below.

  }

  // Returns termination information about the nested container.

  message NestedContainerWait {

    optional int32 exit_status = 1;

  }

  optional Type type = 1;

  ...

  optional NestedContainerWait nested_container_wait = 14;

}

6.3.4.2. Authentication and Authorization[aj]

Since the Agent API changes above allows any client to potentially launch sub-containers, these calls should be properly authenticated and authorized.

To do this, we will add a new field `principal` to ExecutorInfo to identify the principal of the executor container. Note that all sub-containers launched within the executor container are expected to use the same principal for launching further sub-containers.

message ExecutorInfo {
 …

  …

  optional string principal = 10;
}

While the framework is expected to set the principal of the executor, it is the agent’s (or an agent module’s) responsibility to inject the proper credentials for that principal into the container. We do not want to add a `secret` field to ExecutorInfo because it’s not a secure way to transmit credentials.

For agents configured with the default CRAM-MD5 authenticator, the agent will inject the `principal` and `secret` in the container’s (or sub-container’s) environment. Agent looks inside the credentials provided by `--http_credentials` flag to find the matching secret for a given principal.

For authorizing sub-container management (i.e., launch, wait and kill API calls) we will add a new authorizable action `MANAGE_NESTED_CONTAINER`.

// NOTE: This is an existing proto.

message Object {
 optional string value = 1;

  optional FrameworkInfo framework_info = 2;

  optional Task task = 3;

  optional TaskInfo task_info = 4;

  optional ExecutorInfo executor_info = 5;

  optional quota.QuotaInfo quota_info = 6;


}

enum Action {
 …

  …

 // This action will authorize `LAUNCH_NESTED_CONTAINER`, `WAIT_NESTED_CONTAINER` and

 // `KILL_NESTED_CONTAINER` agent API calls.

 // This will have an object with `FrameworkInfo` and `ExecutorInfo` set.

 MANAGE_NESTED_CONTAINER = 19;
}

The default local authorizer will have the following new ACLs.

message ACL {

 …

 

 message ManageNestedContainer {

    required Entity principals = 1;

    required Entity users = 2;

 }


}

Note that this means a sub-container inside an executor can launch a sub-container inside another executor’s container, assuming it has the proper privileges!

6.3.4.2. Sub-container reaping

Since agent forks the sub-container, executor cannot directly reap the exit status of the sub-container. However, the executor does need to know the exit status to be able to generate proper status update to the framework.

One way to solve this problem is to improve the existing launcher so that it serves as a nanny process, reaping and pid as well as forwarding the signals. The pid as well as the exit status of the sub-container will be written to a well known location (e.g., /var/run/mesos/launcher/linux/<container_id>) so that we can get the exit status even if the agent crashes and restarts. We need to introduce a `wait` function for the launcher instead of relying on `process::reap` which assumes a parent child relationship.

6.3.4.3. Checkpointing and recovery

Currently, containerizer checkpoints the pid of the executor (`forkedPid`) so that when the agent crashes and restarts, it can use that information to wait for the executor to exit (i.e., `process::reap`). Some isolators (e.g., the port mapping network isolator) also rely on that to do proper recovery.

Ideally, we should improve the Launcher interface to handle the wait, instead of calling `process::reap`. This is because after the agent crashes and restarts, the parent-child relationship is broken. Calling `waitpid` on the executor pid won’t get us the exit code. Although this is not a problem for executors, it’ll become a problem for task containers because the pod executor needs the exit code to decide what status update to send back to the framework.

We should introduce a new method in the Launcher interface (e.g., `wait`) to wait for a container to terminate and return the exit code.

class Launcher
{

public:

  virtual Future<Option<int>> wait(const ContainerID& containerId);

};

In fact, Launcher should be the one that checkpoints the pid, instead of the containerizer. For backwards compatibility, the containerizer still needs to know the pid of the executor because some isolators still depend on the that. Also, we need to deal with the scenarios where the Launcher does not have checkpointed pid (since Launcher does not do checkpointing currently).

For orphan containers, the logic should be:

  1. the launcher will discover all hierarchical relationship between containers.
  2. any unknown executor container (known to launcher but not agent) will be treated as orphan. Any sub-container nested underneath will be treated as orphan too.
  3. containerizer will call `wait` on the non-orphaned executor containers and their sub-containers (that means the launcher needs to tell the containerizer about all the containers, including sub-containers, in a structured way).
  4. Orphans will be destroyed. Ideally, destroying the executor container will recursively destroy all its sub-containers.

6.3.4.4. Sandbox layout

The sandbox layout for an executor with nested containers will be look like the following:

.../executors/<executorId>/runs/<containerId>/

      |--- stdout

      |--- stderr

      |--- volume/

      |--- .containers/

             |--- <containerId>/

                     |--- stdout

                     |--- stderr

                     |--- volume/

                     |--- .containers/

                             |--- …

For command tasks, this is not a backwards compatible change. We need to have some exception in the code to deal with that.

6.3.4.5. Volume sharing

This is a common requirement that containers in a pod should be able to share volumes so that they can communicate through the filesystem.

Since the rootfs of the sub-container sits on the host filesystem, therefore, the executor container might not have access to that. That means the sub-container has to be able to see the host filesystem before it changes filesystem root. The indication of that is we might not be able to support a host volume for a sub-container whose source is from its parent container. We can still support that if it’s from parent container’s sandbox. In that case, we need to introduce a new Volume.Source.

Persistent volumes or external volumes can be specified at either executor level or task level. To be able to share a persistent volume or external volume between containers in a Pod, the framework should specify those volumes in the executor level and define a new volume for the task whose source is from the executor’s sandbox[ak][al][am][an][ao] (the new Volume.Source discussed above).

6.3.4.6. Why let agent fork-exec?

We ask the agent to do the fork-exec for the sub-containers, instead of creating a binary for the executor to do the fork-exec for the following reasons:

  1. We need CAP_SYS_ADMIN for most of the namespaces forking. Although, with user namespace support, the executor can become privilege inside its own user namespace (user namespace forking does not require CAP_SYS_ADMIN), we don’t want to tie pod support to user namespace support.
  2. We want to avoid complex synchronizations between agent and executor. Isolators live in the agent, so agent will be responsible for calling all the isolate functions. If we were to do the fork-exec in the executor, we have to do a bunch of synchronization between two processes which is far more complex and error prone than having all synchronizations live within the agent.
  3. The executor might not have access to the host filesystem where images are provisioned. It we were to let the executor do fork-exec and subsequently pivot_root (chroot), we’ll have to have a way to inject the root filesystem for the task to the executor container.

6.3.4.7. Command executor

Currently, the command executor and the task are in the same executor container. Eventually, we should not treat that as a special case in the sense that the command executor should be in the executor container and create a sub-container for its task.

Ideally, we should improve command executor to support Pod, making it a general *default* executor in Mesos. Alternatively, if this is too aggressive, we can still keep the existing command executor and create a new one executor for Pod. We’ll eventually retire the old command executor once the Pod executor is stable.[ap]

6.3.4.8. Containerizer API

class Containerizer

{

public:

  virtual Future<Nothing> recover(...);

  // Launch the executor container.

  // @param environment additional environment variables for the executor

  // @param pidFile path to the pid file for checkpointing (if needed).

  //        This will be deprecated soon in favor of letting Launcher to checkpoint.

  virtual Future<bool> launch(

      const ContainerID& containerId,

      const ExecutorInfo& executorInfo[aq],

      const string& directory,

      const Option<string>& user,

      const map<string, string>& environment,

      const Option<string>& pidFile);

 

  // Launch a sub-container nested under the executor container.

  virtual Future<bool> launch(

      const ContainerID& containerId,

      const ContainerID& parentContainerId,

      const CommandInfo& command,

      const Option<ContainerInfo>& containerInfo,

      const Resources& resources,

      const string& directory,

      const Option<string>& user);

  // This only applies to executor container. It should be called

  // only if the total resources of an executor changes. For example,

  // it receives a new task, or a task becomes terminal.

  virtual Future<Nothing> update(

      const ContainerID& containerId,

      const Resources& resources);

  // Resources and status can only be get at executor level for now.

  virtual Future<ResourceStatistics> usage(const ContainerID& containerId);

  virtual Future<ContainerStatus> status(const ContainerID& containerId);

  // Wait for the container to terminate.

  virtual Future<ContainerTermination> wait(const ContainerID& containerId);

  // Destroy the given container.

  virtual Future<Nothing> destroy(const ContainerID& containerId);

};

message ContainerTermination {

  message Reason {

    enum Type {

      MEMORY_LIMIT_REACHED,

      DISK_LIMIT_REACHED,

      LAUNCH_FAILED,

      UPDATE_FAILED,

    }

    optional Type type;

    optional string message;

  }

  // Exit status of the init process.

  optional int32 status;

  // Reasons why cause the termination.

  repeated Reason reasons;

};

6.3.5. Parking lot

  1. Agent should reject request of creating sub-container if the container is launched by docker containerizer.
  2. setns/unshare for pid namespace needs a double fork.
  3. Move image volume provision to an isolator (i.e., `volume/image`). We can have other volume isolators in the future, e.g., `volume/ebs`, `volume/host`. We probably need to split filesystem/linux into multiple volume isolators.
  4. We probably need to remember the namespaces used by the top container so that we can setns into it (pid might be enough).
  5. Signal forwarding is needed for the nanny process?
  6. NetworkInfo should be only specified at ExecutorInfo level?
  7. ContainerLogger?
  8. What about signal escalation?
  9. Remove external containerizer?
  10. Tech debt around `Termination` protobuf.
  11. Reject if posix launcher is used?
  12. No nested cgroups for MVP?
  13. Isolator capability and --nested_isolators flag

7. References

http://kubernetes.io/docs/user-guide/pods/

https://github.com/docker/docker/issues/8781

http://aurora.apache.org/documentation/latest/reference/configuration/

[a]+clambert@mesosphere.io here is the blurb about the default executor resources. the min values allowed are 0.1 cpu and 32 MB (see proto above).

[b]What about taskgroup level restart policy? If we delegate that to the framework, will that be latency issue? or what if there's a network partition.

[c]is it really a task-group level "restart" policy, or "longevity" policy?

1. group lives as long as ANY task is running

2. group lives as long as ALL tasks are running

[d]as a follow-up, how should a pod react to a task-group that exits w/ a failure condition? does it stop processing all following task groups, or attempt to continue?

[e]There can be tasks that are as important as the executor within the Taskgroup. In such cases we might want to kill the task group or restart the entire TaskGroup if that specific task dies. How do we handle TaskGroup restarts based on status of tasks other than the executor?

[f]We will punt on this for MVP. The default executor will terminate if any of the tasks terminate.

[g]_Marked as resolved_

[h]_Re-opened_

Your proposal for the MVP will make for **very** fragile pods

[i]It would be helpful to have a strong definition on what is meant by "sub-container".

[j]SHOULD or MUST?

[k]Is nesting strictly hierarchical? That is, can a nested child access something in the parent? Can this express the need (for example), for multiple tasks to be in the same PID namespace?

[l]Maybe this should be --nested_isolation to match the --isolation flag?

[m]What is the contract with an isolator defined to be 'nested aware'? Where is the definition of what 'nested' even means?

[n]Can an Executor successfully launch multiple TaskGroups?  If there is no resource isolation within the Executor then separate Pods will be able to harm each other through excessive consumption of resources.

[o]Resource isolation is still enforced at the Pod level. Currently, we do not support launching multiple TaskGroups in an executor.

[p]Will this be enforced on Custom Executors or are you just referring to the behavior of the Command Executor?

[q]Yes, it'll be enforced for custom executor as well.

[r]How will this enforcement manifest itself?  If a CustomExecutor is currently running, and a Scheduler Launches a new TaskGroup specifying that already running Executor, what will happen?

[s]It will be rejected by the agent.

[t]Ok.  Seems an odd change in behavior.  Tasks can reuse running Executors.  TaskGroups cannot.  This seems to be a side effect of the desire to delay tackling sub-container cgroups, and not necessarily something which should be true forever.  If we pretended that cgroup and disk isolators were already implemented for sub-tasks would you still make this design decision?

[u]I agree this seems an odd behavior, and I think it does not make sense for agent to reject such request (launching one more task group with an existing customer executor), the reason is customer executor is free to launch a task groups as sub-containers or whatever way it wants, but agent can never know this in advance, so what if a customer executor just wants to launch a new task group as some normal processes rather than sub-containers? Does it still make sense for agent to reject it?

Actually I think customer executor should be free to launch any number of task groups in whatever way they want, if it wants to launch a task group as sub-containers, then that task group is a pod. And it is OK for command executor to only launch one task group just like now it can only launch one task.

[v]There are some open questions related to allowing multiple task groups (which also exist today with multi-task executors). For example, if the executor exits just before task group 2 is received by the master, should we re-launch the executor automatically? Is that what the user/framework wanted? The resource accounting for executor also gets tricky because the master has no clue whether the executor is already running or just terminated and will be relaunched by the agent etc. We can allow multiple task groups with the same caveats, but not sure if it's useful or becomes a crutch for users. Another option is to disallow it, until we figure out the right way to solve this.

[w]Can you clarify which issues you describe here are specific to multi-task Executors and which are specific to TaskGroups running on Executors?

ExecutorIDs are listed in Offers today.  It would perhaps disambiguate some of the problems you describe if an attempt to launch a Task or TaskGroup on an already running Executor treated running Executors like any other Resource. If the Executor crashes just before launch of a Task or TaskGroup then the Task or TaskGroup cannot be launched as the Resource (running Executor) is no longer available.  Just an idea.  The answer to the question above is still important to me.

[x]@Vinod, for the issue you described, if the executor exits just before task group 2 is received by the master, I think the executor will be relaunched by the agent when the agent received the task group 2, that's should be the current behavior of multi-tasks executor, right? I think we can just follow it for multi-task groups executor.

Actually when you look at Kubernetes on Mesos, each Mesos agent node will have a kubelet running on it as a custom executor, and that kubelet will be responsible for launching multiple pods, it will be strange for it to just launch a single pod. I think we should enable any Mesos custom executor with similar capability.

[y]We will allow multiple task groups for custom executor with the above caveats. It won't be supported in the default pod executor.

[z]_Marked as resolved_

[aa]_Re-opened_

OK, so that means custom executor can launch multiple pods. And what about network? Can each pod have its own network? Or they all have to be in the same network with the executor?

[ab]not multiple pods. multiple task groups per pod. there is still going to be one network info that will be shared by all task groups.

[ac]I am a bit confused about the difference between pod and task group. So here a pod is:

1. A default executor + a single task group.

2. A custom executor + multiple task groups.

Right?

[ad]Task group has nothing to do with Pod. It is just a mechanism in Mesos to send a group of task with all or nothing semantics. It's a building block for building pod support. Can you state your use case and it will be easier to go from there.

[ae]The use case is that user may want to launch a group of co-located containers which share the same network so that they can access each other with localhost. This is similar to how Kubernetes supports pod.

[af]OK, this sounds quite doable using the existing primitive. the executor container will have a network namespace which all sub-containers underneath will share. sub-contianer does not necessarily tied to tasks.

[ag]But user may want the custom executor to launch multiple task groups and the tasks of each group are in a separate network. Consider that the framework may be multi-user/multi-tenant aware, so each framework user may launch their own task group, but they may not want all these task groups in the same network. It is natural that user1 wants to launch a task group in network1, and user2 wants to launch a task group in network2. I do not think the existing primitive in this design doc can support it.

And can you please elaborate a bit about "sub-container does not necessarily tied to tasks"? I think each task in a task group should be a sub-container, right?

[ah]ok, in that case, can you use multiple executor containers each of which has a separate network stack?

[ai]Yes, we can. But the problem is, how can framework know when it should launch multiple executors each of which is responsible for launching a task group in a separate network, and when it should launch a single executor which is responsible for launching multiple task groups which are all in the same network? And as framework users, they do not care about executor, they only care about their task group and the network it will join, but with the current primitives, how can a user specify whether a task group should be put into a separate network or not?

[aj]+benh@mesosphere.io please take a look at the newly added authn and authz section for nested containers.

[ak]Is this also how I can share scratch space between containers? I'm imagining it would consume executor "disk" resources; NOT persistent or external. +jie@mesosphere.io

[al]What's the behavior of a HOST volume mount with a relative path?

[am]It'll be the same.

[an]So the relative path .. it is relative to the sandbox of the executor?

[ao]relative to the sandbox of the sub-container

[ap]Strongly +1 to this. We've already felt the pain to support multiple types of executors (even possibly written in different programming languages) in a multi framework setup in the same Mesos cluster. With POD support, there are less reason to reinvent the wheels because many usage are just  because they need to run multiple tasks under the same executor.

[aq]Think about if we can make containerizer interface not depend on `ExecutorInfo` and `TaskInfo`. Probably should use a protobuf to wrap all of them.