1 of 24

Ephemeral hypervisors in an Ironic (Super)Cloud

Jacob Anders, HPC Team Lead, CSIRO

jacob.anders@csiro.au

OpenStack Scientific SIG meetup,

24th October 2018

2 of 24

Background

  • Since the beginning of 2018, CSIRO have been working on a bare-metal OpenStack system with Software-defined InfiniBand,
  • We call it CSIRO SuperCloud ( SuperComputer + Cloud ),
  • The main motivation behind this effort is Cloud-ifying High Performance Computing - gaining flexibility without sacrificing performance,
  • We work on SuperCloud in close collaboration with Mellanox and Red Hat

3 of 24

Background

  • as we progressed on this journey, we realised that the value goes beyond pure HPC and that we can cover many more use cases on one system,
  • the idea of API-driven DataCentre was born,
  • as a result, SuperCloud has gained a range of new capabilities

4 of 24

Obvious questions:

  • ...but… didn’t you have “hypervisor” in the presentation title?
  • ... and if so, why would you do all the work to remove virt, just to add it back?
  • … all valid questions and we have valid answers :)

5 of 24

Motivation

  • Bare-metal with software-defined InfiniBand is an extremely capable platform for anything high performance,
  • The ability to drive physical servers through an OpenStack API is extremely powerful..
  • ..however, not every application:
    • needs the entire node,
    • puts the same emphasis on performance,
    • can leverage RDMA/InfiniBand,
    • can run smoothly on InfiniBand in the first place

6 of 24

Motivation

  • a baremetal instance takes good 5-10 minutes to build, while a VM takes seconds if you want to test something quickly..
  • It’s possible to run hundreds of VMs on a few physical servers which is very useful for testing software at scale,
  • SuperCloud hardware is a prime platform to run non-HPC workload in a high-density setting (more on that later)

7 of 24

Design

  • Traditionally, OpenStack nodes are very static and have a well-defined role (controller, networker, compute, database, storage, monitoring node, …)
  • TripleO made a promise of provisioning OpenStack in a cloud-like fashion, but did it really deliver?

8 of 24

Design

We propose a very different approach:

  • SuperCloud supports bare-metal with multi-tenant software-defined networking out of the box,
  • How about we provision KVM compute nodes in the exact same way as our tenants provision their bare-metal instances,
  • This turns nova-compute into a cloud-native application

9 of 24

Design

  • both baremetal and VM instances are running on a Software-Defined Infiniband fabric and can seamlessly connect to each other,
  • “Internalapi” is a private neutron VLAN network that carries all the OpenStack internal traffic,
  • All the OpenStack nodes are connected to this neutron network,
  • Currently we run both baremetal and VM instances in VLAN tenant networks. Only SRIOV is supported for VMs (more about also this later)

10 of 24

Example

# openstack server list --ip 192.168.2�+--------------------------------------+--------------+--------+--------------------------+--------------------+---------------------+�| ID | Name | Status | Networks | Image | Flavor |�+--------------------------------------+--------------+--------+--------------------------+--------------------+---------------------+�| 6c470638-cfdd-4430-ac10-7ab42a332bf7 | supercloud05 | ACTIVE | internalapi=192.168.2.23 | rhel-7.5-baremetal | my-baremetal-flavor |�| 85c06ed9-c826-4d76-8aba-e6fafa1165d3 | supercloud04 | ACTIVE | internalapi=192.168.2.26 | rhel-7.5-baremetal | my-baremetal-flavor |�+--------------------------------------+--------------+--------+--------------------------+--------------------+---------------------+�# openstack compute service list �+----+------------------+-----------------------+----------+---------+-------+----------------------------+�| ID | Binary | Host | Zone | Status | State | Updated At |�+----+------------------+-----------------------+----------+---------+-------+----------------------------+�| 9 | nova-conductor | supercloud03.csiro.au | internal | enabled | up | 2018-10-24T08:41:56.000000 |�| 12 | nova-scheduler | supercloud03.csiro.au | internal | enabled | up | 2018-10-24T08:42:03.000000 |�| 13 | nova-consoleauth | supercloud03.csiro.au | internal | enabled | up | 2018-10-24T08:42:00.000000 |�| 15 | nova-compute | supercloud03.csiro.au | nova | enabled | up | 2018-10-24T08:41:58.000000 |�| 18 | nova-compute | supercloud05.csiro.au | nova | enabled | up | 2018-10-24T08:42:02.000000 |�| 19 | nova-compute | supercloud04.csiro.au | nova | enabled | up | 2018-10-24T08:41:58.000000 |�+----+------------------+-----------------------+----------+---------+-------+----------------------------+� �

11 of 24

Preparing lab environment for the SIG presentation

# openstack network create sigdemo01�# openstack subnet create --network sigdemo01 --subnet-range 192.168.1.0/24 sigdemo01

# openstack network list �+--------------------------------------+-----------+--------------------------------------+�| ID | Name | Subnets |�+--------------------------------------+-----------+--------------------------------------+�| 7be3a29d-e500-451c-a781-b7504e8e99e7 | sigdemo01 | 9275ace2-0d5f-4999-b023-e92a5e5b31f3 |�| baded261-3d37-45c1-8d2c-d7113c9aae44 | external | a14e3d06-a5d3-4a4b-8ad2-239d5e7f0692 |�+--------------------------------------+-----------+--------------------------------------+�# openstack network show 7be3a29d-e500-451c-a781-b7504e8e99e7 | grep segmentation_id�| provider:segmentation_id | 7

# cat sigtest.sh

�#!/bin/bash�for i in $(seq -w 1 8); do �openstack port show sigtest$i || openstack port create --network sigdemo01 --vnic-type direct sigtest$i�openstack server create --image rhel-7.5-baremetal --key-name sigtest --flavor m1.small --port sigtest$i sigdemo$i�done;�for i in $(seq -w 1 2); do�openstack server create --image rhel-7.5-baremetal --key-name sigtest --flavor my-baremetal-flavor --network sigdemo01 sigdemo-bm$i�done;�

12 of 24

Example

# openstack server list --all-projects --name sig --long -c Name -c Status -c Networks -c 'Flavor Name' -c Host

+-------------+--------+--------------------------------------+---------------------+-----------------------+

| Name | Status | Networks | Flavor Name | Host |

+-------------+--------+--------------------------------------+---------------------+-----------------------+

| sigtest8 | ACTIVE | sigdemo01=192.168.1.12 | m1.small | supercloud05.csiro.au

| sigtest7 | ACTIVE | sigdemo01=192.168.1.14 | m1.small | supercloud04.csiro.au |

| sigtest1 | ACTIVE | sigdemo01=192.168.1.4, 152.83.14.133 | m1.small | supercloud04.csiro.au |

| sigdemo-bm2 | ACTIVE | sigdemo01=192.168.1.7 | my-baremetal-flavor | supercloud03.csiro.au |

| sigdemo-bm1 | ACTIVE | sigdemo01=192.168.1.6, 152.83.14.144 | my-baremetal-flavor | supercloud03.csiro.au |

+-------------+--------+--------------------------------------+---------------------+-----------------------+

13 of 24

Quick RDMA demo

# openstack server list --all-projects --name sig --long -c Name -c Status -c Networks -c 'Flavor Name' -c Host

+-------------+--------+--------------------------------------+---------------------+-----------------------+

| Name | Status | Networks | Flavor Name | Host |

+-------------+--------+--------------------------------------+---------------------+-----------------------+

(...)

| sigdemo2 | ACTIVE | sigdemo01=192.168.1.8, 152.83.14.143 | m1.small | supercloud05.csiro.au |

| sigdemo1 | ACTIVE | sigdemo01=192.168.1.4, 152.83.14.133 | m1.small | supercloud04.csiro.au |

| sigdemo-bm2 | ACTIVE | sigdemo01=192.168.1.7, 152.83.14.132 | my-baremetal-flavor | supercloud03.csiro.au |

| sigdemo-bm1 | ACTIVE | sigdemo01=192.168.1.6, 152.83.14.144 | my-baremetal-flavor | supercloud03.csiro.au |

+-------------+--------+--------------------------------------+---------------------+-----------------------+

14 of 24

RDMA demo (running between VM & BM on FDR10)

[root@sigdemo1 ~]# ib_write_bw 192.168.1.7

RDMA_Write BW Test

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

TX depth : 128

CQ Moderation : 100

Mtu : 2048[B]

Link type : IB

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]

65536 5000 4510.19 4509.52 0.072152

---------------------------------------------------------------------------------------

15 of 24

How does SDN handle that?

  • Compute nodes are added to “VLAN”s as instances running on these “VLAN”s get scheduled,
  • From there, the hypervisors map ports to VLANs appropriately,
  • SDN flexibility is maintained and so is network isolation

16 of 24

Now, you mentioned high-density

  • How to achieve high VM density with SRIOV?
  • How to achieve high network density with VLAN tenant networks?

17 of 24

SDN limitations

  • Currently, our SDN does not support vxlan networks (however this might change in the future)
  • However, bare-metal + SDN offers unlimited options to work around limitations :)

18 of 24

There’s more SuperCloud than what I’ve shown:

# openstack server list --all-projects --name supercloud --long -c Name -c Status -c Networks -c 'Flavor Name' -c Host

+--------------+--------+--------------------------+---------------------+-----------------------+

| Name | Status | Networks | Flavor Name | Host |

+--------------+--------+--------------------------+---------------------+-----------------------+

| supercloud10 | ACTIVE | private=10.0.0.24 | my-baremetal-flavor | supercloud03.csiro.au |

| supercloud09 | ACTIVE | private=10.0.0.8 | my-baremetal-flavor | supercloud03.csiro.au |

| supercloud08 | ACTIVE | private=10.0.0.19 | my-baremetal-flavor | supercloud03.csiro.au |

| supercloud07 | ACTIVE | private=10.0.0.16 | my-baremetal-flavor | supercloud03.csiro.au |

| supercloud05 | ACTIVE | internalapi=192.168.2.23 | my-baremetal-flavor | supercloud03.csiro.au |

| supercloud04 | ACTIVE | internalapi=192.168.2.26 | my-baremetal-flavor | supercloud03.csiro.au |

+--------------+--------+--------------------------+---------------------+-----------------------+

19 of 24

These run a separate OpenStack instance with fully virtualised vxlan networking

# openstack compute service list �+----+------------------+-----------------------+----------+---------+-------+----------------------------+�| ID | Binary | Host | Zone | Status | State | Updated At |�+----+------------------+-----------------------+----------+---------+-------+----------------------------+�| 3 | nova-conductor | supercloud10.csiro.au | internal | enabled | up | 2018-10-24T09:16:09.000000 |�| 4 | nova-scheduler | supercloud10.csiro.au | internal | enabled | up | 2018-10-24T09:16:07.000000 |�| 5 | nova-consoleauth | supercloud10.csiro.au | internal | enabled | up | 2018-10-24T09:16:07.000000 |�| 9 | nova-compute | supercloud08.csiro.au | nova | enabled | up | 2018-10-24T09:16:05.000000 |�| 10 | nova-compute | supercloud09.csiro.au | nova | enabled | up | 2018-10-24T09:16:05.000000 |�| 13 | nova-compute | supercloud07.csiro.au | nova | enabled | up | 2018-10-24T09:16:04.000000 |�+----+------------------+-----------------------+----------+---------+-------+----------------------------+

20 of 24

It’s much higher density than SuperCloud proper

# openstack hypervisor stats show�+----------------------+--------+�| Field | Value |�+----------------------+--------+�| count | 3 |�| current_workload | 0 |�| disk_available_least | 620 |�| free_disk_gb | 691 |�| free_ram_mb | 82294 |�| local_gb | 1311 |�| local_gb_used | 620 |�| memory_mb | 393078 |�| memory_mb_used | 310784 |�| running_vms | 601 |�| vcpus | 60 |�| vcpus_used | 601 |�+----------------------+--------+

# openstack server list | grep -c ACTIVE�600

# ./stackstats.sh �Wed Oct 24 05:17:22 EDT 2018 ACTIVE=600 BUILD=0 ERROR=0 �[root@supercloud10 ~(keystone_admin)]# �

21 of 24

It’s much higher density than SuperCloud proper

# # openstack server list --long -c Name -c Status -c Networks -c 'Flavor Name' -c Host --limit 20

+--------+--------+-----------------+-------------+-----------------------+

| Name | Status | Networks | Flavor Name | Host |

+--------+--------+-----------------+-------------+-----------------------+

| t2-285 | ACTIVE | lab=172.16.2.66 | m1.tiny | supercloud08.csiro.au |

| t2-299 | ACTIVE | lab=172.16.2.97 | m1.tiny | supercloud07.csiro.au |

| t2-295 | ACTIVE | lab=172.16.2.74 | m1.tiny | supercloud09.csiro.au |

(...)

| t2-288 | ACTIVE | lab=172.16.2.64 | m1.tiny | supercloud08.csiro.au |

| t2-281 | ACTIVE | lab=172.16.2.65 | m1.tiny | supercloud07.csiro.au |

| t2-269 | ACTIVE | lab=172.16.2.80 | m1.tiny | supercloud07.csiro.au |

+--------+--------+-----------------+-------------+-----------------------+

22 of 24

Summary

  • While our main focus is bare-metal OpenStack with SDN InfiniBand we still see value in virtualisation capability,
  • Virtualisation is of even higher value if it can be seamlessly scaled up/down
  • Ironic to tenant functionality is a powerful, flexible, efficient and elegant way of providing virtualisation capability on bare-metal OpenStack,
  • Most configurability limitations can be overcome by using Ironic and OpenStack API to create more services,
  • HPC hardware is a prime cloud platform and if density is right, running cloud on HPC might make sense even for non HPC workloads

23 of 24

Future work

  • Enable SDN-IB to support fully vxlan tenant networks..
  • ..as well as fully-virtualised networking,
  • When that’s done, we’ll merge HPC and high-density systems into one,
  • Orchestrating VM migration when a hypervisor needs to be deleted,
  • Enhancing support for trunk ports in Ironic would enable more flexible internal network infrastructure

24 of 24

Thank you for your attention

Question time.