Campaign Storage Hardware

a guide for resellers & users - v1.0, 2017-01

Dr. Peter Braam

Introduction

Networking

Campaign Storage Metadata Repository and Mover Nodes

The Campaign Storage Metadata Repository

The Campaign Storage Object Repository

Campaign Storage Mover nodes

Robinhood Nodes

Small Campaign Storage Systems

Scale Out

Incorporation of fast caches

Scaling the Campaign Storage Object Repository

High availability

Hardware Implications

Introduction

Campaign Storage (CS) provides a new approach to managing large volumes of data. Its strengths lie in nearly unlimited scalability, fast movement of data into and out of the Campaign Storage system, radically new data policy management, and its utilization of a wide variety of low cost industry standard storage software and hardware.

A campaign storage hardware system must include at least one of each of the following:

  1. CS metadata repository (CS-MR).  This is a simplified Lustre installation on ZFS which includes CS metadata servers (CS-MDS) and CS object storage servers (CS-OSS). A minimum of two servers are required.
  2. CS object repository (CS-OR). The object repository holds file data and can consist of objects leveraging tape, objects in public or private cloud object storage, or the objects can be embedded in the CS metadata repository for maximum speed and full POSIX semantics.
  3. CS mover nodes - the mover nodes, of which there must be at least one, are connected to both the existing customer system and to the campaign storage system

As we will see, larger CS systems are not essentially different.  Larger systems contain more servers in the CS repositories, more movers  and typically more servers and drives for the object and metadata repositories.  To achieve high availability, drives are shared between active-passive failover pairs using enclosures.  Most system design proposals below allow for re-use of some components when scaling up an initial CS system.  In summary, figure 1 depicts this description.

Figure 1: Campaign Storage schematic layout.

Figure 1 implies the first customer specific considerations: what network connections will be used between the systems?

After we have articulated the network questions we describe qualities the campaign storage nodes must have, and finally we address how the system can scale and provide high availability with different hardware choices.

Networking

The user provided storage system, a Lustre cluster or other storage farm, will be using an interconnect that is typically already deployed.  The campaign storage system itself uses one or more interconnects, and can optionally share the interconnect used by the user provided storage system..

The following table states questions regarding the customer environment and preferences:

Question

Details for answer

Response to Customer

CS Object Repository

Each object store in CS v1, must be on a 10/25/40GBe network shared with each mover node.  The metadata repository requires to be on this network also minimally with 1Gbe connections. Does the customer data center have sufficiently many 10Gbe and 1Gbe switch ports to use for the Campaign Storage?

Consider that the bulk of archival data storage will be on the object storage repository, while typically only smaller amounts of data will be held for (de) staging on the metadata cluster, which can provide higher performance and POSIX semantics when needed.

Note: IB to CS object repositories is not supported in v1.

YES: We would like to use those ports, i.e.

m - 10/25/40Gbe ports for m movers

o - 10/25/40Gbe ports for o Black Pearl server units

s - 1Gbe ports

NO: Our installation will include a 10 Gbe switch to connect mover nodes and object stores.

CS Metadata Repository

Mover nodes, the metadata repository (including both CS-MDS and CS-OSS nodes) require connectivity supported by Lustre and require a high performance network.  Do you have at least one spare port on this network for each CS-MDS and for each CS-OSS, and for each mover node?

If the CS metadata network can be the same network as the customer storage network. Then CS mover nodes do not need a second interface but can use their existing interface to communicate with the CS metadata repository.

The CS metadata repository can through a single available port into the storage network for the customer storage cluster still be part of that network, but can also be separate.  In the latter case mover nodes require two interfaces.

YES: We would like to use those ports, i.e.

#CS MDS and #CS OSS ports and #mover ports for the metadata repository.

See figure 2.

One port is available:  We suggest to share the customer storage network through an auxiliary switch

Ports are available for mover nodes: Campaign Storage installation will include an IB switch with ports for all CS-MDS, CS-OSS and CS mover nodes (see figure 3). The customer storage system and Campaign Storage system have separate networks.

No ports are available: Campaign Storage mover node software must be installed on nodes in the customer storage cluster (see figure 5). The customer storage system and Campaign Storage system have separate networks.

Optional Robinhood Server

Are Robinhood servers required?

YES: make network provisions for Robinhood nodes as for CS mover nodes.

Management Network

Each mover and repository node needs to be connected to a 1Gbe network. Are ports available?

Consider the sensitivity of the data on the CS cluster.  Separate management may facility different authorization levels for administration.

YES: Provide a port in a switch connected to the internet.

NO: A small network switch will be included to provide a management network for the campaign storage servers.  This switch should have a port to connect to a site internal

External Management Access

For optional external access an OSS network is required to have a 2nd Gbe interface solely used for a VPN connection (ssh or other).  Is external management selected?

YES: Provide a port in a switch connected to the internet to be used for the VPN connection to a specific interface on the CS OSS  node running management service.

NO: All management has to be done from another node in the facility connected to the management network.

The following figures omit the Campaign Object Repository network, the management network and the VPN connections and solely focus on the connection between the Campaign Metadata Repository, CS mover nodes and the Customer cluster. Figure 2 shows a networking schematic where the Customer Storage Cluster and the Campaign Storage metadata repository share a network through using sufficiently many ports. Figure 3 shows the schematic applicable when the network is shared through an auxiliary switch, while Figure 4 shows the networking schematic when the Customer Storage Cluster and the Campaign Storage metadata repository are on separate networks.

Figure 2: Customer Storage System switch has sufficiently many ports.

Figure 3: Customer Storage System switch has one free port.  Customer Storage network and Campaign Storage network are shared through auxiliary switch.

Figure 4: Ports available for mover nodes. Customer Storage System and Campaign Storage network form separate networks.

When the customer storage cluster has no free ports, the mover node software must run on nodes in the user storage cluster.  This configuration has the disadvantage that user storage cluster and Campaign Storage software must co-exist on such nodes.  This is drawn schematically in figure 5.

Figure 5: No ports are available in the customer storage cluster.  Now the mover nodes are embedded in the customer storage cluster and require a second network interface to connect to the campaign switch. . Customer Storage System and Campaign Storage network form separate networks.

When the customer storage cluster has no free ports, the mover node software must run on nodes in the user storage cluster.  This configuration has the disadvantage that user storage cluster and Campaign Storage software must co-exist on such nodes.  This is drawn schematically in figure 5.

Figure 6 shows the networking that arises if the Campaign Object Repository network is added to the case depicted in figure 3.  The optional Robinhood node does not require high-speed access to the CS-OS because it invokes dedicated Lustre HSM to Campaign  mover nodes for data movement to the CS-OS.

Figure 6: Campaign Storage MR networking and Campaign Storage OR networking schematic.

Campaign Storage Metadata Repository and Mover Nodes

The Campaign Storage Metadata Repository

Campaign Storage v1 leverages a collection of Lustre cluster file systems for metadata storage.  These cluster file systems use ZFS.  If desired for staging files, so called embedded Campaign Storage object storage can be included, in which case the file data leveraging that particular CS OR provides full semantics of the Lustre file system.  In this subsection, we describe the hardware used when Campaign Storage uses a single “small” cluster file system as its CS-MR.

The repository consists minimally of a CS-MDS and CS-OSS server node.  Each should have storage, which will be formatted as ZFS on the CS-MDS and ZFS or ldiskfs on the OSS.  This storage holds inode information including extended attributes relating the inodes in the CS MR to data objects on the CS OR, and optionally archival attributes can be stored on the CS MR, as well as work lists etc.

Fact: The metadata usage for each file / directory in the CS MR is likely to average 1KB to 2KB  on the CS MDS, and 512 bytes on a CS OSS. A “Trash” folder in Campaign Storage tracks deleted files and objects that can be removed, and until the trash is cleared deleted files should be assumed to consume metadata also.   Unless the CS MR is used for staging, the CS OSS only stores this metadata for each file[1].  If staging is used, and we in fact recommend to always include a small staging area on the CS OSS (because of the very low additional cost), then the CS OSS must additionally have sufficient space to store staged files.  

The wide variation of disk use per inode is determined by how much metadata and operational data is saved; for example, Campaign Storage can have a per file command log when archiving, but such a log can double disk space consumption.

The following table depicts customer decisions regarding CS MR storage space choices.

Question

Considerations

Hardware Design

How many files will be stored in the CS system?

For performance reasons flash solutions (NVMe or SATA/SAS) must be used and as this holds the core metadata redundancy, e.g. dual or triple replication is required.

The CS MDS servers must have access to 2K * N bytes of storage, multiplied by the redundancy factor. E.g. for 1B files with triple redundancy, approximately 6TB of space is needed.

How much staging space must be available?

Each inode, staged or not, will consume 500B on the CS OSS nodes.  Staging data is in addition to this.  Staging can use disk storage if staging files are large.

Regarding the servers in the CS MR, the following performance considerations must be kept in mind.  

Question

Considerations

What is the desired CS MR file creation rate?

A single dual socket 128GB RAM server can sustain approximately 5-20K creations / sec up to 10B files[2][3].

What is the desired index update rate.

The Campaign Storage Object Repository

The CS OR can contain one or more nodes, and in CS v1 such nodes will be hybrid disk/tape SpectraLogic Black Pearl nodes or disk based Spectra Logic Black Pearl nodes.

The capacity and performance of moving data into the CS OR should be discussed with the supplier of the CS OR storage nodes.

Campaign Storage Mover nodes

Mover and Robinhood nodes will normally incur a higher CPU load.  They will be traversing source file systems or their changelogs, moving files, while computing checksums and perhaps encrypting data. In the case of Robinhood the node is making database entries, which has proven to be very CPU and memory intensive.

The CS mover node should have enough CPU power to drive an incoming and outgoing stream of data, maximizing the network rates to the CS object storage repository and to the Customer Storage Cluster.  A typical dual socket, 64GB RAM compute client will be able to do this.

Robinhood Nodes

Robinhood servers perform the following tasks in Lustre HSM management.

, which are optional but allow more complex batch commands to be easily executed, require database storage because they scan the source file system. Their operations tend to require high CPU utilization.  Good results have been shown with fast flash storage, very large amounts of RAM and appropriately tuned MySQL databases[4].  The use of Robinhood is optional, but it presently is the only mechanism integrate with Campaign Storage for scanning source file systems and applying policies based on database searches.

Small Campaign Storage Systems

A Campaign Storage file archive with 1B files may be held in as little as 512GB of storage on the CS MDS and CS OSS.  Moreover, the ZFS based storage model makes it easy to expand by adding zpools to the CS MDS and CS OSS.

For redundancy we recommend for example using 3 disks in each ZFS pool with parity - hence 3x 256GB flash drives or a mirrored 512GB pair can be a good entry point to provide 1TB of metadata space.

Because metadata space consumption is relatively small, we support only flash (or successor technologies) as metadata storage. PCI connected drives will increase speed roughly 2x over SAS SSD’s but may offer less expandability.

CPU usage for smaller metadata collections can be low on the MDS and OSS nodes, and we recommend nodes with a single Xeon chip and 64GB of RAM.

The OSS node also functions as a management server from which other servers boot.  One should reserve a separate mirrored pair of drives on such a node for booting purposes.

Figure 4: Small Footprint, non-HA CS system with Robinhood

Trials can be conducted without redundancy.

Figure 5: Campaign Storage Trial System

Scale Out

Incorporation of fast caches

The CS-MR can in all configurations include an internal CS-OR, the I-CS-OR.  When the I-CS-OR is designated as the storage location for files, the data store of the filesystem underlying the CS-MR is used for storage.  This I-CS-OR will have superior semantics and performance over external CS-OR’s, but will not offer the lower cost  and archival features that can be gained with external CS-OR’s such as Black Pearl.

Scaling the Campaign Storage Object Repository

Object mirroring, sharding and striping will be supported in early versions of the CS product.  No special configurations are necessary except multiple instances and sufficient network connectivity to the CS-OR instances.

High availability

Campaign Storage offers multiple ways to create high availability:

  1. Inter Data Center availability 
  1. In this case, redundant Lustre servers are used with multipathed, redundant storage and the CS-OR offers their form of redundancy.  Hardware choices are widely discussed by vendors and on the web.
  2. The benefit, with careful testing and configuration (which in practice tends to be hard) can be minimal downtime of operations during hardware failures
  1. Intra Data Center availability 
  1. In this case, replication takes place at the ZFS level to mirror snapshots to a secondary location.  The CS-OR performs its native mirroring to a second data center.
  2. This form of availability and redundancy offers protection from disasters.
  3. This is suitable for wide-area maintenance of archives.

Hardware Implications

Hardware offering the two forms of availability is easily configured.

If intra data center availability is desired, vendor solutions for the CS-MR may be preferred. Such solutions command a premium over standard configurations.  

Wide area replication takes place at the level of ZFS Pools on Lustre Servers offering the CS-MR.  This uses TCP/IP, and hence all Lustre servers should be equipped with suitable connectivity.   The CS-OR typically also replicates over TCP/IP.

Campaign Storage Hardware - a guide for business partners                  2017-01, v1.0                        /


[1] Future versions of Campaign Storage may optionally run without CS OSS nodes.

[2] The creation rate is strongly dependent on the amount of metadata that is stored by the mover utilities. The highest rates are achieved when merely inodes are created, while adding archival attributes or work logs can lead to slow down measured to be approximately 4x.

[3] The creation of files in CS often goes hand in hand with a traversal of the Customer Storage System.  Such systems may exhibit limited performance or require throttling for data management operations to enable them to perform their primary functions at adequate performance.  These considerations are beyond the scope of the Campaign Storage performance.

[4] Refer to LUG presentations in 2015 from Bull and in 2016 from Stanford