Draft: Towards user facing dynamic provisioning

Gregor von Laszewski, Fugang Wang, Javier Diaz, Archit K.  (please add your name if you have contributed)

The purpose of this document is to coordinate the activities related to the deployment and software development of user facing dynamic provisioning

Phase II deployment

In Phase I/II of Futuregrid we wanted to configure Moab in such a way that it is able to dynamically provision operating systems provided by the system administrator. This allows also a minimal  user facing deployment possibility in that the user can contact the system administrator to deploy such an image. Naturally the application of this process is limited by the following factors (a) the deployment of the image is not automatized and needs to be conducted by the system administrator (b) the system administrator must trust the person that generates the image (c) testing of an image deployment can only be done after step a is completed. We term this kind of provisioning “administrator controlled dynamic provisioning”.

However, we found in our  own usecases that users would need access to the deployment mechanism to make this feature truly useful (this we call user facing dynamic provisioning).

Mitigation restrictive access for FG developers: As a mitigation to this issue we decided that a minicluster be available for the core FG developers to test out such features. As such the “testbed” FG has limited some of its services offered to general users to be restricted to services that could be deployed on on production systems such as TeraGrid. The goal here is to set up a system ASAP to have the newest version of XCAT and Moab so we can test out the new system BEFORE it is deployed in general. This activity needs to be started now.

        Status: the backup of the minicluster is in progress and completed by June 10, 2011

        A jira task is fled at http://jira.futuregrid.org/browse/FG-1156

Mitigation restrictive access for FG users: In the next phase of the development we will be working on services that are distinct from those typically offered in production systems while focussing also on the provisioning of bare metal environment via “user facing dynamic provisioning” where users can invoke a process that creates an image for their needs that can than be deployed.

Mitigation incompatible software deployments: One of the issues we have in the current FG is that the versions of some of the key software are so different that software development is difficult to coordinate and conduct due to incompatibility between even minor versions. As we do not have enough software developers to deal with this, we need to address this issue by alligning the software. Table 1 depicts the status of the software on the various systems as of June 7th, 2011. It was decided to upgrade most of the iDataplex systems to be the same version. The coordination of this activity is tasks to be executed by Greg Pike. Recently we also found out that UC is not using XCAT which was in our original deployment plan. We need to mitigate how we deal with this. This is important as recently questions were issued if we should be using XCAT at all (???verify with Greg)

Mitigation reusing our own software tools to further develop them: As part of the divergence in the deployed software versions, we could be using our own tools that we develop to keep the system up to date. That is, instead of us generating an image by hand we first generate an image with the software installed as much as we can automatically. Than we provide specialized configuration scripts that modify the images for a particular machine or node and deploy them from a repository. If a new operating system would be used or a change will take place we have documented the process on how to generate quickly a new distribution and an update of the system will be much less time consuming and we can focus development and deployment resources elsewhere. The reuse of our own tools within the deployment infrastructure has the following benefits (a) we can test them (b) we can enhance them if we find issues or want new features.

Table 1: Installed software version on the HPC images for the various clusters (green indicates up to date software) Items updated on June 7th are indicated with a *. Although the newest version is installed is is lacking the support to conduct dynamic provisioning using the image deployment tools from Javier. Also we have identified a number of deployment issues as some of the configuration files need to be updated. Possibly the best way to deal with this is to conduct a new installation of XCAT.

iu-india

iu-xray

tacc-alamo

uc-hotel

ucsd-sierra

ufl-foxtrot

gcc

4.1.2

4.1.2

4.1.2

4.1.2

3.4.6

4.1.2

moab

5.4.0

5.3.6

5.4.3

6.0.3

5.4.0

n/a

modules

3.2.7

3.1.6

3.2.6

3.2.8

3.2.7

n/a

openmpi

1.4.2

n/a

1.4.3

1.4.3

1.4.2

n/a

torque

2.5.5

2.4.4-snap.200912210954

2.4.8

2.5.5

2.4.3

n/a

xcat

2.6*

n/a

n/a

n/a

2.6*

2.4.3

Phase III: User facing dynamic provisioning

User Stories

To motivate the Phase III development of dynamic provisioning we summarize first a number of usecases that we wish to address.

Story A. Per Job Reprovisioning (of administrator provided images, see Phase II)

A user wishes to assure that a particular image gets provisioned as part of the job submission to the queuing system. This use case is covered by the Moab features provided by version 6.0.4. However, the image must be deployed by the system administrator. users do not have he ability to provide their own images.

  1. The administrator includes an image in the image repository
  2. The user submits a job to a general queue. This job specifies an Image type attached to it.
  3. The Image gets reprovisioned on the resources
  4. The job gets executed within that image
  5. After job is done the Image is no longer needed.

Process View

Sequence Diagram
Figure: Reprovisioning per job

Development Tasks

Development tasks for this activity are to set up an environment that allows the easy integration of images into the images available to the users. As the user facing interfaces are already provided by the queuing system integration.

To simplify the interaction and create an abstraction layer that can be in future adapted to other environments we will develop a function called

fg-moab-image -upload -image <filename> -label <label>

This command uploads the image to the repository that can be read by the queuing system

fg-moab-image -delete -label <label>

        This command deletes the image from the repository accessible to moab

        fg-moab-image -init

This command may be called after images have been uploaded in order to initialize the queuing system after one or more images have been uploaded. The option -init can directly be specified when an image is uploaded and deleted and is than executed immediatly after the action has been performed as last command.

 

The commands are restricted to authorized users. In contrast to Phase I, such commands are not available and the steps were executed by hand by the administrator.

Tasks

  1. The administrator documents how the current dynamic provisioning image is included into Moab. All steps are documented and possible side effects such as if the need of restarting the environment is necessary and what possible side effects this may have.
  2. A team member will be assigned to implement the above functions.
  3. A deployment instruction and strategy will be worked out to identify a mechanism on how to restrict the access to the commands through unix groups.
  4. Documentation to the user and administrators is provided as part of a manual page.
  5. We will develop a tutorial based on the information and improve the manual if needed.
  6. Commands will be developed to add users to this group. This is a convenient wrapper to the normal unix group command but provides a possible interface to a project that is managed as part of the FG portal.

Story B. Reprovisioning based on Prior State (Slight modification to Story A.)

In Story A, the system is provisioned each time the user  specifies the image. However Moab internally has the ability to not reprovisioned the image in case the requested image is already provisioned.

Process View

The use case documents how simple job-based dynamic provisioning can occur.

  1. The user submits a job to a general queue. This job specifies an OS type attached to it.
  2. The queue evaluates the OS requirement.
  1. Repeat the provisioning steps if the job requires multiple processors (such as a large MPI job).

Sequence Diagram

Figure: Reprovisioning based on prior state

Note: This usecase makes it also possible to provision an OS, as we will be able to place this into an image. thus a function Image:=get(OS) could be used in any of our diagrams.

Development Tasks

  1. Document how Moab needs to be set up to support Story A and Story B
  2. Contrast the setup
  3. Decide which setup to provide
  4. Identify if we need a switch function to change the behaviour of Moab based on user demand

A possibly command for switching this behaviour could be developed

        fg-moab -mode perjob|ondemand

switches the behavior of moab to either per job (Story A) or on demand (Story B)  

Story C. Queues for using special images (we will not consider this case for now)

In certain cases it may be more convenient to generate special queues that are associated with a special image. The queue can be managed through other Moab tools and its availability may be controlled by the administrator. The advantage is that features could be build in to the system to only start jobs if the cost of provisioning the resources are bellow a certain threshold. An example would be multiple jobs are queued and justify the provisioning of the image.

This use case illustrates how a group of users or a Virtual Organization (VO) can handle their own queue to specifically tune their application environment to their specification.

  1. A VO sets up a new queue, and provides an Operating System image that is associated to this image.
  1. A user within the VO submits a job to the VO queue.
  2. The queue is evaluated, and determines if there are free resource nodes available.
  1. Repeat the provisioning steps if multiple processors are required (such as an MPI job).

Process View

Tasks

At this time we are not yet considering to implement this usecase.

Story D. Dynamic Resource Re-Assignment

The resources of FG are assigned to different service endpoints. The current services include HPC, Nimbus, and Eucalyptus but could also include other services. The nodes associated with the HPC service are manged through Moab and thus all servers will have to be integrated into it. In case a server is moved to the Eucalyptus, Nimbus, or openstack clouds they need to be removed from the HPC resource pool and be marked to be managed as part of the other services.

Development Tasks

In order to achieve this interaction a convenient command line will be made available to privileged users. This command will later be included in a metascheduler.

  1. Develop a simple command line tool
  2. Integrate the tool into the fg shell
  3. Provide functionality to declare resource pools, and move resources between the pools.

A possible command to conduct this action would be

fg-pool -create <name>

This command creates a named resource pool

fg-pool -list <name>

        Lists the resources in the named pool

fg-pool -list

        List the names of the resource pools

fg-pool  -to <to> -resource <name>

moves the named resource to a particular pool

fg-pool  from <from> -to <to> -resource <name>

moves the named resource to a particular pool. However conducts a check if it was previously in the from pool. If it was not an error is returned.

Technological Decisions

The previous discussions are general design implementations that require a system software layer to interface with standard image management features. A variety of tools exists to deal with images and we have to identify which tools are not only suitable for our purpose of booting a variety of images onto compute servers.

Our strategy documented within the PEP plan was identified in the early stages of the project and was not able to be changed. The tools to use were XCAT while integrating them within Moab while only considering RHEL as OS. However since than major events took place that motivate us to revisit the issue:

  1. the staff members promoting the exclusive use of XCAT have since than left the project. The new staff is not in favour of XCAT
  2. new versions of XCAT have been released. The features that we used between these versions are significantly different and XCAT is deemed less significant
  3. TACC has identified that XCAT does not run on their systems so they have installed a commercial product.
  4. UC is not using XCAT

Thus it is timely before we conduct a software upgrade on all the systems to identify a more homogeneous solution. We will however be faced with the issue that TACC is using their own cluster management strategy. As TACC provides theirs, IU may provide also their different solution.

As a consequence we will initially work on userfacing dynamic provisioning only for the india and sierra machines, while foxtrot will be able to leverage form that environment if desired.

We will be asking FG management for guidance on how to proceed in driving the decision.

The planed systems having dynamic provisioning in its various forms available are depicted in Table 2.

Table 2: Proposed dynamic provisioning solutions

Host

dynamic resource
provisining

userfacing OS

provisioning

india

TG11?  

TG11?

sierra

TG11?

TG11?

foxtrot

to be decided

to be decided

hotel

Fall 2011 (impl. by IU)

to be decided

alamo

Fall 2011 (impl. by TACC)

to be decided

xray

N/A

N/A

In order to spend software resources wisely all software development will be using the newest Moab version that is supported by adaptivecomputing.com. It will be important that the development team works together with the system team do enable an environment that can be used to develop such activities. It was decided by the system management team to provide a minicluster to the software team that will include all the newest software and mimic the setup of india once that has been completed. The goal here is that the software developed on the minicluster can easily transferred and deployed onto india.

As there is a variation in available system software distributed by adaptive computing the first task will be to decide which version. From reading the documentation it will be the newest 64 bit version of Moab while using an ODBC connector. We will use also the newest version of TORQUE from the 2.5 branch (According to Greg the TORQUE developers recommended to not use the newest TORQUE release). In case we chose XCAT we would also use the newest version of XCAT.

Deciding if to use XCAT

One of the discussions that took place recently was if we should use XCAT for our solution. We have to keep in mind that we like to have dynamic provisioning deployed by TG11, but also need to keep in mind that we have limited software development support to provide such solutions. Thus we discuss the impact of two different solutions as part of the next sections:

  1. Solution 1: Involving XCAT as part of the solution
  2. Solution 2: Do not use XCAT as part of the solution.

Involving xCAT

The following list provides some information what is needed for a solution while using the newest version of XCAT.

  1. Using FG Image Generator and the XCAT packimage command
    (Current Implementation)

  1. Procedure
  1. We include information in the xCAT tables
  2. Create needed directories
  3. Copy the image
  4. packimage
  5. Register the image with Moab Service Manager
  1. Problems
  1. This is not a generic solution because xCAT does not provide much support outside of IBM hardware
  2. Currently, only India and Sierra have xCAT installed. Also the installed versions require some updates to make sure XCAT is properly deployed. A recommendation was given. alamo is not an IBM software and will not have XACT installed
  3. All sites on which we enable this initial version of user facing dynamic provisioning should have a very similar configuration for xCAT and Moab
  4. Recent changes in the xCAT software identified that the changes were not minimal and that an update between versions is problematic
  5. Restricted to RPM based installs, when using genimage - RHEL, Centos, Fedora and OpenSuse
  6. We “almost” did it for xCAT 2.5 in India. The problems were that we needed to change the xCAT’s MAC table. The other problem was that we could not provision the image using MOAB. We know the same will happen in the new version of XCAT that is now on India and Sierra:
  1. mitigation:
  1. reinstallation of XCAT and moab on at least one of the machines
  2. assure that the same version of software is used on the minicluster and on the selected machine.

  1. Advantages
  1. We have achieved that for minicluster configuration of xCAT 2.6 and MOAB 5.4
  1. question: should we upgrade to moab 6
  2. question: do we need a second minicluster in case we are developing for the new version of Moab. (could we not dynamically provision the old minicluster configuration is something goes wrong.;-) We asked Archit to document how to restore.)
  1. Status
  1. We need to solve some issues with the dynamic provisioning though MOAB because the configuration in the production machine is different from the testing machine
  1. Using the XCAT genimage and XCAT packimage commands

While the previous solution uses the FG image generation framework, this solution uses the XCAT provided generation that is targeted towards RHEL. It is a different approach.

  1. Procedure
  1. Copy the os directory with other name (maybe a softlink can solve it)
  2. Personalize the image created using rsync between custom directory and the master image
  3. Execute genimage
  4. packimage
  5. Register the image with Moab Service Manager
  1. Problems
  1. This is not a generic solution because xCAT is only for IBM
  2. Currently, only India and Sierra has xCAT
  3. We need to execute more things in the management node. Meaning that we need to personalize the image, by doing chroot to install packages in the image
  4. Approach has not been shown to work
  5. If we were to do this approach the entire code base has to be changed as our FG image generation is done differently.
  1. Advantages
  1. this should work for the a normal XCAT deploy.

DO NOT use xCAT

This method is documented here to provide us with an alterantive to XCAT and to bring the software development to a broader user community (e.g. systems that do not use XCAT).

  1. Deploy TFTP boot server, PXE, and DHCP

  1. Create a plugin for MOAB to dynamic provision images. Modify os.switch.pl script or create a new one. Or, create a plugin for MOAB MSM, which includes setting up the system using TFTP, DHCP and PXE booting; write scripts to do the node reprovisioning using the setup; and implementing a perl module as documented in the plugin_howto.txt as found under $MOABINSTALL/tools/msm/doc.
  1. mitigation: we had a number of people in the past participate in MOAB seminars to learn how to develop code to enable the os.switch.pl command and give guidance on how to proceed. We suggest that this knowledge will be shared with others and that the original team will be tasked to develop such a script in collaboration with the software developers having developed the image repository
  1. Procedure
  1. Copy the image to the TFTP server directory (the previous plugin should do everything when MOAB requests it)
  2. Register the image with Moab Service Manager
  1. Problems
  1. It will most likeley conflict with the current configuration of the systems specially with the TFTP boot server of xCAT, with MOAB when it is configured to dynamic provision through xCAT or Bright Computing
  2. Example of MOAB conflict. In the current moab.cfg file we have lines with RMCFG[msm] and we should change to lines with RMCFG[prov] including the one the indicates the URI of the os.switch.pl
  3. Any other conflicts? DHCP server configuration?
  4. How can we control hardware that does not support IPMI?
  1. mitigation: this is only relevant for gravel, while we had two clusters previously to test things we have now only the minicluster. It is potentially irrelevant for us if we were to have more experimental machines that are ipmi enabled, or if we set aside 4 additional nodes from a cluster on its separate network. This has to be discussed further. Wake on lan may be an interesting other solution, this would allow others to build FG as a NOW and promote a n additional usecase important for educational institutions that do have lots of servers but no HPC center.
  1. Software team doubts that they can redesign the entire software and be also responsible for investigation of deployment issues in time for  Teragrid 2011. The expertise for deploying such an environment is with the system administrative team. To interface with each other this mechanism needs to be learned first by using Internet and asking FG Admin Team
  1. Advantages
  1. It is a generic solution in terms of hardware and OS supported
  2. It should be easier that the people installs a TFTP boot server with PXE rather than installing xCAT

  1. Status
  1. Software team has installed a TFTP server with PXE and a DHCP server in the gravel machines.
  2. We can boot images in the same way that xCAT does. For that, I use the kernel and initrd of xCAT and one of our images.
  3. We need to create a initramfs image to use instead of the normal initrd and We also need to create a kernel that support all these things.
  1. This is not done yet, because I as trying to understand how this works to generate a proper image
  1. Main problem that I have now is that after I generate the generic image, I do not know how to customize it to get the image booting though PXE.
  1. Suggestion: I think that this process should be done by the Admin team and they should give us the procedure of how to customize the image for that purpose. The same applies for the kernel.

Conclusion

The question arises which solution we should continue based on a number of constraints that may conflict with each other.

This includes the long term plan, the available developer resources, the available expertise, and the deadline for TG11. IN order for this decision to be cast, we like that you include comments in this document and make additional improvements.

  1. dynamic provisioning with and our FG image generation
  2. dynamic provisioning with with XCAT genimage
  3. dynamic provisioning with NO xCAT

(A hybrid solution of 1 and 3 is also possible. First we continue the current approach as experimented on the mini-cluster. This requires less effort so we could be possible to make the deadline of TG. Meanwhile we can continue exploring the approach 3, which may require some significant development effort, but the final output would be a unique solution across all FG sites. The two solution could be co-existed side by side as both of them are client of the Moab MSM so a switching between the two is easy). The disadvantage of this approach is that at TG11 we only can say dynamic provisioning is available on our minicluster (which is not accessible to the users and considered a test environment).

Based on the available resources and the short time line we believe solution 1 is still the one we should continue.

Service Interface and Portal Access

The service interface of the dynamic provisioning is a related but somewhat independent issue, considering the same architecture and mechanism could be used for virtually all FG services, namely image management(including generation and repository), dynamic provisioning/RAIN, experiment management, etc.

Currently we adopt python as the main developing language, and would like to continue using this to develop the service interfaces for the image generation/deployment, and image repository scripts. A preliminary RESTful interface for image repository has been started by the community volunteers from the UIC. We will continue evaluate the frameworks available in Python to do RESTful services, especially those offer, or are easily to be integrated with, secured REST services.

Cherrypy is the framework that we tested and the UIC group is using. The next step is to find compatible library/framework/solutions to do HTTPS with user authentication(either HTTP basic or digested). Since each user already have a pair of username/password from the FG portal, this credential could be used to access the secured service from a CLI or portal.

The CLI will be part of the FG software release, in which the service clients will be contained to access the secured FG services. A portal interface will also be provided via the current drupal solution, from where we need a service proxy, which connects to the secured FG service in one end, and serving the pages(typically generated via PHP/JavaScript) in the drupal site where the end users are initiating the service interaction and receiving results.

The service proxy in the drupal is to be developed based on the Services modules of drupal. We have developed and deployed such services in the portal, but only for unrestricted services. When access control is needed, more explore is needed to do the API key based access and/or role/username-password pair based authentication/authorization.

Refernces for the Portal component GUI development

https://portal.futuregrid.org/rest-services-python

https://portal.futuregrid.org/rain-futuregrid-image-generation

References

Moab

  1. http://www.adaptivecomputing.com/resources/docs/adaptivehpc/tftp_server_and_pxe_booting.php
  2. http://www.adaptivecomputing.com/resources/docs/adaptivehpc/index.php
  3. http://www.adaptivecomputing.com/resources/docs/adaptivehpc/on_demand_provisioning.php

FG

  1. https://wiki.futuregrid.org/index.php/Sw:manual/Dynamic_Provisioning/Usecase
  2. https://wiki.futuregrid.org/index.php/Sw:Manual/UseCases
  3. https://wiki.futuregrid.org/index.php/Manual/design/queues
  4. https://wiki.futuregrid.org/index.php/Sw:Manual/RAIN
  5. https://wiki.futuregrid.org/index.php/Sw:manual/Dynamic_Provisioning
  6. https://wiki.futuregrid.org/index.php/Docs/HostOrganizer
  7. https://wiki.futuregrid.org/index.php/Hw:NodeUses
  8. https://wiki.futuregrid.org/index.php/Sw:man/fg-image-deploy
  9. https://wiki.futuregrid.org/index.php/Sw:Manual/Image_Creation/Notes
  10. https://wiki.futuregrid.org/index.php/FG_Software_Plan:_Phase_II

Tasks

  1. http://jira.futuregrid.org/browse/FG-1156

Appendix

Demo for Teragrid 2011

Guideline

  1. From a client machine, we generate an image with the packages that we want. As a result, we get a Generic Image in our client machine (Figure 1, steps from 1 to 4).
  2. We show how to upload this image to the Image Repository (IR)
  3. We list the available images in the IR and execute other commands to show how the IR works
  4. We retrieve an image, for example the same one that we just uploaded
  5. We deploy the image for HPC. This procedure includes the customization of the generic image for HPC (Figure 1, steps 5 and 6).
  6. We log into the HPC login machine and we dynamic provision the image using MOAB (qsub command)
  7. We can use the command rcons to see the booting procedure and other MOAB commands like checknode or checkjob to see what is going on.

Background

  1. Figure 1 shows how the Image Management procedure works

Figure 1 Image management

  1. The image can be stored in the image repository and retrieved. For example Step 4 could upload the image directly in the Image Repository. Figure 2 shows how to use the Image Repository.

Figure 2 Image Repository

APPENDIX NOTES TO BE INTEGRATED IN THE ABOVE TEXT

Dynamic Provisioning with XCat and Moab on the fg-gravel testing cluster

Testbed Setup

XCat Management Node: fg-gravel1.futuregrid.iu.edu

Moab Scheduler Node: fg-gravel1.futuregrid.iu.edu

Torque Server Node: fg-gravel1.futuregrid.iu.edu

XCat Clients: fg-gravel5.futuregrid.iu.edu, fg-gravel6.futuregrid.iu.edu

Testbed Requirements

Server Setup

Client/Node Setup

Submitting a Dynamically provisioned job

In order to submit a job that is provisiond dynamically do:

msub -l os=centos5 hello.sh

This will request a job with node property centos5.

References

http://sourceforge.net/apps/mediawiki/xcat/index.php?title=Moab_Adaptive_Computing

APPENDIX - current documentation on how we add images (incomplete and not verified)

Available images on FG

PLEASE INCLUDE THE LIST OF IMAGES HERE WE HAVE AND WHAT THEY DO

To change system images, currently you need to be root and you modify nodetype table:

 tabedit nodelist

And then change the node group to what you intend.

PLEASE PROVIDE THE EXACT COMMAND

The for stateful images, you run the command:

nodeset {node} boot

and for stateless images, you run the command:

 nodeset {node} netboot

then you reboot the machine.