Ansible for HPC workflows?

I’m exploring Ansible for managing HPC workflows. The use case is that a user
prepares a set of files on a central computer (e.g., a workstation or laptop),
transfers them to a bunch of HPC machines, and submits a job to their queuing
systems. Once the job on any of them starts running, the other ones are then
canceled. When the job finishes, the files are transfered back.

It seems that Ansible has the necessary parts to transfer files, run commands,
poll status, and pull info (by the remote machines), etc. I’m wondering if there
are already modules to handle these types of tasks. If not, what would you
recommend for putting up such a solution? I imagine a few new plugins and
modules will have to be developed. This will be really useful to avoid the file
syncing disaster when each of the servers has some but not all of the latest
data. A lot of people roll their own impromptu solutions, but most just “live”
with the pain.

I’ve used Ansible to install and configure a very basic HPC cluster but stopped short of managing jobs in HPC. I would say you would require some custom Ansible modules to get a decent implementation of job management and interaction with HPC but because Windows modules for Ansible are in PowerShell you can easily do this and use the HPC cmdlets.

Thanks

Jordan

I thought so too, if no one else has developed HPC modules before. Good to know
about the HPC cmdlets in PowerShell. Most of the machines I have access to are
Linux-based though.

I will start playing a little bit more with Ansible for this purpose. Do you
know any good libraries, Python or otherwise, for interacting with job/resource
managers? I’d try to avoid re-inventing the wheel if at all possible.

Ansible probably isn’t the right tool for workflows/pipelines for staging data in and out of a cluster, you might be better of looking at something like makeflow, luigi or airflow (there’ a bunch of these pipeline/workflow tools) We use luigi with a slurm plugin to stage data into a job and execute whatever is needed. As others have pointed out Ansible is more suited to setup/maintenance of a cluster.

The primary focus of these workflow tools usually seems to be the dependency / pipeline aspect, and they appear not quite robust (at least the ones I tried) with respect to interacting with guarded HPC systems (e.g., requiring SSH multi-hop and not allowing long-running processes on login nodes). They are probably fine if we can run them directly on the login nodes, or can keep a daemon running on the login nodes. However, for the use case of setting up jobs on personal workstation, and submitting them to multiple HPC systems opportunistically, they all might require the development of some new plugins. I felt that Ansible may be a suitable framework for such plugins because it’s quite reliable for various SSH scenarios and does not use a remote agent.

This is getting slightly off-topic with ansible, but if you want to submit to multiple clusters at different sites then you’re going to run into scheduling problems and different scheduler configurations, data locality issues if the data sets are big and probably a number of other issues. These problems are reasonably well solved by the bio-informatics and HEP spaces, I’ve found that most pipeline tools are best with homogeneous systems and the behavior of failing fast and early is actually a good thing as you don’t want tasks in your pipeline to continue if you have bad data.

The biggest challenge that you’ll probably come across is how do you track and check for job id’s and how do you decide if a job/task has run successfully, define where the data is, offer data validation steps and how do you do this in a disconnected way since the user on the laptop might disconnect from the network. If you’re doing things opportunistically you’ll want to stage data in before jobs are setup, then you run into a garbage collection problem on jobs that failed/didn’t run. You would probably need to come up with a scheme for do this before implementing a plugin(s) for ansible. This is all assuming that you have addressed the issue of the executables are all of the same versions on the different clusters as versions matter in some simulations.

JFTR, the luigi system has a ssh plugin that lets you scp things in and out of a cluster.