Skip to content

HPC Issues list

Brian M. Schilder edited this page Feb 4, 2022 · 10 revisions

We need to detail a list of issues that people have using the Imperial HPC so they we can get them resolved. Please list/describe any issues that you have below.

You cannot run more than 50 jobs at a time

Most of the queues (even throughput!) have a limit of 50 jobs at a time (excluding array jobs). This is crazy low. At the Sanger it would be common to run >400 jobs at a time and they'd all start near enough instantly.

https://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/computing/job-sizing-guidance/high-throughput/

HPC handles NextFlow badly

running the HelloWorld tutorial takes 30 mins to queue sometimes

you cannot queue more than 50 jobs at a time which is absurd)

See some related notes here: NextFlow

Normally, on the HPC you can run job arrays, which allow you to run up to 10,000 jobs at a time. Essentially these are all put into the queue at the same time and whenever space becomes available for the job parameters you have specified then these jobs run. You can't do this with Nextflow, however, I believe this is a Nextflow limitation and not HPC.

NextFlow pipelines that uses java based application such as picard fails with java sigbus error.

RStudio server cannot be launched with user-specified amounts of RAM

Chris Cave-Ayland (RCS team) noted:

  • "There is one upcoming project that may lay the ground work for users to deploy their own RStudio sessions within jobs but hard to say if/when that may come to fruition."
  • "Well a planned configuration change in the near future should make the 256gb doable in batch (without having to resort to the large memory queue)"

RStudio server can become heavily laggy

I'd previously asked if it would be possible to launch new RStudio instances on dedicated nodes. They said they had an idea for how this could be done, but they couldn't spend any time implementing this anytime soon.

The login nodes can be laggy

Login nodes on HPC can have an extreme lag (30-60+ seconds, sometimes even several minutes), even for basic commands that should be instantaneous (e.g. ls, cd). This is due to a combination of design flaws in HPC and its susceptibility to being overloaded by one or more users who are over-stressing with login nodes with memory-intensive commands. This occurs frequently, and makes HPC highly impractical to use. Some efforts are finally underway to correct this with better login node design. These new nodes include:

ssh <USERNAME>@login-a.hpc.ic.ac.uk
ssh <USERNAME>@login-b.hpc.ic.ac.uk

It generally takes a long time to queue for jobs. In theory, this could be overcome by paying for priority. However, even with express queues the waiting time for even express queue can be quite long and at times hold for 3 days (at least for large-mem jobs).

Large-mem jobs (with cpu multiple of 10 and memory multiple of 120) which are sent to ax4 frequently fail due to exceeding cpu quota.

A recent meeting with Santiago revealed that ax4 is going to retire really soon and some newer machines will replace the large-mem job but I have my doubts.

Large memory Queue for med-bio is not accepting jobs

The med-bio queue for large memory jobs is no longer accepting submissions (15/09/21). The specs of the run I was trying to submit were:

#PBS -l walltime=186:00:00
#PBS -l select=1:ncpus=950:mem=11400gb -q med-bio

This is necessary to rerun scRNA-Seq differential analysis job which needs to be run on multiple cores. The workaround is to run it on a non med-bio queue but this has the downside of a max runtime limit drop from 186 hours to 72 hours and also, the wait time for this queue is dramatically longer (~1 week versus <1 day).

This issue seems to be to do with the recent change over of nodes on the HPC but after contacting someone from the HPC team, they stated:

The pqmedbio-large queue should not be being affected, as its underlying hardware is not part of the refresh this time. However it does seem that that queue is no longer accepting jobs. It's possible something's got messed up in the queue definitions. I'll take a look.

I have yet to hear anything back.

UPDATE - thepqmedbio-large queue appears to be back up now (22/09/21), although I wasn't informed this was the case and haven't been given a reason as to why it went down in the first place.

Useful Info

Santiago from ICT had the following recommendation during a recent (Q2 2021) slow HPC day: -

There is an ongoing issue with the filesystem. Comms were sent regarding this on 27/04/2021 (see attached).

Unfortunately there is no quick or easy fix. 

The best approach would be to try and connect to another login-node, they are:

login-7.hpc.ic.ac.uk
login-6.hpc.ic.ac.uk
login-5.hpc.ic.ac.uk
login-4.hpc.ic.ac.uk
 Lastly, can you check you are not doing any of the following as these can cause slowness even if there is no underlying technical issue.

* A large number (>10,000) of files in a single directory. Any listing operations or wildcard operations (such as grep *) will be significantly slowed by many files in a directory.
* In programming. Do not open a file then write a small amount of data then close many times a second. Always write to files in large chunks if at all possible.
* Using small files. Any file < 8M will not leverage parallel access to disks and as such will only go the speed of a single disk.

Some background to the HPC issues

From an e-mail sent by Simon Clifford on 27/04/2021

Hello all,

This email has two parts.  The first is about the ongoing issues with
the Research Data Service (RDS), the second is about our email
communications.

The RDS is what provides around 11 petabytes of data storage space to
the HPC cluster's nodes and logins.  It is also accessible as a shared
drive.  For over a year now it has been suffering an assortment of
problems which I will now try to explain.

In 2019 we implemented a new parallel storage filesystem, called GPFS,
designed to scale to our workloads.  It immediately exposed problems
with some of the infrastructure switches of the cx2 nodes, which caused
system instability.  These switches should have worked, but didn't.  It
is not possible to replace, repair or work around them.

As a temporary measure a simpler filesystem, NFS, was put in place
across the entire cluster.  NFS does not perform well at the scale of
our systems, but it is in some ways less sensitive to network
instability.  Most of the issues currently being experienced on the
cluster now are due to NFS struggling to keep up.

The best solution is to work towards enabling GPFS on the new cx3 nodes
leaving older nodes on NFS.  The reduced load on NFS should make its
lack of scalability irrelevant.  Work has been ongoing to implement
this; a major blocker is that the new infrastructure has been put into
our network using only IPv6 addresses and this causes conflicts with
our existing mixed IPv4 / IPv6 kit.  A "big bang" implementation of the
network redesign would necessitate an outage of several weeks so we
have instead been using an incremental strategy in order to keep the
system available as much as possible.  This has introduced some
unforeseen issues which you will have noticed over the last few weeks.
ICT Networks, IBM, and others are assisting in resolving these.  We
anticipate the migration of cx3 to GPFS should be complete before the
end of May.

Please be reassured that none of these issues will affect the safety of
your data.  And while they are very frustrating, for most running jobs
the 'hangs' are not terminal -- the job will just wait until the
storage is available again.

Regarding the RCS's communications: it has become apparent, through
some feedback that we are not communicating enough to our users.  We
apologise for this.  We are still quite understaffed, and when a crisis
appears our instincts are to fix it as soon as possible.  Spending time
notifying our users beyond a brief line on the status page
(https://api.rcs.imperial.ac.uk/service-status) is perceived as time
not spent addressing the problem.  However, this mailing list is almost
unused, apart from reminders of service outages.  We intend to use it
more on matters that will still be strictly relevant to the cluster and
RDS, addressing service issues and software installs as well as
maintenance.  We will be guided by your feedback on this.
Clone this wiki locally