Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
npcarter committed Aug 20, 2024
1 parent 5c25ffd commit 2ee3a77
Show file tree
Hide file tree
Showing 5 changed files with 237 additions and 90 deletions.
1 change: 1 addition & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ <h1 class="websitetitle"><a href="https://eddyrivaslab.github.io/">Eddy and Riva
<h2>Available HOWTOs</h2>
<ul>
<li><a href="https://eddyrivaslab.github.io/pages/cluster-computing-in-the-eddy-and-rivas-labs.html">Eddy and Rivas Lab Cluster Resources and how to Access Them</a></li>
<li><a href="https://eddyrivaslab.github.io/pages/leaving-the-lab.html">Leaving the Lab</a></li>
<li><a href="https://eddyrivaslab.github.io/pages/modifying-this-website.html">Modifying This Website</a></li>
<li><a href="https://eddyrivaslab.github.io/pages/my-jobs-arent-running.html">My Jobs Aren't Running</a></li>
<li><a href="https://eddyrivaslab.github.io/pages/running-jobs-on-our-cluster.html">Running Jobs on Our RC Machines</a></li>
Expand Down
143 changes: 84 additions & 59 deletions pages/cluster-computing-in-the-eddy-and-rivas-labs.html
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ <h2>Overview</h2>
When you log in, that's where you'll land. You have 100GB of space
here. </p>
<p>Our <em>lab storage</em> is <code>/n/eddy_lab/</code>. We have 400TB of what RC calls
Tier 1 storage. </p>
Tier 1 storage, which is fast but expensive. </p>
<p>Both your home directory and our lab storage are backed up nightly to
what RC calls <em>snapshots</em>, and periodically to what RC calls <em>disaster
recovery</em> (DR) backups.</p>
Expand All @@ -109,44 +109,17 @@ <h2>Overview</h2>
machine using <code>samba</code>. (Warning: a samba mount is slow, and may
sometimes be flaky; don't rely on it except for lightweight tasks.)
Instructions are below.</p>
<p>RC also provides <em>shared scratch storage</em> for us in
<code>/n/holyscratch01/eddy_lab</code>. You have write access here, so at any
time you can create your own temp directory(s). Best practice is to
use a directory of your own, in
<code>/n/holyscratch01/eddy_lab/Users/&lt;username&gt;</code>. We have a 50TB
allocation. This space can't be remote mounted, isn't backed up, and
is automatically deleted after 90 days.</p>
<p>RC also provides <em>shared scratch storage</em>, which is very fast but not backed up. Files on the scratch storage that are older than 90 days are automatically deleted, and RC strongly frowns on playing tricks to make files look younger than they are. Because RC occasionally moves the scratch storage to different devices, the easiest way to access it is through the \<span class="math">\(SCRATCH variable, which is defined on all RC machines. Our lab has an eddy_lab directory on the scratch space with a 50TB quota, which contains a Users directory, so '\\\)</span>SCRATCH/eddy_lab/Users/<yourusername>' will point to your directory on the scratch space <span class="marginnote">The Users directory was pre-populated with space for a set of usernames at some point in the past. If your username wasn't included, you'll have to email RC to get a directory created for you.</span>. </p>
<p>The scratch space is intended for temporary data, so is a great place to put input or output files from jobs, particularly if you intend to post-process your outputs to extract a smaller amount of data from them.</p>
<p>You can read
<a href="https://docs.rc.fas.harvard.edu/kb/cluster-storage/">more documentation on how RC storage works</a>.</p>
<p>We have three compute partitions dedicated to our lab (the <code>-p</code>, for
partition, will make sense when you learn how to launch compute jobs
with the <code>slurm</code> scheduler):</p>
<ul>
<li>
<p><strong>-p eddy:</strong> 640 cores, 16 nodes (40 cores/node). We use this partition for most of
our computing.</p>
</li>
<li>
<p><strong>-p eddy_gpu:</strong>
4 GPU nodes [holyb0909,holyb0910,holygpu2c0923,holygpu2c1121].
Each holyb node has 4 <a href="https://www.nvidia.com/en-us/data-center/v100/">NVIDIA Tesla V100 NVLINK GPUs</a>
with 32G VRAM, 2 16-core Xeon CPUs, and 192G RAM [installed 2018].
Each holygpu2c node has 8 <a href="https://www.nvidia.com/en-us/data-center/a40/">NVIDIA Ampere A40 GPUs</a>
with 48G VRAM, 2 24-core Xeon CPUs, and 768G RAM [installed 2022].</p>
</li>
</ul>
<p>We are awaiting one more GPU node with 4 <a href="https://www.nvidia.com/en-us/data-center/hgx/">NVIDIA HGX A100 GPUs</a>
with 80G VRAM, 2 24-core AMD CPUs, and 1024G RAM [shipping expected Nov 2022].</p>
<p>We use this partition for GPU-enabled machine learning stuff, TensorFlow and the like.</p>
<ul>
<li><strong>-p eddy_hmmer:</strong> 576 cores in 16 nodes. These are older cores
(circa 2016). We use this partition for long-running or large jobs, to
keep them from getting in people's way on <code>-p eddy</code>.</li>
</ul>
<p>We are awaiting installation of another 1536 CPU cores (in 24 nodes,
64 cores/node) [expected fall 2022].</p>
<p>All of our lab's computing equipment is contained in the eddy partition, which contains 1,872 cores. Most of our machines have 8GB of RAM per core. In addition, we have three GPU-equipped machines, which are part of the partition: holygpu2c0923, holygpu2c1121, and holygpu7c0920<span class="marginnote">The "holy" at the beginning of our machine names refers to their location in the Holyoke data center.</span></p>
<p>Each holygpu2c node has 8 <a href="https://www.nvidia.com/en-us/data-center/a40/">NVIDIA Ampere A40 GPUs</a>
with 48G VRAM [installed 2022]. </p>
<p>The holygpu7 node has 4 <a href="https://www.nvidia.com/en-us/data-center/hgx/">NVIDIA HGX A100 GPUs</a>
with 80G VRAM [installed 2023]. </p>
<p>We can also use Harvard-wide shared partitions on the RC cluster. <code>-p
shared</code> is 17,952 cores (in 375 nodes), for example. RC has
shared</code> is 19,104 cores (in 399 nodes), for example (as of Jan 2023). RC has
<a href="https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions">much more documentation on available partitions</a>.</p>
<h2>Accessing the cluster</h2>
<h3>logging on, first time</h3>
Expand Down Expand Up @@ -197,20 +170,20 @@ <h3>configuring an ssh host alias</h3>

<p>You still have to authenticate by password and OpenAuth code, though.</p>
<h3>configuring single sign-on scp access</h3>
<p>It can get tedious to have to authenticate every time you <code>ssh</code> to RC,
especially if you're using ssh-based tools like <code>scp</code> to copy
individual files back and forth. You can streamline this using
<p>Even better, but a little more complicated: you can make it so you
only have to authenticate once, and every ssh or scp after that is
passwordless. To do this, I use
<a href="https://docs.rc.fas.harvard.edu/kb/using-ssh-controlmaster-for-single-sign-on/">SSH ControlMaster for single sign-on</a>,
to open a single <code>ssh</code> connection that you authenticate once, and all
subsequent <code>ssh</code>-based traffic to RC goes via that connection.</p>
<p>RC's
<a href="https://docs.rc.fas.harvard.edu/kb/using-ssh-controlmaster-for-single-sign-on/">instructions are here</a>
but briefly:</p>
<ul>
<li>Add another hostname alias to your <code>.ssh/config</code> file. Mine is
called <strong>odx</strong>:</li>
<li>Replace the above hostname alias in <code>.ssh/config</code> file with
something like this:</li>
</ul>
<div class="highlight"><pre><span></span><code>Host odx
<div class="highlight"><pre><span></span><code>Host ody
User seddy
HostName login.rc.fas.harvard.edu
ControlMaster auto
Expand All @@ -221,19 +194,19 @@ <h3>configuring single sign-on scp access</h3>
<ul>
<li>Add some aliases to your <code>.bashrc</code> file:</li>
</ul>
<div class="highlight"><pre><span></span><code> <span class="nb">alias</span> odx-start<span class="o">=</span><span class="s1">&#39;ssh -Y -o ServerAliveInterval=30 -fN odx&#39;</span>
<span class="nb">alias</span> odx-stop<span class="o">=</span><span class="s1">&#39;ssh -O stop odx&#39;</span>
<span class="nb">alias</span> odx-kill<span class="o">=</span><span class="s1">&#39;ssh -O exit odx&#39;</span>
<div class="highlight"><pre><span></span><code> <span class="nb">alias</span> ody-start<span class="o">=</span><span class="s1">&#39;ssh -Y -o ServerAliveInterval=30 -fN ody&#39;</span>
<span class="nb">alias</span> ody-stop<span class="o">=</span><span class="s1">&#39;ssh -O stop ody&#39;</span>
<span class="nb">alias</span> ody-kill<span class="o">=</span><span class="s1">&#39;ssh -O exit ody&#39;</span>
</code></pre></div>

<p>Now you can launch a session with:</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c">% odx-start</span><span class="w"></span>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c">% ody-start</span><span class="w"></span>
</code></pre></div>

<p>It'll ask you to authenticate. After you do this, all your ssh-based
commands (in any terminal window) will work without further
authentication. To stop the connection, do</p>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c">% odx-stop</span><span class="w"></span>
<div class="highlight"><pre><span></span><code><span class="w"> </span><span class="c">% ody-stop</span><span class="w"></span>
</code></pre></div>

<p>If you forget to stop it, no big deal, the connection will eventually
Expand Down Expand Up @@ -395,17 +368,7 @@ <h3>writing an sbatch script</h3>
format. An example that (stupidly) loads gcc and just calls
<code>hostname</code>, so the output will be the name of the compute node the
script ran on:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="ch">#!/bin/bash</span>
<div class="highlight"><pre><span></span><code><span class="ch">#!/bin/bash</span>
<span class="c1">#SBATCH -c 1 # Number of cores/threads</span>
<span class="c1">#SBATCH -N 1 # Ensure that all cores are on one machine</span>
<span class="c1">#SBATCH -t 6-00:00 # Runtime in D-HH:MM</span>
Expand All @@ -416,7 +379,7 @@ <h3>writing an sbatch script</h3>

module load gcc
hostname
</code></pre></div></td></tr></table></div>
</code></pre></div>

<p>Save this to a file (<code>foo.sh</code> for example) and submit it with <code>sbatch</code>:</p>
<div class="highlight"><pre><span></span><code> sbatch foo.sh
Expand Down Expand Up @@ -468,6 +431,68 @@ <h3>etiquette</h3>
<p>You can also add <code>--nice 1000</code> to your <code>sbatch</code> command, to downgrade
your running priority in the queue, which helps let other people's
jobs get run before yours.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";

if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}

var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML';

var configscript = document.createElement('script');
configscript.type = 'text/x-mathjax-config';
configscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'none' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" availableFonts: ['STIX', 'TeX']," +
" preferredFont: 'STIX'," +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";

(document.body || document.getElementsByTagName('head')[0]).appendChild(configscript);
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>

</article>
<footer>Powered by <a href="https://getpelican.com/">Pelican</a>. Site theme is a modified version of <a href="https://github.com/andrewheiss/ath-tufte-pelican">ath-tufte-pelican</a>.</footer>
Expand Down
Loading

0 comments on commit 2ee3a77

Please sign in to comment.