NOTE NYU HAS MOVED TO SINGULARITY (please refer to, for its setup
Download python 3.6
version of the Anaconda for ubuntu
from here :
wget <download link>
unzip <downloaded file> / tar -xf <downloaded file>
chmod +777 <downloaded file>
./<downloaded file>.sh
conda info --envs
conda create --name bert python=3.5
conda activate bert
(if something goes wrong, start fresh) Clean the environment :
conda deactivate
conda clean -i -l -t -p -s
conda build purge-all
Conda seems to have a bug, doesn't delete the local files, will eventually exhaust all the storage space
conda remove --name env_name --all
Try to follow the sequence of commands as it is. Do not do module load anaconda, doesn't seems to work for me.
module purge
Types of GPUs available can be found here :
srun -c4 -t5:00:00 --mem=30000 --gres=gpu:p40:1 --pty /bin/bash
requesting specific gpu: --nodelist=gpu-90
module load cudnn/9.0v7.0.5
module load cuda/9.0.176
module load cudnn/9.0v7.3.0.29
module load cuda/9.0.176
module load cudnn/10.0v7.6.2.24
module load cuda/10.1.105
just do conda spider cudnn
or conda spider cuda
to know what versions are available and install accordingly. Must do it before installing tensorflow. If not, pip uninstall tensorflow-gpu==x.x.x and reinstall
Incase of regex error :
module load gcc/9.1.0
conda install -c conda-forge regex
(Use pip not conda!!) Conda has some sorts of bug. It installs cuda and cudnn as well, which is already installed on prince and in contradictions to the requirements of tensorflow.
pip install h5py nltk pyhocon scipy sklearn
pip install tensorflow-gpu==1.7.0
pip install tensorflow-gpu==1.11.0
conda env create -f req.yml
req.yml :
name: env_name
- pytorch
- python=3.6
- pytorch
- torchvision
- numpy
- scikit-learn
- h5py
- scipy
Create a file file_name.s
time : hts:mins:seconds
gpu: type: always use p40
mem: 50000 == 5Gb (the max one can request is 10Gb, don't request this much, you don't need it :) )
nodes>1 not allowed (IDK)
#SBATCH --cpus-per-task=4
#SBATCH --nodes=1
#SBATCH --gres=gpu:p40:1
#SBATCH --time=01:00:00
#SBATCH --mem=50000
#SBATCH --job-name=pp1953
#SBATCH --output=slurm_%j.out
#command line argument
. ~/.bashrc
module load anaconda3/5.3.1
conda activate <conda env name>
conda install -n <conda env name> nb_conda_kernels
# conda activate
cd code/Video-Person-ReID-master/current
python --opt=3 >> gpu-sgd.out
Run the above script as
sbatch file_name.s
squeue -u pp1953
squeue -j 4654238
scancel 4654238
In case below steps arent working : use this link
conda install -n <conda env name> nb_conda_kernels
python -m ipykernel install --user --name build_central --display-name <conda env name>
Now create a sbatch file like this
#SBATCH --cpus-per-task=8
#SBATCH --nodes=1
#SBATCH --time=100:00:00
#SBATCH --gres=gpu:v100:1
#SBATCH --mem=100000
#SBATCH --job-name=pp1953
#SBATCH --output=outputs/slurm_%j.out
. ~/.bashrc
module load anaconda3/5.3.1
module load jupyter-kernels/py3.5
conda activate PPUU
conda install -n PPUU nb_conda_kernels
port=$(shuf -i 10000-65500 -n 1)
/usr/bin/ssh -N -f -R $port:localhost:$port log-0
/usr/bin/ssh -N -f -R $port:localhost:$port log-1
Jupyter server is running on: $(hostname)
Job starts at: $(date)
Step 1 :
ssh -L $port:localhost:$port $
ssh -L $port:localhost:$port $USER@prince
Step 2:
the URL is something: http://localhost:${port}/?token=XXXXXXXX (see your token below)
if [ "$SLURM_JOBTMP" != "" ]; then
jupyter notebook --no-browser --port $port --notebook-dir=$(pwd)
if there is, by any chance, .ssh/config
file rights distorted : do chmod 700 .ssh/config
pip install jupyter
pip install jupyter[notebook]
pip install matplotlib
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter notebook --generate-config
For normal servers : edit the following to (uncomment as well remove # ):
c.NotebookApp.ip = ''
c.NotebookApp.allow_origin = '*'
jupyter notebook --no-browser --port=XXXX(some port)
ssh -N -f -L localhost:YYYY:localhost:XXXX user@ip address
conda install -c conda-forge jsonnet
module spider gcc
module load gcc/9.1.0
conda install -c conda-forge regex
pip install allennlp
pip install --user
might be better to use
pip install
<--> conda install -c conda-forge
Don't load python module!!!
sacct --format=User,JobID,partition,state,time,start,end,elapsed,nodelist -j 4821655
scontrol write batch_script 5553599
or scontrol write batch_script 5553599 -
1080ti, titanxp, titanblack, k40, k20, k20x, m2090, titanv
export PATH="/misc/vlgscratch4/LakeGroup/pathak/anaconda3/bin:$PATH"
conda activate pathak
conda install -n pathak nb_conda_kernels
- Conda install reverts the changes of loading cuda/cudnn
use :conda uninstall tensorflow-gpu cudatoolkit cudnn
- Tensorflow was compiled with diffent version of cudnn and currently is a different version is loaded. Just load the correct/earlier version of cudnn by which tensorflow-gpu was installed
- Tensorflow is not compatible to use gpu. cuda/cudnn used during installation doesn't match with the tensorflow binary from which it was created.
pip uninstall tensorflow-gpu
or possibly delete the whole environment and follow the above procedure.
from google.colab import drive
%cd /gdrive/My\ Drive/...
pip install gdown
gdown<id here>
Local Computer:
In ~/.ssh/config
Add the following
Host prince
User net_id
PubKeyAuthentication yes
IdentityFile /Users/local_name/.ssh/id_rsa
The last IdentityFile
is the private key to use
chmod 700 /home/<net_id>/
chmod 700 ~/.ssh
chmod 600 ~/.ssh/*
Add public key to the ~/.ssh/authorized_keys
ssh prince
ssh into first, or use cisco vpn (allows file transfer)
very helpful, if you are not a big vim fan
sshfs -p 22 ~/project
ssh -f -L -N
sshfs -p 2222 pp1953@ ~/NYU/temp/
ssh -f <user_name>@<tunnel server> -L 2222:<target server>:22 -N
sshfs -p 2222 <user_name>@ local/path/
One single command for the above shall look something like this :
alias storage='ssh -f -L -N; sshfs -p 5555 pp1953@ ~/NYU/project'
alias prince='ssh -p 5555 pp1953@'
In case mount is throwing error
mount_osxfuse: /Users/ppriyank/project/: Input/output error
Then do the following :
pgrep -lf sshfs
kill -9 <pid of the process corresponding to the mount /Users/ppriyank/project/>
sudo umount -f /Users/ppriyank/project/
Then proceed with ususal mount again
Unmounting process remains the same
umount -f ~/NYU/temp/
For god's sake use tmux
Do not OPEN mutiple tmux sessions on mutiple login nodes, that just bad practice. I have all my tmux windows running on log-3, so whatever login node you enter, just do ssh log-3
and then launch tmux session
tmux new -s session_name
tmux a -t session_name
control + b -> # (sliding between windows) or control + b -> ' -> # (window >10)
Set mouse scrolling on by : (Mac control + b -> shift + : ->setw -g mouse on
or setw -g mode-mouse on
from IPython import embed; embed()
pgrep -lf sshfs
kill -9 xyz
sudo umount -f /Users/ppriyank/NYU/project2
sshfs -p 22 ~/NYU/project
tmux a -t pathak
srun -c4 -t100:00:00 --mem=50000 --gres=gpu:p40:1 --pty /bin/bash
cd /scratch/pp1953/model2/codes/
module load cudnn/9.0v7.3.0.29
module load cuda/9.0.176
conda activate bert
Quick setup
sh /mnt/data/
conda create --name pathak python=3.8 ; conda activate pathak ; pip install timm torchvision torch numpy PyYAML yacs termcolor ;
vim .bash_profile
source /home/priyank/miniconda3/etc/profile.d/
conda activate pathak
alias session="tmux new -s pathak; tmux a -t pathak"
vim ~/.bashrc
source ~/.bash_profile
source ~/.bashrc