Introduction to SMU SuperPOD

Overview

Teaching: 10 min
Exercises: 0 min

Questions

Objectives

Onboarding to SMU SuperPOD

Introduction

The SMU SuperPOD is a high-performance computing (HPC) cluster, made by NVIDIA, specifically tailored to meet the demands of cutting-edge research
This shared resource machine consists of 20 NVIDIA DGX A100 nodes, each with 8 advanced and powerful graphical processing units (GPUs) to accelerate calculations and train AI models.
The SMU Office of Information Technology (OIT) and the Center for Research Computing (CRC) jointly manage and provide both access and support for this top of the line machine.

NVIDIA DGX SuperPOD Advantage Specifications

Specification	Values
Computational Ability	1,644 TFLOPS
Number of Nodes	20
CPU Cores	2,560
Total Memory	52.5 TB
Node Interconnect Bandwidth	200 Gb/s Infiniband Connections Per Node
Scratch Storage	750 TB (Raw)
Archival Storage	N/A
Operating System	Ubuntu 20.04

Specification for each compute node:

Specification	Values
CPU number	128
GPU number	8
Memory	1910gb
Time Limit	2 days
Home Storage	200gb (Independence from M3)
Scratch Storage	Unlimited (Independence from M3)

Command to check number the configuration of All nodes:

$ sinfo --Format="PartitionName,Nodes:10,CPUs:8,Memory:12,Time:15,Features:18,Gres:14"

Storage

Note that:

SuperPOD’s home & scratch directory is different from M3’s home.
However, both SuperPOD and M3 share the same project work storage (refer to from SMU ColdFront HPC Management)

Variable	Path	Quota	Usage
${HOME}	/users/${USER}	200 GB	Home directory, backed up
${SCRATCH}	/scratch/users/${USER}	None	Temporary scratch space
${JOB_SCRATCH}	/scratch/_tmp/${USER:0:1}/	None	Per job scratch space,
${JOB_SCRATCH}	${USER}/${SLURM_JOB_ID}_		${SLURM_ARRAY_TASK_ID} is
${JOB_SCRATCH}	${SLURM_ARRAY_TASK_ID}		zero for standard jobs

Make sure you have a SuperPOD account created for you. You can ask your supervisor to request for an account by submitting this form
There are several ways to login to SuperPOD: via 2 login nodes (must be on VPN)

$ ssh username@superpod.smu.edu
$ ssh username@slogin-01.superpod.smu.edu
$ ssh username@slogin-02.superpod.smu.edu

SuperPOD is using the same module system as M3 so nearly all commands are similar.

Requesting a compute node

SuperPOD uses SLURM as scheduler so it is no different from M3 when requesting an interactive node:

For example, requesting a node with 1 GPU, 10 CPUs, 128gb memory for 12 hours: using my workshop Allocation -A tuev_oitrts_workshop_0001

$ srun -A tuev_oitrts_workshop_0001 -N 1 -G 1 -c 10 --mem=128G --time=12:00:00 --pty $SHELL
$ srun -A tuev_oitrts_workshop_0001 -N 1 -G 1 -c 10 --mem=128G --time=12:00:00 --pty bash

Transfering data

It is no difference when transfering data to-from SuperPOD if you are familiar with M3, you can use scp for regular transfer

scp /link/fileA username@superpod.smu.edu:/users/username

or using WinSCP on Windows machine if you dont want to use CLI

Tips, since SuperPOD and M3 share the same work storage, you can utilize this share storage for both systems.

Working with module

By default, very few modules available when using module avail

$ module avail

------------------------------------------------------------------------- /hpc/mp/module_files/compilers -------------------------------------------------------------------------
   amd/aocc/4.1.0    gcc/11.2.0    intel/oneapi/2023.2    nvidia/nvhpc/23.7

--------------------------------------------------------------------------- /hpc/mp/module_files/apps ----------------------------------------------------------------------------
   amber/22    apptainer/1.1.9    conda    gaussian/g16c02    julia/1.9.2    lammps/may22    spack

Similar to M3, SuperPOD also uses Spack as its module manager. Therefore you can find all your needed modules after loading spack:

$ module load spack
$ module avail

------------------------------------------------------------------ /hpc/mp/spack_modules/linux-ubuntu22.04-zen2 ------------------------------------------------------------------
   aocc-4.1.0/aocl-sparse/4.0-t2kjb3u                               gcc-11.2.0/aocl-sparse/4.0-zczy7ug                          gcc-11.2.0/lz4/1.9.4-gtzsc3c
   aocc-4.1.0/autoconf-archive/2023.02.20-inwkm6b                   gcc-11.2.0/autoconf-archive/2023.02.20-r5lazua              gcc-11.2.0/lzo/2.10-x6itbky
   aocc-4.1.0/autoconf/2.69-x53b2ii                                 gcc-11.2.0/autoconf/2.69-xlmuzvq                            gcc-11.2.0/m4/1.4.19-sv4d5ah
   aocc-4.1.0/automake/1.16.5-hfcjabg                               gcc-11.2.0/automake/1.16.5-nsy2ron                          gcc-11.2.0/mbedtls/2.28.2-xvf3rc3
   aocc-4.1.0/berkeley-db/18.1.40-5po7n7c                           gcc-11.2.0/berkeley-db/18.1.40-hlnjdqn                      gcc-11.2.0/mbedtls/2.28.2-42lnomn         (D)     
   aocc-4.1.0/binutils/2.40-eivqxcw                                 gcc-11.2.0/binutils/2.40-u6hr2wz                            gcc-11.2.0/meson/1.1.0-teqdfz5
   aocc-4.1.0/bzip2/1.0.8-5ag7qmi                                   gcc-11.2.0/bison/3.8.2-tifozqf                              gcc-11.2.0/metis/5.1.0-coza6f3
   aocc-4.1.0/cmake/3.26.3-p6v5a7t                                  gcc-11.2.0/boost/1.82.0-xpmd3v6                             gcc-11.2.0/mpfr/4.2.0-meodww2
   aocc-4.1.0/diffutils/3.9-bzq7rzo                                 gcc-11.2.0/bzip2/1.0.8-qaxdt7f                              gcc-11.2.0/msgpack-c/3.1.1-d624eki
   aocc-4.1.0/expat/2.5.0-kav5ad4                                   gcc-11.2.0/cmake/3.26.3-r23mmbq                             gcc-11.2.0/nasm/2.15.05-mdqravc
   aocc-4.1.0/gdbm/1.23-6r6asdl                                     gcc-11.2.0/cmake/3.26.3-utseokk                      (D)    gcc-11.2.0/ncurses/6.4-rfw5ur5
   aocc-4.1.0/gettext/0.21.1-dmnukqt                                gcc-11.2.0/curl/8.0.1-cp7iioq                               gcc-11.2.0/neovim/0.8.3-mdppjp3
   ....

Note: Press “q” to quit checking module

As we are on installation process, if you do not see the modules that you needed available, please inform us so we can install that for you

Key Points

SuperPOD 101

Working with Conda Environment

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to create personal conda environment in SuperPOD

Objectives

Create Conda environment for AI&ML Application

2. Conda Environment

Beside Spack module manager installed in SuperPOD, you can also use Conda for your own package manager.
In many cases, you want to use Conda environment for many AI&ML application, just like you do in M3
First thing first, just load the conda module installed:

$ module load conda
$ conda env list

# conda environments:
#
base                     /hpc/mp/apps/conda

Create conda environment for Tensorflow with GPUs support

Next, let’s create a conda environment for Tensorflow 2.9, here are the steps:

(1) Request a compute node with 1 GPU

Make sure you have your own allocation name, in this case tuev_oitrts_workshop_0001 was allocated via SMU ColdFront HPC Management

$ srun -A tuev_oitrts_workshop_0001 -N1 -G1 -c10 --mem=64G --time=12:00:00 --pty $SHELL

(2) Load cuda and cudnn module for GPU support

$ module load conda gcc/13
$ module load cuda/12
$ module load cudnn/8

(3) Create Tensorflow environment with your prefered version of python, here let’s use TF 2.17 with python 3.10

$ conda create --prefix ~/tensorflow_2.17 python=3.10 pip --y

The conda environment named tensorflow_2.17 is created on your home directory

(4) Activate the conda environment and Install Tensorflow 2.17.1 (or your prefered TF version)

$ conda activate ~/tensorflow_2.17/  
$ pip install tensorflow==2.17.1

Install ipkernel and create the kernel for Notebook

$ pip install ipykernel
$ python3 -m ipykernel install --user --name tensorflow_2.17 --display-name TensorflowGPU2.17

(5) Once installation done, check if the conda environment is able to enable the GPU

$  python
>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Usage of conda environment manager is no difference compared to running in M3.

Create conda environment for Pytorch with GPUs support

Similar to Tensorflow, one can create conda environment for Pytorch with GPUs support.

Following is the brief steps (3) to (5) to create the env and install Pytorch after requesting a node and load the libraries

$ module purge
$ module load conda gcc/11 cuda/11 cudnn
$ conda create --prefix ~/pytorch_2.5.1 pip --y
$ conda activate ~/pytorch_2.5.1
$ conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1  pytorch-cuda=11.8 -c pytorch -c nvidia
$ python
>>> import torch 
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1

Key Points

Conda environment

Using NGC Container in SuperPOD

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use NGC Container in SuperPOD?

Objectives

Learn how to master NGC Container useage in SuperPOD

3. Using NVIDIA NGC Container in SuperPOD

What is Container?

Container demonstrates its efficiency in application deployment in HPC.
Containers can encapsulate complex programs with their dependencies in isolated environments making applications more portable.
A container is a portable unit of software that combines the application and all its dependencies into a single package that is agnostic to the underlying host OS.
Thereby, it removes the need to build complex environments and simplifies the process of application development to deployment.

Docker Container

Docker is the most popular container system at this time
It allows applications to be deployed inside a container on Linux systems.

NVIDIA NGC Container

NGC Stands for NVIDIA GPU Clouds
NGC providing a complete catalog of GPU-accelerated containers that can be deployed and maintained for artificial intelligence applications.
It enables users to run their projects on a reliable and efficient platform that respects confidentiality, reversibility and transparency.
NVIDIA NGC containers and their comprehensive catalog are an amazing suite of prebuilt software stacks (using the Docker backend) that simplifies the use of complex deep learning and HPC libraries that must leverage some sort of GPU-accelerated computing infrastructure.
Complete catalogs of NGC can be found here, where you can find tons of containers for Tensorflow, Pytorch, NEMO, Merlin, TAO, etc…

ENROOT

It is very convenient to download docker and NGC container to SuperPOD. Here I would like to introduce a very effective tool name enroot

A simple, yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
This approach is generally preferred in high-performance environments or virtualized environments where portability and reproducibility is important, but extra isolation is not warranted.

Importing docker container to SuperPOD from docker hub

The following command import docker container ubuntu from https://hub.docker.com/_/ubuntu
It then create the squash file named ubuntu.sqsh at the same location
Finally, it start the ubuntu container

$ enroot import docker://ubuntu
$ enroot create ubuntu.sqsh
$ enroot start ubuntu

#Type ls to see the content of container:
# ls

bin   dev  home  lib32  libx32  mnt  proc  run   srv  tmp    usr
boot  etc  lib   lib64  media   opt  root  sbin  sys  users  var

Type exit to quit container environment

Exercise

Go to dockerhub, search for any container, for example lolcow then use enroot to contruct that container environment

enroot import docker://godlovedc/lolcow
enroot create godlovedc+lolcow.sqsh
enroot start godlovedc+lolcow

Download Tensorflow container

Now, let’s start downloading Tensorflow container from NGC. By browsing the NGC Catalog and search for Tensorflow, I got the link: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow
Copy the image path from website:

The following information was copied to the memory when selecting the 22.12-tf2 version:

nvcr.io/nvidia/tensorflow:22.12-tf2-py3

Im gonna download the version 22.12 tf2 using enroot, pay attention to the syntax difference when pasting:

$ mkdir sqsh
$ cd sqsh
$ enroot import docker://nvcr.io#nvidia/tensorflow:22.12-tf2-py3

The sqsh file nvidia+tensorflow+22.12-tf2-py3.sqsh is created.

Next create the sqsh file:

$ enroot create nvidia+tensorflow+22.12-tf2-py3.sqsh

Working with NGC container in Interactive mode:

Once the container is import and created into your folder in SuperPOD, you can simply activate it from login node when requesting a compute node:

$ srun -N1 -G1 -c10 --mem=64G --time=12:00:00 --container-image $HOME/sqsh/nvidia+tensorflow+22.12-tf2-py3.sqsh --pty $SHELL

Once loaded, you are placed into /workspace which is the container local storage. You can navigate to your $HOME folder freely.

Check the GPU enable:

$ python
>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Exit the container using exit command.

Working with NGC container in Batch mode

Similar to M3, container can be loaded and executed in batch mode.
Following is the sample content of a batch file named spod_testing.sh with a python file testing.py

#!/bin/bash
#SBATCH -J Testing       # job name to display in squeue
#SBATCH -o output-%j.txt    # standard output file
#SBATCH -e error-%j.txt     # standard error file
#SBATCH -p batch -c 12 --mem=20G --gres=gpu:1     # requested partition
#SBATCH -t 1440              # maximum runtime in minutes
#SBATCH -D /link-to-your-folder/

srun --container-image=/users/tuev/sqsh/nvidia+tensorflow+22.12-tf2-py3.sqsh python testing.py

Content of testing.py

import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

Working with NGC container in Jupyter Lab

It is a little bit different if you want to use NGC container in Jupyter Lab
After requesting a node running with your container, you need to run the jupyter command with additional flag –allow-root:

root@bcm-dgxa100-0001:/workspace# jupyter lab --allow-root --no-browser --ip=0.0.0.0

The following URL appear with its token

Or copy and paste this URL:
        http://hostname:8888/?token=fd6495a28350afe11f0d0489755bc3cfd18f8893718555d2

Note that you must replace hostname to the corresponding node that you are in, this case is bcm-dgxa100-0001.

Therefore, you should change the above address to and paste to Firefox:

http://bcm-dgxa100-0001:8888/?token=fd6495a28350afe11f0d0489755bc3cfd18f8893718555d2

Note: you should select the default Python 3 (ipykernel) instead of any other kernels for running the container.

Key Points

NGC Container

Using Jupyter Lab in SuperPOD

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use Jupter Lab in SuperPOD?

Objectives

Learn port forwarding technique to enable Jupter Lab

4. Jupyter Lab on SuperPOD

There is NO display config and Open OnDemand setup in SuperPOD, so it is not quite straighforward to use Jupter Lab
However, it is still possible to use Port-Forwarding in SuperPOD in order to run Jupyter Lab.
Please download and use VSCode for all OS (Windows/Macs/Linux). From VSCode terminal, ssh to superpod with specific port, for example port 8000:

$ ssh -C -D 8000 username@superpod.smu.edu

The C stands for Compression and D stands for Dynamic port-forwarding with SOCKS4/5 to port number 8000. Feel free to change the port and remember to set it up in your browser

4.1 Setup browser to enable proxy viewing (similar for MacOS/Linux as well)

4.1.1 Using Firefox as browser:

Open Firefox, my version is 104.0.2. Use combination Alt+T+S to open up the settings tab. Scroll to bottom and select Settings from Network Settings:

Select Manual Proxy Configuration
In the SOCKS Host, enter localhost, Port 8000
Check SOCKS v5.
Check Proxy DNS when using SOCKS v5.
Check Enable DNS over HTTPS.
Make sure everything else is unchecked, then click OK.
Your screenshot should look like below:

4.1.2 Using Chrome/Safari as browser:

Search for proxies and set a Socks proxy with sever localhost and port 8000.

4.2 Test Proxy

4.2.1. Test Proxy using conda environment:

Go back to MobaXTerm and login into SuperPOD using regular SSH Request a compute node

$ srun -A my_allocation -N1 -G1 -c10 --mem=64G --time=12:00:00 --pty $SHELL

Load cuda, cudnn and activate any of your conda environment, for example Tensorflow_2.9 in the home directory

$ module load conda gcc/13 cuda/12 cudnn/8.9
$ conda activate ~/tensorflow_2.17   

Make sure to install jupyter

$ pip install jupyterlab

Next insert the following command:

$ jupyter notebook --ip=0.0.0.0 --no-browser
# or
$ jupyter lab --ip=0.0.0.0 --no-browser   

The following screen appears

Copy the highlighted URLs to Firefox, you will see Jupyter Notebook port forward to this:

Select TensorflowGPU29 kernel notebook and Check GPU device:

4.2.2. Test Proxy using docker container:

For docker container, the command line need to have 1 additional flag:

$ jupyter lab --ip=0.0.0.0 --no-browser --allow-root

You will need to replace the hostname to the name of the node that you are having:

For example in the previous command, you need to copy and paste the following line to Firefox browser:

$ http://bcm-dgxa100-0016:8888/?token=daefb1c3e2754b37b6b94b619387cb3fd9710608e0152182 

Troubleshoot for notebook requesting password

In certain case, your Jupyter Notebook requires password to be enable, you can setup the password using the command below prior to requesting jupyter lab instance:

$ jupyter notebook password

Key Points

Jupter Lab, Port-Forwarding

Using Batch script in SuperPOD

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to run Batch script in SuperPOD

Objectives

Running batch script using CIFAR100 template model

5. Using Batch script in SuperPOD

SuperPOD uses SLURM as scheduler so there is no difference in running Batch script comparing to ManeFrame 3. However, there are some commands you might need to pay attention when running Batch script using container.

Following are the instructions on how to run Batch script for a Computer Vision sample using CIFAR10 data. Here, I use a python file called model_CNN_CIFAR10.py.

The file can be downloaded from here to your folder:

$ wget https://raw.githubusercontent.com/SouthernMethodistUniversity/SMU_SuperPOD_101/e6315c29ca0542351b79233729708dfa16161cdf/files/model_CNN_CIFAR10.py

5.1 Running Batch script with conda environment

Prepare the batch script with name: modelCNN.sh using the following content:

#!/bin/bash
#SBATCH -J CNN_CIFAR10_SPOD       # job name to display in squeue
#SBATCH -t 60                     # maximum runtime in minutes
#SBATCH -c 2                      # request 2 cpus    
#SBATCH -G 1                      # request 1 gpu a100
#SBATCH --mem=32gb                # request 32gb memory
#SBATCH --mail-user tuev@smu.edu  # request to email to your emailID
#SBATCH --mail-type=end           # request to mail when the model **end**

module load conda gcc
module load cuda cudnn

conda activate ~/tensorflow_2.9
python model_CNN_CIFAR10.py

Be on login node to submit the batch script:

$ sbatch modelCNN.sh

5.2 Running Batch script with container

Prepare the batch script with name: modelCNN_ngc.sh using the following content:

#!/bin/bash
#SBATCH -J CNN_CIFAR10_SPOD       # job name to display in squeue
#SBATCH -t 60                     # maximum runtime in minutes
#SBATCH -c 2                      # request 2 cpus    
#SBATCH -G 1                      # request 1 gpu a100
#SBATCH --mem=32gb                # request 32gb memory
#SBATCH --mail-user tuev@smu.edu  # request to email to your emailID
#SBATCH --mail-type=end           # request to mail when the model **end**

srun --container-image=/users/tuev/sqsh/nvidia+tensorflow+22.12-tf2-py3.sqsh python ./sqsh/model_CNN_CIFAR10.py

Be on login node to submit the batch script:

$ sbatch modelCNN_ngc.sh

Key Points

Batch script, Computer Vision

Job queueing and control in SuperPOD

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to run control Job in SuperPOD

Objectives

To teach command to work with Job in SLURM

The SuperPOD cluster uses the Simple Linux Utility for Resource Management system (SLURM) to manage jobs.

5b. Job Queue and Control

In SLURM there are several usefull commands for checking your job:

Lifecycle of a Job

The life of a job begins when you submit the job to the scheduler. If accepted, it will enter the Queued state.

Thereafter, the job may move to other states, as defined below:

Queued - the job has been accepted by the scheduler and is eligible for execution; waiting for resources.
Held - the job is not eligible for execution because it was held by user request, administrative action, or job dependency.
Running - the job is currently executing on the compute node(s).
Finished - the job finished executing or was canceled/deleted. The diagram below demonstrates these relationships in graphical form.

Useful Commands

Here are some basic SLURM commands for submitting, querying and deleting jobs in SuperPOD:

Command	Actions
`srun -N1 -G1 --pty $SHELL`	Submit an interactive job (reserves 1 Node, 1GPU, 1CPU, 6gb RAM, 1 hour walltime)
`sbatch job.sh`	submit the job script job.sh
`sstat <job id>`	Check the status of the job given jobID
`sstat <job id> --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID`	Narrow some information on sstat
`squeue -u <username>`	Check the status of all jobs submitted by given username
`scontrol show job <job id>`	Check the detailed information for job with given job ID
`scancel <job id>`	Delete the queued or running job given job ID

Check pending, working job:

$ squeue -u $USERNAME

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON
12345  workshop     bash     tuev  R      39:46      1 bcm-dgxa100-0002

The above Job has a JOBID=12345, which will be used below:

Check configuration of any requested job using JOBID:

$ scontrol show job 12345 grep ReqTRES

ReqTRES=cpu=5,mem=30G,node=1,billing=5,gres/gpu=1

Delete any job

$ scancel 12345

Checking how your job is running in node

When you know your working node, for example bcm-dgxa100-0001, from login node, you can login to the compute node and check the processing:

Command to check working cpus:

$ ssh bcm-dgxa100-0001
$ top -u $USERNAME

Command to check working gpus:

$ ssh bcm-dgxa100-0001
$ nvidia-smi
OR to refresh the command every 0.2s
$ watch -n .2 nvidia-smi

Key Points

Job queue, control

Data Science workflow with GPUs using RAPIDS

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to install and use RAPIDS

Objectives

Using GPUs directly to work with data

RAPIDS

RAPIDS provides unmatched speed with familiar APIs that match the most popular PyData libraries. Built on the shoulders of giants including NVIDIA CUDA and Apache Arrow, it unlocks the speed of GPUs with code you already know.

https://rapids.ai/

Installing RAPIDS

There are several ways to install RAPIDS to HPC systems

Using Conda Environment

This is the simplest method and usable to both M2 and SuperPOD system. You can install interactively, first, you just need to request a GPU node and load the corresponding library:

In M3:

$ srun -n1 --gres=gpu:1 -c16 --mem=64gb --time=12:00:00 -p gpu-dev --pty $SHELL
$ module load conda

In SuperPOD:

$ srun -n1 --gres=gpu:1 -c16 --mem=64gb --time=12:00:00 --pty $SHELL
$ module load conda

Once the necessary module has been loaded, you just need to create the conda environment and install rapids, the following command get the latest standard version from https://rapids.ai/

$ conda create -n rapids-23.02 -c rapidsai -c conda-forge -c nvidia  rapids=23.02 python=3.10 cudatoolkit=11.8

Activate the conda environment and Install jupyter kernel to jupyter lab:

$ conda activate rapids-23.02 
$ pip install ipykernel
$ python3 -m ipykernel install --user --name rapids-23.02 --display-name Rapids-23.02

If you have more personalized version, you can select the corresponding option and copy the command from rapids website: rapids.ai:

Using container

This approach is working on SuperPOD only. We will need to download the RAPIDS container from NGC

$ enroot import docker://nvcr.io#nvidia/rapidsai/rapidsai:cuda11.2-runtime-centos7-py3.10
$ enroot create nvidia+rapidsai+rapidsai+cuda11.2-runtime-centos7-py3.10.sqsh

Once my docker container has been downloaded to my $HOME directory, I can load it from login node:

$ srun -N1 -G1 -c10 --mem=64G --time=12:00:00 --container-image $HOME/sqsh/nvidia+rapidsai+rapidsai+cuda11.2-runtime-centos7-py3.10.sqsh --pty $SHELL

Your installation is done!

Key Points

NGC Container, RAPIDS, cudf, cuDask

Sample Application of NEMO for Sentiment Analysis

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use NEMO in container

Objectives

Apply NEMO to run sentiment analysis

NeMo

Neural Module - NeMo is a toolkit for building new state-of-the-art conversational AI models.
NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) synthesis models.
Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures.

Import and Create NeMo sqsh file:

The NGC for NeMo can be found here: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo

$ enroot import docker://nvcr.io#nvidia/nemo:22.09
$ enroot create nvidia+nemo+22.09.sqsh

Sentiment Analysis using NeMo

Here we use this sentiment sample from NVIDIA

SST2 data:

We download the Stanford Sentiment Treebank v2 (SST-2) and preprocess to nemo format for training and testing data

mkdir nemo && cd nemo

curl -s -O https://dl.fbaipublicfiles.com/glue/data/SST-2.zip\
 && unzip -o SST-2.zip -d ./\
 && sed 1d ./SST-2/train.tsv > ./train_nemo_format.tsv\
 && sed 1d ./SST-2/dev.tsv > ./dev_nemo_format.tsv &

Requesting a compute node with NeMo container enable with a GPU:

srun -N1 -G1 -c10 --mem=64G --time=12:00:00 --container-image $HOME/sqsh/nvidia+nemo+22.09.sqsh --container-mounts=$HOME --pty bash -i

Let’s run Sentiment Analysis using NeMo

The model named ‘bert-base-cased’
Computation using 2 GPUs and 20 epochs

cd ./nemo/SST-2
python /workspace/nemo/examples/nlp/text_classification/text_classification_with_bert.py \
      model.dataset.num_classes=2 \
      model.dataset.max_seq_length=256 \
      model.train_ds.batch_size=64 \
      model.validation_ds.batch_size=64 \
      model.language_model.pretrained_model_name='bert-base-cased' \
      model.train_ds.file_path=train_nemo_format.tsv \
      model.validation_ds.file_path=dev_nemo_format.tsv \
      trainer.num_nodes=1 \
      trainer.max_epochs=20 \
      trainer.precision=16 \
      model.optim.name=adam \
      model.optim.lr=1e-4

Check the GPU usage with nvidia-smi command

Output of the model training is text_classification_model.nemo

Model Evaluation and Inference

After saving the model in .nemo format, you can load the model and perform evaluation or inference on the model.
Following is the content of python file to load the trained nemo model and evaluate it:

from nemo.collections.nlp.models.text_classification import TextClassificationModel
model = TextClassificationModel.restore_from("text_classification_model.nemo")
model.to("cuda")

# define the list of queries for inference
queries = ['legendary irish writer brendan behan memoir , borstal boy',
           'demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop ', 
           'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up', 
           'uneasy mishmash of styles and genres']

results = model.classifytext(queries=queries, batch_size=3, max_seq_length=512)

print('The prediction results of some sample queries with the trained model:')
for query, result in zip(queries, results):
    print(f'Query : {query}')
    print(f'Predicted label: {result}')

Key Points

NGC Container, NEMO, Sentiment Analysis

Sample Applications of MultiGPUs for Computer Vision using Horovod

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to utilize MultiGPUs in SuperPOD

Objectives

Apply Horovod to drive multiple GPUs using CIFAR100

MultiGPUs using CIFAR100

In the code, Horovod is imported for MultiGPUs generation
Rest of the code are regular computer vision model as seen in many other papers

Here is the sample python code that utilizing Tensorflow to train the CIFAR100 datasets;

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.layers import Conv2D # convolutional layers to reduce image size
from tensorflow.keras.layers import MaxPooling2D,AveragePooling2D # Max pooling layers to further reduce image size   
from tensorflow.keras.layers import Flatten # flatten data from 2D to column for Dense layer

from tensorflow.keras.datasets import cifar100
import matplotlib.pyplot as plt
# TODO: Step 1: import Horovod
import horovod.tensorflow.keras as hvd
# TODO: Step 1: initialize Horovod
hvd.init()

# TODO: Step 1: pin to a GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    tf.config.experimental.set_memory_growth(gpus[hvd.local_rank()], True)
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')


def plot_acc_loss(history):
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['training', 'validation'], loc='best')
    plt.savefig("calval_hvod.png") #save as jpg
    plt.show()

# load data
(X_train, y_train), (X_test, y_test) = cifar100.load_data()

# Normalized data to range (0, 1):
X_train, X_test = X_train/X_train.max(), X_test/X_test.max()

num_categories=100
y_train = tf.keras.utils.to_categorical(y_train,num_categories)
y_test = tf.keras.utils.to_categorical(y_test,num_categories)

model = Sequential()
model.add(Conv2D(1024, (3, 3), strides=(1, 1), activation='relu', input_shape=(32, 32, 3)))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(.1))
model.add(Conv2D(512, (3, 3), strides=(1, 1), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(.1))
model.add(Conv2D(256, (3, 3), strides=(1, 1), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(.1))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(.1))

#Output layer contains 10 different number from 0-9
model.add(Dense(100, activation='softmax'))

model.summary()
# create model
model.compile(optimizer='Adam', loss='categorical_crossentropy',  metrics=['accuracy'])

#Train the model
model_CNN = model.fit(X_train, y_train, epochs=40,verbose=1,
                    validation_data=(X_test, y_test))

plot_acc_loss(model_CNN)

Using SuperPOD to run MultiGPUs

The following batch script is used to submit the training job using 8 GPUs and Tensorflow 22.02 version

#!/bin/bash
#SBATCH -J CIFAR100M      # job name to display in squeue
#SBATCH -c 16 --mem=750G      # requested partition
#SBATCH -o output-%j.txt    # standard output file
#SBATCH -e error-%j.txt     # standard error file
#SBATCH --gres=gpu:8
#SBATCH -t 1440              # maximum runtime in minutes
#SBATCH --exclusive
#SBATCH --mail-user tuev@smu.edu
#SBATCH --mail-type=end

srun --container-image=$HOME/sqsh/nvidia+tensorflow+22.02-tf2-py3.sqsh mpirun -np 8 --allow-run-as-root --oversubscribe python /users/tuev/cv1/cifar100/multi/cifar100spod-hvod.py

Make sure to use nvidia-smi to check the usage of all 8 GPUs

Key Points

NGC Container, Horovod, Computer Vision

Using YOLOv5 for object detection

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use train YOLOv5 to detect objects

Objectives

Download pretrained YOLOv5 and images then apply YOLO to detect object

YOLOv5

YOLOv5 🚀 is the world’s most loved vision AI, representing Ultralytics open-source research into future vision AI methods, incorporating lessons learned and best practices evolved over thousands of hours of research and development.

To download YOLO, simply go to the github page and clone it to your home or project work directory:

$ git clone https://github.com/ultralytics/yolov5.git

Suggestion:: It is better to use project work directory (pls consult SMU ColdFront page) to store the code and data to avoid jamming up your $HOME directory

Open Conda env and install requirement

Prior to training YOLOv5 model, it’s better to go to your own conda env and install the missing library. For simplicit, I use NEMO Container:

$ srun -n1 --gres=gpu:1 --container-image ./sqsh/nvidia+nemo+22.04.sqsh --time=12:00:00 --pty $SHELL

Go to yolov5 folder and install missing library

$ cd yolov5
$ pip install -r requirements.txt 

Select Pretrained model

Refer to this table for full comparison of models. Here let’s use yolov5l6 for better performance

Dataset for training:

YOLOv5 is trained by using COCO (Common Object in Context) dataset, here we use coco128 which is 128 classes of images from larger COCO dataset.

The dataset is automatically downloaded when using flag –data coco128.yaml

Train YOLOv5

Let’s train model with image size of 1280 pixels, 32 batches and 10 epochs, the data in use is coco128 and pretrained model is yolov5l6:

$ python train.py --img 1280 --batch 32 --epochs 10 --data coco128.yaml --weights yolov5l6.pt

Tail of The output from model training:

 Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        9/9      75.5G    0.02099    0.05281   0.006695        573       1280: 100%|██████████| 4/4 [00:03<00:00,  1.17it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 2/2 [00:01<00:00,  1.01it/s]
                   all        128        929      0.905      0.805      0.902      0.736

10 epochs completed in 0.031 hours.
Optimizer stripped from runs/train/exp/weights/last.pt, 154.9MB
Optimizer stripped from runs/train/exp/weights/best.pt, 154.9MB

Here we see that there are 2 pretrained model created from the training process last.pt and best.pt from corresponding output location.

We will use the best.pt weight for model inference:

To validate the model inference, we use the data from Kaggle

The Kaggle dataset can be found here: https://www.kaggle.com/competitions/open-images-2019-object-detection/data#

Using Kaggle API, one can simply download the dataset from CLI:

kaggle competitions download -c open-images-2019-object-detection

unzip the open-images-2019-object-detection.zip to get the test folder with 100000 images.

Inference using YOLOv5 for object detection with Kaggle data

The weight is used from pretrained model best.pt,

$ python detect.py --weights runs/train/exp/weights/best.pt --img 1280 --conf 0.25 --source ../test

The model output can be found in /run/detect/exp.

Sample model result:

Inference using YOLOv5 for object detection with video

We can also use YOLOv5 for video detection. From the sample video like this:

https://user-images.githubusercontent.com/43855029/222778747-b5312f6d-58c9-4f63-9233-93dfa65f8345.mp4

We run the inference with the best pretrained model using following command:

$ python detect.py --weights runs/train/exp/weights/best.pt --source  video.mp4

output of the inference would look like:

detect: weights=['runs/train/exp/weights/best.pt'], source=../test/before_short.mp4, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
YOLOv5 🚀 v7.0-56-gc0ca1d2 Python-3.8.13 torch-1.13.0a0+d0d6b1f CUDA:0 (NVIDIA A100-SXM4-80GB, 81251MiB)

Fusing layers... 
Model summary: 157 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
video 1/1 (1/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 156.8ms
video 1/1 (2/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 8.2ms
video 1/1 (3/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 8.2ms
video 1/1 (4/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 8.1ms
video 1/1 (5/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 8.1ms
video 1/1 (6/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 8.1ms
video 1/1 (7/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 3 trains, 8.1ms
video 1/1 (8/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 8.1ms
video 1/1 (9/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 8.2ms
video 1/1 (10/120) /users/tuev/YOLO/test/before_short.mp4: 384x640 2 trains, 8.2ms
Speed: 0.3ms pre-process, 9.4ms inference, 2.2ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp2

and the output video is saved in runs/detect/exp2 folder:

https://user-images.githubusercontent.com/43855029/222778650-f68c4a4f-ad51-4237-92a8-bfb0ad37cd54.mp4

Key Points

YOLOv5, object detection, inference

Using Transfer Learning with ResNet50

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use apply transfer learning to detect object

Objectives

Apply ResNet50 model in transfer learning

The following lecture note based on NVIDIA’s Fundamental Introduction to Deep Learning course with different input data

Transfer Learning

So far, we have trained accurate models on large datasets, and also downloaded a pre-trained model that we used with no training necessary. But what if we cannot find a pre-trained model that does exactly what you need, and what if we do not have a sufficiently large dataset to train a model from scratch? In this case, there is a very helpful technique we can use called transfer learning.

With transfer learning, we take a pre-trained model and retrain it on a task that has some overlap with the original training task. A good analogy for this is an artist who is skilled in one medium, such as painting, who wants to learn to practice in another medium, such as charcoal drawing. We can imagine that the skills they learned while painting would be very valuable in learning how to draw with charcoal.

As an example in deep learning, say we have a pre-trained model that is very good at recognizing different types of cars, and we want to train a model to recognize types of motorcycles. A lot of the learnings of the car model would likely be very useful, for instance the ability to recognize headlights and wheels.

Transfer learning is especially powerful when we do not have a large and varied dataset. In this case, a model trained from scratch would likely memorize the training data quickly, but not be able to generalize well to new data. With transfer learning, you can increase your chances of training an accurate and robust model on a small dataset.

Here we just use a simple tensorflow conda environment or container:

$ srun -n1 -G1 --container-image $HOME/sqsh/nvidia+tensorflow+22.02-tf2-py3.sqsh --time=12:00:00 --pty bash -i

Objective

Prepare a pretrained model for transfer learning
Perform transfer learning with your own small dataset on a pretrained model
Further fine tune the model for even better performance

Urban or Rural

In this example, we would like to create a model to recognize urban and rural. The data is downloaded from here

Download the pre-trained model

The ImageNet pre-trained models are often good choices for computer vision transfer learning, as they have learned to classify various different types of images. In doing this, they have learned to detect many different types of features that could be valuable in image recognition.

Let us start by downloading the pre-trained model. Again, this is available directly from the Keras library. As we are downloading, there is going to be an important difference. The last layer of an ImageNet model is a dense layer of 1000 units, representing the 1000 possible classes in the dataset. In our case, we want it to make a different classification: is this urban or rural? Because we want the classification to be different, we are going to remove the last layer of the model. We can do this by setting the flag include_top=False when downloading the model. After removing this top layer, we can add new layers that will yield the type of classification that we want:

from tensorflow.keras.applications.resnet50 import ResNet50

base_model = ResNet50(
    weights='imagenet',  # Load weights pre-trained on ImageNet.
    input_shape=(224, 224, 3),
    include_top=False)
    
base_model.summary()    

Freezing the Base Model

Before we add our new layers onto the pre-trained model, we should take an important step: freezing the model’s pre-trained layers. This means that when we train, we will not update the base layers from the pre-trained model. Instead we will only update the new layers that we add on the end for our new classification. We freeze the initial layers because we want to retain the learning achieved from training on the ImageNet dataset. If they were unfrozen at this stage, we would likely destroy this valuable information. There will be an option to unfreeze and train these layers later, in a process called fine-tuning.

Freezing the base layers is as simple as setting trainable on the model to False.

base_model.trainable = False

Adding new layer

We can now add the new trainable layers to the pre-trained model. They will take the features from the pre-trained layers and turn them into predictions on the new dataset. We will add two layers to the model. First will be a pooling layer like we saw in our earlier convolutional neural network. (If you want a more thorough understanding of the role of pooling layers in CNNs, please read this detailed blog post). We then need to add our final layer, which will classify urban or rural. This will be a densely connected layer with one output.

from tensorflow import keras
inputs = keras.Input(shape=(224, 224, 3))
# Separately from setting trainable on the model, we set training to False 
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
# A Dense classifier with a single unit (binary classification)
outputs = keras.layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

model.summary()

Keras gives us a nice summary here, as it shows the vgg16 pre-trained model as one unit, rather than showing all of the internal layers. It is also worth noting that we have many non-trainable parameters as we have frozen the pre-trained model.

Compile the model

As with our previous exercises, we need to compile the model with loss and metrics options. We have to make some different choices here. In previous cases we had many categories in our classification problem. As a result, we picked categorical crossentropy for the calculation of our loss. In this case we only have a binary classification problem (Urban or Rural), and so we will use binary crossentropy. Further detail about the differences between the two can found here. We will also use binary accuracy instead of traditional accuracy.

By setting from_logits=True we inform the loss function that the output values are not normalized (e.g. with softmax).

# Important to use binary crossentropy and binary accuracy as we now have a binary classification problem
model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True), metrics=[keras.metrics.BinaryAccuracy()])

Augmenting the data

Now that we are dealing with a very small dataset, it is especially important that we augment our data. As before, we will make small modifications to the existing images, which will allow the model to see a wider variety of images to learn from. This will help it learn to recognize new pictures of Urban/Rural instead of just memorizing the pictures it trains on.

from tensorflow.keras.preprocessing.image import ImageDataGenerator
# create a data generator
datagen = ImageDataGenerator(
        samplewise_center=True,  # set each sample mean to 0
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False) # we don't expect the image to be taken upsizedown

Loading the data

We have seen datasets in a couple different formats so far. In the MNIST exercise, we were able to download the dataset directly from within the Keras library. For the sign language dataset, the data was in CSV files. For this exercise, we are going to load images directly from folders using Keras’ flow_from_directory function. We have set up our directories to help this process go smoothly as our labels are inferred from the folder names. In the data directory, we have train and validation directories, which each have folders for images of Urban or Rural. Feel free to explore the images to get a sense of our dataset.

Note that flow_from_directory will also allow us to size our images to match the model: 244x244 pixels with 3 channels.

# load and iterate training dataset
train_it = datagen.flow_from_directory('data/train/', 
                                       target_size=(224, 224), 
                                       color_mode='rgb', 
                                       class_mode='binary', 
                                       batch_size=8)
# load and iterate validation dataset
valid_it = datagen.flow_from_directory('data/val/', 
                                      target_size=(224, 224), 
                                      color_mode='rgb', 
                                      class_mode='binary', 
                                      batch_size=8)

Training the model

Time to train our model and see how it does. Recall that when using a data generator, we have to explicitly set the number of steps_per_epoch:

model.fit(train_it, steps_per_epoch=12, validation_data=valid_it, validation_steps=4, epochs=20)

Discussion of Results

Both the training and validation accuracy should be quite high. This is a pretty awesome result! We were able to train on a small dataset, but because of the knowledge transferred from the ImageNet model, it was able to achieve high accuracy and generalize well. This means it has a very good sense of Urban and Rural

If you saw some fluctuation in the validation accuracy, that is okay too. We have a technique for improving our model in the next section.

Fine tuning the model

Now that the new layers of the model are trained, we have the option to apply a final trick to improve the model, called fine-tuning. To do this we unfreeze the entire model, and train it again with a very small learning rate. This will cause the base pre-trained layers to take very small steps and adjust slightly, improving the model by a small amount.

Note that it is important to only do this step after the model with frozen layers has been fully trained. The untrained pooling and classification layers that we added to the model earlier were randomly initialized. This means they needed to be updated quite a lot to correctly classify the images. Through the process of backpropagation, large initial updates in the last layers would have caused potentially large updates in the pre-trained layers as well. These updates would have destroyed those important pre-trained features. However, now that those final layers are trained and have converged, any updates to the model as a whole will be much smaller (especially with a very small learning rate) and will not destroy the features of the earlier layers.

Let’s try unfreezing the pre-trained layers, and then fine tuning the model:

# Unfreeze the base model
base_model.trainable = True

# It's important to recompile your model after you make any changes
# to the `trainable` attribute of any inner layer, so that your changes
# are taken into account
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate = .00001),  # Very low learning rate
              loss=keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=[keras.metrics.BinaryAccuracy()])

model.fit(train_it, steps_per_epoch=12, validation_data=valid_it, validation_steps=4, epochs=10)

Examine the Prediction

Now that we have a well-trained model, it is time to create the model to detect Urban or Rural We can start by looking at the predictions that come from the model.

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from tensorflow.keras.preprocessing import image as image_utils
from tensorflow.keras.applications.imagenet_utils import preprocess_input

def show_image(image_path):
    image = mpimg.imread(image_path)
    plt.imshow(image)

def make_predictions(image_path):
    show_image(image_path)
    image = image_utils.load_img(image_path, target_size=(224, 224))
    image = image_utils.img_to_array(image)
    image = image.reshape(1,224,224,3)
    image = preprocess_input(image)
    preds = model.predict(image)
    return preds

make_predictions('data/val/urban/urban_20.jpeg')

make_predictions('data/val/rural/rural5.jpeg')

It looks like a negative number prediction means that it is Rural and a positive number prediction means it is Urban. We can use this information to differentiate these scenary

def detect_img(image_path):
    preds = make_predictions(image_path)
    if preds[0]<0:
        print("It's Rural! So freshy")
    else:
        print("It's Urban! So developed!")

import numpy as np
detect_img('data/val/rural/rural15.jpeg')

detect_img('data/val/urban/urban_40.jpeg')

Summary

Great work! With transfer learning, you have built a highly accurate model using a very small dataset. This can be an extremely powerful technique, and be the difference between a successful project and one that cannot get off the ground. We hope these techniques can help you out in similar situations in the future!

There is a wealth of helpful resources for transfer learning in the NVIDIA Transfer Learning Toolkit.

Key Points

ResNet50, object detection, transfer learning

Using Stable Diffusion with SuperPOD

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use Stable Diffusion model

Objectives

Learn how to download and install Stable Diffusion from HuggingFace

User can now access to Stable Diffusion from HuggingFace but still utilizing the power of SPOD’s A100 GPU to inference the data with any incoming prompt. The following take an example from Stable Diffusion model from HuggingFace

First of all, download the library:

pip install diffusers --upgrade

Then use the following command with prompt to generate images:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda")

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

images = pipe(prompt=prompt).images[0]

Key Points

Stable Diffusion, Prompt, HuggingFace

Using Pre-trained model from HuggingFace

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use pre-trained model already available from Hugging Face hub

Objectives

To master the usage of pre-trained deep learning model from Hugging Face

Hugging Face hub

Hugging Face hub is considered to be Github of Machine Learning
It is a platform with over 120k models, 20k datasets and 50k demo apps, All open source and publicly available
All in one platform where people can easily deploy, collaborate and build ML model

Transformers library

The Transformers library, developed by Hugging Face, has played a significant role in making state-of-the-art NLP models more accessible to researchers and developers.
It includes pre-trained models like BERT, GPT, RoBERTa, and more, which can be fine-tuned for specific tasks such as text classification, language generation, question answering, and more.
The library offers a consistent API for various NLP tasks, making it easier for practitioners to experiment with and deploy these models.

Model task

The screenshot below describes the model task from Hugging Face that covers many different aspecs from Computer Vision to NLP, Audio or Reinforcement Learning

Pipeline for inference

The pipeline() makes it simple to use any model from the Hub for inference on any language, computer vision, speech, and multimodal tasks.
Even if you don’t have experience with a specific modality or aren’t familiar with the underlying code behind the models, you can still use them for inference with the pipeline()!

Pipeline for NLP Sentiment Analysis

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I am so excited to use the new SuperPOD from NVIDIA")

[{'label': 'POSITIVE', 'score': 0.9995261430740356}]

classifier(
    ["I am so excited to use the new SuperPOD from NVIDIA", "I hate running late"])

[{'label': 'POSITIVE', 'score': 0.9995261430740356},
 {'label': 'NEGATIVE', 'score': 0.9943193197250366}]

Pipeline Text Generation

from transformers import pipeline
generator = pipeline("text-generation")
generator("Using SMU latest HPC cluster NVIDIA SuperPOD,  you will be able to")

[{'generated_text': 'Using SMU latest HPC cluster NVIDIA SuperPOD,  you will be able to connect to other SSE nodes such as the following and use them as a HPC node:\n\n[CPU: CPU1, GIGABYTE'}]

Pipeline for Mask filling

from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.19619698822498322,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052705690264702,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

Pipeline for Name Entity Recognition

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Tue Vu and I work at SMU in Dallas")

[{'entity_group': 'PER',
  'score': 0.9868829,
  'word': 'Tue Vu',
  'start': 11,
  'end': 17},
 {'entity_group': 'ORG',
  'score': 0.9965092,
  'word': 'SMU',
  'start': 32,
  'end': 35},
 {'entity_group': 'LOC',
  'score': 0.9950755,
  'word': 'Dallas',
  'start': 39,
  'end': 45}]

Pipeline for Question Answering

from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Tue Vu and I work at SMU in Dallas",
)

{'score': 0.3651700019836426, 'start': 32, 'end': 35, 'answer': 'SMU'}

Pipeline for Conversational

from transformers import pipeline, Conversation
converse = pipeline("conversational")

conversation_1 = Conversation("What do you think about using HPC SuperPOD")
conversation_2 = Conversation("Do you believe in God?")
converse([conversation_1, conversation_2])

Answer:

[Conversation id: 44cf473c-29f2-4b44-be6c-15352dab13a2 
 user >> What do you think about using HPC SuperPOD 
 bot >> I think it's a good idea, but I don't think it's a good idea to use it for a lot of things. ,
 Conversation id: 489d923c-f127-4847-8cde-972c77470230 
 user >> What do you do to optimize the Python workflow? 
 bot >> I believe in the power of love.]

Pipeline for Computer Vision - Image Classification

from transformers import pipeline
clf = pipeline("image-classification")

Display the image:

import urllib.request
from io import BytesIO

url = 'https://t4.ftcdn.net/jpg/02/66/72/41/360_F_266724172_Iy8gdKgMa7XmrhYYxLCxyhx6J7070Pr8.jpg'
with urllib.request.urlopen(url) as url:
    img = Image.open(BytesIO(url.read()))
img

360_F_266724172_Iy8gdKgMa7XmrhYYxLCxyhx6J7070Pr8

Model Inference

clf(img)

[{'score': 0.49216628074645996, 'label': 'Egyptian cat'},
 {'score': 0.41306015849113464, 'label': 'tabby, tabby cat'},
 {'score': 0.050162095576524734, 'label': 'tiger cat'},
 {'score': 0.012556081637740135, 'label': 'lynx, catamount'},
 {'score': 0.00524393143132329, 'label': 'ping-pong ball'}]

Key Points

Hugging Face, pre-trained, pipeline

Using OpenAI Whisper

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use OpenAI Whisper

Objectives

Using Whisper with GPU

Whisper

Introduced by OpenAI and now open-source since 09/2022
Speech Recognigtion

Detail:

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

More information can be found here

Models:

The Whisper models are trained for speech recognition and translation tasks, capable of transcribing speech audio into the text in the language it is spoken (ASR) as well as translated into English (speech translation). Researchers at OpenAI developed the models to study the robustness of speech processing systems trained under large-scale weak supervision. There are 9 models of different sizes and capabilities, summarized in the following table.

In December 2022, OpenAI released an improved large model named large-v2, and large-v3 in November 2023.

How to run Whisper on SuperPOD

Powered by GPU, we can run the Whisper inference using SMU SuperPOD with lightning inferencing. Given this mp3 audio Here are the steps to convert this audio to text transcript:

Step 1: Install libraries:

Load one of the conda environment in SuperPOD (also load corresponding cuda, cudnn library as in Step 2) and install whisper:

pip install git+https://github.com/openai/whisper.git
pip install ffmpeg

Step 2: Run whisper inference with GPU supported:

The file audio1.mp3 has the size of 21mb and is 23 minutes long in speech.

Let’s run an Whisper API inference in the commandline interface (I change the directory of running to the same directory as audio1):

$ whisper audio1.mp3 --device cuda --model small --language en --output_format txt > audio1.txt

In the example above, the run is less than 30s using GPU

flag –device cuda: enable GPU processing
flag –model small: use the model size 244M for processing
the pipe >: to print out the output name
to get more header for whisper, run the command:

$ whisper -h

Key Points

OpenAI, whisper, gpu

Using LLAMA3

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use LLAMA3

Objectives

Using LLAMA3 on SuperPOD

LLAMA

Introduced by META in Introducing Meta Llama 3: The most capable openly available LLM to date
Several version:
- LLaMA1: released early 2023, was designed to be smaller and more efficient than other large models like GPT-3, while maintaining competitive performance. It was released in various sizes, such as 7B, 13B, 30B, and 65B parameters.
- LLaMA2: released in July 2023, LLaMA 2 improved upon the original version with enhanced performance, training techniques, and increased scalability. It also includes versions with 7B, 13B, and 70B parameters. Meta open-sourced LLaMA 2, and it was made available for both research and commercial use.
- LLaMA3: released in Apr 2024 and came with 3 versions 8B, 70B and 405B parameters

Models:

All LLaMA models can be found from the HuggingFace:

How to use LLaMA3 on SuperPOD

In order to use LLaMA (any version) on SuperPOD, we will use the pytorch_1.13 conda environment created in Chapter 2 and use port-forwarding for JupyterLab as in Chapter 4

Step 1: Request a compute node & Load the conda environment

Once logged in to SuperPOD, let’s request a compute node and load the conda environment:
We will request for a node with 1 GPU:

$ srun -N1 -c10 -G1 --mem=64gb --time=12:00:00 --pty $SHELL
$ module load conda gcc/11.2.0
$ module load cuda/11.8.0-vbvgppx cudnn
$ conda activate ~/pytorch_1.13
$ jupyter lab --ip=0.0.0.0 --no-browser --allow-root

Following are the instruction on how to use SuperPOD to run LLaMA3 on SuperPOD

Step 2: Request LLaMA3 access

For a new HuggingFace account, in order to access the open source LLaMA3, you will need to agree the license:

Sometimes, it takes a day for you to get approval.

Step 3: Install HuggingFace Hub

Following the guideline here to install HuggingFace Hub into your SuperPOD home folder

Step 4: Create HuggingFace Token

Create a HuggingFace account and click on your account logo and choose setting:
Select Access Tokens:
Create new token:
Save token into your huggingface folder (created from Step 3). Note that a huggingface folder is hidden, so you need to use “.” in front of that folder for access

Step 5: Get ready to load LLaMA3 model in your port-forwared JupyterLab env:

Follow the example from HuggingFace, we have the following result:

Key Points

Meta, LLAMA3, SuperPOD

How to use Ollama with SuperPOD

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use Ollama

Objectives

OLLAMA on SuperPOD

OLLAMA

Ollama is an open-source framework that enables users to run, create, and manage large language models (LLMs) locally on their computers and on HPC system

Key Features of Ollama:

Model Library: Ollama provides access to a diverse collection of pre-built models, including Llama 3.2, Phi 3, Mistral, and Gemma 2, among others. Users can easily download and run these models for various applications.
Customization: Users can create and customize their own models using a straightforward API, allowing for tailored solutions to specific tasks.
Cross-Platform Support: Ollama is compatible with macOS, Linux, and Windows operating systems, ensuring broad accessibility.

Ollama on SuperPOD

Although user can run Ollama on their personal PCs but with very large LLM model like LLAMA3 with 70B or 450B, you would need a strong GPU as A100 in SuperPOD
The model like LLAMA3 450B can easily consumes 200gb storage of your PCs, so we have that saved locally for user to just load and run it
Ollama version 4 has been pre-installed as module on SuperPOD, making it easier to work with

How to use LLaMA3 on SuperPOD

Step 1: request a compute node with 1 GPU:

$ srun -A Allocation -N1 -G1 --mem=64gb --time=12:00:00 --pty $SHELL

Step 2: Load Ollama model:

$ module load ollama

Step 3: Export path to Ollama model

Here we use Ollama models from STARS Project storage. Please inform me if you need access to that location.

$ export OLLAMA_MODELS=/projects/tuev/LLMs/LLMs/Ollama_models/

Step 4: Serve Ollama

$ ollama serve &

Step 5: Now Ollama has been loaded and served. Let’s check the local models:

$ ollama list

You should see the screen like this:

If there are any other models that you want us to download, please email me: tuev@smu.edu

Step 6: Download Ollama model

You can download any LLM model that you downloaded previously to chat:

$ ollama pull llama3:70

Step 7: Run Ollama model

You can run any LLM model that you downloaded previously to chat:

$ ollama run llama3:70

Step 8: Stop Ollama model

$ killall ollama

Key Points

Meta, Ollama, SuperPOD

How to use PaperQA with SuperPOD

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How to use PaperQA

Objectives

paperqa on SuperPOD

PaperQA

Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth.
Retrieval-Augmented Generation (RAG) models have been proposed to reduce hallucinations and provide provenance for how an answer was generated. Applying such models to the scientific literature may enable large-scale, systematic processing of scientific knowledge.
PaperQA, is a RAG agent for answering questions over the scientific literature.
PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. Viewing this agent as a question answering model, we find it exceeds performance of existing LLMs and LLM agents on current science QA benchmarks. To push the field closer to how humans perform research on scientific literature, we also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature. Finally, we demonstrate PaperQA’s matches expert human researchers on LitQA.

Open-source version

By default paperqa use OpenAI GPT and it might cost you some token
However, in this page, we are using the open source LLMs which are free and can be used via Ollama platform

Requirement:

Ollama: Ollama provides access to a diverse collection of pre-built models, including Llama 3.2, Phi 3, Mistral, and Gemma 2, among others. Users can easily download and run these models for various applications.
cuda: The model runs with GPU supported library cuda
paperqa: model to be downloaded from https://github.com/Future-House/paper-qa

Ollama installed on SuperPOD

Although user can run Ollama on their personal PCs but with very large LLM model like LLAMA3 with 70B or 450B, you would need a strong GPU as A100 in SuperPOD
The model like LLAMA3 450B can easily consumes 200gb storage of your PCs, so we have that saved locally for user to just load and run it
Ollama version 4 has been pre-installed as module on SuperPOD, making it easier to work with

How to installa and use paperqa on SuperPOD

Step 1: request a compute node with 1 GPU:

$ srun -A Allocation -N1 -G1 --mem=64gb --time=12:00:00 --pty $SHELL

Step 2: Load Ollama model:

$ module load ollama

Step 3: Export path to Ollama model

Here we use Ollama models from STARS Project storage. Please inform me if you need access to that location.

$ export OLLAMA_MODELS=/projects/tuev/oit_rts_star/oit_rts_star_storage/Ollama_models

Step 4: Serve Ollama

$ ollama serve &

Step 5: Now Ollama has been loaded and served. Let’s check the local models:

$ ollama list

You should see the screen like this:

If there are any other models that you want us to download, please email me: tuev@smu.edu

Step 6: Download Ollama model

You can download any LLM model that you downloaded previously to chat:

$ ollama pull llama3:70

Step 7: Install paperqa from source

paperqa can be installed with pip

$ pip install paper-qa>=5

Step 8: Run paper-qa using CLI

The fastest way to test PaperQA2 is via the CLI. First navigate to a directory with some papers and use the pqa cli:

$ pqa ask 'What manufacturing challenges are unique to bispecific antibodies?'

PaperQA2 is highly configurable, when running from the command line, pqa –help shows all options and short descriptions. For example to run with a higher temperature:

$ pqa --temperature 0.5 ask 'What manufacturing challenges are unique to bispecific antibodies?'

There are more way to run paperqa via CLI, please refer to their github page: https://github.com/Future-House/paper-qa

Step 9: Using via Python library

You can find lots of example in their github page.
In this, we show the script that using llama3:70b llm that we use for one real project:

from paperqa import Settings, ask
from paperqa.settings import AgentSettings
import os

os.environ['OPENAI_API_KEY'] = "ollama"


local_llm_config = dict(
    model_list=[
        dict(
            model_name='ollama/llama3:70b',
            litellm_params=dict(
                model='ollama/llama3:70b',
                api_base="http://localhost:11434",
            ),
        )
    ]
)

answer = ask(
    "How do marketing activities drive firm revenues?",
    settings=Settings(
        llm='ollama/llama3:70b',
        llm_config=local_llm_config,
        summary_llm='ollama/llama3:70b',
        summary_llm_config=local_llm_config,
        embedding='ollama/mxbai-embed-large',
        agent=AgentSettings(
            agent_llm='ollama/llama3:70b', 
            agent_llm_config=local_llm_config
        ),
        paper_directory="net_pdfs"
    ),
)

Step 10: Stop Ollama model

$ killall ollama

Key Points

paperqa, Ollama, SuperPOD

SMU SuperPOD 101

Introduction to SMU SuperPOD

Overview

Introduction

NVIDIA DGX SuperPOD Advantage Specifications

Specification for each compute node:

Storage

Login to SuperPOD

Requesting a compute node

Transfering data

Working with module

Key Points

Working with Conda Environment

Overview

2. Conda Environment

Create conda environment for Tensorflow with GPUs support

(1) Request a compute node with 1 GPU

(2) Load cuda and cudnn module for GPU support

(3) Create Tensorflow environment with your prefered version of python, here let’s use TF 2.17 with python 3.10

(4) Activate the conda environment and Install Tensorflow 2.17.1 (or your prefered TF version)

(5) Once installation done, check if the conda environment is able to enable the GPU

Create conda environment for Pytorch with GPUs support

Key Points

Using NGC Container in SuperPOD

Overview

3. Using NVIDIA NGC Container in SuperPOD

What is Container?

Docker Container

NVIDIA NGC Container

ENROOT

Importing docker container to SuperPOD from docker hub

Exercise

Download Tensorflow container

Working with NGC container in Interactive mode:

Check the GPU enable:

Working with NGC container in Batch mode

Working with NGC container in Jupyter Lab

Key Points

Using Jupyter Lab in SuperPOD

Overview

4. Jupyter Lab on SuperPOD

4.1 Setup browser to enable proxy viewing (similar for MacOS/Linux as well)

4.1.1 Using Firefox as browser:

4.1.2 Using Chrome/Safari as browser:

4.2 Test Proxy

4.2.1. Test Proxy using conda environment:

4.2.2. Test Proxy using docker container:

Troubleshoot for notebook requesting password

Key Points

Using Batch script in SuperPOD

Overview

5. Using Batch script in SuperPOD

5.1 Running Batch script with conda environment

5.2 Running Batch script with container

Key Points

Job queueing and control in SuperPOD

Overview

5b. Job Queue and Control

Lifecycle of a Job

Useful Commands

Check pending, working job:

Check configuration of any requested job using JOBID:

Delete any job

Checking how your job is running in node

Key Points

Data Science workflow with GPUs using RAPIDS

Overview

RAPIDS

Installing RAPIDS

Using Conda Environment

In M3:

In SuperPOD:

Using container

Key Points

Sample Application of NEMO for Sentiment Analysis

Overview

NeMo

Import and Create NeMo sqsh file:

Sentiment Analysis using NeMo

SST2 data: