Powered by AppSignal & Oban Pro

HPC Connect test notebook

examples/hpc_connect_tutorial.livemd

HPC Connect test notebook

Mix.install([
    {:kino, "~> 0.19"},
    {:table, "0.1.2"},
    {:hpc_connect, github: "penthooose/hpc_connect", force: true}
])

Setup and bootstrap

HpcConnect.prepare_livebook_session/1 helper wraps the Livebook-only Kino.Input.* UI into a single form. The text/select values can optionally be reused on the next notebook open via persist_form: true, but the uploaded SSH key must still be selected again each time, as it will be overwritten to not persist.

boot =
    HpcConnect.prepare_livebook_session(
        cluster: :alex,
        # optional for gated HF models:
        # hf_token: System.get_env("HF_TOKEN"),
        remote_command: "hostname && whoami",
        persist_form: true,
        submit_label: "Connect to HPC"
    )
    |> HpcConnect.bootstrap()

session = boot.session
boot

Bootstrap quick summary

%{
    details: boot.details,
    probe: boot.probe,
    command_preview: boot.command_preview
}
%{
    gpus: boot.gpus,
    models: boot.models,
    quota: boot.quota,
    jobs: boot.jobs
}

or all together summarized:

boot.startup

vLLM test configuration

vllm_args = [
    partition: "a100",
    gpus: 1,
    walltime: "02:00:00",
    model: "meta-llama/Llama-3.2-1B-Instruct",
    port: 50200
]

Start vLLM

vllm =
    HpcConnect.start_app(session,
        app: "vllm",
        args: vllm_args
    )

Talk to the running app

HpcConnect.vllm_chat(vllm, "Hello from Livebook")
answer =
    HpcConnect.vllm_chat(
        vllm,
        "Can you solve: forall x in {a,b}: (P(x) -> Q(x)) and (Q(x) -> R(x))"
    ).answer

answer
|> String.split("\n")
|> Enum.each(&IO.puts/1)

Kill the Job and Delete Session

# HpcConnect.cancel_job(session, vllm.job_id)

# or directly with the job ID:
# HpcConnect.cancel_job(session, "3604228")

if session was interrupted, or you want to cancel all jobs:

HpcConnect.cancel_all_jobs(session)

# to check if all jobs are gone:
HpcConnect.list_jobs_summary(session)

Clean all session data (SSH key file):

HpcConnect.cleanup_livebook_session(session)

Manual status queries after the app test

These are the explicit query commands that mostly duplicate the boot summaries, so they live down here now.

HpcConnect.available_gpu_summary(session)
HpcConnect.list_downloaded_models(session)
HpcConnect.list_jobs_summary(session)
HpcConnect.quota_summary(session)

Download a model from HuggingFace

HpcConnect.download_model(session, "meta-llama/Llama-3.2-1B-Instruct")

Build an Apptainer SIF image

HpcConnect.build_sif(session, "vllm")

Allocate GPU interactively without running an app

alloc = HpcConnect.allocate_gpu(session, partition: "a100", walltime: "01:00:00")

# Release GPU
# HpcConnect.release_gpu(session, alloc)

# or with job id directly:
# HpcConnect.release_gpu(session, "JOBID")

Reconnect to an existing vLLM job after notebook/runtime restart

# get the Job ID
[%{job_id: job_id} | _others] = HpcConnect.list_jobs_summary(session)
job_id
reconnected_vllm =
    HpcConnect.reconnect(session, job_id,
        app: "vllm",
        args: [port: 50200]
    )
HpcConnect.vllm_chat(reconnected_vllm, "Hello from Livebook")
answer =
    HpcConnect.vllm_chat(
        reconnected_vllm,
        "Can you solve: forall x in {a,b}: (P(x) -> Q(x)) and (Q(x) -> R(x))"
    ).answer

answer
|> String.split("\n")
|> Enum.each(&IO.puts/1)

Remote Apptainer paths and HPC_Connect files reupload

bootstrap/1 already installs the bundled helper scripts and uploads bundled definition files by default, so the next cells are mainly useful for reruns or debugging.

HpcConnect.remote_def_path(session)
HpcConnect.remote_sif_path(session)
# reuploads any hpc_connect files needed for proper execution
HpcConnect.install_remote_scripts!(session)
HpcConnect.upload_def_file(session)

Cleanup

Run this exit command to cancel all jobs and cleanup the session data (includes SSH file), and clear apptainer cache:

HpcConnect.exit(boot)

Clear Apptainer cache to free up some space on the cluster:

HpcConnect.clear_app_cache(boot)

Use this to only cleanup the session when you still have boot or session:

HpcConnect.cleanup_livebook_session(boot)

Use this only as a recovery helper after an interrupted notebook/runtime where you no longer have the original boot or session value:

HpcConnect.cleanup_livebook_orphans(delete_uploaded: true)

Uninstall HPC_Connect from HPC

Remove all files on the cluster file system connected to HPC_Connect

HpcConnect.uninstall(boot)

Remove all files including downloaded models:

HpcConnect.uninstall(boot, remove_models: true)