Setting up a Kubernetes cluster for Sarus

GCP Cluster Setup

Prerequisites

Install GCP SDK: If you haven’t installed the GCP SDK, do so by following the official guide.

Authentication: Authenticate your account with GCP.

gcloud auth login

Cluster Creation

Let’s show two ways to create a cluster. For launching the app with GPU, use the second part. ## Create a simple two nodes cluster (one of the node for the job):

gcloud container clusters create sarusapp --zone=europe-west1 --machine-type=n1-standard-2 --num-nodes=2

we also need to sepcify the worker node

kubectl taint nodes <name-of-Worker-node> gpu=true:NoSchedule
kubectl label nodes <name-of-Worker-node> gpu-node=true

Create the Cluster with a GPU nodes pool:

gcloud container clusters create sarusappwithgpu --zone=europe-west2-a --machine-type=n2-highmem-4 --num-nodes=1 --addons=GcpFilestoreCsiDriver

gcloud container node-pools create gpu-node-pool  --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=default --machine-type n2-highmem-2 --region europe-west2-a  --cluster sarusappwithgpu --node-locations europe-west2-a  --num-nodes 0 --min-nodes 0 --max-nodes 1 --enable-autoscaling  --node-taints=gpu=true:NoSchedule --node-labels=gpu-node=true

Remarks:

--addons=GcpFilestoreCsiDriver enable the use of ReadWriteMany storage (shared storage between nodes)
--accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=default specifies the type of GPU
--min-nodes 0 --max-nodes 1 --enable-autoscaling enable to use GPU only when needed
--node-taints=gpu=true:NoSchedule --node-labels=gpu-node=true we reserve this nodes pool for GPU jobs and not the sarus app

Cluster Access and Configuration

Access Configuration: Follow the GCP tutorial to configure kubectl for the new cluster. To set up cluster credentials, run:

gcloud container clusters get-credentials sarusappwithgpu --region=europe-west2-a

Test your setup with:

kubectl get nodes

You should see the nodes listed, indicating that they are part of the Google Kubernetes Engine (GKE).

User Permissions

Assign Roles: Assign the container.roles.create and container.roleBindings.create permissions to the user who will be running the Helm install. These permissions enable the application of specific Kubernetes YAML files (rolebinding.yaml and job-role.yaml) and allow one pod to execute jobs on other nodes.

After this command, you can run the app with the command helm install as in the local setup

You need to have in the values.yaml: cloud: GKE

RELEASE="foobar"; helm install -f values.yaml ${RELEASE} . --set sarus-registry-chart.enabled=true --set sarus-secrets-chart.enabled=true

Azure Cluster Setup

This section will guide you through the steps to set up a Kubernetes cluster on Azure using Azure Kubernetes Service (AKS).

Prerequisites

Install Azure Azure Command-Line Interface (CLI) by following the official guide:

Create a Resource Group

Resource groups are logical containers for resources deployed on Azure.

az group create --name sarusappResourceGroup2 --location westus3

Create AKS Cluster

Set up an AKS cluster:

az aks create \
    --resource-group sarusappResourceGroup2 \
    --name sarusappClustergpu \
    --node-count 1 \
    --node-vm-size standard_d5_v2_promo \
    --enable-addons monitoring

if you encounter errors related to microsoft.insights, register the necessary provider:

az provider register --namespace microsoft.insights

Create a GPU Node Pool

Add a GPU node pool:

az aks nodepool add \
    --resource-group sarusappResourceGroup2 \
    --cluster-name sarusappClustergpu \
    --name gpupool \
    --node-count 0 \
    --min-count 0 \
    --max-count 1 \
    --enable-cluster-autoscaler \
    --node-vm-size Standard_NC24ads_A100_v4 \
    --node-taints gpu=true:NoSchedule \
    --labels gpu-node=true \
    --no-wait

Configure `kubectl` for the AKS Cluster

To configure kubectl to interact with your AKS cluster:

az aks get-credentials --resource-group sarusappResourceGroup2 --name sarusappClustergpu

Apply NVIDIA Plugin

This will ensure that your AKS cluster can interact with NVIDIA GPUs.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --patch='{"spec": {"template": {"spec": {"tolerations": [{"key": "gpu","operator": "Equal","value": "true","effect": "NoSchedule"}]}}}}'

There is no need to add any use permission for Azure.

Then you can install the app using the helm command:

You need to have in the values.yaml: cloud: AKS

RELEASE="foobar"; helm install -f values.yaml ${RELEASE} . --set sarus-registry-chart.enabled=true --set sarus-secrets-chart.enabled=true

AWS Cluster Setup (EKS)

We recomment using the tool eksctl to manage EKS resources. You first need to install & configure EKS to create & manage your kubernetes cluster.

We’ll also use a lot the cluster name & region so we suggest to start with

cluster_name=MY_CLUSTER
region=MY_REGION

Cluster and nodegroups creation

see https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html

Create an EKS cluster with

eksctl create cluster --name $cluster_name --region $region --version 1.28 --without-nodegroup

then add nodes to your cluster by creating nodegroups

eksctl create nodegroup \
  --cluster $cluster_name \
  --region $region \
  --name sarus-main \
  --node-ami-family Ubuntu2004 \
  --node-type m5.xlarge \
  --nodes 1 \
  --nodes-min 1 \
  --nodes-max 2

if you wan to use GPU, add another nodegroup (it’s easier to have a separate one as you can activate or deactivate it to reduce costs).

WARNING: the AMI Family set below uses an A10 GPU and is a bit expensive.

Save this config file as gpu-nodegroup.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: MYCLUSTER
  region: MYREGION
  version: "1.28"
managedNodeGroups:
  - name: sarus-gpu
    instanceType: g5.2xlarge
    amiFamily: AmazonLinux2
    desiredCapacity: 1
    minSize: 1
    maxSize: 2
    spot: true
    labels: {
      gpu-node: "true"
    }
    taints:
      - {
          "key": "gpu",
          "value": "true",
          "effect": "NoSchedule"
        }

then do

eksctl create nodegroup --config-file=gpu-nodegroup.yaml

Apply NVIDIA Plugin

This will ensure that your EKS cluster can interact with NVIDIA GPUs.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --patch='{"spec": {"template": {"spec": {"tolerations": [{"key": "gpu","operator": "Equal","value": "true","effect": "NoSchedule"}]}}}}'

EFS file system creation

Create an EFS file system. Make sure it’s on the same VPC as the EKS cluster.

Follow the steps described there https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/efs-create-filesystem.md

Get the Filesystem id, you’ll need it later in the values.yaml you’ll use to install Sarus.

Don’t forget to create mountTargets.

Cluster setup: OIDC and EFS driver

Install OIDC provider

See https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html

oidc_id=$(aws eks describe-cluster --name $cluster_name  --region $region --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
aws iam list-open-id-connect-providers | grep $oidc_id | cut -d "/" -f4
eksctl utils associate-iam-oidc-provider --cluster $cluster_name --approve --region $region

Install EBS CSI driver

Create Amazon EBS CSI driver IAM role Now having eksctl in place, create the IAM role:

eksctl create iamserviceaccount \
  --region $region \
  --name ebs-csi-controller-sa \
  --namespace kube-system \
  --cluster $cluster_name \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --approve \
  --role-only \
  --role-name AmazonEKS_EBS_CSI_DriverRole

As you can see AWS maintains a managed policy for us we can simply use (AWS maintains a managed policy, available at ARN arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy). Only if you use encrypted EBS drives you need to additionally add configuration to the policy.

The command deploys an AWS CloudFormation stack that creates an IAM role, attaches the IAM policy to it, and annotates the existing ebs-csi-controller-sa service account with the Amazon Resource Name (ARN) of the IAM role.

Now we can finally add the EBS CSI add-on. Therefore we also need the AWS Account id which we can obtain by running aws sts get-caller-identity –query Account –output text. The eksctl create addon command looks like this:

eksctl create addon --name aws-ebs-csi-driver \
--cluster $cluster_name \
--region $region \
--service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force

Install EFS CSI driver

The full doc is there: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html

First, you need to create a role for the EFS CSI driver:

export role_name=AmazonEKS_EFS_CSI_DriverRole
eksctl create iamserviceaccount \
    --name efs-csi-controller-sa \
    --namespace kube-system \
    --cluster $cluster_name \
    --role-name $role_name \
    --role-only \
    --region $region \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy \
    --approve
TRUST_POLICY=$(aws iam get-role --role-name $role_name --query 'Role.AssumeRolePolicyDocument' | \
    sed -e 's/efs-csi-controller-sa/efs-csi-*/' -e 's/StringEquals/StringLike/')
aws iam update-assume-role-policy --role-name $role_name --policy-document "$TRUST_POLICY"

Then you need to actually install the driver. We recommend to use an addon. You can do it with eksctl or in the AWS console.

eksctl create addon --name aws-efs-csi-driver --cluster $cluster_name --region $region --version latest     --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EFS_CSI_DriverRole --force

Install Sarus with Helm

See visit Installation on Kubernetes.

OVH Cluster Setup

Setting up a Kubernetes cluster on OVH requires you to have an account with the OVH Public Cloud. ### Cluster Creation 1. Creating a Cluster:

Log into your OVH Public Cloud dashboard and navigate to the Kubernetes section. Create a cluster with at least one node dedicated to running your application.

To have GPU, it is recommanded to create the cluster in the Gravelines region: GRA7 2. Adding a GPU Node Pool:

Add a node pool with GPU support to your cluster and enable autoscaling. This ensures that you can dynamically allocate resources based on the workload.

The best way to add a node pool with the right taint and label is to use the ovh api (because the UI does not allow to add taints and labels..) and to add the configuration as below: bash { "antiAffinity": false, "autoscale": true, "desiredNodes": 0, "flavorName": "t1-45", "maxNodes": 1, "minNodes": 0, "monthlyBilled": true, "template": { "metadata": { "annotations": { "my-annotation": "my-value" }, "finalizers": [], "labels": { "gpu-node": "true" } }, "spec": { "taints": [ { "effect": "NoSchedule", "key": "gpu", "value": "true" } ], "unschedulable": false } } }

Configuration

Download Kubeconfig: Once the cluster is set up, download the Kubeconfig file from the OVH UI.

export KUBECONFIG=/path/to/your/downloaded/kubeconfig.yml

Set up NFS:

Complete nfs part of values.

Add the nfs-subdir-external-provisioner Helm repository and install the NFS provisioner: bash helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ helm install nfs-subdir-external-provisioner -n kube-system nfs-subdir-external-provisioner/nfs-subdir-external-provisioner -f values.yaml To configure the NFS, consult the guide at OVH’s official documentation.

It is important to add the main node to the Manage access control list (ACL). You can have the ip address:

kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }'

Setup GPU Operator: Add NVIDIA’s Helm repository and install the GPU operator:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace --wait

Provide OVH Credentials: Complete ovhCredentials part of values.

Generate and provide OVH credentials for your application. You can generate these tokens at OVH’s API token generation page. Ensure you create a token with permissions for both POST and GET requests. For better security, you can restrict the token to /dedicated/nasha/ rights.

Running the Application

You need to have in the values.yaml: cloud: OVH

Install the Application via Helm:

RELEASE="foobar"; helm install -f values.yaml ${RELEASE} .  --set sarus-registry-chart.enabled=true --set sarus-secrets-chart.enabled=true

Monitor Deployment: Check the status of your pods to ensure everything is running smoothly:

kubectl get pods

Access the Application: To get the URL of your deployed application:

kubectl get service

Look for the EXTERNAL-IP column corresponding to nginx or your ingress service. That’s the IP you can use to access your application.

Note: the upscaling for OVH is longer than for other cloud. It can takes a few hours depending on the moment of the day. ### Clean Up Remember to delete or scale down resources when they’re not in use to avoid unnecessary costs.

Setting up a Kubernetes cluster for Sarus

GCP Cluster Setup

Prerequisites

Authentication: Authenticate your account with GCP.

Cluster Creation

Create the Cluster with a GPU nodes pool:

Cluster Access and Configuration

Test your setup with:

User Permissions

Azure Cluster Setup

Prerequisites

Login to Azure

Create a Resource Group

Create AKS Cluster

Create a GPU Node Pool

Configure kubectl for the AKS Cluster

Apply NVIDIA Plugin

AWS Cluster Setup (EKS)

Cluster and nodegroups creation

Apply NVIDIA Plugin

EFS file system creation

Cluster setup: OIDC and EFS driver

Install OIDC provider

Install EBS CSI driver

Install EFS CSI driver

Install Sarus with Helm

OVH Cluster Setup

Configuration

Running the Application

Configure `kubectl` for the AKS Cluster