Setting up a Kubernetes cluster for Sarus
GCP Cluster Setup
Prerequisites
Install GCP SDK: If you haven’t installed the GCP SDK, do so by following the official guide.
Authentication: Authenticate your account with GCP.
gcloud auth login
Cluster Creation
Let’s show two ways to create a cluster. For launching the app with GPU, use the second part. ## Create a simple two nodes cluster (one of the node for the job):
gcloud container clusters create sarusapp --zone=europe-west1 --machine-type=n1-standard-2 --num-nodes=2
we also need to sepcify the worker node
kubectl taint nodes <name-of-Worker-node> gpu=true:NoSchedule
kubectl label nodes <name-of-Worker-node> gpu-node=true
Create the Cluster with a GPU nodes pool:
gcloud container clusters create sarusappwithgpu --zone=europe-west2-a --machine-type=n2-highmem-4 --num-nodes=1 --addons=GcpFilestoreCsiDriver
gcloud container node-pools create gpu-node-pool --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=default --machine-type n2-highmem-2 --region europe-west2-a --cluster sarusappwithgpu --node-locations europe-west2-a --num-nodes 0 --min-nodes 0 --max-nodes 1 --enable-autoscaling --node-taints=gpu=true:NoSchedule --node-labels=gpu-node=true
Remarks:
--addons=GcpFilestoreCsiDriver
enable the use of ReadWriteMany storage (shared storage between nodes)--accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=default
specifies the type of GPU--min-nodes 0 --max-nodes 1 --enable-autoscaling
enable to use GPU only when needed--node-taints=gpu=true:NoSchedule --node-labels=gpu-node=true
we reserve this nodes pool for GPU jobs and not the sarus app
Cluster Access and Configuration
Access Configuration: Follow the GCP tutorial to configure kubectl
for the new cluster. To set up cluster credentials, run:
gcloud container clusters get-credentials sarusappwithgpu --region=europe-west2-a
Test your setup with:
kubectl get nodes
You should see the nodes listed, indicating that they are part of the Google Kubernetes Engine (GKE).
User Permissions
Assign Roles: Assign the container.roles.create
and
container.roleBindings.create
permissions to the user who will be
running the Helm install. These permissions enable the application of
specific Kubernetes YAML files (rolebinding.yaml
and
job-role.yaml
) and allow one pod to execute jobs on other nodes.
After this command, you can run the app with the command helm install as in the local setup
You need to have in the values.yaml
: cloud: GKE
RELEASE="foobar"; helm install -f values.yaml ${RELEASE} . --set sarus-registry-chart.enabled=true --set sarus-secrets-chart.enabled=true
Azure Cluster Setup
This section will guide you through the steps to set up a Kubernetes cluster on Azure using Azure Kubernetes Service (AKS).
Prerequisites
Install Azure Azure Command-Line Interface (CLI) by following the official guide:
Login to Azure
Login to your Azure account through the CLI:
az login
For more details on Azure Kubernetes deployment, refer to:
Create a Resource Group
Resource groups are logical containers for resources deployed on Azure.
az group create --name sarusappResourceGroup2 --location westus3
Create AKS Cluster
Set up an AKS cluster:
az aks create \
--resource-group sarusappResourceGroup2 \
--name sarusappClustergpu \
--node-count 1 \
--node-vm-size standard_d5_v2_promo \
--enable-addons monitoring
if you encounter errors related to microsoft.insights
, register the
necessary provider:
az provider register --namespace microsoft.insights
Create a GPU Node Pool
Add a GPU node pool:
az aks nodepool add \
--resource-group sarusappResourceGroup2 \
--cluster-name sarusappClustergpu \
--name gpupool \
--node-count 0 \
--min-count 0 \
--max-count 1 \
--enable-cluster-autoscaler \
--node-vm-size Standard_NC24ads_A100_v4 \
--node-taints gpu=true:NoSchedule \
--labels gpu-node=true \
--no-wait
Configure kubectl
for the AKS Cluster
To configure kubectl
to interact with your AKS cluster:
az aks get-credentials --resource-group sarusappResourceGroup2 --name sarusappClustergpu
Apply NVIDIA Plugin
This will ensure that your AKS cluster can interact with NVIDIA GPUs.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --patch='{"spec": {"template": {"spec": {"tolerations": [{"key": "gpu","operator": "Equal","value": "true","effect": "NoSchedule"}]}}}}'
There is no need to add any use permission for Azure.
Then you can install the app using the helm command:
You need to have in the values.yaml
: cloud: AKS
RELEASE="foobar"; helm install -f values.yaml ${RELEASE} . --set sarus-registry-chart.enabled=true --set sarus-secrets-chart.enabled=true
AWS Cluster Setup (EKS)
We recomment using the tool eksctl
to manage EKS resources. You
first need to install & configure EKS to create & manage your kubernetes
cluster.
We’ll also use a lot the cluster name & region so we suggest to start with
cluster_name=MY_CLUSTER
region=MY_REGION
Cluster and nodegroups creation
see https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html
Create an EKS cluster with
eksctl create cluster --name $cluster_name --region $region --version 1.28 --without-nodegroup
then add nodes to your cluster by creating nodegroups
eksctl create nodegroup \
--cluster $cluster_name \
--region $region \
--name sarus-main \
--node-ami-family Ubuntu2004 \
--node-type m5.xlarge \
--nodes 1 \
--nodes-min 1 \
--nodes-max 2
if you wan to use GPU, add another nodegroup (it’s easier to have a separate one as you can activate or deactivate it to reduce costs).
WARNING: the AMI Family set below uses an A10 GPU and is a bit expensive.
Save this config file as gpu-nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: MYCLUSTER
region: MYREGION
version: "1.28"
managedNodeGroups:
- name: sarus-gpu
instanceType: g5.2xlarge
amiFamily: AmazonLinux2
desiredCapacity: 1
minSize: 1
maxSize: 2
spot: true
labels: {
gpu-node: "true"
}
taints:
- {
"key": "gpu",
"value": "true",
"effect": "NoSchedule"
}
then do
eksctl create nodegroup --config-file=gpu-nodegroup.yaml
Apply NVIDIA Plugin
This will ensure that your EKS cluster can interact with NVIDIA GPUs.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
kubectl patch daemonset nvidia-device-plugin-daemonset -n kube-system --patch='{"spec": {"template": {"spec": {"tolerations": [{"key": "gpu","operator": "Equal","value": "true","effect": "NoSchedule"}]}}}}'
EFS file system creation
Create an EFS file system. Make sure it’s on the same VPC as the EKS cluster.
Follow the steps described there https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/efs-create-filesystem.md
Get the Filesystem id, you’ll need it later in the values.yaml you’ll use to install Sarus.
Don’t forget to create mountTargets.
Cluster setup: OIDC and EFS driver
Install OIDC provider
See https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html
oidc_id=$(aws eks describe-cluster --name $cluster_name --region $region --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
aws iam list-open-id-connect-providers | grep $oidc_id | cut -d "/" -f4
eksctl utils associate-iam-oidc-provider --cluster $cluster_name --approve --region $region
Install EBS CSI driver
Create Amazon EBS CSI driver IAM role Now having eksctl in place, create the IAM role:
eksctl create iamserviceaccount \
--region $region \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster $cluster_name \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve \
--role-only \
--role-name AmazonEKS_EBS_CSI_DriverRole
As you can see AWS maintains a managed policy for us we can simply use (AWS maintains a managed policy, available at ARN arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy). Only if you use encrypted EBS drives you need to additionally add configuration to the policy.
The command deploys an AWS CloudFormation stack that creates an IAM role, attaches the IAM policy to it, and annotates the existing ebs-csi-controller-sa service account with the Amazon Resource Name (ARN) of the IAM role.
Now we can finally add the EBS CSI add-on. Therefore we also need the AWS Account id which we can obtain by running aws sts get-caller-identity –query Account –output text. The eksctl create addon command looks like this:
eksctl create addon --name aws-ebs-csi-driver \
--cluster $cluster_name \
--region $region \
--service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force
Install EFS CSI driver
The full doc is there: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html
First, you need to create a role for the EFS CSI driver:
export role_name=AmazonEKS_EFS_CSI_DriverRole
eksctl create iamserviceaccount \
--name efs-csi-controller-sa \
--namespace kube-system \
--cluster $cluster_name \
--role-name $role_name \
--role-only \
--region $region \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy \
--approve
TRUST_POLICY=$(aws iam get-role --role-name $role_name --query 'Role.AssumeRolePolicyDocument' | \
sed -e 's/efs-csi-controller-sa/efs-csi-*/' -e 's/StringEquals/StringLike/')
aws iam update-assume-role-policy --role-name $role_name --policy-document "$TRUST_POLICY"
Then you need to actually install the driver. We recommend to use an addon. You can do it with eksctl or in the AWS console.
eksctl create addon --name aws-efs-csi-driver --cluster $cluster_name --region $region --version latest --service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EFS_CSI_DriverRole --force
Install Sarus with Helm
See visit Installation on Kubernetes.
OVH Cluster Setup
The best way to add a node pool with the right taint and label is to use
the ovh
api
(because the UI does not allow to add taints and labels..) and to add
the configuration as below:
bash { "antiAffinity": false, "autoscale": true, "desiredNodes": 0, "flavorName": "t1-45", "maxNodes": 1, "minNodes": 0, "monthlyBilled": true, "template": { "metadata": { "annotations": { "my-annotation": "my-value" }, "finalizers": [], "labels": { "gpu-node": "true" } }, "spec": { "taints": [ { "effect": "NoSchedule", "key": "gpu", "value": "true" } ], "unschedulable": false } } }
Configuration
Download Kubeconfig: Once the cluster is set up, download the Kubeconfig file from the OVH UI.
export KUBECONFIG=/path/to/your/downloaded/kubeconfig.yml
Set up NFS:
Complete nfs part of values.
Add the nfs-subdir-external-provisioner Helm repository and install the
NFS provisioner:
bash helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ helm install nfs-subdir-external-provisioner -n kube-system nfs-subdir-external-provisioner/nfs-subdir-external-provisioner -f values.yaml
To configure the NFS, consult the guide at OVH’s official
documentation.
It is important to add the main node to the Manage access control list (ACL). You can have the ip address:
kubectl get nodes -o jsonpath='{ $.items[*].status.addresses[?(@.type=="InternalIP")].address }'
Setup GPU Operator: Add NVIDIA’s Helm repository and install the GPU operator:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace --wait
Provide OVH Credentials: Complete ovhCredentials part of values.
Generate and provide OVH credentials for your application. You can
generate these tokens at OVH’s API token generation
page. Ensure you create a
token with permissions for both POST and GET requests. For better
security, you can restrict the token to /dedicated/nasha/
rights.
Running the Application
You need to have in the values.yaml
: cloud: OVH
Install the Application via Helm:
RELEASE="foobar"; helm install -f values.yaml ${RELEASE} . --set sarus-registry-chart.enabled=true --set sarus-secrets-chart.enabled=true
Monitor Deployment: Check the status of your pods to ensure everything is running smoothly:
kubectl get pods
Access the Application: To get the URL of your deployed application:
kubectl get service
Look for the EXTERNAL-IP
column corresponding to nginx
or your
ingress service. That’s the IP you can use to access your application.
Note: the upscaling for OVH is longer than for other cloud. It can takes a few hours depending on the moment of the day. ### Clean Up Remember to delete or scale down resources when they’re not in use to avoid unnecessary costs.