AI Infrastructure Engineer

Required Skills

nvidia gpu systems (a100/h100)

cuda stack

linux

containers

kubernetes

openshift

red hat openshift ai (rhods)

terraform

ansible

gitops

argocd

mlflow

kubeflow

airflow

distributed training frameworks (ddp, deepspeed)

llm hosting (llama, mistral, falcon)

genai

rag architecture

vector databases (milvus, pgvector)

cloud infrastructure

devops

multi-cloud

hybrid cloud

high-performance storage

high-performance networking

infrastructure as code

mlops

documentation

written communication

troubleshooting

optimization

enterprise ai adoption

government sector experience

asynchronous team communication

Job Description

Job Title: AI Infrastructure Engineer

Job Type: Full-time

Location: On-site, Sharjah, Sharjah, United Arab Emirates

Job Summary

Join our team as an AI Infrastructure Engineer and help architect, build, and operate the next generation of high-performance AI platforms. You will play a pivotal role in supporting advanced AI workloads—including LLMs, GenAI, Computer Vision, and MLOps—powering 200+ government entities through SDD’s Sovereign Cloud and hybrid/multi-cloud environments. If you are passionate about designing robust GPU clusters and enabling enterprise-grade AI adoption, this is an exciting opportunity to have real impact.

Key Responsibilities

Design, implement, and optimize GPU-based compute clusters (NVIDIA A100/H100) to deliver scalable, high-availability AI infrastructure.
Architect reference platforms for LLM hosting, vector databases, MLOps, and high-performance storage/networking.
Deploy, configure, and maintain Red Hat OpenShift AI (RHODS) for secure, multi-tenant AI workloads, with seamless GPU orchestration.
Enable and support enterprise-level AI adoption, including onboarding and serving open-source LLMs (Llama, Falcon, Mistral) and RAG pipelines.
Implement Infrastructure-as-Code (Terraform, Ansible) and GitOps for automation, scaling, and lifecycle management of the AI platform.
Develop and maintain robust MLOps pipelines (MLflow, Kubeflow) to manage data preparation, model training, evaluation, and inference.
Document architectures, operational runbooks, and deliver clear, comprehensive written communications in an asynchronous team environment.

Required Skills and Qualifications

7–12 years of hands-on experience in Cloud Infrastructure, DevOps, ML Infrastructure, or Platform Engineering roles.
Expertise in NVIDIA GPU systems (A100/H100), CUDA stack, Linux, containers, and Kubernetes/OpenShift-based orchestration.
Demonstrated experience with OpenShift AI (RHODS) and multi-cloud/hybrid AI workloads.
Strong background in LLM hosting (Llama, Mistral, Falcon), GenAI, and RAG architecture using vector databases (Milvus, pgvector).
Advanced skills in MLOps stacks and pipelines (MLflow, Kubeflow, Airflow) and distributed training frameworks (DDP, DeepSpeed).
Proficient in using Infrastructure-as-Code tools (Terraform, Ansible) and GitOps methods (ArgoCD, etc.).
Exceptional troubleshooting skills, optimization mindset, and a preference for clear, detailed written communication in an async culture.

Preferred Qualifications

NVIDIA Deep Learning/AI Infrastructure Certification, Red Hat OpenShift AI specialization, or Kubernetes CKA/CKAD.
Relevant certifications in Azure AI, Oracle Cloud AI, Terraform, or Ansible.
Experience supporting enterprise AI adoption at scale across government or highly regulated sectors.

Apply now

Refer and Earn

First name

Last name

Phone (Optional)

Linkedin profile URL (Optional)

Upload your resume (in English) (Optional)

Click to upload or drag & drop (.pdf)