
AI Infrastructure Engineer
Required Skills
nvidia gpu systems (a100/h100)
cuda stack
linux
containers
kubernetes
openshift
red hat openshift ai (rhods)
terraform
ansible
gitops
argocd
mlflow
kubeflow
airflow
distributed training frameworks (ddp, deepspeed)
llm hosting (llama, mistral, falcon)
genai
rag architecture
vector databases (milvus, pgvector)
cloud infrastructure
devops
multi-cloud
hybrid cloud
high-performance storage
high-performance networking
infrastructure as code
mlops
documentation
written communication
troubleshooting
optimization
enterprise ai adoption
government sector experience
asynchronous team communication
Job Description
Job Title: AI Infrastructure Engineer
Job Type: Full-time
Location: On-site, Sharjah, Sharjah, United Arab Emirates
Job Summary
Join our team as an AI Infrastructure Engineer and help architect, build, and operate the next generation of high-performance AI platforms. You will play a pivotal role in supporting advanced AI workloads—including LLMs, GenAI, Computer Vision, and MLOps—powering 200+ government entities through SDD’s Sovereign Cloud and hybrid/multi-cloud environments. If you are passionate about designing robust GPU clusters and enabling enterprise-grade AI adoption, this is an exciting opportunity to have real impact.
Key Responsibilities
- Design, implement, and optimize GPU-based compute clusters (NVIDIA A100/H100) to deliver scalable, high-availability AI infrastructure.
- Architect reference platforms for LLM hosting, vector databases, MLOps, and high-performance storage/networking.
- Deploy, configure, and maintain Red Hat OpenShift AI (RHODS) for secure, multi-tenant AI workloads, with seamless GPU orchestration.
- Enable and support enterprise-level AI adoption, including onboarding and serving open-source LLMs (Llama, Falcon, Mistral) and RAG pipelines.
- Implement Infrastructure-as-Code (Terraform, Ansible) and GitOps for automation, scaling, and lifecycle management of the AI platform.
- Develop and maintain robust MLOps pipelines (MLflow, Kubeflow) to manage data preparation, model training, evaluation, and inference.
- Document architectures, operational runbooks, and deliver clear, comprehensive written communications in an asynchronous team environment.
Required Skills and Qualifications
- 7–12 years of hands-on experience in Cloud Infrastructure, DevOps, ML Infrastructure, or Platform Engineering roles.
- Expertise in NVIDIA GPU systems (A100/H100), CUDA stack, Linux, containers, and Kubernetes/OpenShift-based orchestration.
- Demonstrated experience with OpenShift AI (RHODS) and multi-cloud/hybrid AI workloads.
- Strong background in LLM hosting (Llama, Mistral, Falcon), GenAI, and RAG architecture using vector databases (Milvus, pgvector).
- Advanced skills in MLOps stacks and pipelines (MLflow, Kubeflow, Airflow) and distributed training frameworks (DDP, DeepSpeed).
- Proficient in using Infrastructure-as-Code tools (Terraform, Ansible) and GitOps methods (ArgoCD, etc.).
- Exceptional troubleshooting skills, optimization mindset, and a preference for clear, detailed written communication in an async culture.
Preferred Qualifications
- NVIDIA Deep Learning/AI Infrastructure Certification, Red Hat OpenShift AI specialization, or Kubernetes CKA/CKAD.
- Relevant certifications in Azure AI, Oracle Cloud AI, Terraform, or Ansible.
- Experience supporting enterprise AI adoption at scale across government or highly regulated sectors.