Job Title: Site Reliability Engineer

Job Type: Contractor

Location: Remote

Job Summary: In this role, you'll apply your expertise to help train next-generation AI systems. Your work will shape how models learn, reason, and perform through high-quality, real-world input. No prior experience in AI is required — your domain knowledge is what matters.

As an expert you will be creating Reinforcement Learning Environments which test an AI model’s ability to solve complex software engineering workflows. These workflows are similar in scope to common DevOps | CI/CD | Debugging workflows using common cli tools such as git, docker, gdb, asan, ffmpeg and many more. Your task will be to create reproducible rl environments that test a model’s ability to solve these workflows along with a golden reference solution.

Key Responsibilities:

• Lead the deployment, monitoring, and recovery of complex, containerized AI training environments using advanced terminal techniques.

• Proactively identify, diagnose, and resolve infrastructure bottlenecks and failures in long-running processes.

• Orchestrate resilient system builds and infrastructure management, ensuring stability and optimal resource utilization.

• Collaborate closely with engineering teams to refine CI/CD pipelines and automate routine operational tasks.

• Manage and optimize filesystem structures, networked storage, and process scheduling in Dockerized sandboxes.

• Conduct rapid mid-execution replanning during error states and unforeseen runtime issues.

• Document best practices, emergent solutions, and contribute to knowledge transfer across the team.

Required Skills and Qualifications:

• Demonstrated expert proficiency with terminal-based problem solving and complex system administration.

• Mastery of dynamic infrastructure recovery and long-running operational process management.

• Deep expertise in containerized environments (e.g., Docker, Kubernetes) and sandbox orchestration.

• Strong Python skills, with the ability to script, automate, and debug real-world production systems.

• Proficiency in Bash and familiarity with JavaScript/TypeScript, Go, Rust, C/C++.

• Experience with build systems, package managers, databases, version control, and cryptography tools.

• Adept at troubleshooting, documenting, and replanning in high-velocity technical environments.

Preferred Qualifications:

• Background in machine learning operations or AI infrastructure.

• Familiarity with ML frameworks and distributed computing.

• Experience supporting multi-phase, high-intensity engineering projects.

Compensation Structure

Compensation is output-based; experts are paid per task that meets the project specifications. The time required to complete work may vary depending on the expert’s experience and workflow. Minimum submission requirements apply. Experts must submit a minimum of tasks per week.

Start Timeline & Availability

We typically fill roles within 48 hours and are looking for experts ready to jump in right away. If selected, we expect you to start your first tasks within 24–48 hours of completing onboarding.