Site Reliability Engineer

$40 - $70/hourpay

Required Skills

Terminal-Native Problem Solving
Dynamic Infrastructure Recovery
Containerized Environment Mastery
Python
About micro1
micro1 is a data engine that helps AI labs train foundational models and enterprises build AI agents. We provide frontier evaluations and reinforcement learning environments used to improve LLM capabilities, as well as contextual evaluations used to monitor and improve AI agents in enterprise settings. Our data engine includes an AI recruiter agent that sources and vets domain experts, a data platform that enables rapid production of high-quality training data, and a pipeline performance system that ensures both quality and velocity.
Our goal is to have 1 billion people doing meaningful work by contributing their expertise to the development of frontier AI models. We’ve raised $40M+ in funding, and our AI recruiter has powered more than 1 million AI-led interviews as our global network of experts expands to form the human intelligence layer for AGI.

Job Description

Job Title: Site Reliability Engineer


Job Type: Contractor


Location: Remote


Job Summary:

Join our customer's team as a Site Reliability Engineer for a specialized, high-intensity project centered on training and optimizing AI models within cutting-edge containerized infrastructures. This terminal-intensive engagement demands a systems-first approach, real-time troubleshooting, and dynamic process recovery, offering significant potential for future extension or transition into advanced phases for standout performers.


Key Responsibilities:

• Lead the deployment, monitoring, and recovery of complex, containerized AI training environments using advanced terminal techniques.

• Proactively identify, diagnose, and resolve infrastructure bottlenecks and failures in long-running processes.

• Orchestrate resilient system builds and infrastructure management, ensuring stability and optimal resource utilization.

• Collaborate closely with engineering teams to refine CI/CD pipelines and automate routine operational tasks.

• Manage and optimize filesystem structures, networked storage, and process scheduling in Dockerized sandboxes.

• Conduct rapid mid-execution replanning during error states and unforeseen runtime issues.

• Document best practices, emergent solutions, and contribute to knowledge transfer across the team.


Required Skills and Qualifications:

• Demonstrated expert proficiency with terminal-based problem solving and complex system administration.

• Mastery of dynamic infrastructure recovery and long-running operational process management.

• Deep expertise in containerized environments (e.g., Docker, Kubernetes) and sandbox orchestration.

• Strong Python skills, with the ability to script, automate, and debug real-world production systems.

• Proficiency in Bash and familiarity with JavaScript/TypeScript, Go, Rust, C/C++.

• Experience with build systems, package managers, databases, version control, and cryptography tools.

• Adept at troubleshooting, documenting, and replanning in high-velocity technical environments.


Preferred Qualifications:

• Background in machine learning operations or AI infrastructure.

• Familiarity with ML frameworks and distributed computing.

• Experience supporting multi-phase, high-intensity engineering projects.

Apply now

Please note that after completing the interview process, you’ll be added to our talent pool and considered for this and other roles that match your skills.

Have any questions? See FAQs

Refer and Earn$300