Site Reliability Engineer (LInE)

Required Skills

Linux
Kubernetes
Prometheus
About micro1
micro1 connects domain experts to the development of frontier AI models. Real-world expertise is turned into training data, evaluations, and feedback loops that improve how models perform. AI labs and enterprises use micro1 to train models and build reliable AI agents through advanced evaluations and reinforcement learning environments. Experts contribute directly to how AI systems learn, reason, and perform across domains like finance, healthcare, engineering, and more. Our platform identifies and vets top talent through an AI recruiter, enabling high-quality contributions at scale.
Our goal is to enable 1 billion people to do meaningful work by applying their expertise to AI. We’ve raised $40M+ in funding, and our AI recruiter has powered over 1 million AI-led interviews as our global network of experts grows into the human intelligence layer for AI.

Job Description

Job Title: Site Reliability Engineer


Job Type: Contractor


Location: Remote


Job Summary:

Join our customer's team as an expert Site Reliability Engineer and play a pivotal role in ensuring the performance, reliability, and scalability of mission-critical infrastructure. You'll leverage your deep expertise in Linux, Kubernetes, and Prometheus to architect, monitor, and enhance robust systems supporting innovative applications.


Key Responsibilities:

  1. Design, implement, and maintain scalable infrastructure using Linux, Kubernetes, and Prometheus.
  2. Monitor system health, analyze performance metrics, and proactively address bottlenecks or potential failures.
  3. Automate operational processes to minimize manual intervention and increase system reliability.
  4. Respond swiftly to incidents, conduct root cause analysis, and drive continuous improvements in incident response procedures.
  5. Collaborate closely with development and operations teams to deliver seamless deployments and high system availability.
  6. Create comprehensive documentation and clear runbooks for operational excellence and knowledge sharing.
  7. Champion best practices in SRE, security, and compliance across the customer's ecosystem.


Required Skills and Qualifications:

  1. Expert-level hands-on experience with Linux system administration and troubleshooting.
  2. Advanced proficiency with Kubernetes, including cluster deployment, operations, and management.
  3. Deep knowledge of Prometheus for monitoring, metrics collection, and alerting.
  4. Strong scripting abilities (Bash, Python, or similar) for automation and tooling.
  5. Excellent written and verbal communication skills, with the ability to document and share knowledge effectively.
  6. Proven track record in site reliability engineering or similar roles in high-availability environments.
  7. Demonstrated commitment to proactive problem-solving and collaborative teamwork.


Preferred Qualifications:

  1. Experience with other cloud-native tools (e.g., Grafana, Helm, Istio, or similar).
  2. Certifications in Kubernetes, Linux, or cloud platforms.
  3. Background in high-growth or large-scale production environments.

Apply now

Please note that after completing the interview process, you’ll be added to our talent pool and considered for this and other roles that match your skills.

Have any questions? See FAQs

Refer and Earn$500