Open Role
AI Infra Engineer
at Perplexity
San Francisco, CA·Posted Jul 3
About the role
We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters
Responsibilities
•
Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
•
Manage and optimize Slurm-based HPC environments for distributed training of large language models
•
Develop robust APIs and orchestration systems for both training pipelines and inference services
•
Implement resource scheduling and job management systems across heterogeneous compute environments
•
Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
•
Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
•
Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
•
Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands
Qualifications
•
Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
•
Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
•
Experience with deploying and managing distributed training systems at scale
•
Deep understanding of container orchestration and distributed systems architecture
•
High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
•
Experience managing GPU clusters and optimizing compute resource utilization
Required Skills
•
Expert-level Kubernetes administration and YAML configuration management
•
Proficiency with Slurm job scheduling, resource management, and cluster configuration
•
Python and C++ programming with focus on systems and infrastructure automation
•
Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
•
Strong understanding of networking, storage, and compute resource management for ML workloads
•
Experience developing APIs and managing distributed systems for both batch and real-time workloads
•
Solid debugging and monitoring skills with expertise in observability tools for containerized environments
Preferred Skills
•
Experience with Kubernetes operators and custom controllers for ML workloads
•
Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
•
Familiarity with GPU cluster management and CUDA optimization
•
Experience with other ML frameworks like TensorFlow or distributed training libraries
•
Background in HPC environments, parallel computing, and high-performance networking
•
Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
•
Experience with container registries, image optimization, and multi-stage builds for ML workloads
Required Experience
•
Demonstrated experience managing large-scale Kubernetes deployments in production environments
•
Proven track record with Slurm cluster administration and HPC workload management
•
Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
•
Experience supporting both long-running training jobs and high-availability inference services
•
Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management
About Perplexity

Redefines AI-powered search.
View full profile →- HQ
- San Francisco, CA
- Stage
- Series C+
- Total Raised
- $2.2B
- Employees
- 1,001-5,000
- Founded
- 2022
More roles at Perplexity
- →Member of Technical Staff (Software Engineer, API Platform)San Francisco, CA · New York City, NY
- →Engineering Manager (API Platform)San Francisco, CA
- →Member of Technical Staff (Software Engineer, Enterprise Platform)San Francisco, CA · New York City, NY
- Member of Technical Staff (Software Engineer, Cloud Infrastructure)San Francisco, CA · New York City, NY · Palo Alto, CA