Staff Software Engineer, Compute
Job Description
You are an experienced software engineer who thrives on building large scale computation platforms. You have deep expertise in backend systems that orchestrate workloads and route requests efficiently, while taking care of capacity and resource constraints. You possess a strong understanding of foundational cloud infrastructure and Linux provisioning and management tools. You know how to achieve reliability and scale with minimum operational load.
Key Responsibilities:
Develop and maintain our core Python platform, which handles routing of requests, orchestration of AI workloads, GPU server capacity management, observability, authentication, rate limiting, and many others
Develop and maintain our infrastructure layer where we use Terraform, Ansible, and provider APIs to manage our fleet of GPU workers
Own K8s, FluxCD, Nomad, Prometheus, Thanos, Grafana, Loki, distributed networking storage, and other technologies that underpin our platform
Create the vision and lay the foundation for where our infrastructure should go in the next 1/2/5 years
Requirements:
Deep experience building distributed compute platforms, preferably with Python
Strong foundation in managing both cloud and bare metal infrastructure
Solid understanding of K8s and CI/CD on it
Excellent communication
Self-starter who executes quickly, takes ownership and constantly seeks improvement
Location:
San Francisco, CA
What we offer at fal
Interesting and challenging work
Base salary $180,000-250,000 plus equity
Employee-friendly equity terms (early exercise, extended exercise)
A lot of learning and growth opportunities
We are currently hiring in downtown San Francisco. We prefer to work in-person but we also offer remote work opportunities for exceptional candidates.
We offer visa sponsorship and will help you relocate to San Francisco.
Health, dental, and vision insurance (US)
Regular team events and offsites
Company Information
Location: Overland Park, Kansas, United States
Type: Hybrid