CareerPlanAI job matching →

Director of AI Infrastructure Seattle, WA View job

Allen AI Institute
🗓 Posted 2026-06-04
Apply on Allen AI Institute ↗

Role Overview
Ai2 is a non-profit research institute at the forefront of open-source AI development. Unlike industry peers, our goal is to share our findings, data, code, and models with the global scientific community. We are seeking a Director of AI Infrastructure to oversee the systems that power our research. This leader will be responsible for the full lifecycle of our high-performance computing (HPC) environment which includes on-prem GPU clusters and the software orchestration layer that schedules workloads across a hybrid cloud environment.

Responsibilities
The essential functions include, but are not limited to the following:
- Cluster Management: Oversee the availability and performance of dense on-prem GPU clusters. You will partner with hardware vendors and internal teams to ensure our physical infrastructure meets the demands of frontier model training.
- Orchestration & Scheduling: Direct the strategy for Beaker, our internal orchestration platform. Your goal is to optimize job scheduling, ensuring high utilization of both on-prem assets and elastic cloud resources (AWS/GCP).
- Storage Architecture: Develop and execute a long-term roadmap for storage that balances high-throughput performance for active training with cost-effective durability for petascale research data.
- Resource Economics: Act as the primary steward of our GPU compute budget. You will make data-driven decisions on when to burst to the cloud versus when to invest in on-prem capacity.
- User Support & Velocity: Serve as the technical bridge to our research teams. You will ensure that infrastructure is an accelerator, not a bottleneck, for a diverse set of research objectives.

Requirements and Qualifications
Who You Are:
- Systems Expert: You have a deep understanding of the Linux kernel, container runtimes, and distributed systems. You understand the performance implications of InfiniBand topologies and NCCL optimizations.
- Strategic Thinker: You look beyond the immediate "fire" to design systems that will scale for the next 3–5 years of AI research.
- Pragmatic Leader: You are comfortable making trade-offs between technical elegance and operational necessity. You prioritize reliability and researcher velocity above all else.

What You’ll Need:
- Experience: 12+ years in infrastructure, systems engineering, or HPC, with at least 5 years in a leadership role managing multi-disciplinary engineering teams.
- Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent years of technical work experience.
- GPU/HPC Stack: Direct experience managing large-scale NVIDIA GPU clusters and high-performance networking (InfiniBand/RoCE).
- Cloud Native: Strong background in Kubernetes, Slurm, or similar orchestration frameworks, particularly in hybrid-cloud configurations.
- Storage Mastery: Experience with distributed filesystems (e.g., WEKA, Ceph, Lustre) and cloud storage integration at scale.
- Software Development: Proficient in Go or Python, with the ability to review architecture and code for our internal tooling.

Physical Demands and Work Environment:
- Must be able to remain in a stationary position for long periods of time.
- The ability to communicate information and ideas so others will understand. Must be able to exchange accurate information in these situations.
- The ability to observe details at close range.
- Can work under deadlines.

What They Offer
Compensation:
- Base salary range: $176,400 - $264,600.
- Generous bonus plans to provide a competitive compensation package.
- Annual bonuses and participation in the long-term incentive plan.

Benefits:
- Medical, dental, vision, and an employee assistance program for team members and their families.
- Enrollment in health savings account plan, healthcare reimbursement arrangement plan, and health care and dependent care flexible spending account plans.
- Company’s 401k plan.
- $125 per month to assist with commuting or internet expenses.
- $200 per month for fitness and wellbeing expenses.
- Up to ten sick days per year, up to seven personal days per year, up to 20 vacation days per year, and twelve paid holidays throughout the calendar year.

Culture and Perks:
- Learning organization with weekly Ai2 Academy lectures and world-class AI experts as guest speakers.
- Commitment to diversity, inclusion, and a healthy work/life balance.
- Collaborative and transparent team environment.
- Seattle office located on the water with access to mountains, lakes, and outdoor activities.

Sourced via direct · Listed on CareerPlan, which tracks 70,000+ jobs from 20+ sources.

Get roles like this matched to your CV

Upload your CV once. CareerPlan scores every new posting against your profile with AI and surfaces the ones worth your time — with a recruiter-style pitch.

Try CareerPlan free →