Cambridge Residency Programme: Next-Generation AI Datacentre Networking
Track A — Modelling & Simulation Best suited to candidates whose primary strength is analytical reasoning, performance modelling, or simulation. Design and analyse novel network architectures (e.g., hybrid optical-electrical, reconfigurable topologies) tailored for AI communication patterns. Develop analytical models and simulators to quantify the performance, cost, and energy trade-offs of proposed designs. Study architectural trade-offs involving topology, transport, collective communication, and emerging optical/networking hardware. Collaborate with systems researchers to compare model predictions with testbed measurements. Evolve existing evaluation tools and frameworks to address new research questions and scenarios relevant to product teams. Implement and evaluate network protocols, transport mechanisms, and collective communication schemes on experimental hardware testbeds featuring modern GPUs, optical circuit switches, and RDMA interconnects. Build and run communication-intensive workloads (e.g., collective algorithm benchmarks, distributed training/inference jobs) to stress-test new network designs. PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, Operations Research, or a related field. Evidence of independent research, such as first-author publications, strong thesis work, or impactful prototypes. Ability to communicate research clearly through papers, talks, and cross-functional collaboration. Experience with datacentre network architectures, transport protocols, or collective communication. Familiarity with circuit-switched or optical networking concepts (e.g., optical circuit switches, co-packaged optics). Understanding of AI/ML workload communication patterns (e.g., all-reduce, MoE routing, pipeline parallelism). Experience building simulators, evaluation frameworks, or experimental prototypes. Proficiency in Python and familiarity with scientific computing libraries (NumPy, SciPy, pandas). High-performance networking: RDMA (RoCEv2, InfiniBand), transport protocol implementation, or congestion control. GPU and distributed ML communication: CUDA programming, NCCL, or experience with ML training/inference systems (e.g., PyTorch, Megatron, vLLM). Experimental infrastructure: Building or managing hardware testbeds, measurement and profiling.