Principal Network Architect
![]() | |
![]() United States, Washington, Redmond | |
![]() | |
OverviewDo you want to be at the forefront of innovating the latest hardware designs to propel Microsoft's cloud growth? Are you seeking a unique career opportunity that combines technical capabilities, cross-team collaboration with business insight and strategy? Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees, we come together with a growth mindset, innovate to empower others, and collaborate to achieve our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day. Join the AI System Architecture (ASA) team within Microsoft's Azure Hardware Systems and Infrastructure (AHSI) organization, the team behind Microsoft's expanding Cloud Infrastructure and for powering Microsoft's "Intelligent Cloud" mission. We are looking for a Principal Network Architect to join our team!
ResponsibilitiesOwn end-to-end network architecture for AI training/inference clusters: topology, routing, transport, congestion control, QoS, telemetry, reliability, and failure domains.Lead and grow a high-performing team (~10 engineers) across architecture, performance, and validation; set goals, mentor, and drive execution.Define scale-out/scale-up designs (e.g., leaf-spine, dragonfly/dragonfly+, Clos/fat-tree, 2D/3D torus variants) and network services for job schedulers and accelerator runtimes.Drive congestion-control strategy (ECN/PFC, DCQCN, HPCC, TIMELY, HULL, adaptive load balancing like CONGA/HULA) and transport tuning (RDMA/RoCEv2, QUIC/TCP variants) for tail-latency and throughput SLAs.Hands-on analysis of switch/NIC behavior using counters, traces, and telemetry (PFC/ECN stats, INT, in-band telemetry, gNMI/gNOI, sFlow/NetFlow, eBPF); create reproducible perf tests.Evaluate and influence silicon & optics (ASIC feature roadmaps, queueing/scheduling, packet recirculation, shared buffer, VOQs, cut-through vs store-and-forward, 400/800G, linear vs retimed optics).Prototype and validate in lab and pre-prod: build testbeds, craft microbenchmarks and realistic AI workloads; automate with Python/Go/Ansible; codify SLOs and pass/fail gates.Partner across teams (accelerator/HBM, storage, orchestration, reliability) to co-design network-aware collective ops (all-reduce/all-to-all/mixture-of-experts) and placement policies.Influence standards and industry direction via active participation in IEEE 802.3/802.1, IETF, OCP, OIF, Ethernet Alliance, and vendor ecosystems; drive MSFT requirements into roadmaps.Operational excellence: define observability, fault isolation, failure testing (Jepsen-style chaos, link flap/black-hole, incast), capacity planning, and upgrade/rollout strategies.Documentation & reviews: author design docs, RFCs, and executive briefs; run design and readiness reviews. |