We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Senior Software Engineer

Microsoft
United States, Texas, Irving
7000 State Highway 161 (Show on map)
Sep 26, 2025
OverviewThe Azure Compute team builds a fault-tolerant, distributed system on top of commodity datacenter hardware to deliver infrastructure for hosting cloud applications in virtual machines (VMs). The team creates the illusion that resources are limitless, infinitely elastic, and always available.This role is in the Availability Platform team within Azure Compute, which focuses on ensuring every Azure VM achieves an SLA of 99.99+%. Achieving and exceeding this target requires out-of-the-box thinking, backed by sound data-driven decisions and intelligent automation. The team owns services that monitor the health of millions of Azure machines and the control plane services that make all repair decisions in Azure. We leverage AI and machine learning to build predictive failure models that proactively live-migrate VMs before failures occur, minimizing customer impact and improving platform resilience.We are also exploring the use of generative AI to enhance diagnostics, automate root cause analysis, and accelerate incident resolution. Our collaboration with data scientists and AI researchers enables us to continuously evolve our platform with smarter, self-healing capabilities. As a Senior Software Engineer, you will join a talented team that invests in people and technology for the long term. We emphasize comprehensive designs, incremental development with high quality, frequent shipping, and rapid adaptation to customer feedback. Join us in pushing the boundaries of scale, reliability, availability, and efficiency-while integrating cutting-edge AI to redefine cloud infrastructure. If you want hands-on experience with services architecture at hyperscale, this is the role for you.Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees, we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities* Collaborates with appropriate stakeholders to determine user requirements for a scenario. Drives identification of dependencies and the development of design documents for a product, application, service, or platform. * Works with partner teams to ensure a project/sub-system of a product works well with the components of the partner team, ensuring proper end-to-end testing, live-site coverage, scalability, performance, and DRI escalation pathways are established before going live.* Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).* Leverages subject-matter expertise of product features and partners with appropriate stakeholders (e.g., project managers) to drive a workgroup's project plans, release plans, and work items.* Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate.* Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale. * Applies best practices to build code based on well-established methods and secure design principles while also applying best practices for new code development and formal validation of security invariants. Drives product development and scaling to customer requirements and applies best practices for meeting scaling needs and performance expectations and security promises.
Applied = 0

(web-759df7d4f5-28ndr)