Azure Reliability: 3 Fixes You Need

by HubSite 365 about John Savill's [MVP]

Principal Cloud Solutions Architect

Azure DataCenterAzure Weekly UpdateLearning Selection

Microsoft pro tips to boost Azure reliability with Availability Zones, gateway SKUs and resilient network connectivity

Key insights

Availability Zones: Distribute VMs and services across zones to reduce single-point failures and improve uptime.
Use zones for key resources so outages in one zone won’t take down your whole application.
Network gateway SKUs: Pick the gateway SKU that matches your throughput and resiliency needs to avoid bottlenecks during failover.
Review performance and HA features of each SKU and scale or upgrade before traffic peaks.
Default outbound internet access and NAT Gateway: Microsoft will retire implicit outbound access for new VMs, so configure explicit outbound methods like NAT Gateway or public IPs.
Inventory existing VMs, plan migrations away from legacy implicit access, and apply explicit outbound controls for security and predictability.
Service-managed failover and High availability for Cosmos DB: Enable service-managed failover and HA to let Cosmos DB switch write regions automatically during outages.
Activating these features reduces downtime and helps meet SLA expectations for production workloads.
Azure Site Reliability Engineering Agent and AI-driven incident management: Use the new SRE agent to speed root cause analysis by automatically analyzing logs and metrics with AI.
This cuts mean time to repair and frees engineers to focus on higher-value tasks.
Platform changes and migrations: Revisit disaster recovery for App Service (platform DR changes) and plan to replace Functions Proxies before end of support.
Update DR plans, test failover procedures, and migrate deprecated features well ahead of deadlines to avoid service gaps.

Introduction

The recent YouTube video by John Savill's [MVP] offers a clear, practical look at three high-impact actions to improve Azure reliability. In the video, Savill walks viewers through concrete steps, timestamps, and rationale that help teams prioritize resilience work without overhauling their entire cloud estate. He emphasizes low-friction changes that deliver measurable gains, and he frames each recommendation with examples and timing so teams can plan work effectively.

Moreover, the presentation balances urgency with pragmatism, explaining why some defaults are no longer safe and what to do about them. For editorial readers, the video serves as both a checklist and a discussion starter about tradeoffs between cost, complexity, and risk. Therefore, readers should see these suggestions as prioritized actions rather than one-size-fits-all mandates.

Availability Zones: Design for Fault Isolation

Savill recommends using Availability Zones to protect workloads against datacenter-level failures. He explains that spreading instances across zones reduces single points of failure and increases SLA coverage, which is essential for production systems that must remain online during infrastructure incidents. The video shows how bringing zone awareness to virtual machines, managed disks, and certain platform services can make a big difference with relatively little operational change.

However, there are tradeoffs to consider. For example, cross-zone deployments may increase network egress or add configuration complexity, and some services might not be zone-aware in every region. Consequently, teams must weigh cost and latency impacts against the reliability gains, and they should test failover paths as part of deployment validation.

Network Gateway SKUs and Outbound Connectivity

An important segment of the video covers network gateway SKUs and the upcoming shift away from implicit internet access for new VMs. Savill stresses that relying on the old default outbound behavior creates hidden risks because new virtual machines will soon require explicit outbound mechanisms such as a NAT Gateway or assigned public IPs. He advises organizations to audit current VM networking and to migrate workloads to managed outbound solutions to avoid surprises during rollouts or incidents.

At the same time, choosing the right gateway SKU or NAT configuration involves balancing throughput, cost, and feature needs. Higher-capacity SKUs provide better performance and resiliency but raise monthly costs, and some legacy network setups can complicate migration. Thus, planning, staged testing, and cost modeling are necessary to make the change sustainable and effective.

Network Connectivity and Cosmos DB Failover

Savill highlights network connectivity patterns and introduces reliability recommendations for managed data services such as Cosmos DB. He recommends enabling service-managed failover and activating high availability so that write regions can shift automatically during outages and applications continue to function. The video shows that these options reduce manual intervention and improve mean time to recovery when they are configured and validated properly.

Nonetheless, automatic failover has tradeoffs including potential data latency, the need for multi-region replication strategy, and application readiness for regional writes. Teams must test consistency models and update connection logic to handle region changes. Therefore, a thorough failover playbook and regular drills are essential to make automated failover dependable.

Azure Site Reliability Engineering Agent and Guidance Hub

Finally, the presentation covers new tools and resources for incident analysis, including the Azure Site Reliability Engineering Agent and the Reliability Guidance Hub. Savill describes how AI-assisted diagnostics can sift through logs and signals faster, helping engineers pinpoint causes and reduce downtime. He also points viewers to centralized reliability documentation that aggregates best practices and practical playbooks to guide teams through common scenarios.

Despite the promise of automation, challenges remain in integrating AI tools into existing workflows and trusting automated recommendations. Teams should validate outputs, preserve human oversight, and train staff to interpret tool findings. Additionally, consumption of advanced tooling may require new permissions, telemetry setup, and budget planning, so organizations should plan adoption carefully.

Summary and Next Steps

In short, John Savill's video frames three practical reliability actions: adopt Availability Zones, configure explicit outbound networking with appropriate gateway SKUs, and enable managed failover for critical services like Cosmos DB, while exploring new AI-driven diagnostics. Each recommendation brings improvements in uptime and fault tolerance, yet each also carries tradeoffs in cost, complexity, and operational change. Hence, teams should prioritize work based on risk, test thoroughly, and update runbooks to reflect new behaviors.

Ultimately, the video serves as a concise guide for teams that need to raise their resilience posture without unnecessary overhaul. For organizations that follow the advice, the rewards include clearer network control, faster recovery, and fewer surprises during incidents. Therefore, readers should treat these actions as part of a staged reliability program that balances immediate wins with long-term resilience goals.

Related resources

Azure Weekly Update - Azure Reliability: 3 Fixes You Need

Keywords

azure reliability best practices, azure high impact fixes, improve azure uptime, azure resilience strategies, azure fault tolerance tips, prevent azure downtime, azure service reliability engineering, optimize azure performance