
Founder | CEO @ RADACAD | Coach | Power BI Consultant | Author | Speaker | Regional Director | MVP
In a recent episode of the Fabric Insider series, Reza Rad (RADACAD) [MVP] interviews Santhosh Kumar Ravindran from the Microsoft Fabric Data Engineering team about Spark performance in Microsoft Fabric. The conversation focuses on practical fixes and new features that aim to reduce notebook startup times and improve interactive experiences for data engineers and data scientists. As the episode explains, the difference between perceived slowness and configuration choices is often the root cause of poor performance, not the underlying Fabric Spark runtime itself. Consequently, the discussion highlights both product capabilities and operational tradeoffs teams must weigh.
The episode, titled “Fabric Spark: Custom Live Pools, Resource Profiles & Performance Fixes,” frames its discussion around three central themes: pre-warmed compute, workload-aware sizing, and diagnostic practices. Reza Rad and Santhosh walk through how Microsoft has adapted open-source Spark to fit Fabric’s notebook-first experience and why those adaptations matter for interactive analytics. They also explain how Fabric’s governance model layers capacity, workspace, environment, and session-level settings to control compute behavior. Thus, listeners get both conceptual background and operational guidance in a single session.
One of the flagship features discussed is custom live pools, which keep clusters pre-hydrated on a schedule so notebook sessions can start in roughly five seconds instead of minutes. This approach reduces cold start delays by preparing nodes and environments before users connect, and it works best for predictable, recurring notebook workloads such as scheduled runs or routine explorations. However, teams must accept tradeoffs: live pools require paid capacity SKUs, mandatory schedules, and they support only notebook-based sessions rather than all Spark job types.
Furthermore, Santhosh explains that while live pools improve interactivity, they shift some responsibilities to capacity planning and scheduling. Organizations must choose how many clusters to keep warm and what environments to attach, which increases predictability but also locks in reserved resources during the scheduled window. Therefore, teams that prioritize low-latency interactive work will value the responsiveness, while those focused on cost minimization may prefer on-demand pools despite longer startup times.
Alongside live pools, the episode introduces Resource Profiles as a way to tune cluster sizing and behavior for different workload patterns. These profiles let administrators set characteristics like read-heavy or write-heavy behaviors, which in turn influence how many clusters stay warm and what instance types are used. By matching profiles to usage patterns, organizations can balance concurrency, cost, and startup time more effectively.
Yet, configuring resource profiles requires careful testing and observation. Santhosh recommends starting with conservative max cluster counts and using utilization metrics to refine idle timeouts and schedules, because over-provisioning wastes capacity while under-provisioning causes wait problems. As a result, the best strategy combines telemetry-driven adjustments with an initial schedule buffer to guarantee hydration completes before peak use.
The episode also addresses why Spark sometimes “feels slow” and how policy choices can create cold starts, missed session sharing, and capacity underutilization. Diagnostics and throttling resolution are core topics, and Santhosh points to session-level instrumentation and capacity metrics as primary tools to pinpoint bottlenecks. Moreover, the team highlights limits like the Max Job Lifetime that impose hard ceilings and must be considered when designing long-running workloads.
Tradeoffs emerge clearly: aggressive session-sharing and longer idle timeouts improve utilization but can increase contention for CPU and memory, while strict isolation improves predictability at the cost of resource efficiency. Therefore, architects must weigh the need for fast interactive response against the financial and operational consequences of reserved capacity. The episode presents practical diagnostic steps and emphasizes iterative tuning rather than one-size-fits-all settings.
Finally, Reza and Santhosh outline an actionable rollout plan for organizations that want to adopt these features without disruption. They recommend pilot projects that target a subset of notebook workloads, measure real-world hydration times, and then expand schedules and profiles based on observed concurrency and utilization. This phased approach reduces risk and surfaces hidden constraints such as library management and environment publishing requirements.
Looking ahead, the conversation teases ongoing roadmap work to improve Fabric Spark’s management surfaces and broaden automation options, though some current limitations remain, like the need to publish environments in the portal after library updates. In short, the episode offers a practical blend of new product capabilities, configuration patterns, and realistic tradeoffs for teams aiming to make Spark notebooks snappier in Microsoft Fabric.
Overall, the Fabric Insider episode led by Reza Rad and guest Santhosh Kumar Ravindran provides a concise, usable guide for teams balancing responsiveness, cost, and manageability in notebook-driven Spark workloads. By combining feature walkthroughs with diagnostic tips and a sensible rollout plan, the conversation helps teams decide when to use custom live pools and Resource Profiles and how to measure success as they scale. Consequently, organizations can move from trial and error to repeatable practices that improve developer productivity and platform efficiency.
Microsoft Fabric Spark, Custom Live Pools Fabric, Fabric Resource Profiles, Fabric Performance Fixes, Fabric Insider Episode, Microsoft Fabric Optimization, Spark Performance Tuning, Live Pools Best Practices