Ilya Baybikov's home page

Lovely Heap and poor CPU

Let’s assume you’re dealing with a legacy application where most of the logic lives in stored procedures, functions, triggers, and so on.

The app uses a queue implemented on top of SQL Server - not unusual for legacy applications. The interesting part is that the table behind the queue is a Heap with no nonclustered indexes. Let’s call this table Lovely Heap.

The application has hundreds of thousands of tables, and each one has a trigger that inserts one or more rows into Lovely Heap. Okay, okay, not hundreds of thousands - just hundreds. But you get the idea of how everything was built.

Due to an error, Lovely Heap was flooded with millions of rows that would never be removed by the queue processor (which, of course, was just a SQL Server Agent job).

That data is no longer needed, and the decision is simple: delete those rows manually. No sooner said than done.

Suppose that before the deletion, Lovely Heap had ~9k pages, and after the deletion it still had ~6k pages allocated.

Now, make your bets on what happened on your production SQL Server to the stored procedure that reads data from the Lovely Heap table after all invalid rows were deleted:

And surprise! - right after the deletion, CPU utilization went through the roof. Poor CPU…

Testing

That solved the problem - but what exactly happened, and why did CPU usage skyrocket? To investigate, I will run some tests using the latest and greatest SQL Server 2025 + CU3, starting by preparing the schema and test data:

 1create table dbo.LovelyHeap
 2(
 3	Id			bigint identity(1,1) not null,
 4	QueueId		tinyint not null,
 5	OperationId	tinyint not null,
 6	[Data]		bigint not null,
 7	CreateDate	datetime2(7) not null
 8		constraint DF_LovelyHeap__CreateDate default (sysutcdatetime())
 9)
10go
11create proc dbo.stp_LovelyHeapRead
12(    
13	@QueueId tinyint
14)
15as
16set nocount on
17
18delete lh
19output deleted.Id, deleted.OperationId, deleted.[Data]
20from	dbo.LovelyHeap lh with(readpast)
21where QueueId = @QueueId
22go

Next, I inserted some data using SqlQueryStress created by Adam Machanic.

I ran two instances simultaneously to mix the data:

Let’s look at index physical stats using this query:

1select  object_name([object_id]) table_name,
2		record_count record_cnt,
3		forwarded_record_count fwd_record_cnt,
4		ghost_record_count ghost_record_cnt,
5		avg_fragmentation_in_percent,
6		page_count page_cnt
7from	sys.dm_db_index_physical_stats(db_id(),null,null,null,'detailed')
8where [object_id] = object_id('dbo.LovelyHeap')

Result:

table_name record_cnt fwd_record_cnt ghost_record_cnt avg_fragmentation_in_percent page_cnt
LovelyHeap 2100000 0 0 0 9544

Now let’s run the stored procedure and collect execution stats (10 threads × 500 iterations):

1declare	@QueueId tinyint = floor(rand() * 100) + 1	-- 0 to 100
2exec dbo.stp_LovelyHeapRead @QueueId = @QueueId

Total execution time was about 1 minute 16 seconds.

To collect execution stats, I will use Query Store and the following query:

 1select	object_name(qsq.[object_id]) proc_name,
 2		rs.count_executions execution_cnt,
 3		rs.avg_duration / 1000.0 avg_duration_ms,
 4		rs.avg_cpu_time / 1000.0 avg_cpu_ms,
 5		rs.avg_logical_io_reads avg_logical_reads,
 6		rs.avg_logical_io_writes avg_logical_writes
 7from	sys.query_store_query qsq
 8		join sys.query_store_plan qsp
 9		on qsp.query_id = qsq.query_id
10		join sys.query_store_runtime_stats rs
11		on rs.plan_id = qsp.plan_id
12where qsq.[object_id] = object_id('dbo.stp_LovelyHeapRead')

Execution stats:

proc_name execution_cnt avg_duration_ms avg_cpu_ms avg_logical_reads avg_logical_writes
stp_LovelyHeapRead 5000 148.7901908 96.8456538 9554.45 1.454

Index physical stats:

table_name record_cnt fwd_record_cnt ghost_record_cnt avg_fragmentation_in_percent page_cnt
LovelyHeap 2000000 0 0 0 9546

Now let’s delete rows for QueueId equal 254 or 255:

1delete dbo.LovelyHeap
2where QueueId in (254, 255)

Index physical stats:

table_name record_cnt fwd_record_cnt ghost_record_cnt avg_fragmentation_in_percent page_cnt
LovelyHeap 0 0 0 1.13065326633166 6311

Insert another 100k rows:

1insert	dbo.LovelyHeap(QueueId,OperationId,[Data])
2select	floor(rand() * 100) + 1 QueueId,	-- 1 to 100
3		floor(rand() * 100) OperationId,	-- 0 to 100
4		floor(rand() * 2147483647) [Data]	-- 0 to ~2bil

Index physical stats:

table_name record_cnt fwd_record_cnt ghost_record_cnt avg_fragmentation_in_percent page_cnt
LovelyHeap 100000 0 0 1.13065326633166 6311

Run stp_LovelyHeapRead again (10 threads × 500 iterations). This time it took about 4 minutes 48 seconds. Execution stats:

proc_name execution_cnt avg_duration_ms avg_cpu_ms avg_logical_reads avg_logical_writes
stp_LovelyHeapRead 5000 444.030594 305.071358 6317.3034 0.1564

Index physical stats again:

table_name record_cnt fwd_record_cnt ghost_record_cnt avg_fragmentation_in_percent page_cnt
LovelyHeap 0 0 0 1.13065326633166 6311

Avg logical reads decreased as expected, but CPU usage and duration increased 3x! This behavior was unexpected for me. What is going on here?

Execution plan

It doesn’t matter whether the table has rows to return or not - the observed behavior remains the same. From here on, I’ll compare two scenarios to make it simpler:

Let’s run both stored procedures and review their execution plans: Execution plan Looking at the execution plans:

In both cases the execution plan is the same, and the most expensive operator is a Table Scan. Weird, right?

Under the hood

Now let’s see what SQL Server is doing under the hood using Windows Performance Toolkit.

Stack trace for Scenario A (slice duration ~80 ms): Scenario A trace

Stack trace for Scenario B (slice duration ~322 ms): Scenario B trace

The key difference is locking behavior:

Yes, I know - there are a lot of 2s here, but that’s purely accidental, I swear.

Anyway, that’s roughly 1 ms of CPU per HeapFragment::LockResource call.

Conclusion

At first glance, this behavior feels completely wrong: fewer rows, fewer logical reads, the same execution plan - yet 3× higher CPU and longer execution time.

Why this happens isn’t obvious at first. But based on the investigation, the key difference is this: in Scenario B, the workload shifts from data processing to CPU-heavy internal synchronization and the engine spends most of its time on coordination overhead.

And the main takeaway for me is simple - sometimes, less data means more work for SQL Server.

One more practical lesson: heaps are usually a poor fit for OLTP workloads. Without a clustered index, SQL Server relies heavily on allocation structures and page scans, which can lead to unpredictable performance like this. Heaps still have their place - for example, in staging tables or bulk data loading, but for most OLTP scenarios, a properly designed clustered index is the safer and more predictable option.

To be honest, I don’t fully understand why it behaves this way. If you have a better explanation, feel free to share it in comments below the LinkedIn post.

Catch you next time!

#Tsql #Database Internals