Leila Gharani [MVP] explains why Power Query can keep the wrong record
In a recent YouTube video, trainer Leila Gharani [MVP] demonstrates a subtle but important pitfall when using Power Query Remove Duplicates to keep the latest record. The video shows that a straightforward approach — sorting newest-first and removing duplicates — can silently return incorrect results on larger datasets. Consequently, reports may publish bad numbers without any error or warning. The demonstration highlights how common tools can fail in production unless users understand internal behavior.
The problem demonstrated
Gharani walks through a typical scenario: sort by date so the newest entry appears first, then use Remove Duplicates to keep that top row for each key. At small scale the output looks correct, but when the same steps run on a large CSV the count is wrong and some keys keep older rows. In other words, the visible editor order does not always guarantee the row order used by the deduplication step. As a result, teams who trust the editor preview can be surprised by quietly incorrect data feeding downstream reports.
Why Power Query keeps the wrong row
The video explains that the root cause is how Power Query optimizes work, often by using query folding and lazy evaluation. Query folding can push operations back to the data source or reorder steps to improve efficiency, so the editor’s sort may not materialize before duplicates are removed. Therefore, duplicate-removal operates on the table’s current internal state rather than the visually sorted state. This behavior is efficient, but when you need a stable ordering for correct deduplication, it becomes a hazard.
The simple fix and an alternative
Gharani proposes a one-line fix: wrap the sorted table in Table.Buffer so Power Query materializes the sorted data in memory before deduplication. With buffering, the sort is fully realized and Table.Distinct or the UI Remove Duplicates step will keep the intended row. She also presents a second approach that avoids relying on order: use the Group By method to compute a maximum date or an index per key, then merge or filter to keep the correct record. Both solutions work, but they come with tradeoffs that users should weigh.
Tradeoffs: performance, memory, and folding
Buffering enforces order, yet it increases memory usage because the table is materialized in Power Query’s process. Thus, for very large datasets buffering can slow queries or even exhaust available memory. On the other hand, the Group By pattern often avoids buffering and preserves query folding, which can be faster when the source supports it. However, grouping can be more complex to implement and may change how transformations are structured, which affects maintainability.
Practical verification and safe practices
The video emphasizes verifying results against the original source. For example, compare deduplicated output to the CSV or source query counts before publishing, and test on real dataset sizes. Additionally, normalizing text with Trim, Clean, or consistent casing helps avoid hidden mismatches that look identical in the editor. Finally, deduplicate on the correct business key rather than the entire row so you control which fields determine uniqueness.
Challenges in production use
Teams must balance robustness, speed, and resource use when choosing an approach. Buffering is simple and reliable for moderate datasets, but it can raise costs when run at scale. Conversely, preserving query folding can deliver faster results and lower memory demand, yet it may hide ordering assumptions and require more careful step design. Therefore, developers should document the chosen pattern and include tests to detect regressions when data grows or sources change.
Conclusion
Gharani’s video offers a clear and practical lesson: the visible order in Power Query’s editor is not always the order used by downstream steps, and that mismatch can silently corrupt results. By either materializing the sort with Table.Buffer or adopting a grouping strategy, analysts can ensure they keep the correct record. Ultimately, the right choice depends on dataset size, source capabilities, and operational constraints, so validate and monitor any deduplication pattern before it reaches production.
