Recent Projects

Garnet

Garnet is a high-performance remote cache-store designed as a faster, modern, and extensible alternative to traditional cache-store systems. Garnet adopts the popular RESP wire protocol, enabling compatibility with unmodified Redis clients across most programming languages. Garnet delivers exceptional throughput and scalability with many client connections, achieving up to 100x higher throughput and 4x lower latency at the higher percentiles when running on commodity cloud VMs. Garnet’s storage layer, called Tsavorite, enables thread-scalable operations, tiered storage support (memory, SSD, cloud), fast checkpointing, recovery, and multi-key transaction support.

Built on modern .NET, Garnet is cross-platform and easily extensible with custom operations and data types in C#. Garnet is open source with over 11K stars on GitHub and is deployed in production at scale within Microsoft. Read about the technology in our research paper and visit the project website for more details.

Bf-Tree

Bf-Tree is a modern read-write-optimized concurrent larger-than-memory range index, implemented in Rust. Traditional B-Trees face challenges with larger-than-memory workloads due to inefficient page-based caching and high write amplification. Bf-Tree introduces a novel mini-page abstraction that enables fine-grained, variable-length in-memory cache units for buffering reads, writes, and gaps in the index. This design uses a variable-length buffer pool with LRU-approximation for efficient memory management. Benchmarks show Bf-Tree achieves 2.5x faster scans than RocksDB, 6x faster writes than classic B-Trees, and 2x faster point lookups for small records. Bf-Tree is open source and available as a Rust crate. Learn more from our research paper presented at VLDB 2024.

Vector Databases

We are actively working on vector database technologies in collaboration with the DiskANN team. We are adding native vector support to Garnet, enabling high-performance approximate nearest neighbor (ANN) search capabilities within the cache-store framework.

We have also used Bf-Tree to build a new storage provider for DiskANN, making it possible to efficiently store and index vectors alongside the ANN graph. Bf-Tree’s range indexing capabilities also enable indexing of vector attributes, providing the ability to perform filtered ANN searches where queries can combine vector similarity with attribute predicates.

Additionally, we developed IP-DiskANN, the first in-place update method for graph-based ANN indices. Traditional proximity graph indices like DiskANN are challenging to update in real-time, especially for deletions, as they only store outgoing edges. IP-DiskANN introduces an approximate in-neighbor identification technique using greedy search, enabling targeted edge rewiring that performs localized repairs without full graph traversal. This allows truly streaming updates with both insertions and deletions performed in-place, maintaining high recall while achieving superior throughput compared to batch-based approaches. Learn more from our research paper.

FASTER

Managing large and frequently updated application state efficiently and reliably is a hard problem in the cloud and edge. Following my learnings from state management challenges in Trill, I designed and built a high-performance key-value store called FASTER. FASTER bridges the gap between larger-than-memory and pure in-memory data structures using a novel hybrid log organization. FASTER is open source, available in C# and C++, with over 6,600 stars on GitHub. Read about the technology in our research paper presented at SIGMOD 2018, and visit the project website for more details.

Recoverability

In FASTER, we first introduced a novel thread-scalable recovery model and protocol called Concurrent Prefix Recovery (CPR), which is also applicable to traditional databases, and avoids the overhead of a separate write-ahead log for prefix-consistent checkpoints. Learn about CPR in our research paper presented at SIGMOD 2019. We subsequently extended CPR to distributed and serverless environments, building systems and protocols for recoverability in such settings:

  • DPR (Distributed Prefix Recovery): Enables fast, fault-tolerant updates in distributed key-value stores while ensuring prefix durability through asynchronous, non-blocking rollback protocols. Learn more in our research paper at SIGMOD 2021.

  • DARQ: Introduces Composable Resilient Steps (CReSt) for building fault-tolerant cloud applications, where DARQ enforces step semantics with speculative execution that removes persistence bottlenecks while preserving fault-tolerance. Read our research paper from SIGMOD 2023.

  • DSE (Distributed Speculative Execution): Decouples durable execution from physical execution, allowing applications to bypass persistence overhead while automatically repairing state after failures, reducing latency by up to an order of magnitude. See our research paper.

  • SSMS (Serverless State Management Systems): Proposes a new cloud architecture that addresses fault-tolerance, deployment, and scaling complexities in serverless applications through logical application models and resilient programming primitives. Read our vision paper from CIDR 2024.

FishStore

We built a storage layer for flexible-schema data called FishStore, available as open source. FishStore is an ingestion and storage layer for flexible- and fixed-schema datasets. It allows you to dynamically register complex predicates over the data, to define interesting subsets of the data. Such predicates are called PSFs (for predicated subset functions). FishStore performs partial parsing of the ingested data (based on active PSFs) in a fast, parallel, and micro-batched manner, and hash indexes records for subsequent fast PSF-based retrieval. To accomplish its goals, FishStore leverages and extends the FASTER hash key-value store, and uses an unmodified parser interface for fast parsing (we use simdjson in many of our examples). The FishStore research paper appeared at SIGMOD 2019 and the system was demonstrated at VLDB 2019.

ALEX

We built an optimized learned range index called ALEX, that exploits machine learning to achieve very high performance while handling updates efficiently. ALEX uses a hierarchy of models to predict key locations, adaptively learning from the data distribution. Learn more from our research paper at SIGMOD 2020 and the open source implementation.

Qd-Tree & Crystal

Towards simplifying storage for modern OLAP applications, we developed techniques to learn data partitioning strategies from a dataset and query workload using deep reinforcement learning. The Qd-Tree approach automatically discovers optimal data layouts that minimize query costs. Learn more from our research paper at SIGMOD 2020.

Building on Qd-Tree, we developed Crystal, a unified cache storage system for analytical databases operating in cloud environments. Modern cloud analytical databases use disaggregated storage, where compute nodes access large, remote, columnar data (such as Parquet on Amazon S3 and Azure Blob). Crystal acts as a smart, query-aware cache at the compute node, leveraging connections to database query engines and push-down predicates to efficiently cache frequently queried regions of tables. This leads to substantial improvements in query latencies and reduced bandwidth usage for frameworks like Spark. Learn more from our research paper at VLDB 2021.

CRA & Ambrosia

I am interested in distributed data processing and resilient microservice architectures. CRA is an open-source framework that makes it easy to author resilient distributed platforms and applications. Learn about CRA from our technical report and research paper at ICDE 2019. We used CRA to build Ambrosia, a virtually resilient microservices framework prototype based on robust exactly-once message delivery. Read about Ambrosia in our technical report and research paper at VLDB 2020.

Netherite

Netherite is a high-performance distributed workflow execution engine for Azure Durable Functions and the Durable Task Framework. Unlike the default Azure Storage backend which relies on many small queue and table operations, Netherite uses Azure Event Hubs for ordered stream-based messaging and stores partition state using an immutable log with checkpoints in Azure Page Blobs. This architecture enables over an order of magnitude improvement in throughput and latency compared to the default storage provider. Netherite leverages our FASTER technology for high-performance in-memory operations with efficient persistence. Netherite is open source and can be used as a drop-in backend replacement for existing Durable Functions applications. Read about the technology in our research paper presented at VLDB 2022.

Trill & Quill

Starting 2012, I led the creation of Trill, a high-performance incremental analytics engine (for stream processing) built as a C# .NET library. Trill employs a “one-size-fits-many” system architecture that provides best-of-breed or better performance across a diverse range of analytics styles and latency needs. You can learn more about Trill from the research paper at VLDB 2015 or from my VLDB slides. Trill is open source (usage samples). Visit the project website for more information. Quill is a distributed system that leverages Trill for large-scale temporal analytics, presented at VLDB 2016.

Prior to Trill, I worked on stream processing research in areas such as out-of-order processing, recursive streaming, pattern detection, latency estimation, unifying real-time and offline analytics, and progressive offline analytics and visualization. My work on streams shipped commercially as part of Microsoft SQL Server, as the StreamInsight engine.

Mison & Parquet-Select

I am interested in data processing over raw data such as CSV and JSON, where the schema may not always be known a priori. A core building block for raw data processing is parsing. My colleague and I built a SIMD-based parser for raw data, called Mison. You can read about Mison in our research paper at VLDB 2017. We also designed a technique that makes it possible to parse large CSV files in parallel. This work appeared at SIGMOD 2019.

Following Mison, we developed Parquet-Select, a fast select operator that extracts only the values matching a user’s filter directly from encoded, compressed columnar data. Parquet-Select leverages Bit Manipulation Instructions (BMI) on modern x86 CPUs to implement predicate pushdown without requiring full data decoding. This approach demonstrates up to an order of magnitude speedup in micro-benchmarks and up to 5.5x end-to-end improvement in Apache Spark queries. Learn more from our research paper at SIGMOD 2023.


© 2025. All rights reserved.

Powered by Hydejack v8.4.0