Recent Projects

SimpleStore - Simplifying Storage

I work on an umbrella project called SimpleStore, which aims to simply access to storage in modern cloud, edge, actor, and serverless environments. See my overview slides here. In the context of SimpleStore, I have worked on several research projects, described below.

FASTER (2017-now)

Managing large and frequently updated application state efficiently and reliably is a hard problem in the cloud and edge. Following my learnings from state management challenges in Trill, I designed and built a new high-performance key-value store called FASTER. FASTER bridges the gap between larger-than-memory and pure in-memory data structures using a novel hybrid log organization. FASTER is now open source, available in C# and C++, with more than 3300 stars on GitHub. Read about the technology in our research paper, and from the project website.

FASTER employs a new scalable recovery model called Concurrent Prefix Recovery (CPR), which is also applicable to traditional databases, and avoids the overhead of a separate write-ahead log. Learn about CPR in our new research paper.

We are currently working on extensions to FASTER and CPR to make it work better in distributed and serverless environments.

FishStore (2019-now)

Recently, we built a storage layer for flexible-schema data called FishStore, available as open source here. FishStore is a new ingestion and storage layer for flexible- and fixed-schema datasets. It allows you to dynamically register complex predicates over the data, to define interesting subsets of the data. Such predicates are called PSFs (for predicated subset functions). FishStore performs partial parsing of the ingested data (based on active PSFs) in a fast, parallel, and micro-batched manner, and hash indexes records for subsequent fast PSF-based retrieval. To accomplish its goals, FishStore leverages and extends the FASTER hash key-value store, and uses an unmodified parser interface for fast parsing (we use simdjson in many of our examples). The FishStore research paper appeared at SIGMOD 2019 and the system was demonstrated at VLDB 2019.

ML for Storage (2018-now)

Towards simplifying storage for modern OLAP applications, we recently developed techniques to learn data partitioning strategies from a dataset and query workload, using deep reinforcement learning. This work is under submission, and more details will be posted in future.

We have also built a new learned range index called ALEX, that is able to exploit machine learning to give very high performance, while being able to handle updates efficiently. This work, also under submission, is accessible via arXiv.

CRA & Ambrosia - Simplifying Compute

I am interested in distributed data processing and resilient microservice architectures. CRA is my open-source framework that makes it easy to author resilient distributed platforms and applications such as Quill - a distributed temporal analytics system. Learn about CRA from our technical report and from our paper in ICDE 2019. Recently, we used CRA to build Ambrosia, a microservices framework based on robust exactly-once message delivery. Read about Ambrosia in our technical report and our research paper in VLDB 2020.

Stream Processing

Starting 2012, I led the creation of Trill, a high-performance incremental analytics engine built as a C# .NET library. Trill employs a new “one-size-fits-many” system architecture that provides best-of-breed or better performance across a diverse range of analytics styles and latency needs. You can learn more about Trill from the research paper or from my VLDB slides. Trill is now open source (usage samples). Visit the project website for more information on Trill. Quill is a distributed system that leverages Trill for large-scale temporal analytics.

Prior to Trill, I worked on stream processing research in areas such as out-of-order processing, recursive streaming, pattern detection, latency estimation, unifying real-time and offline analytics, and progressive offline analytics and visualization. My work on streams shipped commercially as part of Microsoft SQL Server, as the StreamInsight engine.

Raw Data Parsing and Storage

I am interested in data processing over raw data such as CSV and JSON, where the schema may not always be known a priori. A core building block for raw data processing is parsing. My colleague Yinan led the creation of a SIMD-based parser for raw data, called Mison. You can read about Mison in our research paper. We also recently designed a new technique that makes it possible to parse large CSV files in parallel. This work appeared in SIGMOD 2019.


© 2021. All rights reserved.

Powered by Hydejack v8.4.0