That substrate unlocks a family of frontier problems. Loss-bounded compression that preserves analytic fidelity. Sub-millisecond subselection that skips 99.9% of blocks. Generative augmentation for rare-event inference. Retrieval and indexing that exploit grid-aware attention. Probabilistic execution plans with deterministic fallbacks, all under continual learning from live traffic.

Papers and patents from the lab.
Each entry is the public artifact of a line of work that informs the platform. Venues are listed where peer-reviewed. Patents are listed with their USPTO grant number.
Currently chasing
- OPEN-01
Loss-bounded compression
Squeeze data to within a breath of Shannon while preserving analytic fidelity downstream.
- OPEN-02
Sub-millisecond subselection
Skip 99.9% of blocks at query time without giving up a single qualifying row.
- OPEN-03
Generative augmentation
Synthesize rare-event data that genuinely improves inference rather than memorizing the tail.
- OPEN-04
Grid-aware retrieval
Indexing and attention that exploit storage-grid locality across exabyte lakes.
- Train on Validation (ToV): Fast data selection with applications to fine-tuningICLR 2026Read paper
- Scaling laws for learning with real and surrogate dataNeurIPS 2024Read paper
- Towards a statistical theory of data selection under weak supervisionICLR 2024Read paper
- Scaling training data with lossy image compressionKDD 2024Read paper
- Compressing tabular data via latent variable estimationICML 2023Read paper
- Sampling, diffusion, and stochastic localizationPreprintRead paper
- Inline data detection in large data streamsUSPTO patentRead patent
- Efficient data deduplication through sketch computation and similarity metricsUSPTO patentRead patent
Footnotes.
- ¹
Venue abbreviations. NeurIPS is the Conference on Neural Information Processing Systems. ICLR is the International Conference on Learning Representations. KDD is the Conference on Knowledge Discovery and Data Mining. ICML is the International Conference on Machine Learning.
- ²
Patents are listed with their USPTO grant number. Read patent links resolve to the public USPTO image archive and may render slowly.
- ³
Author lines are abbreviated. Full author attribution lives on the linked artifact, where the paper, conference proceedings, or patent record carries the authoritative byline.
If turning entropy into intelligence at exabyte scale sounds like research you want to stretch.
The lab works in tight collaboration with the platform teams. Open problems range from loss-bounded compression to grid-aware retrieval to probabilistic plans. We hire researchers who want a production substrate to test their ideas against.