What the Heck is SmithDB?

LangChain built SmithDB, a purpose-built distributed database in Rust that now powers LangSmith’s agent observability. Modern AI agent traces have gotten too big and complex for general-purpose databases, with hundreds of nested spans, multi-modal content, and spans that stay open for hours. SmithDB uses an object-storage-backed LSM architecture built on Apache DataFusion and the Vortex file format, delivering P50 latencies of 92ms for trace tree loads and making LangSmith up to 15x faster. It is not a standalone product you install; it is the new data layer behind LangSmith, and it is already serving 100% of US Cloud ingestion and tracing UI traffic.


This content originally appeared on HackerNoon and was authored by Shawn Gordon

Introduction

If you have spent any time building with LLMs over the last couple of years, you have probably heard of LangChain. Their open-source framework for building LLM applications is one of the most popular in the space, and their commercial product, LangSmith, is a platform for observing, evaluating, and deploying agents. What you might not have heard about yet is SmithDB, which LangChain announced at their Interrupt conference on May 13, 2026. It is a purpose-built distributed database for agent observability, and it solves a problem that I think a lot of people building production agents are going to recognize immediately.

The problem statement is basically that agent traces have gotten too big and too complex for general-purpose databases to handle well. When LangSmith first launched in 2023, most people were building simple RAG pipelines and prompt chains. The traces generated by those applications were small and predictable. Now in 2026, agents are running for hours, making hundreds of nested tool calls, and generating traces with multi-modal content like images and audio. A single modern agent trace can be megabytes of deeply nested data, and a span might start minutes or even hours before it finishes. Traditional observability stores were never designed for that kind of workload. All the images I used for SmithDB came from LangSmith.

\n Payload growth (credit LangSmith)

Let's Dive In

SmithDB is built in Rust and uses the Apache DataFusion query engine and the Vortex columnar file format, with heavy customizations for LangSmith's specific workload patterns. If you are not familiar with those tools, DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. Vortex is a newer columnar file format that positions itself as an aspirational successor to Apache Parquet, with significantly faster random access reads and scans. I’ve had Vortex on my list of things to look at specifically, but hadn’t had a chance yet.

At its core, SmithDB is an object-storage-backed log-structured merge tree (LSM). It initially made me think of SlateDB, but on further review, they are pretty different. The architecture has three main pieces: object storage for durable trace data, a small Postgres metastore for segment metadata, and stateless ingestion, query, and compaction services. Because the query and ingestion services are stateless, scaling is just a matter of adding compute rather than managing local disks. That is a big deal for enterprises that need to self-host or deploy across multiple clouds. I borrowed the architecture diagram from Langsmith.

\n SmithDB Architecture (credit LangSmith)

The performance numbers are worth noting. SmithDB delivers P50 latencies of 92ms for trace tree loads, 71ms for single run loads, 82ms for run filtering, and 400ms for full-text search. LangChain claims this makes core LangSmith experiences up to 12 to 15 times faster than before, and the early customer feedback from companies like Clay, Vanta, and Cogent Security seems to back that up. Clay, for example, logs hundreds of millions of observability events per day and reported that the performance improvements were immediately noticeable.

What Makes It Different

There are a few engineering decisions here that I think are genuinely interesting and worth understanding.

The first is how SmithDB handles progressive querying over object storage. Most LangSmith queries want the newest data for a given project. A naive approach would scan all candidate files, sort-merge everything, then apply a limit. SmithDB instead walks backward through time, builds a bounded window over the newest segments, and stops as soon as it has enough data. That turns an expensive "sort everything, then limit" operation into a much cheaper bounded scan.

Query windowing (credit LangSmith)

\n Query Process (credit LangSmith)

The second is how it handles the fact that agent spans are not point-in-time events. In traditional request/response applications, a span starts and finishes in milliseconds. Agent spans can stay open for a long time while the agent makes tool calls, retries, or hands off to other agents. SmithDB treats a run as a sequence of events rather than a single immutable row. That sounds simple, but it affects the entire query engine, from how filters are applied to how compaction works.

The third is late materialization of large fields. Agent traces contain big payloads, sometimes megabytes of JSON from tool outputs and LLM responses. SmithDB separates core run fields from these large fields, keeping only pointers in the main rows. The query engine only fetches the full payload when you actually open a specific run or explicitly ask for those fields. That means loading a list of runs or applying filters stays fast because you are not reading megabytes of data you do not need.

There is also a custom inverted index layout optimized for object storage that powers sub-second full-text search and JSON key-path filtering. On local disk, an index can rely on cheap random seeks. On object storage, that pattern falls apart because every unnecessary request adds latency. SmithDB's index layout uses term-sorted row groups with min/max zones so it can prune aggressively before fetching any postings data.

Trying It Out

SmithDB is not a standalone product that you download and install separately. It is the new data layer powering LangSmith. As of the announcement, 100% of US Cloud ingestion and 100% of tracing UI query traffic runs through SmithDB. All major filters, including metadata, feedback, text search, tree filters, and trace filters, are backed by SmithDB. If you are already using LangSmith, you are already using SmithDB, whether you realized it or not.

For self-hosted deployments, SmithDB is not yet available but is supposed to be coming soon. Given the object-storage-backed architecture, self-hosting should be straightforward since there are no local disks to manage and no complex sharding to configure.

If you want to try LangSmith itself, you can sign up at smith.langchain.com. The platform is framework-agnostic, so you do not need to be using LangChain or LangGraph to take advantage of it. It works with OpenAI SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, and custom implementations via OpenTelemetry.

Summary

The broader trend here is interesting. As AI agents get more complex and longer-running, the infrastructure around them needs to evolve too. Traditional APM and observability tools were built for request/response workloads where a span lasts milliseconds. Agent observability is a fundamentally different problem with deeply nested traces, multi-modal content, and spans that can stay open for hours. The fact that LangChain felt the need to build an entirely new database to solve this is indicative of the evolution of data and the need for new tools in this AI world.

So, what the heck is SmithDB? It is the purpose-built data layer behind LangSmith, designed from the ground up to handle the unique challenges of agent observability at scale. Built in Rust on top of Apache DataFusion and Vortex, it uses an object-storage-backed LSM architecture that delivers sub-100ms trace loads and sub-second full-text search. It is not something you install on its own, but if you are using LangSmith, it is what makes everything fast. And given how much agent trace data is growing, having a purpose-built engine for it feels less like a luxury and more like a necessity.

Check out my other What the Heck is… articles at the links below:

\


This content originally appeared on HackerNoon and was authored by Shawn Gordon


Print Share Comment Cite Upload Translate Updates
APA

Shawn Gordon | Sciencx (2026-05-27T07:06:37+00:00) What the Heck is SmithDB?. Retrieved from https://www.scien.cx/2026/05/27/what-the-heck-is-smithdb/

MLA
" » What the Heck is SmithDB?." Shawn Gordon | Sciencx - Wednesday May 27, 2026, https://www.scien.cx/2026/05/27/what-the-heck-is-smithdb/
HARVARD
Shawn Gordon | Sciencx Wednesday May 27, 2026 » What the Heck is SmithDB?., viewed ,<https://www.scien.cx/2026/05/27/what-the-heck-is-smithdb/>
VANCOUVER
Shawn Gordon | Sciencx - » What the Heck is SmithDB?. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2026/05/27/what-the-heck-is-smithdb/
CHICAGO
" » What the Heck is SmithDB?." Shawn Gordon | Sciencx - Accessed . https://www.scien.cx/2026/05/27/what-the-heck-is-smithdb/
IEEE
" » What the Heck is SmithDB?." Shawn Gordon | Sciencx [Online]. Available: https://www.scien.cx/2026/05/27/what-the-heck-is-smithdb/. [Accessed: ]
rf:citation
» What the Heck is SmithDB? | Shawn Gordon | Sciencx | https://www.scien.cx/2026/05/27/what-the-heck-is-smithdb/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.