DuckDB lets data folks use familiar tools on their own computers which means they stay happy and efficient

Struggles with cloud providers and common data stacks

I prefer to work on a capable computer that I control at the operating system level. This way, I have my favorite tools at my disposal, can lift all curtains when I run into issues, and—crucially—if data sits close to the CPU and the code that I write, I don’t have to wait that much.

This preference is incompatible with most cloud providers, as you often cannot run their services natively on your own. Additionally, the problem spaces I work in commonly require services to run on data stacks that are meant to run on clusters, with complicated setup processes and operations.

As a result, I often surrender the idea of running things on my own computer, as working on the hosted services tends to be the least frustrating experience overall. Instead of writing code and starting local processes, I package the code into artifacts and submit jobs to remote services.

And everything is good, right? No. Because I cannot use my preferred tools once the code is shipped elsewhere to run and because feedback loops are long. I have grown fond of specific editors, utilities, and ways of debugging and grappling with data over a long time. My hardware is perfectly capable. I don’t see the upside of learning to solve problems in new but less efficient ways.

DuckDB to the rescue?

DuckDB is an in-process database management system for analytics. It is helpful for local data exploration and wrangling. If you’re fluent in SQL but don’t know the ins and outs of Pandas or derivatives, DuckDB allows you to write SQL with great Python interoperability.

R2 as a repository for data

R2 is Cloudflare’s flavor of object storage and uses the S3 API. I already confessed my love for local development, so why bring internet buckets into the mix? For the sake of collaboration and to have somewhere for production systems to send data continuously.

R2 pairs nicely with local development because egress is free. Egress costs could be substantial if you regularly download datasets to your laptop, workstation, servers, or other devices not managed by the service provider that stores your data.

Even if you run production systems on a major cloud provider with high egress costs, you could benefit from replicating data to R2 if you expect frequent downloads. See my other post for examples of how much you could save.

Reflections

After experimenting with DuckDB and comparing it with AWS Athena for querying Parquet files on a large table , I have some opinions.

It’s hard to get excited about object storage, so let’s start there: R2 does what it says on the tin. I was hoping that the performance when querying Parquet files on R2 directly from DuckDB would be better, however.

For ad-hoc exploration, notebooks, and development in Python, I find DuckDB superb. When I ran into issues or questions, I got input on their Discord server from Mark Raasveldt, a co-founder of the backing company DuckDB Labs .

I recommend evaluating DuckDB for production workloads if data volume and query resource usage are not concerns. During my experiments, I encountered out-of-memory errors and occasionally saw temporary storage fill up.

So when is data volume a problem for DuckDB? For now, the latest version being 0.7.1, I’d say “doesn’t fit in memory” just because that’s when I started to run into issues. See my other post if you’re interested in pricing and scalability aspects.

Simon Pantzare

I help clients avoid the messy parts of modern software development.

Let’s talk