Costs to run analytics on dedicated servers compared to AWS

If you don’t care about DuckDB already, go ahead and read my other post. In this post, I explore what it could mean for your hosting bill.

When Sweden is your market

Suppose we’re working on a product for the domestic market in Sweden. If we’re building a B2B product, that’s 618 thousand corporations maximum.

Let’s say we have 500 GB of compressed data that we need to continuously query and report on. Per customer, that’s 800 KB of data, maybe 3 MB decompressed, if all Swedish corporations were our customers.

The use case we have in mind is internal analytics or BI, where we need to see how customers compare and thus need to join or summarize data from all of them. If we were to provide an analytics-heavy product to our customers, we could scale beyond a few megabytes per customer by avoiding multitenancy.

Dedicated server costs

Say that we need 1 TB of RAM to query our dataset efficiently using DuckDB for internal analytics where one customer equals one corporation. Why so much RAM? Because I started to run into issues when I ran out of it.

We could go for a Hetzner AX161 dedicated server with 1 TB of RAM and 2x 8 TB NVMe SSDs for €850 per month. That gives roughly 1.5 MB of in-memory data for every Swedish corporation in the same box.

The extra disk space gives us plenty of room for derived tables, various staging areas, and such.

Also, chances are that we won’t need this much RAM. Our query patterns could be different. We could work around the limitations I encountered by making a series of less memory-intensive queries or because later releases of DuckDB (I used version 0.7.1) optimize hash aggregates on partitions and other queries. If we bring the RAM down to 256 GB on the AX161 configuration, the cost is €395 per month. If we’re okay with Arm and half the amount of disk, the Hetzner RX220 dedicated server is just €260 monthly.

AWS pricing

Athena and S3

Athena is €4.6 per TB scanned . If we don’t query the data that much or too often, it’s cheap. If we end up scanning all data hourly, we will pay at least €1.7k monthly. (S3 request costs would be negligible unless we have tiny files or use inefficient file formats.)

Storage on S3 is €10 per month for 500 GB of standard storage. On top of that, we pay for requests and transfer costs (more on that later).

Redshift

Redshift comes in many flavors . If we go for a provisioned cluster composed of 3x RA3 nodes with 96 GiB RAM and 12 vCPU each, we would pay €7,100 monthly for servers. Storing the raw datasets would be €11 monthly for Redshift Managed Storage (RMS, a mix of local SSDs and S3 ), assuming that the compression rate in Redshift tables is the same as the original files and that no data is duplicated on multiple nodes. This is for on-demand pricing. If we pay upfront to reserve the same nodes over 3 years, we pay €2,700 monthly. If we’re hesitant to pay €97,100 upfront and choose a 1-year term with nothing upfront instead, we pay €5,000 monthly. We pay nothing extra for querying data that resides on these nodes. We may also be able to turn on concurrency scaling to allow additional capacity to be added to our cluster dynamically and to reduce its base capacity.

If we use the Redshift Spectrum feature, we can query data on S3 with pricing similar to Athena.

The other flavor is Redshift Serverless . When active, we pay for Redshift Processing Units (RPUs) with a one-minute minimum charge. We also pay for storage, similar to a provisioned cluster of RA3 nodes. When we query files on S3 from Redshift Serverless, we don’t pay extra for Spectrum. As Serverless costs depend on query patterns and frequency, and RPU consumption is opaque, it isn’t easy to estimate.

Egress fees, where Cloudflare R2 shines

Now let’s consider egress and that we must transfer the full dataset out of AWS S3 daily. Transfer pricing is tiered , and we would end up paying around €600 monthly. This is where Cloudflare R2 shines, as it’s €7 per month for storage and nothing for transfer .

Scaling

What if we plan to expand outside the land of fika and meatballs? I won’t elaborate on why it’s hard to foresee bottlenecks. Just get metrics in place. If our use cases are internal only, like for BI or product development, it may be okay that it breaks now and then.

If we go for Parquet files on object storage, it’s simple to add new query engines to the mix or to limit resource usage by querying subsets of data, as I illustrated in the other post. DuckDB can also generate Parquet files and write them to S3 , effectively materializing views that can be queried more efficiently.

Closing thoughts

With data on local disk and R2, our costs are more predictable than when we use AWS S3 and Athena. A fixed-size Redshift cluster on AWS is predictably priced but expensive compared to servers from a hosting provider like Hetzner.

On the other hand, we may dislike running servers on our own. Also, consider the larger ecosystem and community around AWS, that we can mix services and storage options and scale down compute capacity during off-hours. However, bespoke setups on AWS add complexity and require engineering.

If we have a method to produce Parquet files, we’re not locked into any particular set of services or tools. This is also true if we’re generating JSON, CSV, or are running anything familiar like PostgreSQL, MySQL, or SQLite.

If we’re starting out and haven’t yet decided, let’s pick a proven, boring DBMS that we use for everything, including analytics, until it becomes a problem.

More reading

Simon Pantzare

I help clients avoid the messy parts of modern software development.

Let’s talk