Throwing 107 GB and 5 billion fake rows of order data at DuckDB and Athena
- DuckDB is an in-process database management system for OLAP.
- Athena is Amazon’s serverless analytics service.
- Cloudflare R2 is an alternative to AWS S3 where egress is free of charge.
If you don’t care about DuckDB already, go ahead and read my other post. This post details some highlights of my experiments on a technical level. If you’re more into what it means for your hosting bill, I wrote about that too.
Test data
I used a dataset with one table containing 5 billion fake order lines from 1 billion orders over 10 years. The data is in snappy-compressed Parquet files, partitioned by year and month. The total file size is 107 GB (about 330 GB uncompressed). Here is a sample of records:
┌──────────────────────┬───────┬────────────┬──────────┬────────────────────┬────────────────────┬─────────────┬─────────────────────┬───────────┬─────────┬─────────┬──────────┬─────────┐
│ order_id │ line │ product_id │ quantity │ price │ line_price │ customer_id │ created │ num_lines │ day │ month │ store_id │ year │
│ int64 │ int64 │ int64 │ int64 │ double │ double │ int64 │ timestamp │ int64 │ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼───────┼────────────┼──────────┼────────────────────┼────────────────────┼─────────────┼─────────────────────┼───────────┼─────────┼─────────┼──────────┼─────────┤
│ 1135077679086384973 │ 1 │ 854 │ 5 │ 16.083097153631755 │ 80.41548576815877 │ 1535 │ 2016-09-06 01:03:04 │ 4 │ 6 │ 9 │ 8 │ 2016 │
│ 1135077679086384973 │ 2 │ 561 │ 3 │ 43.970102508551975 │ 131.91030752565592 │ 1535 │ 2016-09-06 01:03:04 │ 4 │ 6 │ 9 │ 8 │ 2016 │
│ 1135077679086384973 │ 3 │ 99 │ 3 │ 54.659783092799465 │ 163.9793492783984 │ 1535 │ 2016-09-06 01:03:04 │ 4 │ 6 │ 9 │ 8 │ 2016 │
│ 1135077679086384973 │ 4 │ 261 │ 5 │ 33.972849604393296 │ 169.8642480219665 │ 1535 │ 2016-09-06 01:03:04 │ 4 │ 6 │ 9 │ 8 │ 2016 │
│ 4791786128601458882 │ 3 │ 854 │ 6 │ 16.083097153631755 │ 96.49858292179053 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │
│ 4791786128601458882 │ 2 │ 83 │ 2 │ 88.68105154796304 │ 177.36210309592607 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │
│ 4791786128601458882 │ 1 │ 368 │ 10 │ 33.87760446706489 │ 338.7760446706489 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │
│ 4791786128601458882 │ 4 │ 781 │ 2 │ 97.59143722363255 │ 195.1828744472651 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │
│ 4791786128601458882 │ 5 │ 396 │ 2 │ 3.9038578062505547 │ 7.807715612501109 │ 9267 │ 2016-09-06 04:15:31 │ 5 │ 6 │ 9 │ 8 │ 2016 │
│ -1029262612490749341 │ 8 │ 854 │ 10 │ 16.083097153631755 │ 160.83097153631755 │ 5408 │ 2016-09-06 05:27:59 │ 9 │ 6 │ 9 │ 8 │ 2016 │
├──────────────────────┴───────┴────────────┴──────────┴────────────────────┴────────────────────┴─────────────┴─────────────────────┴───────────┴─────────┴─────────┴──────────┴─────────┤
│ 10 rows 13 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Each Parquet file has a size of approximately 50 MB. A sample of file paths:
orders/year=2023/month=3/data_0.parquet
orders/year=2023/month=3/data_10.parquet
orders/year=2023/month=3/data_11.parquet
orders/year=2023/month=3/data_12.parquet
orders/year=2023/month=3/data_13.parquet
orders/year=2023/month=3/data_14.parquet
orders/year=2023/month=3/data_15.parquet
orders/year=2023/month=3/data_1.pa rquet
orders/year=2023/month=3/data_2.parquet
orders/year=2023/month=3/data_3.parquet
orders/year=2023/month=3/data_4.parquet
Point DuckDB at local Parquet files
create or replace view orders as
select * from parquet_scan('orders/*/*/*.parquet', hive_partitioning=1);
Get data to and from R2
I used rclone to upload data to R2 and got a throughput of ~470 Mb/s from a box in Helsinki. I downloaded the dataset to a box in Falkenstein at ~920 Mb/s.
Your
rclone.conf
needs a block like this:
[r2]
type = s3
provider = Cloudflare
env_auth = false
access_key_id = $access
secret_access_key = $secret
endpoint = https://$account.r2.cloudflarestorage.com
acl = private
Point DuckDB at R2
install httpfs;
load httpfs;
set s3_endpoint='$account.r2.cloudflarestorage.com';
set s3_region='auto';
set s3_access_key_id='$access';
set s3_secret_access_key='$secret';
create or replace view orders as
select * from parquet_scan('s3://duckdata/year-month/orders/*/*/*.parquet', hive_partitioning=1);
Notably, my first attempt to query R2 failed while later attempts succeeded:
explain analyze select count(1) from orders;
14% ▕████████▍ ▏ Error: IO Error: Connection error for HTTP HEAD to 'https://duckdata.$account.r2.cloudflarestorage.com/year-month/orders/year%3D2019/month%3D9/data_4.parquet'
Upload data to S3
Again I used rclone to upload the dataset to AWS S3 and achieved a throughput of 650 Mb/s.
[s3]
type = s3
provider = AWS
env_auth = false
region = $region
access_key_id = $access
secret_access_key = $secret
acl = private
Create a table in Athena
CREATE EXTERNAL TABLE orders (
order_id bigint,
store_id string,
customer_id string,
product_id bigint,
created timestamp,
num_lines bigint,
line bigint,
quantity bigint,
price double,
line_price double,
day varchar(2)
)
PARTITIONED BY (year varchar(4), month varchar(2))
STORED AS PARQUET
LOCATION 's3://$bucket/orders/'
tblproperties ("parquet.compression"="SNAPPY");
MSCK REPAIR TABLE orders;
I had to change some column types to string that were recognized as int64 by DuckDB. I didn’t investigate but please reach out if you know what might be the reason. I got errors like these:
HIVE_BAD_DATA: Field store_id's type BINARY in parquet file s3://$bucket/orders/year=2023/month=4/data_11.parquet is incompatible with type bigint defined in table schema
HIVE_BAD_DATA: Field order_id's type INT64 in parquet file s3://$bucket/orders/year=2023/month=4/data_8.parquet is incompatible with type varchar defined in table schema
HIVE_CURSOR_ERROR: Failed to read Parquet file: s3://$bucket/orders/year=2022/month=11/data_15.parquet
Query
This is not a benchmark. I wanted to get a sense of what one might expect.
Simple count
select count(1) from orders;
Athena | 2.547s |
---|---|
DuckDB (Xeon 2017) | 1.81s |
DuckDB (Xeon 2017 + R2) | 90s |
DuckDB (Ampere) | 2.50s |
DuckDB (i9 2022) | 0.797s |
I decided not to run additional queries against the R2 storage due to how long this one took.
Count
select year, store_id, count(1) order_line_count
from orders
group by 1, 2
order by 3 desc, 1, 2
limit 10;
Athena | 9.095s |
---|---|
DuckDB (Xeon 2017) | 20.20s |
DuckDB (Ampere) | 10.39s |
DuckDB (i9 2022) | 6.76s |
Count distinct
select year, store_id, count(distinct order_id) order_count
from orders
where year = '2022'
group by 1, 2
order by 3 desc, 1, 2
limit 10;
Athena | 4.823s |
---|---|
DuckDB (Xeon 2017) | 10.52s |
DuckDB (Ampere) | 6.88s |
DuckDB (i9 2022) | 4.57s |
I had to add a filter by year to prevent out-of-memory errors on DuckDB. This is because DuckDB cannot use partitioning information for hash aggregates.
I attempted to rewrite the query so that it used a bunch of
union all
to query each partition individually, as DuckDB supports filter pushdown into its Parquet reader, but still got OOM errors, presumably because it attempts to run all queries in parallel.
Join
with orders_sub as (
select
order_id,
product_id
from
orders
where
year = '2022'
),
order_pairs as (
select
o1.product_id as product_id1,
o2.product_id as product_id2,
o1.order_id as order_id
from
orders_sub o1
join orders_sub o2 on o1.order_id = o2.order_id
where
o1.product_id < o2.product_id
)
select
product_id1,
product_id2,
count(order_id) as num_orders
from
order_pairs
group by
product_id1,
product_id2
order by
num_orders desc
limit
10;
Athena | 14.668s |
---|---|
DuckDB (Xeon 2017) | 56.59s |
DuckDB (Ampere) | 39.78s |
DuckDB (i9 2022) | 39.05s |
Hardware
Benchmarks by yabs.sh .
Xeon 2017
A KVM guest on Hetzner dedicated server with Xeon W-2145. Spinning disks, zfs on host.
Basic System Information:
---------------------------------
Uptime : 13 days, 11 hours, 39 minutes
Processor : Common KVM processor
CPU cores : 16 @ 3695.998 MHz
AES-NI : ❌ Disabled
VM-x/AMD-V : ❌ Disabled
RAM : 62.8 GiB
Swap : 975.0 MiB
Disk : 9.8 TiB
Distro : Debian GNU/Linux 11 (bullseye)
Kernel : 5.10.0-21-amd64
VM Type : KVM
Net Online : IPv4
fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k (IOPS) | 64k (IOPS)
------ | --- ---- | ---- ----
Read | 280.18 MB/s (70.0k) | 528.06 MB/s (8.2k)
Write | 280.92 MB/s (70.2k) | 530.83 MB/s (8.2k)
Total | 561.11 MB/s (140.2k) | 1.05 GB/s (16.5k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | --- ---- | ---- ----
Read | 1.97 GB/s (3.8k) | 207.88 MB/s (203)
Write | 2.08 GB/s (4.0k) | 221.72 MB/s (216)
Total | 4.06 GB/s (7.9k) | 429.61 MB/s (419)
Geekbench 6 Benchmark Test:
---------------------------------
Test | Value
|
Single Core | 894
Multi Core | 5202
Ampere
Hetzner CAX41, Ampere Altra. NVMe SSD, ext4.
Basic System Information:
---------------------------------
Uptime : 0 days, 0 hours, 32 minutes
Processor : Neoverse-N1
CPU cores : 16 @ ??? MHz
AES-NI : ✔ Enabled
VM-x/AMD-V : ❌ Disabled
RAM : 30.5 GiB
Swap : 0.0 KiB
Disk : 496.2 GiB
Distro : Ubuntu 22.04.2 LTS
Kernel : 5.15.0-69-generic
VM Type : NONE
Net Online : IPv4 & IPv6
fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k (IOPS) | 64k (IOPS)
------ | --- ---- | ---- ----
Read | 155.07 MB/s (38.7k) | 1.19 GB/s (18.5k)
Write | 154.97 MB/s (38.7k) | 1.22 GB/s (19.1k)
Total | 310.04 MB/s (77.5k) | 2.41 GB/s (37.7k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | --- ---- | ---- ----
Read | 1.56 GB/s (3.0k) | 1.52 GB/s (1.4k)
Write | 1.70 GB/s (3.3k) | 1.70 GB/s (1.6k)
Total | 3.26 GB/s (6.3k) | 3.23 GB/s (3.1k)
Geekbench 6 Benchmark Test:
---------------------------------
Test | Value
|
Single Core | 1088
Multi Core | 8700
i9 2022
A WSL2 guest on i9-13900KF. Guest has its own NVMe SSD, xfs.
Basic System Information:
---------------------------------
Uptime : 0 days, 5 hours, 18 minutes
Processor : 13th Gen Intel(R) Core(TM) i9-13900KF
CPU cores : 32 @ 2995.198 MHz
AES-NI : ✔ Enabled
VM-x/AMD-V : ✔ Enabled
RAM : 31.2 GiB
Swap : 8.0 GiB
Disk : 1.1 TiB
Distro : Debian GNU/Linux 11 (bullseye)
Kernel : 5.15.90.1-microsoft-standard-WSL2
VM Type : WSL
Net Online : IPv4
fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k (IOPS) | 64k (IOPS)
------ | --- ---- | ---- ----
Read | 1.41 GB/s (353.7k) | 2.08 GB/s (32.6k)
Write | 1.41 GB/s (354.7k) | 2.10 GB/s (32.8k)
Total | 2.83 GB/s (708.4k) | 4.19 GB/s (65.4k)
| |
Block Size | 512k (IOPS) | 1m (IOPS)
------ | --- ---- | ---- ----
Read | 2.74 GB/s (5.3k) | 3.01 GB/s (2.9k)
Write | 2.89 GB/s (5.6k) | 3.21 GB/s (3.1k)
Total | 5.64 GB/s (11.0k) | 6.22 GB/s (6.0k)
Geekbench 6 Benchmark Test:
---------------------------------
Test | Value
|
Single Core | 2441
Multi Core | 17531
Dig deeper
- Video lecture by Mark Raasveldt: DuckDB Internals (CMU Advanced Databases / Spring 2023)
-
groupby, join
benchmarks published by DuckDB: Database-like ops benchmark