Optimize DuckDB performance for large datasets with partitioning

Advanced 45 min May 12, 2026 60 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Configure DuckDB with advanced partitioning strategies and memory optimization for processing multi-gigabyte datasets efficiently. Includes Python integration, query optimization techniques, and comprehensive monitoring setup.

Prerequisites

  • Root access or sudo privileges
  • At least 8GB RAM available
  • Python 3.8 or higher
  • 50GB free disk space for testing

What this solves

DuckDB excels at analytical workloads but requires careful configuration for datasets larger than system memory. This tutorial shows you how to implement table partitioning, configure memory management, and optimize query performance for datasets ranging from gigabytes to terabytes. You'll also set up monitoring to track performance metrics and identify bottlenecks.

Step-by-step configuration

Install DuckDB with Python integration

Start by installing DuckDB and the Python client for programmatic access and advanced configuration options.

sudo apt update
sudo apt install -y python3-pip python3-dev build-essential
pip3 install duckdb==0.9.2 pandas pyarrow
sudo dnf update -y
sudo dnf install -y python3-pip python3-devel gcc gcc-c++ make
pip3 install duckdb==0.9.2 pandas pyarrow

Configure system memory limits

Set memory limits for DuckDB processes to prevent system overload during large dataset operations.

# DuckDB memory limits
duckdb           soft    as              8388608
duckdb           hard    as              16777216
duckdb           soft    memlock         4194304
duckdb           hard    memlock         8388608

Create DuckDB configuration file

Configure memory allocation, temporary storage location, and threading parameters for optimal performance.

mkdir -p ~/.config/duckdb
sudo mkdir -p /var/lib/duckdb/temp
sudo chown $USER:$USER /var/lib/duckdb/temp
# Memory configuration
SET memory_limit = '4GB';
SET temp_directory = '/var/lib/duckdb/temp';
SET threads = 4;

Performance settings

SET enable_progress_bar = true; SET checkpoint_threshold = '16MB'; SET wal_autocheckpoint = 1000;

Optimization settings

SET enable_optimizer = true; SET enable_profiling = true;

Set up partitioned table structure

Create a Python script to implement range and hash partitioning strategies for large datasets.

#!/usr/bin/env python3
import duckdb
import pandas as pd
from datetime import datetime, timedelta

def create_partitioned_tables():
    conn = duckdb.connect('/var/lib/duckdb/analytics.db')
    
    # Create partitioned sales table with date-based partitioning
    conn.execute("""
        CREATE TABLE IF NOT EXISTS sales_data (
            id INTEGER,
            transaction_date DATE,
            customer_id INTEGER,
            product_id INTEGER,
            amount DECIMAL(10,2),
            region VARCHAR(50)
        ) PARTITION BY RANGE (transaction_date);
    """)
    
    # Create monthly partitions for the last 24 months
    base_date = datetime.now() - timedelta(days=730)
    for i in range(24):
        partition_date = base_date + timedelta(days=30*i)
        next_date = partition_date + timedelta(days=30)
        
        partition_name = f"sales_{partition_date.strftime('%Y_%m')}"
        
        conn.execute(f"""
            CREATE TABLE IF NOT EXISTS {partition_name} 
            PARTITION OF sales_data 
            FOR VALUES FROM ('{partition_date.strftime('%Y-%m-%d')}') 
            TO ('{next_date.strftime('%Y-%m-%d')}');
        """)
    
    # Create hash-partitioned customer table
    conn.execute("""
        CREATE TABLE IF NOT EXISTS customer_data (
            customer_id INTEGER,
            name VARCHAR(100),
            email VARCHAR(100),
            created_at TIMESTAMP,
            lifetime_value DECIMAL(12,2)
        ) PARTITION BY HASH (customer_id);
    """)
    
    # Create 8 hash partitions for customer data
    for i in range(8):
        conn.execute(f"""
            CREATE TABLE IF NOT EXISTS customer_partition_{i} 
            PARTITION OF customer_data 
            FOR VALUES WITH (MODULUS 8, REMAINDER {i});
        """)
    
    conn.close()
    print("Partitioned tables created successfully")

if __name__ == "__main__":
    create_partitioned_tables()
sudo mkdir -p /opt/duckdb
sudo cp partition_setup.py /opt/duckdb/
sudo chmod +x /opt/duckdb/partition_setup.py
python3 /opt/duckdb/partition_setup.py

Configure memory management and query optimization

Create advanced configuration for memory allocation, parallel processing, and query optimization.

#!/usr/bin/env python3
import duckdb
import psutil

def configure_performance_settings():
    conn = duckdb.connect('/var/lib/duckdb/analytics.db')
    
    # Calculate optimal memory settings based on system resources
    total_memory = psutil.virtual_memory().total
    available_memory = psutil.virtual_memory().available
    cpu_count = psutil.cpu_count(logical=False)
    
    # Allocate 60% of available memory to DuckDB
    memory_limit = int(available_memory * 0.6)
    memory_limit_gb = memory_limit // (1024**3)
    
    # Configure memory and threading
    conn.execute(f"SET memory_limit = '{memory_limit_gb}GB';")
    conn.execute(f"SET threads = {min(cpu_count, 8)};")
    
    # Enable aggressive optimization
    conn.execute("SET enable_optimizer = true;")
    conn.execute("SET optimizer_join_order = true;")
    conn.execute("SET enable_http_metadata_cache = true;")
    
    # Configure buffer pool
    buffer_size = min(memory_limit_gb // 4, 2)  # Max 2GB buffer
    conn.execute(f"SET buffer_size = '{buffer_size}GB';")
    
    # Set up temporary storage configuration
    conn.execute("SET temp_directory = '/var/lib/duckdb/temp';")
    conn.execute("SET max_temp_directory_size = '10GB';")
    
    # Configure parallel processing
    conn.execute("SET enable_object_cache = true;")
    conn.execute("SET preserve_insertion_order = false;")
    
    print(f"Configured DuckDB with {memory_limit_gb}GB memory, {min(cpu_count, 8)} threads")
    print(f"Buffer size: {buffer_size}GB, CPU cores: {cpu_count}")
    
    conn.close()

if __name__ == "__main__":
    configure_performance_settings()
pip3 install psutil
python3 /opt/duckdb/optimization_config.py

Implement partition pruning and query optimization

Create scripts to demonstrate efficient querying techniques and partition elimination strategies.

#!/usr/bin/env python3
import duckdb
import time
from datetime import datetime, timedelta

def create_optimized_indexes():
    conn = duckdb.connect('/var/lib/duckdb/analytics.db')
    
    # Create indexes for efficient partition pruning
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_sales_date_region 
        ON sales_data (transaction_date, region);
    """)
    
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_customer_created 
        ON customer_data (created_at);
    """)
    
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_customer_value 
        ON customer_data (lifetime_value DESC);
    """)
    
    conn.close()
    print("Indexes created for partition pruning")

def demonstrate_partition_pruning():
    conn = duckdb.connect('/var/lib/duckdb/analytics.db')
    
    # Enable query profiling
    conn.execute("PRAGMA enable_profiling = 'query_tree';")
    
    # Example 1: Date range query that should prune partitions
    start_time = time.time()
    result = conn.execute("""
        SELECT region, COUNT(*), SUM(amount)
        FROM sales_data 
        WHERE transaction_date >= '2024-01-01' 
          AND transaction_date < '2024-03-01'
        GROUP BY region
        ORDER BY SUM(amount) DESC;
    """).fetchall()
    
    query_time = time.time() - start_time
    print(f"Date range query completed in {query_time:.2f} seconds")
    print(f"Results: {len(result)} regions found")
    
    # Example 2: Hash partition query
    start_time = time.time()
    result = conn.execute("""
        SELECT COUNT(*), AVG(lifetime_value)
        FROM customer_data 
        WHERE customer_id % 8 = 3
          AND lifetime_value > 1000;
    """).fetchall()
    
    query_time = time.time() - start_time
    print(f"Hash partition query completed in {query_time:.2f} seconds")
    
    # Show query plan for optimization analysis
    plan = conn.execute("EXPLAIN ANALYZE SELECT * FROM sales_data WHERE transaction_date >= '2024-01-01' LIMIT 10;").fetchall()
    print("\nQuery execution plan:")
    for row in plan:
        print(row[0])
    
    conn.close()

if __name__ == "__main__":
    create_optimized_indexes()
    demonstrate_partition_pruning()
python3 /opt/duckdb/query_optimizer.py

Set up performance monitoring

Install monitoring tools to track DuckDB performance metrics and resource usage.

sudo apt install -y prometheus-node-exporter
pip3 install prometheus-client flask
sudo dnf install -y golang
go install github.com/prometheus/node_exporter@latest
pip3 install prometheus-client flask

Create DuckDB metrics exporter

Build a custom metrics exporter to track query performance, memory usage, and partition statistics.

#!/usr/bin/env python3
import duckdb
import time
import psutil
from prometheus_client import start_http_server, Gauge, Counter, Histogram
from threading import Thread
import logging

Prometheus metrics

query_duration = Histogram('duckdb_query_duration_seconds', 'Query execution time') active_connections = Gauge('duckdb_active_connections', 'Number of active connections') memory_usage = Gauge('duckdb_memory_usage_bytes', 'Memory usage in bytes') partitions_scanned = Counter('duckdb_partitions_scanned_total', 'Total partitions scanned') rows_processed = Counter('duckdb_rows_processed_total', 'Total rows processed') class DuckDBMonitor: def __init__(self, db_path): self.db_path = db_path self.running = True def collect_metrics(self): while self.running: try: conn = duckdb.connect(self.db_path) # Collect system metrics memory_info = psutil.virtual_memory() memory_usage.set(memory_info.used) # Collect database statistics tables_info = conn.execute(""" SELECT table_name, estimated_size FROM duckdb_tables() WHERE schema_name = 'main'; """).fetchall() # Check active queries (if available) try: active_queries = conn.execute(""" SELECT COUNT(*) FROM duckdb_queries() WHERE state = 'RUNNING'; """).fetchone()[0] active_connections.set(active_queries) except: # Fallback if queries table not available active_connections.set(1) conn.close() except Exception as e: logging.error(f"Error collecting metrics: {e}") time.sleep(30) # Collect metrics every 30 seconds def start(self): # Start metrics collection thread metrics_thread = Thread(target=self.collect_metrics) metrics_thread.daemon = True metrics_thread.start() # Start Prometheus metrics server start_http_server(8000) print("DuckDB metrics exporter started on port 8000") try: while True: time.sleep(1) except KeyboardInterrupt: self.running = False print("Shutting down metrics exporter") if __name__ == "__main__": logging.basicConfig(level=logging.INFO) monitor = DuckDBMonitor('/var/lib/duckdb/analytics.db') monitor.start()
python3 /opt/duckdb/metrics_exporter.py &

Configure systemd service for DuckDB monitoring

Create a systemd service to ensure the metrics exporter runs automatically on system startup.

[Unit]
Description=DuckDB Performance Monitor
After=network.target

[Service]
Type=simple
User=duckdb
Group=duckdb
WorkingDirectory=/opt/duckdb
ExecStart=/usr/bin/python3 /opt/duckdb/metrics_exporter.py
Restart=always
RestartSec=10
Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Environment=PYTHONPATH=/opt/duckdb

[Install]
WantedBy=multi-user.target
sudo useradd -r -s /bin/false duckdb
sudo chown -R duckdb:duckdb /opt/duckdb /var/lib/duckdb
sudo systemctl daemon-reload
sudo systemctl enable --now duckdb-monitor

Set up automated performance benchmarking

Create a benchmarking script to regularly test query performance and partition efficiency.

#!/usr/bin/env python3
import duckdb
import time
import json
from datetime import datetime
import pandas as pd

def run_benchmark_suite():
    conn = duckdb.connect('/var/lib/duckdb/analytics.db')
    results = []
    
    # Benchmark queries with different partition access patterns
    benchmark_queries = [
        {
            "name": "single_partition_scan",
            "query": """
                SELECT COUNT(*), AVG(amount) 
                FROM sales_data 
                WHERE transaction_date >= '2024-01-01' 
                  AND transaction_date < '2024-02-01'
            """
        },
        {
            "name": "multi_partition_scan",
            "query": """
                SELECT region, COUNT(*), SUM(amount) 
                FROM sales_data 
                WHERE transaction_date >= '2024-01-01' 
                  AND transaction_date < '2024-06-01'
                GROUP BY region
            """
        },
        {
            "name": "hash_partition_join",
            "query": """
                SELECT c.name, COUNT(s.id), SUM(s.amount)
                FROM customer_data c
                JOIN sales_data s ON c.customer_id = s.customer_id
                WHERE c.lifetime_value > 5000
                  AND s.transaction_date >= '2024-01-01'
                GROUP BY c.name
                ORDER BY SUM(s.amount) DESC
                LIMIT 100
            """
        }
    ]
    
    for benchmark in benchmark_queries:
        # Enable profiling
        conn.execute("PRAGMA enable_profiling = 'query_tree';")
        
        # Run query multiple times and average
        times = []
        for _ in range(3):
            start_time = time.time()
            result = conn.execute(benchmark["query"]).fetchall()
            end_time = time.time()
            times.append(end_time - start_time)
        
        avg_time = sum(times) / len(times)
        
        results.append({
            "query_name": benchmark["name"],
            "avg_execution_time": avg_time,
            "result_count": len(result),
            "timestamp": datetime.now().isoformat()
        })
        
        print(f"{benchmark['name']}: {avg_time:.2f}s avg ({len(result)} rows)")
    
    # Save results to file
    with open('/var/log/duckdb-benchmark.json', 'a') as f:
        for result in results:
            f.write(json.dumps(result) + '\n')
    
    conn.close()
    return results

if __name__ == "__main__":
    print("Starting DuckDB performance benchmark...")
    results = run_benchmark_suite()
    print(f"Benchmark completed. {len(results)} queries executed.")
sudo mkdir -p /var/log
sudo touch /var/log/duckdb-benchmark.json
sudo chown duckdb:duckdb /var/log/duckdb-benchmark.json
python3 /opt/duckdb/benchmark.py

Configure automated benchmark scheduling

Set up a cron job to run performance benchmarks regularly and track performance trends over time.

sudo crontab -u duckdb -e
# Run performance benchmark every 4 hours
0 /4    /usr/bin/python3 /opt/duckdb/benchmark.py >> /var/log/duckdb-benchmark.log 2>&1

Clean up old temp files daily

0 2 * find /var/lib/duckdb/temp -type f -mtime +7 -delete

Rotate benchmark logs weekly

0 0 0 logrotate -f /etc/logrotate.d/duckdb
/var/log/duckdb-benchmark.log {
    weekly
    rotate 12
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
    su duckdb duckdb
}

Verify your setup

Test the partitioned database configuration and confirm optimal performance settings are active.

# Check DuckDB version and configuration
python3 -c "import duckdb; print(f'DuckDB version: {duckdb.__version__}')"

Verify partitioned tables exist

python3 -c " import duckdb conn = duckdb.connect('/var/lib/duckdb/analytics.db') result = conn.execute(\"SELECT table_name FROM duckdb_tables() WHERE schema_name = 'main';\").fetchall() print('Tables:', [r[0] for r in result]) conn.close()"

Check monitoring service status

sudo systemctl status duckdb-monitor

Test metrics endpoint

curl -s http://localhost:8000/metrics | grep duckdb_

Run a quick benchmark

python3 /opt/duckdb/benchmark.py

Check memory configuration

free -h cat /proc/meminfo | grep -E 'MemTotal|MemAvailable'

Common issues

Symptom Cause Fix
Out of memory errors Memory limit too high for system Reduce memory_limit in config, check available RAM with free -h
Slow partition queries Missing indexes or partition pruning not working Check query plan with EXPLAIN ANALYZE, ensure WHERE clauses match partition keys
Temp directory full Large query spilling to disk Increase temp_directory size or clean old files with find /var/lib/duckdb/temp -mtime +1 -delete
Metrics exporter not starting Permission issues or missing dependencies Check service logs with journalctl -u duckdb-monitor, verify ownership of /opt/duckdb
Poor query performance Suboptimal threading or buffer configuration Run python3 /opt/duckdb/optimization_config.py to recalculate settings

Next steps

Running this in production?

Need this handled for you? Running this at scale adds a second layer of work: capacity planning, failover drills, cost control, and on-call. Our managed platform covers monitoring, backups and 24/7 response by default.

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle high availability infrastructure for businesses that depend on uptime. From initial setup to ongoing operations.