Optimize Elasticsearch 8 indexing performance for large datasets with bulk operations and memory tuning

Advanced 45 min Apr 03, 2026 297 views
Ubuntu 24.04 Debian 12 AlmaLinux 9 Rocky Linux 9

Configure Elasticsearch 8 for maximum indexing performance when handling large datasets through bulk API optimization, JVM memory tuning, and index mapping strategies. This guide covers production-grade performance tuning for high-throughput indexing workloads.

Prerequisites

  • Elasticsearch 8 installed and running
  • At least 16GB RAM available
  • Root or sudo access
  • Python 3.6+ installed

What this solves

When indexing large datasets into Elasticsearch, default configurations often result in poor performance, high memory usage, and indexing timeouts. This tutorial shows you how to optimize Elasticsearch 8 for high-throughput bulk indexing operations through JVM heap tuning, bulk API configuration, index settings optimization, and OS-level performance improvements.

You'll learn to handle datasets with millions of documents efficiently while maintaining cluster stability and search performance. These optimizations are essential for log aggregation systems, data lakes, and real-time analytics platforms that require fast data ingestion.

Prerequisites and system requirements

This tutorial assumes you have Elasticsearch 8 already installed and running. If you need to install Elasticsearch first, follow our Elasticsearch 8 installation guide.

Your system should have at least 16GB RAM for optimal performance with the configurations shown here. For production environments processing large datasets, 32GB or more is recommended.

Step-by-step performance optimization

Configure JVM heap size for optimal memory usage

Set the JVM heap to 50% of available RAM, with a maximum of 32GB. This leaves memory for the OS file system cache, which Elasticsearch uses heavily for performance.

# Set initial and maximum heap size
-Xms16g
-Xmx16g

Enable G1GC for better large heap performance

-XX:+UseG1GC -XX:G1HeapRegionSize=32m -XX:+G1UseAdaptiveIHOP -XX:G1MixedGCCountTarget=8 -XX:G1HeapWastePercent=5

Optimize GC logging for monitoring

-Xlog:gc*,gc+age=trace,safepoint:gc.log:time,level,tags -XX:+UnlockDiagnosticVMOptions -XX:+LogVMOutput

Optimize Elasticsearch cluster settings for bulk operations

Configure cluster-level settings to handle high indexing loads efficiently. These settings increase thread pools and improve bulk operation handling.

curl -X PUT "localhost:9200/_cluster/settings" -H "Content-Type: application/json" -d '
{
  "persistent": {
    "thread_pool.write.queue_size": 1000,
    "thread_pool.write.size": 8,
    "indices.memory.index_buffer_size": "20%",
    "indices.memory.min_index_buffer_size": "96mb",
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
  }
}'

Create optimized index templates for bulk indexing

Set up index templates with settings optimized for high-throughput indexing. These settings reduce replica overhead during bulk operations and optimize segment merging.

curl -X PUT "localhost:9200/_index_template/bulk_optimized_template" -H "Content-Type: application/json" -d '
{
  "index_patterns": ["logs-", "metrics-", "bulk-*"],
  "template": {
    "settings": {
      "number_of_shards": 2,
      "number_of_replicas": 0,
      "refresh_interval": "30s",
      "index.translog.flush_threshold_size": "1gb",
      "index.translog.sync_interval": "30s",
      "index.merge.policy.max_merge_at_once": 5,
      "index.merge.policy.segments_per_tier": 5,
      "index.merge.scheduler.max_thread_count": 2,
      "index.codec": "best_compression",
      "index.mapping.total_fields.limit": 10000
    },
    "mappings": {
      "dynamic_templates": [
        {
          "strings_as_keywords": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        {
          "timestamps": {
            "match": "timestamp",
            "mapping": {
              "type": "date",
              "format": "strict_date_optional_time||epoch_millis"
            }
          }
        }
      ]
    }
  }
}'

Configure OS-level performance optimizations

Optimize kernel parameters and file system settings for Elasticsearch performance. These changes improve I/O performance and memory management.

# Virtual memory settings
vm.max_map_count=262144
vm.swappiness=1
vm.dirty_ratio=15
vm.dirty_background_ratio=5

Network settings

net.core.somaxconn=32768 net.core.netdev_max_backlog=5000 net.core.rmem_default=262144 net.core.rmem_max=16777216 net.core.wmem_default=262144 net.core.wmem_max=16777216

File system settings

fs.file-max=1000000

Apply the kernel parameter changes:

sudo sysctl -p /etc/sysctl.d/99-elasticsearch.conf

Set file descriptor limits for Elasticsearch user

Increase file descriptor limits to handle large numbers of concurrent connections and open files during bulk operations.

elasticsearch soft nofile 1000000
elasticsearch hard nofile 1000000
elasticsearch soft nproc 32768
elasticsearch hard nproc 32768
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited

Configure systemd service limits

Update the Elasticsearch systemd service to use the increased limits and prevent memory swapping.

[Service]
LimitNOFILE=1000000
LimitNPROC=32768
LimitMEMLOCK=infinity
TimeoutStartSec=180

Reload systemd and restart Elasticsearch:

sudo systemctl daemon-reload
sudo systemctl restart elasticsearch
sudo systemctl status elasticsearch

Create bulk indexing script with optimal batch sizes

Create a Python script that demonstrates optimal bulk indexing techniques with proper error handling and performance monitoring.

#!/usr/bin/env python3
import json
import time
import requests
from datetime import datetime, timezone
import threading
from queue import Queue

class BulkIndexer:
    def __init__(self, es_host="localhost:9200", batch_size=5000, workers=4):
        self.es_host = es_host
        self.batch_size = batch_size
        self.workers = workers
        self.queue = Queue(maxsize=workers * 2)
        self.stats = {
            'indexed': 0,
            'errors': 0,
            'start_time': time.time()
        }
    
    def bulk_index_worker(self):
        """Worker thread for bulk indexing"""
        session = requests.Session()
        while True:
            batch = self.queue.get()
            if batch is None:
                break
            
            try:
                response = session.post(
                    f"http://{self.es_host}/_bulk",
                    data=batch,
                    headers={'Content-Type': 'application/x-ndjson'},
                    timeout=300
                )
                
                if response.status_code == 200:
                    result = response.json()
                    self.stats['indexed'] += len(result.get('items', []))
                    
                    # Check for indexing errors
                    for item in result.get('items', []):
                        if 'error' in item.get('index', {}):
                            self.stats['errors'] += 1
                            print(f"Index error: {item['index']['error']}")
                else:
                    self.stats['errors'] += 1
                    print(f"HTTP Error: {response.status_code} - {response.text}")
                    
            except Exception as e:
                self.stats['errors'] += 1
                print(f"Request error: {e}")
            finally:
                self.queue.task_done()
    
    def index_documents(self, documents, index_name):
        """Index documents using bulk API with optimal batching"""
        # Start worker threads
        threads = []
        for _ in range(self.workers):
            t = threading.Thread(target=self.bulk_index_worker)
            t.daemon = True
            t.start()
            threads.append(t)
        
        # Process documents in batches
        batch = []
        for doc in documents:
            # Add index action
            action = {"index": {"_index": index_name}}
            batch.append(json.dumps(action))
            batch.append(json.dumps(doc))
            
            if len(batch) >= self.batch_size  2:  # 2 because each doc has action + data
                self.queue.put('\n'.join(batch) + '\n')
                batch = []
        
        # Process remaining documents
        if batch:
            self.queue.put('\n'.join(batch) + '\n')
        
        # Wait for all tasks to complete
        self.queue.join()
        
        # Stop workers
        for _ in range(self.workers):
            self.queue.put(None)
        for t in threads:
            t.join()
        
        # Print statistics
        elapsed = time.time() - self.stats['start_time']
        rate = self.stats['indexed'] / elapsed if elapsed > 0 else 0
        print(f"Indexing complete:")
        print(f"  Documents: {self.stats['indexed']}")
        print(f"  Errors: {self.stats['errors']}")
        print(f"  Time: {elapsed:.2f}s")
        print(f"  Rate: {rate:.2f} docs/sec")

Example usage

if __name__ == "__main__": # Generate sample documents def generate_sample_docs(count=100000): for i in range(count): yield { "@timestamp": datetime.now(timezone.utc).isoformat(), "message": f"Sample log message {i}", "level": "INFO" if i % 10 != 0 else "ERROR", "user_id": i % 1000, "request_id": f"req_{i}", "response_time": (i % 100) * 10 } indexer = BulkIndexer(batch_size=5000, workers=4) docs = generate_sample_docs(100000) indexer.index_documents(docs, "bulk-test-index")

Make the script executable:

sudo chmod 755 /opt/elasticsearch/bulk_indexer.py

Install Python dependencies for bulk indexing

Install the required Python packages for the bulk indexing script.

sudo apt update
sudo apt install -y python3-pip python3-venv
python3 -m venv /opt/elasticsearch/venv
source /opt/elasticsearch/venv/bin/activate
pip install requests
sudo dnf update -y
sudo dnf install -y python3-pip python3-venv
python3 -m venv /opt/elasticsearch/venv
source /opt/elasticsearch/venv/bin/activate
pip install requests

Configure monitoring for indexing performance

Set up monitoring to track indexing performance and identify bottlenecks. These API calls provide real-time metrics during bulk operations.

#!/bin/bash

echo "=== Elasticsearch Indexing Performance Monitor ==="
echo "Press Ctrl+C to stop monitoring"
echo

while true; do
    echo "--- $(date) ---"
    
    # Indexing stats
    curl -s "localhost:9200/_stats/indexing" | jq -r '
        .indices | to_entries[] | 
        "\(.key): indexed=\(.value.total.indexing.index_total) time=\(.value.total.indexing.index_time_in_millis)ms"
    ' | head -5
    
    echo
    
    # Thread pool stats
    curl -s "localhost:9200/_cat/thread_pool/write?v&h=node_name,name,active,queue,rejected,completed"
    
    echo
    
    # JVM stats
    curl -s "localhost:9200/_nodes/stats/jvm" | jq -r '
        .nodes[] | 
        "JVM: heap_used=\(.jvm.mem.heap_used_percent)% gc_time=\(.jvm.gc.collectors.old.collection_time_in_millis)ms"
    '
    
    echo "----------------------------------------"
    sleep 10
done
sudo chmod 755 /opt/elasticsearch/monitor_indexing.sh

Optimize index settings for specific use cases

Configure time-based indices for log data

For time-series data like logs, use daily or hourly indices with Index Lifecycle Management (ILM) to optimize performance and storage.

curl -X PUT "localhost:9200/_ilm/policy/logs_policy" -H "Content-Type: application/json" -d '
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "10GB",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "1d",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "number_of_replicas": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}'

Create data stream template for continuous indexing

Set up a data stream template that automatically applies the ILM policy and optimized settings for continuous data ingestion.

curl -X PUT "localhost:9200/_index_template/logs_stream_template" -H "Content-Type: application/json" -d '
{
  "index_patterns": ["logs-app-*"],
  "data_stream": {},
  "template": {
    "settings": {
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs-app",
      "number_of_shards": 2,
      "number_of_replicas": 0,
      "refresh_interval": "30s",
      "index.codec": "best_compression"
    },
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "message": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "level": {
          "type": "keyword"
        }
      }
    }
  },
  "priority": 200
}'

Verify your setup

Test the optimized configuration by running the bulk indexing script and monitoring performance:

# Test the bulk indexer
source /opt/elasticsearch/venv/bin/activate
python3 /opt/elasticsearch/bulk_indexer.py

Monitor indexing in another terminal

/opt/elasticsearch/monitor_indexing.sh

Check cluster health

curl -s "localhost:9200/_cluster/health" | jq .

Verify index statistics

curl -s "localhost:9200/bulk-test-index/_stats" | jq '.indices[].total.indexing'

Check JVM memory usage

curl -s "localhost:9200/_nodes/stats/jvm" | jq '.nodes[].jvm.mem'
Note: After bulk indexing completes, consider increasing replica count and adjusting refresh intervals for better search performance if needed.

Performance tuning recommendations

ScenarioBatch SizeWorkersRefresh IntervalReplicas
Initial data load5000-100004-8-1 (disable)0
Real-time ingestion1000-20002-430s-60s1
Log aggregation3000-50004-630s0-1
Time-series metrics2000-30003-530s1

Common issues

SymptomCauseFix
Bulk requests timing outBatch size too large or insufficient heapReduce batch size to 1000-2000 docs, increase JVM heap
High memory usage during indexingIndex buffer size too highReduce indices.memory.index_buffer_size to 10%-15%
Thread pool rejectionsToo many concurrent bulk requestsIncrease thread_pool.write.queue_size or reduce workers
Slow indexing performanceToo many replicas during bulk loadSet replicas to 0 during indexing, increase after completion
OutOfMemoryErrorJVM heap too small for workloadIncrease heap size but keep under 32GB, monitor GC logs
Disk space issuesTranslog and segment files accumulatingReduce refresh_interval and translog.flush_threshold_size
Warning: Never disable refresh completely (-1) in production without a plan to re-enable it. Documents won't be searchable until refresh occurs.

Next steps

Automated install script

Run this to automate the entire setup

Need help?

Don't want to manage this yourself?

We handle infrastructure performance optimization for businesses that depend on uptime. From initial setup to ongoing operations.