Configure Elasticsearch 8 for maximum indexing performance when handling large datasets through bulk API optimization, JVM memory tuning, and index mapping strategies. This guide covers production-grade performance tuning for high-throughput indexing workloads.
Prerequisites
- Elasticsearch 8 installed and running
- At least 16GB RAM available
- Root or sudo access
- Python 3.6+ installed
What this solves
When indexing large datasets into Elasticsearch, default configurations often result in poor performance, high memory usage, and indexing timeouts. This tutorial shows you how to optimize Elasticsearch 8 for high-throughput bulk indexing operations through JVM heap tuning, bulk API configuration, index settings optimization, and OS-level performance improvements.
You'll learn to handle datasets with millions of documents efficiently while maintaining cluster stability and search performance. These optimizations are essential for log aggregation systems, data lakes, and real-time analytics platforms that require fast data ingestion.
Prerequisites and system requirements
This tutorial assumes you have Elasticsearch 8 already installed and running. If you need to install Elasticsearch first, follow our Elasticsearch 8 installation guide.
Your system should have at least 16GB RAM for optimal performance with the configurations shown here. For production environments processing large datasets, 32GB or more is recommended.
Step-by-step performance optimization
Configure JVM heap size for optimal memory usage
Set the JVM heap to 50% of available RAM, with a maximum of 32GB. This leaves memory for the OS file system cache, which Elasticsearch uses heavily for performance.
# Set initial and maximum heap size
-Xms16g
-Xmx16g
Enable G1GC for better large heap performance
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:+G1UseAdaptiveIHOP
-XX:G1MixedGCCountTarget=8
-XX:G1HeapWastePercent=5
Optimize GC logging for monitoring
-Xlog:gc*,gc+age=trace,safepoint:gc.log:time,level,tags
-XX:+UnlockDiagnosticVMOptions
-XX:+LogVMOutput
Optimize Elasticsearch cluster settings for bulk operations
Configure cluster-level settings to handle high indexing loads efficiently. These settings increase thread pools and improve bulk operation handling.
curl -X PUT "localhost:9200/_cluster/settings" -H "Content-Type: application/json" -d '
{
"persistent": {
"thread_pool.write.queue_size": 1000,
"thread_pool.write.size": 8,
"indices.memory.index_buffer_size": "20%",
"indices.memory.min_index_buffer_size": "96mb",
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%"
}
}'
Create optimized index templates for bulk indexing
Set up index templates with settings optimized for high-throughput indexing. These settings reduce replica overhead during bulk operations and optimize segment merging.
curl -X PUT "localhost:9200/_index_template/bulk_optimized_template" -H "Content-Type: application/json" -d '
{
"index_patterns": ["logs-", "metrics-", "bulk-*"],
"template": {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 0,
"refresh_interval": "30s",
"index.translog.flush_threshold_size": "1gb",
"index.translog.sync_interval": "30s",
"index.merge.policy.max_merge_at_once": 5,
"index.merge.policy.segments_per_tier": 5,
"index.merge.scheduler.max_thread_count": 2,
"index.codec": "best_compression",
"index.mapping.total_fields.limit": 10000
},
"mappings": {
"dynamic_templates": [
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"ignore_above": 256
}
}
},
{
"timestamps": {
"match": "timestamp",
"mapping": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
]
}
}
}'
Configure OS-level performance optimizations
Optimize kernel parameters and file system settings for Elasticsearch performance. These changes improve I/O performance and memory management.
# Virtual memory settings
vm.max_map_count=262144
vm.swappiness=1
vm.dirty_ratio=15
vm.dirty_background_ratio=5
Network settings
net.core.somaxconn=32768
net.core.netdev_max_backlog=5000
net.core.rmem_default=262144
net.core.rmem_max=16777216
net.core.wmem_default=262144
net.core.wmem_max=16777216
File system settings
fs.file-max=1000000
Apply the kernel parameter changes:
sudo sysctl -p /etc/sysctl.d/99-elasticsearch.conf
Set file descriptor limits for Elasticsearch user
Increase file descriptor limits to handle large numbers of concurrent connections and open files during bulk operations.
elasticsearch soft nofile 1000000
elasticsearch hard nofile 1000000
elasticsearch soft nproc 32768
elasticsearch hard nproc 32768
elasticsearch soft memlock unlimited
elasticsearch hard memlock unlimited
Configure systemd service limits
Update the Elasticsearch systemd service to use the increased limits and prevent memory swapping.
[Service]
LimitNOFILE=1000000
LimitNPROC=32768
LimitMEMLOCK=infinity
TimeoutStartSec=180
Reload systemd and restart Elasticsearch:
sudo systemctl daemon-reload
sudo systemctl restart elasticsearch
sudo systemctl status elasticsearch
Create bulk indexing script with optimal batch sizes
Create a Python script that demonstrates optimal bulk indexing techniques with proper error handling and performance monitoring.
#!/usr/bin/env python3
import json
import time
import requests
from datetime import datetime, timezone
import threading
from queue import Queue
class BulkIndexer:
def __init__(self, es_host="localhost:9200", batch_size=5000, workers=4):
self.es_host = es_host
self.batch_size = batch_size
self.workers = workers
self.queue = Queue(maxsize=workers * 2)
self.stats = {
'indexed': 0,
'errors': 0,
'start_time': time.time()
}
def bulk_index_worker(self):
"""Worker thread for bulk indexing"""
session = requests.Session()
while True:
batch = self.queue.get()
if batch is None:
break
try:
response = session.post(
f"http://{self.es_host}/_bulk",
data=batch,
headers={'Content-Type': 'application/x-ndjson'},
timeout=300
)
if response.status_code == 200:
result = response.json()
self.stats['indexed'] += len(result.get('items', []))
# Check for indexing errors
for item in result.get('items', []):
if 'error' in item.get('index', {}):
self.stats['errors'] += 1
print(f"Index error: {item['index']['error']}")
else:
self.stats['errors'] += 1
print(f"HTTP Error: {response.status_code} - {response.text}")
except Exception as e:
self.stats['errors'] += 1
print(f"Request error: {e}")
finally:
self.queue.task_done()
def index_documents(self, documents, index_name):
"""Index documents using bulk API with optimal batching"""
# Start worker threads
threads = []
for _ in range(self.workers):
t = threading.Thread(target=self.bulk_index_worker)
t.daemon = True
t.start()
threads.append(t)
# Process documents in batches
batch = []
for doc in documents:
# Add index action
action = {"index": {"_index": index_name}}
batch.append(json.dumps(action))
batch.append(json.dumps(doc))
if len(batch) >= self.batch_size 2: # 2 because each doc has action + data
self.queue.put('\n'.join(batch) + '\n')
batch = []
# Process remaining documents
if batch:
self.queue.put('\n'.join(batch) + '\n')
# Wait for all tasks to complete
self.queue.join()
# Stop workers
for _ in range(self.workers):
self.queue.put(None)
for t in threads:
t.join()
# Print statistics
elapsed = time.time() - self.stats['start_time']
rate = self.stats['indexed'] / elapsed if elapsed > 0 else 0
print(f"Indexing complete:")
print(f" Documents: {self.stats['indexed']}")
print(f" Errors: {self.stats['errors']}")
print(f" Time: {elapsed:.2f}s")
print(f" Rate: {rate:.2f} docs/sec")
Example usage
if __name__ == "__main__":
# Generate sample documents
def generate_sample_docs(count=100000):
for i in range(count):
yield {
"@timestamp": datetime.now(timezone.utc).isoformat(),
"message": f"Sample log message {i}",
"level": "INFO" if i % 10 != 0 else "ERROR",
"user_id": i % 1000,
"request_id": f"req_{i}",
"response_time": (i % 100) * 10
}
indexer = BulkIndexer(batch_size=5000, workers=4)
docs = generate_sample_docs(100000)
indexer.index_documents(docs, "bulk-test-index")
Make the script executable:
sudo chmod 755 /opt/elasticsearch/bulk_indexer.py
Install Python dependencies for bulk indexing
Install the required Python packages for the bulk indexing script.
sudo apt update
sudo apt install -y python3-pip python3-venv
python3 -m venv /opt/elasticsearch/venv
source /opt/elasticsearch/venv/bin/activate
pip install requests
Configure monitoring for indexing performance
Set up monitoring to track indexing performance and identify bottlenecks. These API calls provide real-time metrics during bulk operations.
#!/bin/bash
echo "=== Elasticsearch Indexing Performance Monitor ==="
echo "Press Ctrl+C to stop monitoring"
echo
while true; do
echo "--- $(date) ---"
# Indexing stats
curl -s "localhost:9200/_stats/indexing" | jq -r '
.indices | to_entries[] |
"\(.key): indexed=\(.value.total.indexing.index_total) time=\(.value.total.indexing.index_time_in_millis)ms"
' | head -5
echo
# Thread pool stats
curl -s "localhost:9200/_cat/thread_pool/write?v&h=node_name,name,active,queue,rejected,completed"
echo
# JVM stats
curl -s "localhost:9200/_nodes/stats/jvm" | jq -r '
.nodes[] |
"JVM: heap_used=\(.jvm.mem.heap_used_percent)% gc_time=\(.jvm.gc.collectors.old.collection_time_in_millis)ms"
'
echo "----------------------------------------"
sleep 10
done
sudo chmod 755 /opt/elasticsearch/monitor_indexing.sh
Optimize index settings for specific use cases
Configure time-based indices for log data
For time-series data like logs, use daily or hourly indices with Index Lifecycle Management (ILM) to optimize performance and storage.
curl -X PUT "localhost:9200/_ilm/policy/logs_policy" -H "Content-Type: application/json" -d '
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "10GB",
"max_age": "1d"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "1d",
"actions": {
"set_priority": {
"priority": 50
},
"allocate": {
"number_of_replicas": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}'
Create data stream template for continuous indexing
Set up a data stream template that automatically applies the ILM policy and optimized settings for continuous data ingestion.
curl -X PUT "localhost:9200/_index_template/logs_stream_template" -H "Content-Type: application/json" -d '
{
"index_patterns": ["logs-app-*"],
"data_stream": {},
"template": {
"settings": {
"index.lifecycle.name": "logs_policy",
"index.lifecycle.rollover_alias": "logs-app",
"number_of_shards": 2,
"number_of_replicas": 0,
"refresh_interval": "30s",
"index.codec": "best_compression"
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"level": {
"type": "keyword"
}
}
}
},
"priority": 200
}'
Verify your setup
Test the optimized configuration by running the bulk indexing script and monitoring performance:
# Test the bulk indexer
source /opt/elasticsearch/venv/bin/activate
python3 /opt/elasticsearch/bulk_indexer.py
Monitor indexing in another terminal
/opt/elasticsearch/monitor_indexing.sh
Check cluster health
curl -s "localhost:9200/_cluster/health" | jq .
Verify index statistics
curl -s "localhost:9200/bulk-test-index/_stats" | jq '.indices[].total.indexing'
Check JVM memory usage
curl -s "localhost:9200/_nodes/stats/jvm" | jq '.nodes[].jvm.mem'
Performance tuning recommendations
| Scenario | Batch Size | Workers | Refresh Interval | Replicas |
|---|---|---|---|---|
| Initial data load | 5000-10000 | 4-8 | -1 (disable) | 0 |
| Real-time ingestion | 1000-2000 | 2-4 | 30s-60s | 1 |
| Log aggregation | 3000-5000 | 4-6 | 30s | 0-1 |
| Time-series metrics | 2000-3000 | 3-5 | 30s | 1 |
Common issues
| Symptom | Cause | Fix |
|---|---|---|
| Bulk requests timing out | Batch size too large or insufficient heap | Reduce batch size to 1000-2000 docs, increase JVM heap |
| High memory usage during indexing | Index buffer size too high | Reduce indices.memory.index_buffer_size to 10%-15% |
| Thread pool rejections | Too many concurrent bulk requests | Increase thread_pool.write.queue_size or reduce workers |
| Slow indexing performance | Too many replicas during bulk load | Set replicas to 0 during indexing, increase after completion |
| OutOfMemoryError | JVM heap too small for workload | Increase heap size but keep under 32GB, monitor GC logs |
| Disk space issues | Translog and segment files accumulating | Reduce refresh_interval and translog.flush_threshold_size |
Next steps
- Configure Jaeger with Elasticsearch backend for application performance monitoring
- Set up Grafana dashboards for monitoring Elasticsearch performance metrics
- Implement Spark streaming with Kafka for real-time data processing before indexing
- Configure Elasticsearch cross-cluster replication for high availability
- Implement Elasticsearch snapshot lifecycle management for data backup and archival
Automated install script
Run this to automate the entire setup
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
# Usage function
usage() {
echo "Usage: $0 [OPTIONS]"
echo "Options:"
echo " -h, --help Show this help message"
echo " --heap-size SIZE Set JVM heap size (default: 16g)"
echo " --skip-os-tuning Skip OS-level optimizations"
exit 1
}
# Default values
HEAP_SIZE="16g"
SKIP_OS_TUNING=false
# Parse arguments
while [[ $# -gt 0 ]]; do
case $1 in
-h|--help)
usage
;;
--heap-size)
HEAP_SIZE="$2"
shift 2
;;
--skip-os-tuning)
SKIP_OS_TUNING=true
shift
;;
*)
echo -e "${RED}Unknown option: $1${NC}"
usage
;;
esac
done
# Cleanup function for rollback on failure
cleanup() {
echo -e "${RED}Script failed! Rolling back changes...${NC}"
if [[ -f /etc/elasticsearch/jvm.options.backup ]]; then
mv /etc/elasticsearch/jvm.options.backup /etc/elasticsearch/jvm.options
fi
if [[ -f /etc/sysctl.conf.backup ]]; then
mv /etc/sysctl.conf.backup /etc/sysctl.conf
fi
exit 1
}
trap cleanup ERR
# Check if running as root
if [[ $EUID -ne 0 ]]; then
echo -e "${RED}This script must be run as root${NC}"
exit 1
fi
# Auto-detect distribution
if [[ -f /etc/os-release ]]; then
. /etc/os-release
case "$ID" in
ubuntu|debian)
PKG_MGR="apt"
PKG_INSTALL="apt install -y"
PKG_UPDATE="apt update"
ES_CONFIG_DIR="/etc/elasticsearch"
ES_SERVICE="elasticsearch"
;;
almalinux|rocky|centos|rhel|ol|fedora)
PKG_MGR="dnf"
PKG_INSTALL="dnf install -y"
PKG_UPDATE="dnf update -y"
ES_CONFIG_DIR="/etc/elasticsearch"
ES_SERVICE="elasticsearch"
;;
amzn)
PKG_MGR="yum"
PKG_INSTALL="yum install -y"
PKG_UPDATE="yum update -y"
ES_CONFIG_DIR="/etc/elasticsearch"
ES_SERVICE="elasticsearch"
;;
*)
echo -e "${RED}Unsupported distribution: $ID${NC}"
exit 1
;;
esac
else
echo -e "${RED}Cannot detect distribution${NC}"
exit 1
fi
echo -e "${GREEN}Starting Elasticsearch 8 performance optimization for $PRETTY_NAME${NC}"
# Check if Elasticsearch is installed
echo -e "${YELLOW}[1/6] Checking prerequisites...${NC}"
if ! command -v elasticsearch &> /dev/null; then
echo -e "${RED}Elasticsearch is not installed. Please install it first.${NC}"
exit 1
fi
if ! systemctl is-active --quiet $ES_SERVICE; then
echo -e "${RED}Elasticsearch service is not running. Starting it...${NC}"
systemctl start $ES_SERVICE
sleep 10
fi
# Install required tools
$PKG_INSTALL curl jq
# Configure JVM heap size and GC settings
echo -e "${YELLOW}[2/6] Configuring JVM heap size and garbage collection...${NC}"
if [[ -f "$ES_CONFIG_DIR/jvm.options" ]]; then
cp "$ES_CONFIG_DIR/jvm.options" "$ES_CONFIG_DIR/jvm.options.backup"
fi
cat > "$ES_CONFIG_DIR/jvm.options.d/performance.options" << EOF
# Heap size configuration
-Xms$HEAP_SIZE
-Xmx$HEAP_SIZE
# G1GC configuration for better large heap performance
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:+G1UseAdaptiveIHOP
-XX:G1MixedGCCountTarget=8
-XX:G1HeapWastePercent=5
# GC logging for monitoring
-Xlog:gc*,gc+age=trace,safepoint:gc.log:time,level,tags
-XX:+UnlockDiagnosticVMOptions
-XX:+LogVMOutput
EOF
chown root:elasticsearch "$ES_CONFIG_DIR/jvm.options.d/performance.options"
chmod 644 "$ES_CONFIG_DIR/jvm.options.d/performance.options"
# Configure OS-level performance optimizations
if [[ "$SKIP_OS_TUNING" == false ]]; then
echo -e "${YELLOW}[3/6] Applying OS-level performance optimizations...${NC}"
# Backup sysctl.conf
cp /etc/sysctl.conf /etc/sysctl.conf.backup
# Apply kernel parameters
cat >> /etc/sysctl.conf << EOF
# Elasticsearch performance optimizations
vm.max_map_count=262144
vm.swappiness=1
vm.dirty_ratio=15
vm.dirty_background_ratio=5
net.core.rmem_max=134217728
net.core.wmem_max=134217728
net.ipv4.tcp_rmem=4096 65536 134217728
net.ipv4.tcp_wmem=4096 65536 134217728
EOF
# Apply settings immediately
sysctl -p
fi
# Restart Elasticsearch to apply JVM changes
echo -e "${YELLOW}[4/6] Restarting Elasticsearch to apply JVM settings...${NC}"
systemctl restart $ES_SERVICE
sleep 15
# Wait for Elasticsearch to be ready
echo "Waiting for Elasticsearch to be ready..."
for i in {1..30}; do
if curl -s http://localhost:9200/_cluster/health >/dev/null 2>&1; then
break
fi
sleep 2
done
# Configure cluster settings for bulk operations
echo -e "${YELLOW}[5/6] Optimizing cluster settings for bulk operations...${NC}"
curl -X PUT "localhost:9200/_cluster/settings" -H "Content-Type: application/json" -d '{
"persistent": {
"thread_pool.write.queue_size": 1000,
"thread_pool.write.size": 8,
"indices.memory.index_buffer_size": "20%",
"indices.memory.min_index_buffer_size": "96mb",
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%"
}
}' || echo -e "${YELLOW}Warning: Could not update cluster settings. Elasticsearch may still be starting.${NC}"
# Create optimized index template
echo -e "${YELLOW}[6/6] Creating optimized index template for bulk indexing...${NC}"
curl -X PUT "localhost:9200/_index_template/bulk_optimized_template" -H "Content-Type: application/json" -d '{
"index_patterns": ["logs-*", "metrics-*", "bulk-*"],
"template": {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 0,
"refresh_interval": "30s",
"index.translog.flush_threshold_size": "1gb",
"index.translog.sync_interval": "30s",
"index.merge.policy.max_merge_at_once": 5,
"index.merge.policy.segments_per_tier": 5,
"index.merge.scheduler.max_thread_count": 2,
"index.codec": "best_compression",
"index.mapping.total_fields.limit": 10000
},
"mappings": {
"dynamic_templates": [
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword",
"ignore_above": 256
}
}
},
{
"timestamps": {
"match": "*timestamp*",
"mapping": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
]
}
}
}' || echo -e "${YELLOW}Warning: Could not create index template. Elasticsearch may still be starting.${NC}"
# Verification checks
echo -e "${GREEN}Running verification checks...${NC}"
# Check if Elasticsearch is responding
if curl -s http://localhost:9200/_cluster/health | jq -e '.status' >/dev/null 2>&1; then
echo -e "${GREEN}✓ Elasticsearch is responding${NC}"
CLUSTER_STATUS=$(curl -s http://localhost:9200/_cluster/health | jq -r '.status')
echo -e "${GREEN}✓ Cluster status: $CLUSTER_STATUS${NC}"
else
echo -e "${RED}✗ Elasticsearch is not responding properly${NC}"
fi
# Check JVM heap size
HEAP_INFO=$(curl -s http://localhost:9200/_nodes/stats/jvm | jq -r '.nodes | to_entries[0].value.jvm.mem.heap_max_in_bytes' 2>/dev/null || echo "unknown")
if [[ "$HEAP_INFO" != "unknown" ]]; then
HEAP_GB=$((HEAP_INFO / 1024 / 1024 / 1024))
echo -e "${GREEN}✓ JVM heap size configured: ${HEAP_GB}GB${NC}"
else
echo -e "${YELLOW}⚠ Could not verify JVM heap size${NC}"
fi
# Check if template was created
if curl -s http://localhost:9200/_index_template/bulk_optimized_template >/dev/null 2>&1; then
echo -e "${GREEN}✓ Bulk optimized template created successfully${NC}"
else
echo -e "${YELLOW}⚠ Could not verify template creation${NC}"
fi
echo -e "${GREEN}Elasticsearch 8 performance optimization completed!${NC}"
echo ""
echo -e "${YELLOW}Next steps:${NC}"
echo "1. Monitor GC logs in Elasticsearch logs directory"
echo "2. Use bulk API with batch sizes of 5-15MB for optimal performance"
echo "3. Monitor cluster performance with: curl localhost:9200/_cluster/stats"
echo "4. Consider adding replicas after bulk indexing is complete"
Review the script before running. Execute with: bash install.sh