EFK日志平台架构及实践管理指南

EFK日志平台架构及实践管理指南

EFK平台概述

什么是EFK

EFK是一套完整的日志收集、存储、分析和可视化解决方案,由三个核心组件组成:

EFK组件架构:

E - Elasticsearch: 分布式搜索和分析引擎

- 数据存储和索引

- 全文搜索和聚合分析

- RESTful API接口

- 水平扩展能力

F - Fluentd: 统一日志收集和处理

- 多源数据收集

- 灵活的插件架构

- 数据解析和路由

- 缓冲和重试机制

K - Kibana: 数据可视化和分析平台

- 交互式仪表板

- 实时搜索和过滤

- 图表和报表生成

- 告警和监控界面

EFK核心优势

# EFK平台核心优势

技术优势:

实时性: 近实时数据处理和分析

可扩展: 水平扩展支持海量数据

灵活性: 支持多种数据源和格式

开源: 完全开源,社区活跃

生态: 丰富的插件和集成方案

业务价值:

运维监控: 实时系统状态监控

故障排查: 快速定位和分析问题

安全审计: 安全事件追踪和分析

业务分析: 用户行为和业务指标分析

合规性: 日志审计和合规报告

生产环境架构设计

硬件资源规划

服务器配置建议

# Elasticsearch节点配置

Master节点:

CPU: 4-8核心

内存: 8-16GB

存储: 100GB SSD (OS + 配置)

网络: 1Gbps

数量: 3台(奇数个,避免脑裂)

Data节点:

CPU: 16-32核心

内存: 64-128GB

存储: 2-8TB SSD/NVMe (数据存储)

网络: 10Gbps

数量: 根据数据量和性能需求确定

Coordinating节点:

CPU: 8-16核心

内存: 16-32GB

存储: 200GB SSD

网络: 10Gbps

数量: 2-4台

# Fluentd聚合节点配置

Fluentd Aggregator:

CPU: 8-16核心

内存: 16-32GB

存储: 500GB SSD (缓冲)

网络: 10Gbps

数量: 2-4台(高可用)

# Kibana节点配置

Kibana服务器:

CPU: 4-8核心

内存: 8-16GB

存储: 100GB SSD

网络: 1Gbps

数量: 2台(负载均衡)

存储架构设计

# 存储分层策略

热数据层 (Hot Tier):

- 存储时间: 0-7天

- 存储介质: NVMe SSD

- 索引配置: 高性能写入和查询

- 副本数量: 1个副本

温数据层 (Warm Tier):

- 存储时间: 7-30天

- 存储介质: SATA SSD

- 索引配置: 只读,优化压缩

- 副本数量: 1个副本

冷数据层 (Cold Tier):

- 存储时间: 30天-1年

- 存储介质: HDD或对象存储

- 索引配置: 高压缩比,低查询频率

- 副本数量: 0个副本

归档层 (Archive Tier):

- 存储时间: 1年以上

- 存储介质: 对象存储(S3/OSS)

- 索引配置: 极度压缩,很少查询

- 备份策略: 定期快照备份

网络架构设计

网络拓扑规划

# 网络分层设计

管理网络:

用途: 集群管理和监控

网段: 10.1.0.0/24

带宽: 1Gbps

安全: 访问控制列表

数据网络:

用途: 节点间数据传输

网段: 10.2.0.0/24

带宽: 10Gbps

优化: Jumbo Frame支持

客户端网络:

用途: 外部访问和API调用

网段: 10.3.0.0/24

带宽: 1-10Gbps

安全: 负载均衡和防火墙

存储网络:

用途: 共享存储访问

网段: 10.4.0.0/24

带宽: 10Gbps

协议: iSCSI/NFS

Elasticsearch集群部署

基础环境准备

系统优化配置

# 操作系统参数优化

# /etc/sysctl.conf

vm.max_map_count=262144

vm.swappiness=1

net.core.somaxconn=65535

net.ipv4.tcp_max_syn_backlog=65535

fs.file-max=655360

# 应用参数

sysctl -p

# 用户限制配置

# /etc/security/limits.conf

elasticsearch soft nofile 65536

elasticsearch hard nofile 65536

elasticsearch soft nproc 4096

elasticsearch hard nproc 4096

elasticsearch soft memlock unlimited

elasticsearch hard memlock unlimited

# JVM堆内存设置原则

# 堆内存不超过系统内存的50%

# 堆内存不超过32GB (压缩指针限制)

# 例: 64GB内存的服务器,ES堆设置为30GB

Java环境安装

# 安装OpenJDK 11或17

sudo apt update

sudo apt install openjdk-11-jdk

# 验证Java版本

java -version

# 设置JAVA_HOME环境变量

echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc

echo 'export PATH=$JAVA_HOME/bin:$PATH' >> ~/.bashrc

source ~/.bashrc

Elasticsearch安装部署

软件包安装

# 导入Elasticsearch GPG密钥

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

# 添加Elasticsearch仓库

echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list

# 安装Elasticsearch

sudo apt update

sudo apt install elasticsearch

# 启用开机自启

sudo systemctl enable elasticsearch

# 创建数据和日志目录

sudo mkdir -p /var/lib/elasticsearch

sudo mkdir -p /var/log/elasticsearch

sudo chown -R elasticsearch:elasticsearch /var/lib/elasticsearch

sudo chown -R elasticsearch:elasticsearch /var/log/elasticsearch

Master节点配置

# /etc/elasticsearch/elasticsearch.yml - Master节点配置

cluster.name: production-efk-cluster

node.name: es-master-01

node.roles: [master]

# 网络设置

network.host: 10.1.0.10

http.port: 9200

transport.port: 9300

# 路径配置

path.data: /var/lib/elasticsearch

path.logs: /var/log/elasticsearch

# 集群发现

discovery.seed_hosts: ["10.1.0.10", "10.1.0.11", "10.1.0.12"]

cluster.initial_master_nodes: ["es-master-01", "es-master-02", "es-master-03"]

# 内存设置

bootstrap.memory_lock: true

# 安全设置

xpack.security.enabled: true

xpack.security.transport.ssl.enabled: true

xpack.security.transport.ssl.verification_mode: certificate

xpack.security.transport.ssl.keystore.path: elastic-certificates.p12

xpack.security.transport.ssl.truststore.path: elastic-certificates.p12

# 监控设置

xpack.monitoring.collection.enabled: true

Data节点配置

# /etc/elasticsearch/elasticsearch.yml - Data节点配置

cluster.name: production-efk-cluster

node.name: es-data-01

node.roles: [data, data_content, data_hot, data_warm, data_cold]

# 网络设置

network.host: 10.1.0.20

http.port: 9200

transport.port: 9300

# 路径配置

path.data: ["/data1/elasticsearch", "/data2/elasticsearch"]

path.logs: /var/log/elasticsearch

# 集群发现

discovery.seed_hosts: ["10.1.0.10", "10.1.0.11", "10.1.0.12"]

# 内存设置

bootstrap.memory_lock: true

# 索引设置

index.number_of_shards: 1

index.number_of_replicas: 1

# 存储优化

index.store.type: niofs

index.merge.scheduler.max_thread_count: 1

# 缓存设置

indices.memory.index_buffer_size: 20%

indices.memory.min_index_buffer_size: 96mb

# 安全设置

xpack.security.enabled: true

xpack.security.transport.ssl.enabled: true

xpack.security.transport.ssl.verification_mode: certificate

xpack.security.transport.ssl.keystore.path: elastic-certificates.p12

xpack.security.transport.ssl.truststore.path: elastic-certificates.p12

JVM配置优化

# /etc/elasticsearch/jvm.options

# 堆内存设置 (根据服务器内存调整)

-Xms30g

-Xmx30g

# 垃圾回收器选择

-XX:+UseG1GC

-XX:G1HeapRegionSize=32m

-XX:+UnlockExperimentalVMOptions

-XX:+UseG1GC

-XX:MaxGCPauseMillis=200

# 内存优化

-XX:+AlwaysPreTouch

-Xss1m

-Djava.awt.headless=true

# 文件描述符

-Dfile.encoding=UTF-8

-Djna.nosys=true

# GC日志

-Xlog:gc*,gc+age=trace,safepoint:gc.log:time,level,tags

-XX:+UseGCLogFileRotation

-XX:NumberOfGCLogFiles=32

-XX:GCLogFileSize=64m

# 临时目录

-Djava.io.tmpdir=${ES_TMPDIR}

# JVM崩溃时生成转储文件

-XX:+HeapDumpOnOutOfMemoryError

-XX:HeapDumpPath=/var/lib/elasticsearch

-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

Fluentd部署配置

Fluentd安装部署

安装方式选择

# 方式1: 使用官方仓库安装

curl -fsSL https://toolbelt.treasuredata.com/sh/install-ubuntu-bionic-td-agent4.sh | sh

# 方式2: 使用Ruby Gem安装

gem install fluentd

# 方式3: 使用Docker容器部署

docker run -d \

--name fluentd \

-p 24224:24224 \

-p 24224:24224/udp \

-v /var/log:/var/log \

-v $(pwd)/fluent.conf:/fluentd/etc/fluent.conf \

fluent/fluentd:v1.16-debian-1

# 启用服务

sudo systemctl enable td-agent

sudo systemctl start td-agent

核心插件安装

# 安装Elasticsearch输出插件

sudo td-agent-gem install fluent-plugin-elasticsearch

# 安装Kafka插件

sudo td-agent-gem install fluent-plugin-kafka

# 安装系统监控插件

sudo td-agent-gem install fluent-plugin-systemd

# 安装解析插件

sudo td-agent-gem install fluent-plugin-parser

# 安装缓冲插件

sudo td-agent-gem install fluent-plugin-redis

# 验证插件安装

sudo td-agent-gem list | grep fluent-plugin

Agent节点配置

基础日志收集配置

# /etc/td-agent/td-agent.conf - Agent节点配置

# 系统配置

log_level info

workers 4

root_dir /var/log/td-agent

# 输入源配置 - 系统日志

@type systemd

@id systemd_input

tag systemd

path /var/log/journal

@type local

persistent true

path /var/log/td-agent/systemd.pos

field_map {"MESSAGE": "message", "_HOSTNAME": "hostname", "_SYSTEMD_UNIT": "unit"}

field_map_strict true

# 输入源配置 - Nginx访问日志

@type tail

@id nginx_access_log

tag nginx.access

path /var/log/nginx/access.log

pos_file /var/log/td-agent/nginx_access.pos

format nginx

refresh_interval 10

@type nginx

expression /^(?[^ ]*) (?[^ ]*) (?[^ ]*) \[(?

time_format %d/%b/%Y:%H:%M:%S %z

# 输入源配置 - 应用程序日志

@type tail

@id application_log

tag app.logs

path /var/log/app/*.log

pos_file /var/log/td-agent/app.pos

refresh_interval 5

@type json

time_key time

time_format %Y-%m-%d %H:%M:%S.%L

# 输入源配置 - Docker容器日志

@type forward

@id docker_input

port 24224

bind 0.0.0.0

self_hostname "#{Socket.gethostname}"

shared_key fluentd_shared_key

# 数据处理 - 添加主机名标签

@type record_transformer

hostname "#{Socket.gethostname}"

timestamp ${time}

# 数据处理 - 地理位置解析

@type geoip

geoip_lookup_keys remote

location_country ${country_name["remote"]}

location_city ${city_name["remote"]}

location_latitude ${latitude["remote"]}

location_longitude ${longitude["remote"]}

skip_adding_null_record false

# 数据处理 - 敏感信息脱敏

@type record_transformer

enable_ruby true

message ${record["message"].gsub(/password["\s]*[:=]["\s]*[^"\s,}]+/, 'password=***')}

message ${record["message"].gsub(/token["\s]*[:=]["\s]*[^"\s,}]+/, 'token=***')}

# 输出配置 - 发送到Fluentd聚合器

@type forward

@id forward_output

name aggregator1

host 10.1.0.30

port 24224

weight 60

name aggregator2

host 10.1.0.31

port 24224

weight 40

# 缓冲配置

@type file

path /var/log/td-agent/buffer/forward

flush_mode interval

retry_type exponential_backoff

flush_thread_count 2

flush_interval 30s

retry_forever

retry_max_interval 30

chunk_limit_size 2M

queue_limit_length 8

overflow_action block

# 安全配置

self_hostname "#{Socket.gethostname}"

shared_key fluentd_shared_key

# 健康检查

heartbeat_type tcp

聚合器节点配置

Fluentd聚合器配置

# /etc/td-agent/td-agent.conf - 聚合器节点配置

# 系统配置

log_level info

workers 8

root_dir /var/log/td-agent

# 输入配置 - 接收Agent数据

@type forward

@id forward_input

port 24224

bind 0.0.0.0

self_hostname "#{Socket.gethostname}"

shared_key fluentd_shared_key

# 输入配置 - HTTP接口

@type http

@id http_input

port 8888

bind 0.0.0.0

cors_allow_origins ["*"]

@type json

# 数据路由 - 按标签分类处理

@type copy

@type elasticsearch

@id elasticsearch_systemd

host 10.1.0.20,10.1.0.21,10.1.0.22,10.1.0.23

port 9200

scheme https

ssl_verify false

user elastic

password changeme

# 索引配置

index_name systemd-logs-%Y.%m.%d

type_name _doc

# 模板配置

template_name systemd_template

template_file /etc/td-agent/templates/systemd_template.json

# 缓冲配置

@type file

path /var/log/td-agent/buffer/systemd

timekey 1h

timekey_wait 10m

timekey_use_utc true

flush_mode interval

retry_type exponential_backoff

flush_thread_count 8

flush_interval 5s

retry_forever

retry_max_interval 30

chunk_limit_size 10M

queue_limit_length 32

overflow_action block

# 备份输出到文件

@type file

@id file_backup_systemd

path /var/log/td-agent/backup/systemd.%Y%m%d_%H

compress gzip

timekey 1h

timekey_use_utc true

@type elasticsearch

@id elasticsearch_nginx

host 10.1.0.20,10.1.0.21,10.1.0.22,10.1.0.23

port 9200

scheme https

ssl_verify false

user elastic

password changeme

# 索引配置

index_name nginx-access-%Y.%m.%d

type_name _doc

# 生命周期策略

ilm_policy_id nginx_access_policy

# 缓冲配置

@type file

path /var/log/td-agent/buffer/nginx

timekey 1h

timekey_wait 10m

timekey_use_utc true

flush_mode interval

retry_type exponential_backoff

flush_thread_count 8

flush_interval 5s

retry_forever

retry_max_interval 30

chunk_limit_size 10M

queue_limit_length 32

overflow_action block

@type elasticsearch

@id elasticsearch_app

host 10.1.0.20,10.1.0.21,10.1.0.22,10.1.0.23

port 9200

scheme https

ssl_verify false

user elastic

password changeme

# 索引配置

index_name application-logs-%Y.%m.%d

type_name _doc

# 动态索引名

target_index_key @target_index

@type file

path /var/log/td-agent/buffer/app

timekey 1h

timekey_wait 10m

timekey_use_utc true

flush_mode interval

retry_type exponential_backoff

flush_thread_count 8

flush_interval 5s

retry_forever

retry_max_interval 30

chunk_limit_size 10M

queue_limit_length 32

overflow_action block

# 错误处理

@type file

@id error_file

path /var/log/td-agent/error/error.log

flush_mode interval

retry_type exponential_backoff

flush_thread_count 2

flush_interval 30s

retry_forever

retry_max_interval 30

chunk_limit_size 2M

queue_limit_length 8

overflow_action block

Kibana部署配置

Kibana安装部署

软件包安装

# 安装Kibana

sudo apt install kibana

# 启用服务

sudo systemctl enable kibana

基础配置

# /etc/kibana/kibana.yml

server.port: 5601

server.host: "0.0.0.0"

server.name: "kibana-prod-01"

# Elasticsearch配置

elasticsearch.hosts: ["https://10.1.0.20:9200", "https://10.1.0.21:9200"]

elasticsearch.username: "kibana_system"

elasticsearch.password: "changeme"

# SSL配置

elasticsearch.ssl.certificateAuthorities: ["/etc/kibana/certs/ca.crt"]

elasticsearch.ssl.verificationMode: "certificate"

# 安全配置

xpack.security.enabled: true

xpack.security.encryptionKey: "something_at_least_32_characters"

xpack.security.session.idleTimeout: "1h"

xpack.security.session.lifespan: "30d"

# 监控配置

xpack.monitoring.enabled: true

monitoring.ui.enabled: true

# 日志配置

logging.appenders:

file:

type: file

fileName: /var/log/kibana/kibana.log

layout:

type: json

logging.root:

appenders:

- default

- file

level: warn

# 性能配置

elasticsearch.requestTimeout: 30000

elasticsearch.shardTimeout: 30000

server.maxPayload: 1048576

# 可视化配置

visualization.colorMapping:

map.includeElasticMapsService: false

负载均衡配置

Nginx负载均衡

# /etc/nginx/sites-available/kibana

upstream kibana_backend {

least_conn;

server 10.1.0.40:5601 max_fails=3 fail_timeout=30s;

server 10.1.0.41:5601 max_fails=3 fail_timeout=30s;

keepalive 32;

}

server {

listen 80;

listen 443 ssl http2;

server_name kibana.company.com;

# SSL配置

ssl_certificate /etc/nginx/ssl/kibana.crt;

ssl_certificate_key /etc/nginx/ssl/kibana.key;

ssl_protocols TLSv1.2 TLSv1.3;

ssl_ciphers HIGH:!aNULL:!MD5;

# 安全头

add_header X-Frame-Options DENY;

add_header X-Content-Type-Options nosniff;

add_header X-XSS-Protection "1; mode=block";

add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload";

# 日志配置

access_log /var/log/nginx/kibana_access.log;

error_log /var/log/nginx/kibana_error.log;

# 代理配置

location / {

proxy_pass http://kibana_backend;

proxy_http_version 1.1;

proxy_set_header Upgrade $http_upgrade;

proxy_set_header Connection 'upgrade';

proxy_set_header Host $host;

proxy_set_header X-Real-IP $remote_addr;

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

proxy_set_header X-Forwarded-Proto $scheme;

proxy_cache_bypass $http_upgrade;

# 超时配置

proxy_connect_timeout 60s;

proxy_send_timeout 60s;

proxy_read_timeout 60s;

# 缓冲配置

proxy_buffering on;

proxy_buffer_size 128k;

proxy_buffers 4 256k;

proxy_busy_buffers_size 256k;

}

# 健康检查

location /status {

access_log off;

return 200 "healthy\n";

add_header Content-Type text/plain;

}

}

索引模板和策略配置

索引生命周期策略

PUT _ilm/policy/efk_logs_policy

{

"policy": {

"phases": {

"hot": {

"actions": {

"rollover": {

"max_size": "10GB",

"max_age": "1d"

},

"set_priority": {

"priority": 100

}

}

},

"warm": {

"min_age": "7d",

"actions": {

"allocate": {

"number_of_replicas": 0,

"include": {

"box_type": "warm"

}

},

"forcemerge": {

"max_num_segments": 1

},

"set_priority": {

"priority": 50

}

}

},

"cold": {

"min_age": "30d",

"actions": {

"allocate": {

"number_of_replicas": 0,

"include": {

"box_type": "cold"

}

},

"set_priority": {

"priority": 0

}

}

},

"delete": {

"min_age": "90d",

"actions": {

"delete": {}

}

}

}

}

}

索引模板配置

PUT _index_template/efk_logs_template

{

"index_patterns": ["*-logs-*"],

"priority": 200,

"template": {

"settings": {

"number_of_shards": 1,

"number_of_replicas": 1,

"index.lifecycle.name": "efk_logs_policy",

"index.lifecycle.rollover_alias": "logs",

"index.mapping.total_fields.limit": 2000,

"index.refresh_interval": "5s",

"index.max_result_window": 10000

},

"mappings": {

"properties": {

"@timestamp": {

"type": "date"

},

"hostname": {

"type": "keyword"

},

"level": {

"type": "keyword"

},

"message": {

"type": "text",

"analyzer": "standard"

},

"tags": {

"type": "keyword"

},

"source": {

"type": "keyword"

},

"fields": {

"type": "object",

"dynamic": true

}

}

}

}

}

监控告警配置

ElastAlert告警配置

ElastAlert安装

# 安装ElastAlert

pip install elastalert

# 创建配置目录

sudo mkdir -p /etc/elastalert

sudo mkdir -p /var/log/elastalert

# 初始化索引

elastalert-create-index

告警规则配置

# /etc/elastalert/rules/error_logs_alert.yaml

name: Application Error Logs Alert

type: frequency

index: application-logs-*

num_events: 10

timeframe:

minutes: 5

filter:

- term:

level: "ERROR"

alert:

- "email"

- "slack"

email:

- "ops-team@company.com"

- "dev-team@company.com"

smtp_host: "smtp.company.com"

smtp_port: 587

smtp_auth_file: "/etc/elastalert/smtp_auth.yaml"

from_addr: "alerts@company.com"

slack:

webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

slack_channel_override: "#alerts"

slack_username_override: "ElastAlert"

alert_text: |

应用程序错误日志告警!

在过去5分钟内检测到 {0} 条错误日志

时间范围: {1} - {2}

索引: {3}

请及时检查系统状态!

alert_text_type: alert_text_only

include:

- "@timestamp"

- "hostname"

- "message"

- "level"

- "source"

Elasticsearch集群监控

集群健康监控脚本

#!/bin/bash

# /usr/local/bin/es_cluster_check.sh

ES_HOST="https://10.1.0.20:9200"

ES_USER="elastic"

ES_PASS="changeme"

# 获取集群健康状态

CLUSTER_HEALTH=$(curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_cluster/health")

CLUSTER_STATUS=$(echo "$CLUSTER_HEALTH" | jq -r '.status')

# 获取节点信息

NODES_INFO=$(curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_nodes/stats")

TOTAL_NODES=$(echo "$NODES_INFO" | jq '.nodes | length')

# 获取索引统计

INDICES_STATS=$(curl -s -u "$ES_USER:$ES_PASS" "$ES_HOST/_cat/indices?v&h=index,health,status,pri,rep,docs.count,store.size&format=json")

# 发送到监控系统

cat << EOF | curl -X POST "http://monitoring.company.com/api/metrics" \

-H "Content-Type: application/json" \

-d @-

{

"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",

"source": "elasticsearch",

"cluster": {

"status": "$CLUSTER_STATUS",

"nodes": $TOTAL_NODES,

"indices": $(echo "$INDICES_STATS" | jq length)

},

"health": $CLUSTER_HEALTH,

"nodes": $NODES_INFO,

"indices": $INDICES_STATS

}

EOF

# 健康检查告警

if [ "$CLUSTER_STATUS" != "green" ]; then

echo "WARNING: Elasticsearch cluster status is $CLUSTER_STATUS" | \

mail -s "Elasticsearch Cluster Alert" ops-team@company.com

fi

故障排除指南

常见问题诊断

Elasticsearch故障排除

# 检查集群状态

curl -X GET "localhost:9200/_cluster/health?pretty"

# 检查节点状态

curl -X GET "localhost:9200/_nodes/stats?pretty"

# 检查索引状态

curl -X GET "localhost:9200/_cat/indices?v&health=red"

# 检查未分配分片

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty"

# 修复红色索引

curl -X POST "localhost:9200/red-index/_close"

curl -X POST "localhost:9200/red-index/_open"

# 手动分配分片

curl -X POST "localhost:9200/_cluster/reroute" \

-H "Content-Type: application/json" \

-d '{

"commands": [

{

"allocate_primary": {

"index": "my-index",

"shard": 0,

"node": "node-1",

"accept_data_loss": true

}

}

]

}'

Fluentd故障排除

# 检查Fluentd状态

sudo systemctl status td-agent

# 查看详细日志

sudo tail -f /var/log/td-agent/td-agent.log

# 测试配置语法

sudo td-agent --dry-run -c /etc/td-agent/td-agent.conf

# 检查缓冲区状态

sudo ls -la /var/log/td-agent/buffer/

# 重启服务

sudo systemctl restart td-agent

# 发送测试日志

echo '{"message":"test log","level":"info"}' | \

curl -X POST -d @- http://localhost:8888/test.log

Kibana故障排除

# 检查Kibana状态

sudo systemctl status kibana

# 查看Kibana日志

sudo tail -f /var/log/kibana/kibana.log

# 检查Elasticsearch连接

curl -X GET "http://localhost:5601/api/status"

# 重建索引模式

curl -X DELETE "localhost:5601/api/saved_objects/index-pattern/logs-*"

# 清理缓存

sudo rm -rf /var/lib/kibana/optimize/

容量规划指南

存储容量计算

# 日志量估算公式

每日日志量 = 日志行数 × 平均行大小 × 压缩比

# 示例计算

日志行数: 1亿行/天

平均行大小: 200字节

压缩比: 3:1

每日原始日志: 100,000,000 × 200 = 20GB

每日压缩日志: 20GB ÷ 3 = 6.7GB

# 存储需求计算

保留天数: 90天

副本数量: 1个

总存储需求: 6.7GB × 90天 × 2(原始+副本) = 1.2TB

硬件资源规划

# 集群规模规划

小型环境 (< 1TB/月):

Master节点: 3台 × 8GB内存

Data节点: 3台 × 32GB内存 × 1TB存储

协调节点: 2台 × 16GB内存

中型环境 (1-10TB/月):

Master节点: 3台 × 16GB内存

Data节点: 6台 × 64GB内存 × 2TB存储

协调节点: 3台 × 32GB内存

大型环境 (10TB+/月):

Master节点: 3台 × 32GB内存

Data节点: 12台+ × 128GB内存 × 4TB存储

协调节点: 6台 × 64GB内存

安全管理配置

X-Pack安全配置

用户和角色管理

# 创建内置用户密码

sudo /usr/share/elasticsearch/bin/elasticsearch-setup-passwords auto

# 创建自定义角色

curl -X POST "https://localhost:9200/_security/role/log_reader" \

-H "Content-Type: application/json" \

-u elastic:password \

-d '{

"cluster": ["monitor"],

"indices": [

{

"names": ["*-logs-*"],

"privileges": ["read", "view_index_metadata"]

}

]

}'

# 创建用户

curl -X POST "https://localhost:9200/_security/user/log_analyst" \

-H "Content-Type: application/json" \

-u elastic:password \

-d '{

"password": "secure_password",

"roles": ["log_reader"],

"full_name": "Log Analyst",

"email": "analyst@company.com"

}'

SSL/TLS证书配置

# 生成CA证书

sudo /usr/share/elasticsearch/bin/elasticsearch-certutil ca

# 生成节点证书

sudo /usr/share/elasticsearch/bin/elasticsearch-certutil cert \

--ca elastic-stack-ca.p12 \

--dns elasticsearch-01,elasticsearch-02,elasticsearch-03 \

--ip 10.1.0.20,10.1.0.21,10.1.0.22,10.1.0.23 \

--out elastic-certificates.p12

# 复制证书到各节点

sudo cp elastic-certificates.p12 /etc/elasticsearch/

sudo chown elasticsearch:elasticsearch /etc/elasticsearch/elastic-certificates.p12

sudo chmod 660 /etc/elasticsearch/elastic-certificates.p12

性能优化实践

Elasticsearch性能调优

JVM调优参数

# /etc/elasticsearch/jvm.options.d/performance.options

# 垃圾回收优化

-XX:+UseG1GC

-XX:MaxGCPauseMillis=200

-XX:G1HeapRegionSize=32m

-XX:+UnlockExperimentalVMOptions

-XX:G1NewSizePercent=30

-XX:G1MaxNewSizePercent=40

# 内存优化

-XX:+AlwaysPreTouch

-XX:+UseLargePages

-XX:LargePageSizeInBytes=2m

# 编译优化

-XX:+UnlockDiagnosticVMOptions

-XX:+LogVMOutput

-XX:LogFile=/var/log/elasticsearch/gc.log

# 性能监控

-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps

-XX:+PrintGCDateStamps

-XX:+PrintGCApplicationStoppedTime

Fluentd性能调优

Buffer优化配置

# 高性能缓冲配置

@type file

path /data/fluentd/buffer

# 缓冲大小优化

chunk_limit_size 32MB

total_limit_size 8GB

queue_limit_length 1024

# 刷新优化

flush_mode interval

flush_interval 5s

flush_thread_count 16

# 压缩优化

compress gzip

# 重试优化

retry_type exponential_backoff

retry_wait 1s

retry_max_interval 60s

retry_forever true

# 溢出处理

overflow_action drop_oldest_chunk

生产环境最佳实践

部署最佳实践

集群架构设计

分离Master、Data、Coordinating节点角色

使用奇数个Master节点避免脑裂

合理规划网络拓扑和存储架构

安全配置

启用X-Pack安全功能

配置SSL/TLS加密传输

实施细粒度权限控制

定期轮换密钥和证书

监控告警

部署全面的监控体系

配置实时告警规则

建立运维响应机制

定期检查系统健康状态

备份恢复

配置自动快照备份

定期测试恢复流程

建立灾难恢复计划

确保数据安全性

性能优化

根据业务需求调优配置

监控系统性能指标

定期优化索引策略

实施容量规划

运维管理策略

索引生命周期管理

# 定期清理过期索引

#!/bin/bash

RETENTION_DAYS=90

INDICES_TO_DELETE=$(curl -s "http://localhost:9200/_cat/indices" | \

awk '{print $3}' | \

grep -E '^.*-[0-9]{4}\.[0-9]{2}\.[0-9]{2}$' | \

while read index; do

index_date=$(echo $index | grep -oE '[0-9]{4}\.[0-9]{2}\.[0-9]{2}$')

index_timestamp=$(date -d "${index_date//./-}" +%s)

current_timestamp=$(date +%s)

days_diff=$(( (current_timestamp - index_timestamp) / 86400 ))

if [ $days_diff -gt $RETENTION_DAYS ]; then

echo $index

fi

done)

for index in $INDICES_TO_DELETE; do

echo "Deleting index: $index"

curl -X DELETE "http://localhost:9200/$index"

done

集群维护计划

# 维护计划清单

日常维护:

- 检查集群健康状态

- 监控磁盘空间使用率

- 查看错误日志和告警

- 验证备份完整性

周度维护:

- 清理过期索引

- 优化索引设置

- 检查节点性能指标

- 更新安全补丁

月度维护:

- 容量规划评估

- 性能调优分析

- 灾难恢复演练

- 系统升级计划

季度维护:

- 架构优化评估

- 安全审计检查

- 成本效益分析

- 技术栈升级规划

通过以上全面的EFK架构部署和管理实践,可以构建一个高可用、高性能、安全可靠的企业级日志平台,为业务运营提供强有力的数据支撑和实时监控能力。在实际生产环境中,应根据具体业务需求和技术栈特点,灵活调整配置参数和架构设计,确保系统的稳定性和可扩展性。

[an error occurred while processing the directive]