Zabbix 优化

zabbix 监控主机和监控项比较少的时候,无需过度优化！

优化方法

官方文档:

https://www.zabbix.com/documentation/5.0/zh/manual/appendix/performance_tuning
https://blog.zabbix.com/monitoring-how-busy-zabbix-processes-are/457/

如果主机和监控项众多, Zabbix Server 有可能会出现的性能问题，主要表现如下

Web 管理页面操作卡顿，经常出现502错误
监控图形中图层断裂
监控告警不及时

可以查看队列了解 Zabbix 性能状态

管理-- 队列

优化方法

数据库: 写多读少,数据采集比较频繁,可以考虑 PostgreSQL
使用主动模式,减轻 Zabbix Server压力
如果需要监控远程主机,使用 Zabbix Proxy
删除无用监控项,建议使用自定义的模板和监控项
适当增加监控项的取值间隔,减少历史数据保存周期,此工作由housekeeper进程定时清理
针对 Zabbix历史数据和趋势图的数据表,进行周期性分表保存
对 Zabbix Server 进程调优,发现性能瓶颈的进程,加大它的进程数量
对 Zabbix Server 缓存调优,发现哪种缓存的剩余内存少,就加大它的缓存值

数据库空间估算

监控项的保留时长,历史数据默认90天,趋势数据1年

占用的数据库磁盘空间可能会很大,如何估算空间占用?官方给出相关的公式

官方公式

https://www.zabbix.com/documentation/6.0/zh/manual/installation/requirements
https://www.zabbix.com/documentation/5.0/zh/manual/installation/requirements

下表包含可用于计算 Zabbix 系统所需磁盘空间的公式：

参数	磁盘空间的计算公式及说明
配置文件	通常为 10MB 或更少
History (历史数据)	*公式：`days (items / refresh rate) * 24 * 3600 * bytes`** ● `items`: 监控项数量。 ● `days`: 保留历史数据的天数。 ● `refresh rate`: 监控项的更新间隔。 ● `bytes`: 保留单个值所需占用的字节数，依赖于数据库引擎，通常约为 90 字节。
Trends (趋势数据)	*公式：`days \ (items / 3600) \* 24 \* 3600 \* bytes`** ● `items`: 监控项数量。 ● `days`: 保留历史数据的天数。 ● `bytes`: 保留单个趋势数据所需占用的字节数，依赖于数据库引擎，通常约为 90 字节。
Events (事件数据)	*公式：`days \ events \* 24 \* 3600 \* bytes`** ● `events`: 每秒产生的事件数量。假设最糟糕的情况下，每秒产生 1 个事件。 ● `days`: 保留历史数据的天数。 ● `bytes`: 保留单个事件数据所需的字节数，取决于数据库引擎，通常约为 170 字节。

根据使用 MySQL 后端数据库的实际统计数据中收集到的平均值，例如监控项为数值类型的值约 90 个字节，事件约 170 个字节。

因此，所需要的磁盘总空间按下列方法计算：

配置文件数据+ 历史数据+ 趋势数据+ 事件数据

范例: 假设 100台主机,每个主机约有100个监控项，每个监控项每间隔60s获取一次数值

#历史数据：
days*(items/refresh rate)*24*3600*bytes
90*(10000/60)*24*3600*90=116173440000 字节 约共116GB

#趋势：
days*(items/3600)*24*3600*bytes
365*(10000/3600)*24*3600*90=5676480000 字节 约共5G

#事件：
days*events*24*3600*bytes
365*1*24*3600*170=5361120000 约5G

#合计：126GB

Zabbix Server 配置文件解析

AlertScriptsPath
默认值：/usr/local/share/zabbix/alertscripts
说明：告警脚本目录

AllowRoot
默认值：0 说明：是否允许使用root启动，0:不允许，1:允许，默认情况下她会使用zabbix用户来启动zabbix进程，不推荐使用root

CacheSize
取值范围： 128K-8G
默认值：8M
说明：配置缓存，用于存储host，item，trigger数据，2.2.3版本之前最大支持2G，目前最大支持8G,zabbix5.0最大支持64G
Zabbix6.0 默认值32M

CacheUpdateFrequency
取值范围：1-3600
默认值：60
说明：多少秒更新一次配置缓存

DBHost
默认值：localhost
说明：数据库主机地址

DBName
默认值：无
必填：是

DBPassword：
默认值：无
说明：数据库密码

DBPort
取值范围：1024-65535
默认值:3306
说明：SQLite作为DB，这个选项请忽略，如果使用socket链接，也请忽略。

DBSchema
说明：Schema名称. 用于 IBM DB2 、 PostgreSQL.

DBSocket
默认值：/tmp/mysql.sock
说明：mysql sock文件路径

DebugLevel
取值范围：0-5
默认值：3
说明: 指定debug级别
0 - 基本信息
1 - critical信息
2 - error信息
3 - warnings信息
4 - 调试日志，日志内容很多，慎重使用
5 - 用于调试web和vmware监控

ExternalScripts
默认值： /usr/local/share/zabbix/externalscripts
说明： 外部脚本目录

Fping6Location
默认值：/usr/sbin/fping6
说明：fping6路径，如果zabbix非root启动，请给fping6 SUID

FpingLocation
默认值：/usr/sbin/fping
说明:和上面的一样

HistoryCacheSize
取值范围：128K-2G
默认值：8M
说明：
历史记录缓存大小，用于存储历史记录

HistoryTextCacheSize
取值范围：128K-2G
默认值：16M
说明：文本类型历史记录的缓存大小，存储character, text 、log历史记录.

HousekeepingFrequency
取值范围：0-24
默认值：1
说明：housekeep执行频率，默认每小时回去删除一些过期数据。如果server重启，那么30分钟之后才执行一次，接下来，每隔一小时在执行一次。

Include
说明：include配置文件，可以使用正则表达式，例如：/usr/local/zabbix-
2.4.4/conf/ttlsa.com/*.conf

JavaGateway
说明：Zabbix Java gateway的主机名，需要启动Java pollers

JavaGatewayPort
取值范围：1024-32767
默认值：10052
Zabbix Java gateway监听端口

ListenIP
默认值：0.0.0.0
说明：监听地址，留空则会在所有的地址上监听，可以监听多个IP地址，ip之间使用逗号分隔，例如：127.0.0.1,10.10.0.2

ListenPort
取值范围：1024-32767
默认值：10051
说明：监听端口

LoadModule
说明：加载模块，格式: LoadModule=，文件必须在指定的LoadModulePath目录下，如果需要加载多个模块，那么写多个即可。

LoadModulePath
模块目录，参考上面

LogFile
日志文件，例如：/data/logs/zabbix/zabbix-server.log

LogFileSize
取值范围：0-1024
默认值：1
0表示禁用日志自动rotation，如果日志达到了限制，并且rotation失败，老日志文件将会被清空掉，重新生成一个新日志。

LogSlowQueries
取值范围：0-3600000
默认值：0
多慢的数据库查询将会被记录，单位：毫秒，0表示不记录慢查询。只有在DebugLevel=3时，这个配置才有效。

MaxHousekeeperDelete
取值范围： 0-1000000
默认值：5000
housekeeping一次删除的数据不能大于MaxHousekeeperDelete

PidFile
默认值：/tmp/zabbix_server.pid
PID文件

ProxyConfigFrequency
取值范围：1-604800
默认值：3600
proxy被动模式下，server多少秒同步配置文件至proxy。

ProxyDataFrequency
取值范围：1-3600
默认值:1
被动模式下，zabbix server间隔多少秒向proxy请求历史数据

SenderFrequency
取值范围：5-3600
默认值：30
间隔多少秒，再尝试发送为发送的报警

SNMPTrapperFile
默认值：/tmp/zabbix_traps.tmp
SNMP trap发送到server的数据临时存放文件。

SourceIP
出口IP地址

SSHKeyLocation
SSH公钥私钥路径

SSLCertLocation
SSL证书目录，用于web监控

SSLKeyLocation
SSL认证私钥路径、用于web监控

SSLCALocation
SSL认证,CA路径，如果为空，将会使用系统默认的CA

StartDBSyncers
取值范围：1-100
默认值：4
预先foke DB Syncers的数量，1.8.5以前最大值为64

StartDiscoverers
取值范围：0-250
默认值：1
pre-forked discoverers的数量，1.8.5版本以前最大可为255

StartPollers
# Mandatory: no
取值范围：0-1000
默认值：5
Number of pre-forked instances of pollers.

StartHTTPPollers
取值范围：0-1000
默认值：1
pre-forked HTTP pollers的数量，1.8.5以前最大255

StartIPMIPollers
取值范围：0-1000
默认值：0
pre-forked IPMI pollers的数量，1.8.5之前，最大为255

StartPollersUnreachable
取值范围：0-1000
默认值：1
pre-forked instances of pollers for unreachable hosts (including IPMI and Java)

Timeout
取值范围：1-30
默认值：3
agent，snmp，external check的超时时间，单位为秒

TmpDir
默认值：/tmp

TrapperTimeout
取值范围：1-300
默认值：300
处理trapper数据的超时时间

TrendCacheSize
取值范围：128K-2G
默认值：4M
历史数据缓存大小

UnavailableDelay
取值范围：1-3600
默认值：60
间隔多少秒再次检测主机是否可用

UnreachableDelay
取值范围：1-3600
默认值：15
间隔多少秒再次检测主机是否可达

UnreachablePeriod
取值范围：1-3600
默认值：45
检测到主机不可用，多久将它置为不可达

User
默认值：zabbix
启动zabbix server的用户，在配置禁止root启动，并且当前shell用户是root得情况下有效。如果当前用户是ttlsa，那么zabbix server的运行用户是ttlsa

ValueCacheSize
取值范围：0,128K-64G
默认值：8M
0表示禁用，history value缓存大小，当缓存超标了，将会每隔5分钟往server日志里面记录。养成看日志的好习惯。

实战优化案例

案例

下图为 Zabbix 数据采集进程的繁忙度

监测 --- 主机 --- Zabbix Server --- 图形

修改自动发现规则

修改 Zabbix Server 配置

[root@zabbix-server ~]#vim /etc/zabbix/zabbix_server.conf
StartDiscoverers=100 # StartDiscoverers=1 默认值为1,加大此值

[root@zabbix-server ~]#systemctl restart zabbix-agent.service

Zabbix Server 还有其它相关进程

案例

缓存使用情况如下图

Zabbix Server 相关缓存的配置项

[root@zabbix-server ~]#grep -i cache /etc/zabbix/zabbix_server.conf
### Option: VMwareCacheSize
# Size of VMware cache, in bytes.
# VMwareCacheSize=8M
### Option: CacheSize
# Size of configuration cache, in bytes.
# CacheSize=8M #此值不能太小,否则Zabbix Server 可能无法启动
### Option: CacheUpdateFrequency
# How often Zabbix will perform update of configuration cache, in seconds.
# CacheUpdateFrequency=60
### Option: HistoryCacheSize
# Size of history cache, in bytes.
# HistoryCacheSize=16M
### Option: HistoryIndexCacheSize
# Size of history index cache, in bytes.
# Shared memory size for indexing history cache.
# HistoryIndexCacheSize=4M
### Option: TrendCacheSize
# Size of trend cache, in bytes.
# TrendCacheSize=4M
### Option: ValueCacheSize
# Size of history value cache, in bytes.
# Setting to 0 disables value cache.
# ValueCacheSize=8M

案例

利用API添加几百台监控的主机,因为CacheSize默认值8M太小 ,会造成Zabbix Server 不断的重启,可以看
到如下日志信息

root@zabbix-server ~]#tail -f /var/log/zabbix/zabbix_server.log
......
8588:20220617:213218.764 syncing trend data done
8588:20220617:213218.764 Zabbix Server stopped. Zabbix 5.0.24 (revision
313ff6504e3).
8611:20220617:213228.800 Starting Zabbix Server. Zabbix 5.0.24 (revision
313ff6504e3).
8611:20220617:213228.800 ****** Enabled features ******
8611:20220617:213228.800 SNMP monitoring: YES
8611:20220617:213228.800 IPMI monitoring: YES
8611:20220617:213228.800 Web monitoring: YES
8611:20220617:213228.800 VMware monitoring: YES
8611:20220617:213228.800 SMTP authentication: YES
8611:20220617:213228.800 ODBC: YES
8611:20220617:213228.800 SSH support: YES
8611:20220617:213228.800 IPv6 support: YES
8611:20220617:213228.800 TLS support: YES
8636:20220617:213239.302 ******************************
8636:20220617:213239.302 using configuration file:
/etc/zabbix/zabbix_server.conf
8636:20220617:213239.306 current database version (mandatory/optional):
05000000/05000005
8636:20220617:213239.306 required mandatory version: 05000000
8636:20220617:213239.315 server #0 started [main process]
8637:20220617:213239.316 server #1 started [configuration syncer #1]
8637:20220617:213239.535 __mem_malloc: skipped 0 asked 72 skip_min
18446744073709551615 skip_max 0
8637:20220617:213239.535 [file:dbconfig.c,line:96] __zbx_mem_malloc(): out of
memory (requested 72 bytes)
8637:20220617:213239.535 [file:dbconfig.c,line:96] __zbx_mem_malloc(): please
increase CacheSize configuration parameter
8637:20220617:213239.535 === memory statistics for configuration cache ===
8637:20220617:213239.535 free chunks of size 24 bytes: 55
8637:20220617:213239.535 free chunks of size 32 bytes: 11
8637:20220617:213239.535 free chunks of size 40 bytes: 6
8637:20220617:213239.535 free chunks of size 48 bytes: 5
8637:20220617:213239.535 free chunks of size 56 bytes: 3
8637:20220617:213239.535 min chunk size: 24 bytes
8637:20220617:213239.535 max chunk size: 56 bytes
8637:20220617:213239.535 memory of total size 7264040 bytes fragmented into
70263 chunks
8637:20220617:213239.535 of those, 2320 bytes are in 80 free
chunks
8637:20220617:213239.535 of those, 7261720 bytes are in 70183 used
chunks
8637:20220617:213239.535 of those, 1124192 bytes are used by allocation
overhead
8637:20220617:213239.535 ================================

[root@zabbix-server ~]#cat /var/log/syslog
Jun 17 21:37:18 zabbix-server systemd[1]: zabbix-server.service: Scheduled
restart job, restart counter is at 31.
Jun 17 21:37:18 zabbix-server systemd[1]: Stopped Zabbix Server.
Jun 17 21:37:18 zabbix-server systemd[1]: Starting Zabbix Server...
Jun 17 21:37:18 zabbix-server systemd[1]: Started Zabbix Server.
Jun 17 21:37:18 zabbix-server kill[9306]: 用法：
Jun 17 21:37:18 zabbix-server kill[9306]: kill [options] <pid> [...]
Jun 17 21:37:18 zabbix-server kill[9306]: 选项：
Jun 17 21:37:18 zabbix-server kill[9306]: <pid> [...] send signal to
every <pid> listed
Jun 17 21:37:18 zabbix-server kill[9306]: -<signal>, -s, --signal <signal>
Jun 17 21:37:18 zabbix-server kill[9306]: specify the
<signal> to be sent
Jun 17 21:37:18 zabbix-server kill[9306]: -l, --list=[<signal>] list all
signal names, or convert one to a name
Jun 17 21:37:18 zabbix-server kill[9306]: -L, --table list all
signal names in a nice table
Jun 17 21:37:18 zabbix-server kill[9306]: -h, --help 显示此帮助然后离开
Jun 17 21:37:18 zabbix-server kill[9306]: -V, --version 显示程序版本然后离开
Jun 17 21:37:18 zabbix-server kill[9306]: 欲了解更多详细信息，请参见 kill(1)。
Jun 17 21:37:18 zabbix-server systemd[1]: zabbix-server.service: Control process
exited, code=exited, status=1/FAILURE
Jun 17 21:37:18 zabbix-server systemd[1]: zabbix-server.service: Failed with
result 'exit-code

[root@zabbix-server ~]#systemctl status zabbix-server.service
● zabbix-server.service - Zabbix Server
Loaded: loaded (/lib/systemd/system/zabbix-server.service; enabled; vendor
preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Fri 2021-06-17
21:38:51 CST; 8s ago
Process: 9509 ExecStart=/usr/sbin/zabbix_server -c $CONFFILE (code=exited,
status=0/SUCCESS)
Process: 9528 ExecStop=/bin/kill -SIGTERM $MAINPID (code=exited,
status=1/FAILURE)
Main PID: 9526 (code=exited, status=0/SUCCESS)
6月 17 21:38:51 zabbix-server.wang.org systemd[1]: zabbix-server.service: Control
process exited, code=exited, status=1/FAILURE
6月 17 21:38:51 zabbix-server.wang.org systemd[1]: zabbix-server.service: Failed
with result 'exit-code'.

#加大缓存
[root@zabbix-server ~]#vim /etc/zabbix/zabbix_server.conf
# CacheSize=8M
CacheSize=256M

#观察结果,发现Zabbix Server 服务启动成功
[root@zabbix-server ~]#systemctl is-active zabbix-server.service
active

案例

由于添加了很多不可访问的主机,会出现unreachable poller进程繁忙的情况,原因是此进程只有一个

[root@zabbix-server ~]#ps aux|grep unreachable
zabbix 9820 0.0 0.6 365768 12308 ? S 21:40 0:00
/usr/sbin/zabbix_server: unreachable poller #1 [got 2 values in 8.005057 sec,
getting values]
root 11642 0.0 0.0 9524 720 pts/3 S+ 22:06 0:00 grep --
color=auto unreachable

修改配置,加大进程数

[root@zabbix-server ~]#vi /etc/zabbix/zabbix_server.conf
# StartPollersUnreachable=1
StartPollersUnreachable=100
[root@zabbix-server ~]#systemctl restart zabbix-server.service
[root@zabbix-server ~]#ps aux|grep unreachable |wc -l
101

再次观察数据和图形,可以看到繁忙度大幅下降

案例

自动发现的进程只有一个

#默认进程是一个
[root@zabbix-server ~]#ps aux|grep discoverer
zabbix 12141 0.0 0.5 366096 10876 ? S 22:09 0:00
/usr/sbin/zabbix_server: discoverer #1 [processed 2 rules in 251.928169 sec,
performing discovery]
root 13553 0.0 0.1 9524 2620 pts/3 S+ 22:25 0:00 grep --
color=auto discoverer

多创建几个自动发现场景,缩短发现时间,配置如下

观察进程繁忙程度