Redis 集群与高可用

Redis单机服务存在数据和服务的单点问题,而且单机性能也存在着上限,可以利用Redis的集群相关技术来解决这些问题

Redis 主从复制

Redis 主从复制架构

Redis和MySQL的主从模式类似，也支持主从模式（master/slave），可以实现Redis数据的跨主机的远程备份

常见客户端连接主从的架构:

程序APP先连接到高可用性 LB 集群提供的虚拟IP，再由LB调度将用户的请求至后端Redis 服务器来真正提供服务

主从复制特点

一个master可以有多个slave
一个slave只能有一个master
数据流向是从master到slave单向的
master 可读可写
slave 只读

主从复制实现

当master出现故障后,可以自动提升一个slave节点变成新的Mster,因此Redis Slave 需要设置和master相同的连接密码

此外当一个Slave提升为新的master时需要通过持久化实现数据的恢复

当配置Redis复制功能时，强烈建议打开主服务器的持久化功能。否则的话，由于延迟等问题，部署的主节点Redis服务应该要避免自动启动。

参考案例: 导致主从服务器数据全部丢失

1.假设节点A为主服务器，并且关闭了持久化。并且节点B和节点C从节点A复制数据
2.节点A崩溃，然后由自动拉起服务重启了节点A.由于节点A的持久化被关闭了，所以重启之后没有任何数据
3.节点B和节点C将从节点A复制数据，但是A的数据是空的，于是就把自身保存的数据副本删除。

在关闭主服务器上的持久化，并同时开启自动拉起进程的情况下，即便使用Sentinel来实现Redis的高可用性，也是非常危险的。因为主服务器可能拉起得非常快，以至于Sentinel在配置的心跳时间间隔内没有检测到主服务器已被重启，然后还是会发生上面描述的情况,导致数据丢失。

无论何时，数据安全都是极其重要的，所以应该禁止主服务器关闭持久化的同时自动启动。

主从命令配置

启用主从同步

Redis Server 默认为 master节点，如果要配置为从节点,需要指定master服务器的IP,端口及连接密码在从节点执行 REPLICAOF MASTER_IP PORT 指令可以启用主从同步复制功能,早期版本使用 SLAVEOF指令

127.0.0.1:6379> REPLICAOF MASTER_IP PORT #新版推荐使用
127.0.0.1:6379> SLAVEOF MasterIP Port #旧版使用，将被淘汰
127.0.0.1:6379> CONFIG SET masterauth <masterpass>

#在mater上设置key1
[root@centos8 ~]#redis-cli
127.0.0.1:6379> AUTH 123456
OK
127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:0
master_replid:a3504cab4d33e9723a7bc988ff8e022f6d9325bf
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
127.0.0.1:6379> SET key1 v1-master
OK
127.0.0.1:6379> KEYS *
1) "key1"
127.0.0.1:6379> GET key1
"v1-master"
127.0.0.1:6379>

#以下都在slave上执行，登录
[root@centos8 ~]#redis-cli
127.0.0.1:6379> info
NOAUTH Authentication required.
127.0.0.1:6379> AUTH 123456
OK
127.0.0.1:6379> INFO replication #查看当前角色默认为master
# Replication
role:master
connected_slaves:0
master_replid:a3504cab4d33e9723a7bc988ff8e022f6d9325bf
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
127.0.0.1:6379> SET key1 v1-slave-18
OK
127.0.0.1:6379> KEYS *
1) "key1"
127.0.0.1:6379> GET key1
"v1-slave-18"
127.0.0.1:6379>

#在第二个slave,也设置相同的key1,但值不同
127.0.0.1:6379> KEYS *
1) "key1"
127.0.0.1:6379> GET key1
"v1-slave-28"

127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:0
master_replid:a3504cab4d33e9723a7bc988ff8e022f6d9325bf
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
127.0.0.1:6379>

#在slave上设置master的IP和端口，4.0版之前的指令为slaveof
127.0.0.1:6379> REPLICAOF 10.0.0.8 6379 #仍可使用SLAVEOF MasterIP Port
OK
#在slave上设置master的密码，才可以同步
127.0.0.1:6379> CONFIG SET masterauth 123456
OK
127.0.0.1:6379> INFO replication
# Replication #角色变为slave
role:slave
master_host:10.0.0.8 #指向master
master_port:6379
master_link_status:up
master_last_io_seconds_ago:8
master_sync_in_progress:0
slave_repl_offset:42
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:b69908f23236fb20b810d198f7f4539f795e0ee5
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:42
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:42
#查看已经同步成功
127.0.0.1:6379> GET key1
"v1-master"
#在master上可以看到所有slave信息
127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:2
slave0:ip=10.0.0.18,port=6379,state=online,offset=112,lag=1 #slave信息
slave1:ip=10.0.0.28,port=6379,state=online,offset=112,lag=1
master_replid:dc30f86c2d3c9029b6d07831ae3f27f8dbacac62
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:112
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:112
127.0.0.1:6379>

删除主从同步

在从节点执行 REPLICAOF NO ONE 或 SLAVEOF NO ONE 指令可以取消主从复制

取消复制会断开和master的连接而不再有主从复制关联, 但不会清除slave上已有的数据

#新版
127.0.0.1:6379> REPLICAOF NO ONE
#旧版
127.0.0.1:6379> SLAVEOF NO ONE

验证同步

在 master 上观察日志

[root@centos8 ~]#tail /var/log/redis/redis.log
24402:M 06 Oct 2020 09:09:16.448 * Replica 10.0.0.18:6379 asks for
synchronization
24402:M 06 Oct 2020 09:09:16.448 * Full resync requested by replica
10.0.0.18:6379
24402:M 06 Oct 2020 09:09:16.448 * Starting BGSAVE for SYNC with target: disk
24402:M 06 Oct 2020 09:09:16.453 * Background saving started by pid 24507
24507:C 06 Oct 2020 09:09:16.454 * DB saved on disk
24507:C 06 Oct 2020 09:09:16.455 * RDB: 2 MB of memory used by copy-on-write
24402:M 06 Oct 2020 09:09:16.489 * Background saving terminated with success
24402:M 06 Oct 2020 09:09:16.490 * Synchronization with replica 10.0.0.18:6379
succeeded

在 slave 节点观察日志

[root@centos8 ~]#tail -f /var/log/redis/redis.log
24395:S 06 Oct 2020 09:09:16.411 * Connecting to MASTER 10.0.0.8:6379
24395:S 06 Oct 2020 09:09:16.412 * MASTER <-> REPLICA sync started
24395:S 06 Oct 2020 09:09:16.412 * Non blocking connect for SYNC fired the
event.
24395:S 06 Oct 2020 09:09:16.412 * Master replied to PING, replication can
continue...
24395:S 06 Oct 2020 09:09:16.414 * Partial resynchronization not possible (no
cached master)
24395:S 06 Oct 2020 09:09:16.419 * Full resync from master:
20ec2450b850782b6eeaed4a29a61a25b9a7f4da:0
24395:S 06 Oct 2020 09:09:16.456 * MASTER <-> REPLICA sync: receiving 196 bytes
from master
24395:S 06 Oct 2020 09:09:16.456 * MASTER <-> REPLICA sync: Flushing old data
24395:S 06 Oct 2020 09:09:16.456 * MASTER <-> REPLICA sync: Loading DB in memory
24395:S 06 Oct 2020 09:09:16.457 * MASTER <-> REPLICA sync: Finished with
success

修改 Slave 节点配置文件

范例:

[root@centos8 ~]#vim /etc/redis.conf
.......

# replicaof <masterip> <masterport>
replicaof 10.0.0.8 6379 #指定master的IP和端口号
......

# masterauth <master-password>
masterauth 123456 #如果密码需要设置
requirepass 123456 #和masterauth保持一致，用于将来从节点提升主后使用
.......

[root@centos8 ~]#systemctl restart redis

Master 和 Slave查看状态

#在master上查看状态
127.0.0.1:6379> info replication
# Replication
role:master
connected_slaves:1
slave0:ip=10.0.0.18,port=6379,state=online,offset=1104403,lag=0
master_replid:b2517cd6cb3ad1508c516a38caed5b9d2d9a3e73
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1104403
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:55828
repl_backlog_histlen:1048576
127.0.0.1:6379>

#在slave上查看状态
127.0.0.1:6379> get key1 #同步成功后,slave原key信息丢失，获取master复制过来新的值
"v1-master"
127.0.0.1:6379> INFO replication
# Replication
role:slave
master_host:10.0.0.8
master_port:6379
master_link_status:up
master_last_io_seconds_ago:6 #如果主从复制通信正常，每10秒重新从0计数，此值无法修改，如果无法通信，当计数到60时，master_link_status显示为down
master_sync_in_progress:0 #0表示同步完成，1表示正在同步
slave_repl_offset:1104431
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:b2517cd6cb3ad1508c516a38caed5b9d2d9a3e73
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1104431
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:55856
repl_backlog_histlen:1048576
127.0.0.1:6379>

#停止master的redis服务：systemctl stop redis,在slave上可以观察到以下现象
127.0.0.1:6379> INFO replication
# Replication
role:slave
master_host:10.0.0.8
master_port:6379
master_link_status:down #显示down，表示无法连接master
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:1104529
master_link_down_since_seconds:4
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:b2517cd6cb3ad1508c516a38caed5b9d2d9a3e73
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1104529
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:55954
repl_backlog_histlen:1048576
127.0.0.1:6379>

Slave 日志

[root@centos8 ~]#tail -f /var/log/redis/redis.log
24592:S 20 Feb 2020 12:03:58.792 * Connecting to MASTER 10.0.0.8:6379
24592:S 20 Feb 2020 12:03:58.792 * MASTER <-> REPLICA sync started
24592:S 20 Feb 2020 12:03:58.797 * Non blocking connect for SYNC fired the
event.
24592:S 20 Feb 2020 12:03:58.797 * Master replied to PING, replication can
continue...
24592:S 20 Feb 2020 12:03:58.798 * Partial resynchronization not possible (no
cached master)
24592:S 20 Feb 2020 12:03:58.801 * Full resync from master:
b69908f23236fb20b810d198f7f4539f795e0ee5:2440
24592:S 20 Feb 2020 12:03:58.863 * MASTER <-> REPLICA sync: receiving 213 bytes
from master
24592:S 20 Feb 2020 12:03:58.863 * MASTER <-> REPLICA sync: Flushing old data
24592:S 20 Feb 2020 12:03:58.863 * MASTER <-> REPLICA sync: Loading DB in memory
24592:S 20 Feb 2020 12:03:58.863 * MASTER <-> REPLICA sync: Finished with
success

Master日志

[root@centos8 ~]#tail /var/log/redis/redis.log
11846:M 20 Feb 2020 12:11:35.171 * DB loaded from disk: 0.000 seconds
11846:M 20 Feb 2020 12:11:35.171 * Ready to accept connections
11846:M 20 Feb 2020 12:11:36.086 * Replica 10.0.0.18:6379 asks for
synchronization
11846:M 20 Feb 2020 12:11:36.086 * Partial resynchronization not accepted:
Replication ID mismatch (Replica asked for
'b69908f23236fb20b810d198f7f4539f795e0ee5', my replication IDs are
'4bff970970c073c1f3d8e8ad20b1c1f126a5f31c' and
'0000000000000000000000000000000000000000')
11846:M 20 Feb 2020 12:11:36.086 * Starting BGSAVE for SYNC with target: disk
11846:M 20 Feb 2020 12:11:36.095 * Background saving started by pid 11850
11850:C 20 Feb 2020 12:11:36.121 * DB saved on disk
11850:C 20 Feb 2020 12:11:36.121 * RDB: 4 MB of memory used by copy-on-write
11846:M 20 Feb 2020 12:11:36.180 * Background saving terminated with success
11846:M 20 Feb 2020 12:11:36.180 * Synchronization with replica 10.0.0.18:6379
succeeded

Slave 只读状态

验证Slave节点为只读状态, 不支持写入

127.0.0.1:6379> set key1 v1-slave
(error) READONLY You can't write against a read only replica.

主从复制故障恢复

主从复制故障恢复过程介绍

Slave 节点故障和恢复

当 slave 节点故障时，将Redis Client指向另一个 slave 节点即可,并及时修复故障从节点

Master 节点故障和恢复

当 master 节点故障时，需要提升slave为新的master

master故障后，当前还只能手动提升一个slave为新master，不能自动切换。

之后将其它的slave节点重新指定新的master为master节点

Master的切换会导致master_replid发生变化，slave之前的master_replid就和当前master不一致从而会引发所有 slave的全量同步。

主从复制故障恢复实现

假设当前主节点10.0.0.8故障,提升10.0.0.18为新的master

#查看当前10.0.0.18节点的状态为slave,master指向10.0.0.8
127.0.0.1:6379> INFO replication
# Replication
role:slave
master_host:10.0.0.8
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:3794
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:8e8279e461fdf0f1a3464ef768675149ad4b54a3
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:3794
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:3781
repl_backlog_histlen:14
127.0.0.1:6379>

停止slave同步并提升为新的master

#将当前 slave 节点提升为 master 角色
127.0.0.1:6379> REPLICAOF NO ONE #旧版使用SLAVEOF no one
OK
(5.04s)
127.0.0.1:6379> info replication
# Replication
role:master
connected_slaves:0
master_replid:94901d6b8ff812ec4a4b3ac6bb33faa11e55c274
master_replid2:0083e5a9c96aa4f2196934e10b910937d82b4e19
master_repl_offset:3514
second_repl_offset:3515
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:3431
repl_backlog_histlen:84
127.0.0.1:6379>

测试能否写入数据：

127.0.0.1:6379> set keytest1 vtest1
OK

修改所有slave 指向新的master节点

#修改10.0.0.28节点指向新的master节点10.0.0.18
127.0.0.1:6379> SLAVEOF 10.0.0.18 6379
OK
127.0.0.1:6379> set key100 v100
(error) READONLY You can't write against a read only replica.

#查看日志
[root@centos8 ~]#tail -f /var/log/redis/redis.log
1762:S 20 Feb 2020 13:28:21.943 # Connection with master lost.
1762:S 20 Feb 2020 13:28:21.943 * Caching the disconnected master state.
1762:S 20 Feb 2020 13:28:21.943 * REPLICAOF 10.0.0.18:6379 enabled (user request
from 'id=5 addr=127.0.0.1:59668 fd=9 name= age=149 idle=0 flags=N db=0 sub=0
psub=0 multi=-1 qbuf=41 qbuf-free=32727 obl=0 oll=0 omem=0 events=r
cmd=slaveof')
1762:S 20 Feb 2020 13:28:21.966 * Connecting to MASTER 10.0.0.18:6379
1762:S 20 Feb 2020 13:28:21.966 * MASTER <-> REPLICA sync started
1762:S 20 Feb 2020 13:28:21.967 * Non blocking connect for SYNC fired the event.
1762:S 20 Feb 2020 13:28:21.968 * Master replied to PING, replication can
continue...
1762:S 20 Feb 2020 13:28:21.968 * Trying a partial resynchronization (request
8e8279e461fdf0f1a3464ef768675149ad4b54a3:3991).
1762:S 20 Feb 2020 13:28:21.969 * Successful partial resynchronization with
master.
1762:S 20 Feb 2020 13:28:21.969 * MASTER <-> REPLICA sync: Master accepted a
Partial Resynchronization.

在新master可看到slave

#在新master节点10.0.0.18上查看状态
127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:1
slave0:ip=10.0.0.28,port=6379,state=online,offset=4606,lag=0
master_replid:8e8279e461fdf0f1a3464ef768675149ad4b54a3
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:4606
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:4606
127.0.0.1:6379>

实现 Redis 的级联复制

即实现基于Slave节点的Slave

master和slave1节点无需修改,只需要修改slave2及slave3指向slave1做为mater即可

#在slave2和slave3上执行下面指令
127.0.0.1:6379> REPLICAOF 10.0.0.18 6379
OK
127.0.0.1:6379> CONFIG SET masterauth 123456

在 master 设置key,观察是否同步

#在master新建key
127.0.0.1:6379> set key2 v2
OK
127.0.0.1:6379> get key2
"v2"

#在slave1和slave2验证key
127.0.0.1:6379> get key2
"v2

#在slave1和slave2都无法新建key
127.0.0.1:6379> set key3 v3
(error) READONLY You can't write against a read only replica

在中间那个slave1查看状态

127.0.0.1:6379> INFO replication
# Replication
role:slave
master_host:10.0.0.8
master_port:6379
master_link_status:up
master_last_io_seconds_ago:8 #最近一次与master通信已经过去多少秒。
master_sync_in_progress:0 #是否正在与master通信。
slave_repl_offset:4312 #当前同步的偏移量
slave_priority:100 #slave优先级，master故障后值越小越优先同步。
slave_read_only:1
connected_slaves:
slave0:ip=10.0.0.28,port=6379,state=online,offset=4312,lag=0 #slave的slave节点
master_replid:8e8279e461fdf0f1a3464ef768675149ad4b54a3
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:4312
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:4312

主从复制优化

主从复制过程

Redis主从复制分为全量同步和增量同步

Redis 的主从同步是非阻塞的，即同步过程不会影响主服务器的正常访问.

注意:主节点重启会导致全量同步,从节点重启只会导致增量同步

从redis 2.8版本以前，并不支持部分同步，当主从服务器之间的连接断掉之后，master服务器和slave服务器之间都是进行全量数据同步，从redis 2.8开始，即使主从连接中途断掉，也不需要进行全量同步，因为从这个版本开始引入了部分同步。

全量复制过程 Full resync

4.0之前版本的复制：run_id和复制偏移量来判断进行全量复制还是部分复制
4.0之后版本的复制：根据master_replid和复制偏移量来判断进行全量复制还是部分复制
主从关系建立后master_replid，保存的是当前主节点的master_replid master_replid2保存的是上一个主节点的master_replid

主从节点建立连接,验证身份后,从节点向主节点发送PSYNC(2.8版本之前是SYNC)命令
主节点向从节点发送FULLRESYNC命令,包括master_replid(runID)和offset
从节点保存主节点信息
主节点执行BGSAVE保存RDB文件,同时记录新的记录到buffer中
主节点发送RDB文件给从节点
主节点将新收到buffer中的记录发送至从节点
从节点删除本机的旧数据
从节点加载RDB
从节点同步主节点的buffer信息

全量复制发生在下面情况

从节点首次连接主节点(无master_replid/run_id)
从节点的复制偏移量不在复制积压缓冲区内
从节点无法连接主节点超过一定的时间

范例：查看RUNID

#Redis 重启服务后，RUNID会发生变化
127.0.0.1:6379> info server
# Server
redis_version:7.0.5
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:77bd58d092d1d003
redis_mode:standalone
os:Linux 5.4.0-124-generic x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:9.4.0
process_id:16407
process_supervised:systemd
run_id:9e954950c255644ef291f6be0c579ae893c16aad
tcp_port:6379
server_time_usec:1667276559043301
uptime_in_seconds:3463
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:6332175
executable:/apps/redis/bin/redis-server
config_file:/apps/redis/etc/redis.conf
io_threads_active:0

增量复制过程 partial resynchronization

在主从复制首次完成全量同步之后再次需要同步时,从服务器只要发送当前的offset位置(类似于MySQL的binlog的位置)给主服务器，然后主服务器根据相应的位置将之后的数据(包括写在缓冲区的积压数据)发送给从服务器,再次将其保存到从节点内存即可。

即首次全量复制,之后的复制基本增量复制实现

主从同步完整过程

主从同步完整过程如下：

slave发起连接master，验证通过后,发送PSYNC命令
master接收到PSYNC命令后，执行BGSAVE命令将全部数据保存至RDB文件中,并将后续发生的写操作记录至buffer中
master向所有slave发送RDB文件
master向所有slave发送后续记录在buffer中写操作
slave收到快照文件后丢弃所有旧数据
slave加载收到的RDB到内存
slave 执行来自master接收到的buffer写操作
当slave完成全量复制后,后续master只会先发送slave_repl_offset信息
以后slave比较自身和master的差异,只会进行增量复制的数据即可

复制缓冲区(环形队列)配置参数：

#master的写入数据缓冲区，用于记录自上一次同步后到下一次同步过程中间的写入命令，计算公式：repl-backlog-size = 允许从节点最大中断时长 * 主实例offset每秒写入量，比如:master每秒最大写入64mb，最大允许60秒，那么就要设置为64mb*60秒=3840MB(3.8G),建议此值是设置的足够大，默认值为1M
repl-backlog-size 1mb

#如果一段时间后没有slave连接到master，则backlog size的内存将会被释放。如果值为0则表示永远不释放这部份内存。
repl-backlog-ttl 3600

避免全量复制

第一次全量复制不可避免,后续的全量复制可以利用小主节点(内存小),业务低峰时进行全量
节点RUN_ID不匹配:主节点重启会导致RUN_ID变化,可能会触发全量复制,可以利用config命令动态修改配置，故障转移例如哨兵或集群选举新的主节点也不会全量复制,而从节点重启动,不会导致全量复制,只会增量复制
复制积压缓冲区不足: 当主节点生成的新数据大于缓冲区大小,从节点恢复和主节点连接后,会导致全量复制.解决方法将repl-backlog-size 调大

避免复制风暴

单主节点复制风暴

当主节点重启，多从节点复制

解决方法:更换复制拓扑

一主带一从一从带多从

单机器多实例复制风暴

机器宕机后，大量全量复制

解决方法:主节点分散多机器

主从同步优化配置

Redis在2.8版本之前没有提供增量部分复制的功能，当网络闪断或者slave Redis重启之后会导致主从之间的全量同步，即从2.8版本开始增加了部分复制的功能。

性能相关配置

repl-diskless-sync no # 是否使用无盘方式进行同步RDB文件，默认为no(编译安装默认为yes)，no表示不使用无盘，需要将RDB文件保存到磁盘后再发送给slave，yes表示使用无盘，即RDB文件不需要保存至本地磁盘，而且直接通过网络发送给slave

repl-diskless-sync-delay 5 #无盘时复制的服务器等待的延迟时间

repl-ping-slave-period 10 #slave向master发送ping指令的时间间隔，默认为10s

repl-timeout 60 #指定ping连接超时时间,超过此值无法连接,master_link_status显示为down状态,并记录错误日志

repl-disable-tcp-nodelay no #是否启用TCP_NODELAY
#设置成yes，则redis会合并多个小的TCP包成一个大包再发送,此方式可以节省带宽，但会造成同步延迟时长的增加，导致master与slave数据短期内不一致
#设置成no，则master会立即同步数据

repl-backlog-size 1mb #master的写入数据缓冲区，用于记录自上一次同步后到下一次同步前期间的写入命令，计算公式：repl-backlog-size = 允许slave最大中断时长 * master节点offset每秒写入量，如:master每秒最大写入量为32MB，最长允许中断60秒，就要至少设置为32*60=1920MB,建议此值是设置的足够大,如果此值太小,会造成全量复制

repl-backlog-ttl 3600 #指定多长时间后如果没有slave连接到master，则backlog的内存数据将会过期。如果值为0表示永远不过期。

slave-priority 100 #slave参与选举新的master的优先级，此整数值越小则优先级越高。当master故障时将会按照优先级来选择slave端进行选举新的master，如果值设置为0，则表示该slave节点永远不会被选为master节点。

min-replicas-to-write 1 #指定master的可用slave不能少于个数，如果少于此值,master将无法执行写操作,默认为0,生产建议设为1,

min-slaves-max-lag 20 #指定至少有min-replicas-to-write数量的slave延迟时间都大于此秒数时，master将不能执行写操作,默认为10s

常见主从复制故障

主从硬件和软件配置不一致

主从节点的maxmemory不一致,主节点内存大于从节点内存,主从复制可能丢失数据

rename-command 命令不一致,如在主节点启用flushdb,从节点禁用此命令,结果在master节点执行flushdb后,导致slave节点不同步

#在从节点定义rename-command flushall "",但是在主节点没有此配置,则当在主节点执行flushall时,会在从节点提示下面同步错误
10822:S 16 Oct 2020 20:03:45.291 # == CRITICAL == This replica is sending an
error to its master: 'unknown command `flushall`, with args beginning with: '
after processing the command '<unknown>'

#master有一个rename-command flushdb "wang",而slave没有这个配置,则同步时从节点可以看到以下同步错误
3181:S 21 Oct 2020 17:34:50.581 # == CRITICAL == This replica is sending an
error to its master: 'unknown command `wang`, with args beginning with: ' after
processing the command '<unknown>'

Master 节点密码错误

如果slave节点配置的master密码错误，导致验证不通过,自然将无法建立主从同步关系。

[root@centos8 ~]#tail -f /var/log/redis/redis.log
24930:S 20 Feb 2020 13:53:57.029 * Connecting to MASTER 10.0.0.8:6379
24930:S 20 Feb 2020 13:53:57.030 * MASTER <-> REPLICA sync started
24930:S 20 Feb 2020 13:53:57.030 * Non blocking connect for SYNC fired the
event.
24930:S 20 Feb 2020 13:53:57.030 * Master replied to PING, replication can
continue...
24930:S 20 Feb 2020 13:53:57.031 # Unable to AUTH to MASTER: -ERR invalid
password

Redis 版本不一致

不同的redis 版本之间尤其是大版本间可能会存在兼容性问题，如：Redis 3,4,5,6之间

因此主从复制的所有节点应该使用相同的版本

安全模式下无法远程连接

如果开启了安全模式，并且没有设置bind地址和密码,会导致无法远程连接

[root@centos8 ~]#vim /etc/redis.conf
#bind 127.0.0.1 #将此行注释
[root@centos8 ~]#systemctl restart redis
[root@centos8 ~]#ss -ntl
State Recv-Q Send-Q Local Address:Port Peer
Address:Port
LISTEN 0 128 0.0.0.0:22
0.0.0.0:*
LISTEN 0 100 127.0.0.1:25
0.0.0.0:*
LISTEN 0 128 0.0.0.0:6379
0.0.0.0:*
LISTEN 0 128 [::]:22
[::]:*
LISTEN 0 100 [::1]:25
[::]:*
LISTEN 0 128 [::]:6379
[::]:*

[root@centos8 ~]#redis-cli -h 10.0.0.8
10.0.0.8:6379> KEYS *
(error) DENIED Redis is running in protected mode because protected mode is
enabled, no bind address was specified, no authentication password is requested
to clients. In this mode connections are only accepted from the loopback
interface. If you want to connect from external computers to Redis you may adopt
one of the following solutions: 1) Just disable protected mode sending the
command 'CONFIG SET protected-mode no' from the loopback interface by connecting
to Redis from the same host the server is running, however MAKE SURE Redis is not
publicly accessible from internet if you do so. Use CONFIG REWRITE to make this
change permanent. 2) Alternatively you can just disable the protected mode by
editing the Redis configuration file, and setting the protected mode option to
'no', and then restarting the server. 3) If you started the server manually just
for testing, restart it with the '--protected-mode no' option. 4) Setup a bind
address or an authentication password. NOTE: You only need to do one of the
above things in order for the server to start accepting connections from the
outside.
10.0.0.38:6379> exit

#可以本机登录
[root@centos8 ~]#redis-cli
127.0.0.1:6379> KEYS *
(empty list or set

Redis 哨兵 Sentinel

Redis 集群介绍

主从架构和MySQL的主从复制一样,无法实现master和slave角色的自动切换，即当master出现故障时,不能实现自动的将一个slave 节点提升为新的master节点,即主从复制无法实现自动的故障转移功能,如果想实现转移,则需要手动修改配置,才能将 slave 服务器提升新的master节点.此外只有一个主节点支持写操作,所以业务量很大时会导致Redis服务性能达到瓶颈

需要解决的主从复制以下存在的问题：

master和slave角色的自动切换，且不能影响业务
提升Redis服务整体性能，支持更高并发访问

哨兵 Sentinel 工作原理

哨兵Sentinel从Redis2.6版本开始引用，Redis 2.8版本之后稳定可用。生产环境如果要使用此功能建议使用Redis的2.8版本以上版本

Sentinel 架构和故障转移机制

Sentinel 故障转移

专门的Sentinel 服务进程是用于监控redis集群中Master工作的状态，当Master主服务器发生故障的时候，可以实现Master和Slave的角色的自动切换，从而实现系统的高可用性

Sentinel是一个分布式系统,即需要在多个节点上各自同时运行一个sentinel进程，Sentienl 进程通过流言协议(gossip protocols)来接收关于Master是否下线状态，并使用投票协议(Agreement Protocols)来决定是否执行自动故障转移,并选择合适的Slave作为新的Master

每个Sentinel进程会向其它Sentinel、Master、Slave定时发送消息，来确认对方是否存活，如果发现某个节点在指定配置时间内未得到响应，则会认为此节点已离线，即为主观宕机Subjective Down，简称为 SDOWN

如果哨兵集群中的多数Sentinel进程认为Master存在SDOWN，共同利用 is-master-down-by-addr 命令互相通知后，则认为客观宕机Objectively Down，简称 ODOWN

接下来利用投票算法，从所有slave节点中，选一台合适的slave将之提升为新Master节点，然后自动修改其它slave相关配置，指向新的master节点,最终实现故障转移failover

Redis Sentinel中的Sentinel节点个数应该为大于等于3且最好为奇数

客户端初始化时连接的是Sentinel节点集合，不再是具体的Redis节点，即 Sentinel只是配置中心不是代理。

Redis Sentinel 节点与普通 Redis 没有区别,要实现读写分离依赖于客户端程序

Sentinel 机制类似于MySQL中的MHA功能,只解决master和slave角色的自动故障转移问题，但单个Master 的性能瓶颈问题并没有解决

Redis 3.0 之前版本中,生产环境一般使用哨兵模式较多,Redis 3.0后推出Redis cluster功能,可以支持更大规模的高并发环境

Sentinel中的三个定时任务

每10 秒每个sentinel 对master和slave执行info
- 发现slave节点
- 确认主从关系
每2秒每个sentinel通过master节点的channel交换信息(pub/sub)
- 通过sentinel__:hello频道交互
- 交互对节点的“看法”和自身信息
每1秒每个sentinel对其他sentinel和redis执行ping

实现哨兵架构

以下案例实现一主两从的基于哨兵的高可用Redis架构

哨兵需要先实现主从复制

哨兵的前提是已经实现了Redis的主从复制

注意: master 的配置文件中masterauth 和slave 都必须相同

所有主从节点的 redis.conf 中关健配置

范例: 准备主从环境配置

#在所有主从节点执行
#基于包安装
[root@centos8 ~]#yum -y install redis
[root@ubuntu2004 ~]#apt -y install redis redis-sentinel

[root@centos8 ~]#vim /etc/redis.conf
bind 0.0.0.0
masterauth "123456"
requirepass "123456"

#或者非交互执行
[root@centos8 ~]#sed -i -e 's/bind 127.0.0.1/bind 0.0.0.0/' -e 's/^# masterauth .*/masterauth 123456/' -e 's/^# requirepass .*/requirepass 123456/' /etc/redis.conf

#在所有从节点执行
[root@centos8 ~]#echo "replicaof 10.0.0.8 6379" >> /etc/redis.conf

#在所有主从节点执行
[root@centos8 ~]#systemctl enable --now redis

master 服务器状态

[root@redis-master ~]#redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface
may not
127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:2
slave0:ip=10.0.0.28,port=6379,state=online,offset=112,lag=1
slave1:ip=10.0.0.18,port=6379,state=online,offset=112,lag=0
master_replid:8fdca730a2ae48fb9c8b7e739dcd2efcc76794f3
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:112
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:112
127.0.0.1:6379>

配置 slave1

[root@redis-slave1 ~]#redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface
may not be safe.
127.0.0.1:6379> REPLICAOF 10.0.0.8 6379
OK
127.0.0.1:6379> CONFIG SET masterauth "123456"
OK
127.0.0.1:6379> INFO replication
# Replication
role:slave
master_host:10.0.0.8
master_port:6379
master_link_status:up
master_last_io_seconds_ago:4
master_sync_in_progress:0
slave_repl_offset:140
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:8fdca730a2ae48fb9c8b7e739dcd2efcc76794f3
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:140
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:99
repl_backlog_histlen:42

配置 slave2

[root@redis-slave2 ~]#redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface
may not be safe.
127.0.0.1:6379> REPLICAOF 10.0.0.8 6379
OK
127.0.0.1:6379> CONFIG SET masterauth "123456"
OK
127.0.0.1:6379> INFO replication
# Replication
role:slave
master_host:10.0.0.8
master_port:6379
master_link_status:up
master_last_io_seconds_ago:3
master_sync_in_progress:0
slave_repl_offset:182
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:8fdca730a2ae48fb9c8b7e739dcd2efcc76794f3
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:182
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:15
repl_backlog_histlen:168
127.0.0.1:6379>

编辑哨兵配置

sentinel 配置

Sentinel实际上是一个特殊的redis服务器,有些redis指令支持,但很多指令并不支持.默认监听在26379/tcp端口

哨兵服务可以和Redis服务器分开部署在不同主机，但为了节约成本一般会部署在一起

所有redis节点使用相同的以下示例的配置文件

#如果是编译安装，在源码目录有sentinel.conf，复制到安装目录即可，
如:/apps/redis/etc/sentinel.conf
[root@ubuntu2204 ~]#cp redis-7.0.5/sentinel.conf /apps/redis/etc/sentinel.conf
[root@centos8 ~]#cp redis-6.2.5/sentinel.conf /apps/redis/etc/sentinel.conf

[root@ubuntu2204 ~]#chown redis.redis /apps/redis/etc/sentinel.conf

#包安装修改配置文件
[root@centos8 ~]#vim /etc/redis-sentinel.conf
bind 0.0.0.0
port 26379
daemonize yes
pidfile "redis-sentinel.pid"
logfile "sentinel_26379.log"
dir "/tmp" #工作目录

sentinel monitor mymaster 10.0.0.8 6379 2

#mymaster是集群的名称，此行指定当前mymaster集群中master服务器的地址和端口
#2为法定人数限制(quorum)，即有几个sentinel认为master down了就进行故障转移，一般此值是所有sentinel节点(一般总数是>=3的 奇数,如:3,5,7等)的一半以上的整数值，比如，总数是3，即3/2=1.5，取整为2,是master的ODOWN客观下线的依据

sentinel auth-pass mymaster 123456
#mymaster集群中master的密码，注意此行要在上面行的下面

sentinel down-after-milliseconds mymaster 30000
#判断mymaster集群中所有节点的主观下线(SDOWN)的时间，单位：毫秒，建议3000

sentinel parallel-syncs mymaster 1
#发生故障转移后，可以同时向新master同步数据的slave的数量，数字越小总同步时间越长，但可以减轻新master的负载压力

sentinel failover-timeout mymaster 180000
#所有slaves指向新的master所需的超时时间，单位：毫秒

sentinel deny-scripts-reconfig yes #禁止修改脚本

logfile /var/log/redis/sentinel.log

#编译安装修改配置文件
[root@ubuntu2204 ~]#vim /apps/redis/etc/sentinel.conf
[root@ubuntu2204 ~]#grep -Ev "#|^$" /apps/redis/etc/sentinel.conf
protected-mode no
port 26379
daemonize no
pidfile "/apps/redis/run/redis-sentinel.pid"
logfile "/apps/redis/log/redis-sentinel.log"
dir "/tmp"
sentinel monitor mymaster 10.0.0.102 6379 2
sentinel auth-pass mymaster 123456
sentinel down-after-milliseconds mymaster 3000
acllog-max-len 128
sentinel deny-scripts-reconfig yes
sentinel resolve-hostnames no
sentinel announce-hostnames no

三个哨兵服务器的配置都如下

[root@redis-master ~]#grep -vE "^#|^$" /etc/redis-sentinel.conf
port 26379
daemonize no
pidfile "/var/run/redis-sentinel.pid"
logfile "/var/log/redis/sentinel.log"
dir "/tmp"
sentinel monitor mymaster 10.0.0.8 6379 2 #修改此行
sentinel auth-pass mymaster 123456 #增加此行
sentinel down-after-milliseconds mymaster 3000 #修改此行
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
sentinel deny-scripts-reconfig yes

#注意此行自动生成必须唯一,一般不需要修改，如果相同则修改此值需重启redis和sentinel服务
sentinel myid 50547f34ed71fd48c197924969937e738a39975b

.....
# Generated by CONFIG REWRITE
protected-mode no
supervised systemd
sentinel leader-epoch mymaster 0
sentinel known-replica mymaster 10.0.0.28 6379
sentinel known-replica mymaster 10.0.0.18 6379
sentinel current-epoch 0

[root@redis-master ~]#scp /etc/redis-sentinel.conf redis-slave1:/etc/
[root@redis-master ~]#scp /etc/redis-sentinel.conf redis-slave2:/etc/

启动哨兵服务

将所有哨兵服务器都启动起来

#确保每个哨兵主机myid不同，如果相同，必须手动修改为不同的值
[root@redis-slave1 ~]#vim /etc/redis-sentinel.conf
sentinel myid 50547f34ed71fd48c197924969937e738a39975c
[root@redis-slave2 ~]#vim /etc/redis-sentinel.conf
sentinel myid 50547f34ed71fd48c197924969937e738a39975d


[root@redis-master ~]#systemctl enable --now redis-sentinel.service
[root@redis-slave1 ~]#systemctl enable --now redis-sentinel.service
[root@redis-slave2 ~]#systemctl enable --now redis-sentinel.service

如果是编译安装,在所有哨兵服务器执行下面操作启动哨兵

[root@redis-master ~]##vim /apps/redis/etc/sentinel.conf
bind 0.0.0.0
port 26379
daemonize yes
pidfile "redis-sentinel.pid"
Logfile "sentinel_26379.log"
dir "/apps/redis/data"
sentinel monitor mymaster 10.0.0.8 6379 2
sentinel auth-pass mymaster 123456
sentinel down-after-milliseconds mymaster 15000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
sentinel deny-scripts-reconfig yes

[root@redis-master ~]#/apps/redis/bin/redis-sentinel /apps/redis/etc/sentinel.conf

#如果是编译安装，可以在所有节点生成新的service文件
[root@redis-master ~]#cat /lib/systemd/system/redis-sentinel.service
[Unit]
Description=Redis Sentinel
After=network.target

[Service]
ExecStart=/apps/redis/bin/redis-sentinel /apps/redis/etc/sentinel.conf --supervised systemd
ExecStop=/bin/kill -s QUIT $MAINPID
User=redis
Group=redis
RuntimeDirectory=redis
RuntimeDirectory
Mode=0755

[Install]
WantedBy=multi-user.target

#注意所有节点的目录权限,否则无法启动服务
[root@redis-master ~]#chown -R redis.redis /apps/redis/
[root@redis-master ~]#systemctl daemon-reload
[root@redis-master ~]#systemctl enable --now redis-sentinel.service

验证哨兵服务

查看哨兵服务端口状态

[root@redis-master ~]#ss -ntl
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 0.0.0.0:22 0.0.0.0:*
LISTEN 0 128 0.0.0.0:26379 0.0.0.0:*
LISTEN 0 128 0.0.0.0:6379 0.0.0.0:*
LISTEN 0 128 [::]:22 [::]:*
LISTEN 0 128 [::]:26379 [::]:*
LISTEN 0 128 [::]:6379 [::]:*

查看哨兵日志

master的哨兵日志

[root@redis-master ~]#tail -f /var/log/redis/sentinel.log
38028:X 20 Feb 2020 17:13:08.702 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
38028:X 20 Feb 2020 17:13:08.702 # Redis version=5.0.3, bits=64,
commit=00000000, modified=0, pid=38028, just started
8028:X 20 Feb 2020 17:13:08.702 # Configuration loaded
38028:X 20 Feb 2020 17:13:08.702 * supervised by systemd, will signal readiness
38028:X 20 Feb 2020 17:13:08.703 * Running mode=sentinel, port=26379.
38028:X 20 Feb 2020 17:13:08.703 # WARNING: The TCP backlog setting of 511
cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value
of 128.
38028:X 20 Feb 2020 17:13:08.704 # Sentinel ID is
50547f34ed71fd48c197924969937e738a39975b
38028:X 20 Feb 2020 17:13:08.704 # +monitor master mymaster 10.0.0.8 6379 quorum
2
38028:X 20 Feb 2020 17:13:08.709 * +slave slave 10.0.0.28:6379 10.0.0.28 6379 @
mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:13:08.709 * +slave slave 10.0.0.18:6379 10.0.0.18 6379 @
mymaster 10.0.0.8 6379

slave的哨兵日志

[root@redis-slave1 ~]#tail -f /var/log/redis/sentinel.log
25509:X 20 Feb 2020 17:13:27.435 * Removing the pid file.
25509:X 20 Feb 2020 17:13:27.435 # Sentinel is now ready to exit, bye bye...
25572:X 20 Feb 2020 17:13:27.448 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
25572:X 20 Feb 2020 17:13:27.448 # Redis version=5.0.3, bits=64,
commit=00000000, modified=0, pid=25572, just started
25572:X 20 Feb 2020 17:13:27.448 # Configuration loaded
25572:X 20 Feb 2020 17:13:27.448 * supervised by systemd, will signal readiness
25572:X 20 Feb 2020 17:13:27.449 * Running mode=sentinel, port=26379.
25572:X 20 Feb 2020 17:13:27.449 # WARNING: The TCP backlog setting of 511
cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value
of 128.
25572:X 20 Feb 2020 17:13:27.449 # Sentinel ID is
50547f34ed71fd48c197924969937e738a39975b
25572:X 20 Feb 2020 17:13:27.449 # +monitor master mymaster 10.0.0.8 6379 quorum
2

当前sentinel状态

在sentinel状态中尤其是最后一行，涉及到masterIP是多少，有几个slave，有几个sentinels，必须是符合全部服务器数量

[root@redis-master ~]#redis-cli -p 26379
127.0.0.1:26379> INFO sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=10.0.0.8:6379,slaves=2,sentinels=3 #两个slave,三个sentinel服务器,如果sentinels值不符合,检查myid可能冲突

停止 Master 节点实现故障转移

停止 Master 节点

[root@redis-master ~]#killall redis-server

查看各节点上哨兵信息：

[root@redis-master ~]#redis-cli -p 26379
Warning: Using a password with '-a' or '-u' option on the command line interface
may not be safe.
127.0.0.1:26379> INFO sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=10.0.0.18:6379,slaves=2,sentinels=3

故障转移时sentinel的信息：

[root@redis-master ~]#tail -f /var/log/redis/sentinel.log
38028:X 20 Feb 2020 17:42:27.362 # +sdown master mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:27.418 # +odown master mymaster 10.0.0.8 6379 #quorum
2/2
38028:X 20 Feb 2020 17:42:27.418 # +new-epoch 1
38028:X 20 Feb 2020 17:42:27.418 # +try-failover master mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:27.419 # +vote-for-leader
50547f34ed71fd48c197924969937e738a39975b 1
38028:X 20 Feb 2020 17:42:27.422 # 50547f34ed71fd48c197924969937e738a39975d
voted for 50547f34ed71fd48c197924969937e738a39975b 1
38028:X 20 Feb 2020 17:42:27.475 # +elected-leader master mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:27.475 # +failover-state-select-slave master mymaster
10.0.0.8 6379
38028:X 20 Feb 2020 17:42:27.529 # +selected-slave slave 10.0.0.18:6379
10.0.0.18 6379 @ mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:27.529 * +failover-state-send-slaveof-noone slave
10.0.0.18:6379 10.0.0.18 6379 @ mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:27.613 * +failover-state-wait-promotion slave
10.0.0.18:6379 10.0.0.18 6379 @ mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:28.506 # +promoted-slave slave 10.0.0.18:6379
10.0.0.18 6379 @ mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:28.506 # +failover-state-reconf-slaves master mymaster
10.0.0.8 6379
38028:X 20 Feb 2020 17:42:28.582 * +slave-reconf-sent slave 10.0.0.28:6379
10.0.0.28 6379 @ mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:28.736 * +slave-reconf-inprog slave 10.0.0.28:6379
10.0.0.28 6379 @ mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:28.736 * +slave-reconf-done slave 10.0.0.28:6379
10.0.0.28 6379 @ mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:28.799 # +failover-end master mymaster 10.0.0.8 6379
38028:X 20 Feb 2020 17:42:28.799 # +switch-master mymaster 10.0.0.8 6379
10.0.0.18 6379
38028:X 20 Feb 2020 17:42:28.799 * +slave slave 10.0.0.28:6379 10.0.0.28 6379 @
mymaster 10.0.0.18 6379
38028:X 20 Feb 2020 17:42:28.799 * +slave slave 10.0.0.8:6379 10.0.0.8 6379 @
mymaster 10.0.0.18 6379
38028:X 20 Feb 2020 17:42:31.809 # +sdown slave 10.0.0.8:6379 10.0.0.8 6379 @
mymaster 10.0.0.18 6379

验证故障转移

故障转移后redis.conf中的replicaof行的master IP会被修改

[root@redis-slave2 ~]#grep ^replicaof /etc/redis.conf
replicaof 10.0.0.18 6379

哨兵配置文件的sentinel monitor IP 同样也会被修改

[root@redis-slave1 ~]#grep "^[a-Z]" /etc/redis-sentinel.conf
port 26379
daemonize no
pidfile "/var/run/redis-sentinel.pid"
logfile "/var/log/redis/sentinel.log"
dir "/tmp"
sentinel myid 50547f34ed71fd48c197924969937e738a39975b
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 10.0.0.18 6379 2 #自动修改此行
sentinel down-after-milliseconds mymaster 3000
sentinel auth-pass mymaster 123456
sentinel config-epoch mymaster 1
protected-mode no
supervised systemd
sentinel leader-epoch mymaster 1
sentinel known-replica mymaster 10.0.0.8 6379
sentinel known-replica mymaster 10.0.0.28 6379
sentinel known-sentinel mymaster 10.0.0.28 26379
50547f34ed71fd48c197924969937e738a39975d
sentinel current-epoch 1

[root@redis-slave2 ~]#grep "^[a-Z]" /etc/redis-sentinel.conf
port 26379
daemonize no
pidfile "/var/run/redis-sentinel.pid"
logfile "/var/log/redis/sentinel.log"
dir "/tmp"
sentinel myid 50547f34ed71fd48c197924969937e738a39975d
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 10.0.0.18 6379 2 #自动修改此行
sentinel down-after-milliseconds mymaster 3000
sentinel auth-pass mymaster 123456
sentinel config-epoch mymaster 1
protected-mode no
supervised systemd
sentinel leader-epoch mymaster 1
sentinel known-replica mymaster 10.0.0.28 6379
sentinel known-replica mymaster 10.0.0.8 6379
sentinel known-sentinel mymaster 10.0.0.8 26379
50547f34ed71fd48c197924969937e738a39975b
sentinel current-epoch 1

验证 Redis 各节点状态

新的master 状态

[root@redis-slave1 ~]#redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface
may not be safe.
127.0.0.1:6379> INFO replication
# Replication
role:master #提升为master
connected_slaves:1
slave0:ip=10.0.0.28,port=6379,state=online,offset=56225,lag=1
master_replid:75e3f205082c5a10824fbe6580b6ad4437140b94
master_replid2:b2fb4653bdf498691e5f88519ded65b6c000e25c
master_repl_offset:56490
second_repl_offset:46451
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:287
repl_backlog_histlen:56204

另一个slave指向新的master

[root@redis-slave2 ~]#redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface
may not be safe.
127.0.0.1:6379> INFO replication
# Replication
role:slave
master_host:10.0.0.18 #指向新的master
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:61029
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:75e3f205082c5a10824fbe6580b6ad4437140b94
master_replid2:b2fb4653bdf498691e5f88519ded65b6c000e25c
master_repl_offset:61029
second_repl_offset:46451
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1
repl_backlog_histlen:61029

原 Master 重新加入 Redis 集群

[root@redis-master ~]#cat /etc/redis.conf
#sentinel会自动修改下面行指向新的master
replicaof 10.0.0.18 6379

在原 master上观察状态

[root@redis-master ~]#redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface
may not be safe.
127.0.0.1:6379> INFO replication
# Replication
role:slave
master_host:10.0.0.18
master_port:6379
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:764754
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:75e3f205082c5a10824fbe6580b6ad4437140b94
master_replid2:b2fb4653bdf498691e5f88519ded65b6c000e25c
master_repl_offset:764754
second_repl_offset:46451
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:46451
repl_backlog_histlen:718304

[root@redis-master ~]#redis-cli -p 26379
127.0.0.1:26379> INFO sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=10.0.0.18:6379,slaves=2,sentinels=3
127.0.0.1:26379>

观察新master上状态和日志

[root@redis-slave1 ~]#redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface
may not be safe.
127.0.0.1:6379> INFO replication
# Replication
role:master
connected_slaves:2
slave0:ip=10.0.0.28,port=6379,state=online,offset=769027,lag=0
slave1:ip=10.0.0.8,port=6379,state=online,offset=769027,lag=0
master_replid:75e3f205082c5a10824fbe6580b6ad4437140b94
master_replid2:b2fb4653bdf498691e5f88519ded65b6c000e25c
master_repl_offset:769160
second_repl_offset:46451
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:287
repl_backlog_histlen:768874
127.0.0.1:6379>

[root@redis-slave1 ~]#tail -f /var/log/redis/sentinel.log
25717:X 20 Feb 2020 17:42:33.757 # +sdown slave 10.0.0.8:6379 10.0.0.8 6379 @
mymaster 10.0.0.18 6379
25717:X 20 Feb 2020 18:41:29.566 # -sdown slave 10.0.0.8:6379 10.0.0.8 6379 @
mymaster 10.0.0.18 6379

Sentinel 运维

在Sentinel主机手动触发故障切换

#redis-cli -p 26379
127.0.0.1:26379> sentinel failover <masterName>

范例: 手动故障转移

[root@centos8 ~]#vim /etc/redis.conf
replica-priority 10 #指定优先级,值越小sentinel会优先将之选为新的master,默为值为100
[root@centos8 ~]#systemctl restart redis

#或者动态修改
[root@centos8 ~]#redis-cli -a 123456
Warning: Using a password with '-a' or '-u' option on the command line interface
may not be safe.
127.0.0.1:6379> CONFIG GET replica-priority
1) "replica-priority"
2) "100"
127.0.0.1:6379> CONFIG SET replica-priority 99
OK
127.0.0.1:6379> CONFIG GET replica-priority
1) "replica-priority"
2) "99"

[root@centos8 ~]#redis-cli -p 26379
127.0.0.1:26379> sentinel failover mymaster #原主节点自动变成从节点
OK

应用程序连接 Sentinel

Redis 官方支持多种开发语言的客户端：https://redis.io/clients

客户端连接 Sentinel 工作原理

客户端获取 Sentinel 节点集合,选举出一个 Sentinel
由这个sentinel 通过masterName 获取master节点信息,客户端通过sentinel get-master-addr-by-name master-name这个api来获取对应主节点信息
客户端发送role指令确认master的信息,验证当前获取的“主节点”是真正的主节点，这样的目的是为了防止故障转移期间主节点的变化
客户端保持和Sentinel节点集合的联系，即订阅Sentinel节点相关频道，时刻获取关于主节点的相关信息,获取新的master 信息变化,并自动连接新的master

Java 连接Sentinel哨兵

Java 客户端连接Redis：https://github.com/xetorthio/jedis/blob/master/pom.xml

#jedis/pom.xml 配置连接redis
<properties>
		<redis-hosts>localhost:6379,localhost:6380,localhost:6381,localhost:6382,localhost:6383,localhost:6384,localhost:6385</redis-hosts>
		<sentinel-hosts>localhost:26379,localhost:26380,localhost:26381</sentinel-hosts>
		<cluster-hosts>localhost:7379,localhost:7380,localhost:7381,localhost:7382,localhost:7383,localhost:7384,localhost:7385</cluster-hosts>
		<github.global.server>github</github.global.server>
</properties>

java客户端连接单机的redis是通过Jedis来实现的，java代码用的时候只要创建Jedis对象就可以建多个Jedis连接池来连接redis，应用程序再直接调用连接池即可连接Redis。而Redis为了保障高可用,服务一般都是Sentinel部署方式，当Redis服务中的主服务挂掉之后,会仲裁出另外一台Slaves服务充当Master。这个时候,我们的应用即使使用了Jedis 连接池,如果Master服务挂了,应用将还是无法连接新的Master服务，为了解决这个问题, Jedis也提供了相应的Sentinel实现,能够在Redis Sentinel主从切换时候,通知应用,把应用连接到新的Master服务。

Redis Sentinel的使用也是十分简单的,只是在JedisPool中添加了Sentinel和MasterName参数，JRedisSentinel底层基于Redis订阅实现Redis主从服务的切换通知，当Reids发生主从切换时，Sentinel会发送通知主动通知Jedis进行连接的切换，JedisSentinelPool在每次从连接池中获取链接对象的时候,都要对连接对象进行检测,如果此链接和Sentinel的Master服务连接参数不一致,则会关闭此连接,重新获取新的Jedis连接对象。

Python 连接 Sentinel 哨兵

[root@ubuntu2204 ~]#apt -y install python3-redis
[root@centos8 ~]#yum -y install python3 python3-redis

[root@centos8 ~]#cat sentinel_test.py
#!/usr/bin/python3
import redis
from redis.sentinel import Sentinel

#连接哨兵服务器(主机名也可以用域名)
sentinel = Sentinel([('10.0.0.8', 26379),
                     ('10.0.0.18', 26379),
                     ('10.0.0.28', 26379)],
                    socket_timeout=0.5)

redis_auth_pass = '123456'

#mymaster 是配置哨兵模式的redis集群名称，此为默认值,实际名称按照个人部署案例来填写
#获取主服务器地址
master = sentinel.discover_master('mymaster')
print(master)
#获取从服务器地址
slave = sentinel.discover_slaves('mymaster')
print(slave)

#获取主服务器进行写入
master = sentinel.master_for('mymaster', socket_timeout=0.5,
password=redis_auth_pass, db=0)
w_ret = master.set('name', 'wang')
#输出：True

#获取从服务器进行读取（默认是round-roubin）
slave = sentinel.slave_for('mymaster', socket_timeout=0.5,
password=redis_auth_pass, db=0)
r_ret = slave.get('name')
print(r_ret)
#输出：wang

[root@centos8 ~]#chmod +x sentinel_test.py
[root@centos8 ~]#./sentinel_test.py
('10.0.0.8', 6379)
[('10.0.0.18', 6379), ('10.0.0.28', 6379)]
b'wang'