1. 硬盘故障
因为底层做了raid配置,有硬件故障,直接更换硬盘,会自动同步数据。
如果没有做raid的处理方法:
正常node执行gluster volume status 记录故障节点uuid
执行getfattr -d -m '.*' /brick
记录 trusted.glusterfs.volume-id及trusted.gfid
演示
(1) 查看gv2卷信息
[root@mystorage1 /]# gluster volume info gv2 Volume Name: gv2 Type: Replicate Volume ID: 228f63c4-0219-4c39-8e87-f3ae237ff6d9 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: mystorage1:/storage/brick2 Brick2: mystorage2:/storage/brick2 Options Reconfigured: performance.cache-size: 256MB performance.read-ahead: on performance.readdir-ahead: on
(2) 模拟硬盘故障:在VMware上把storage2上的/dev/sdc盘直接删除,可以看到glusterfs已经故障啦
[root@mystorage2 ~]# fdisk -l|grep /dev/sdc [root@mystorage2 ~]# Message from syslogd@mystorage2 at Mar 21 23:10:14 ... storage-brick2[3101]: [2017-03-21 15:10:14.549397] M [MSGID: 113075] [posix-helpers.c:1856:posix_health_check_thread_proc] 0-gv2-posix: health-check failed, going down #storage1卷上数据还是存在的,并没有丢失(复制卷数据冗余) [root@mystorage1 brick2]# ls 10M-1.file 10M-3.file 10M-4.file 20M.file
(3) 模拟换硬盘:再给storage2添加一块盘格式化挂载上
[root@mystorage2 ~]# mkfs.xfs /dev/sdc [root@mystorage2 ~]# mount /dev/sdc /storage/brick2 #2上的brick2里已经没有.glusterfs .trashcan等数据加载的信息 [root@mystorage2 brick2]# ls -a . .. [root@mystorage1 brick2]# ls -a . .. 10M-1.file 10M-3.file 10M-4.file 20M.file .glusterfs .trashcan
(4) 恢复数据
#使用系统自带工具查看正常节点上的卷信息 [root@mystorage1 brick2]# getfattr -d -m '.*' /storage/brick2 getfattr: Removing leading '/' from absolute path names # file: storage/brick2 trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ== trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAA/////w== trusted.glusterfs.dht.commithash="3330055593" trusted.glusterfs.volume-id=0sIo9jxAIZTDmOh/OuI3/22Q== #在故障节点上查看,可以看到正常卷是有记载信息的,故障卷并没有 [root@mystorage2 brick2]# getfattr -d -m '.*' /storage/brick2 [root@mystorage2 brick2]# getfattr -d -m '.*' /storage/brick1 getfattr: Removing leading '/' from absolute path names # file: storage/brick1 trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ== trusted.glusterfs.dht=0sAAAAAQAAAAB//////////w== trusted.glusterfs.volume-id=0s1yGtRzv7Rf68R2fOEdGa+Q== #在故障卷上设置正常卷的记载信息,使它能够与正常卷同步,数据即可恢复 [root@mystorage1 brick2]# getfattr -d -m '.*' /storage/brick2 getfattr: Removing leading '/' from absolute path names # file: storage/brick2 trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ== trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAA/////w== trusted.glusterfs.dht.commithash="3330055593" trusted.glusterfs.volume-id=0sIo9jxAIZTDmOh/OuI3/22Q== #setfattr -n trusted.glusterfs.volume-id -v 记录值 brickpath/setfattr -n trusted.gfid -v 记录值 brickpath/... root@mystorage2 brick2]# setfattr -n trusted.glusterfs.volume-id -v 0sIo9jxAIZTDmOh/OuI3/22Q== /storage/brick2 [root@mystorage2 brick2]# getfattr -d -m '.*' /storage/brick2 getfattr: Removing leading '/' from absolute path names # file: storage/brick2 trusted.glusterfs.volume-id=0sIo9jxAIZTDmOh/OuI3/22Q== #依次全部设置 [root@mystorage2 brick2]# setfattr -n trusted.gfid -v 0sAAAAAAAAAAAAAAAAAAAAAQ== /storage/brick2 [root@mystorage2 brick2]# setfattr -n trusted.glusterfs.dht -v 0sAAAAAQAAAAAAAAAA/////w== /storage/brick2 [root@mystorage2 brick2]# setfattr -n trusted.glusterfs.dht.commithash -v "3330055593" /storage/brick2 [root@mystorage2 brick2]# getfattr -d -m '.*' /storage/brick2 getfattr: Removing leading '/' from absolute path names # file: storage/brick2 trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ== trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAA/////w== trusted.glusterfs.dht.commithash="3330055593" trusted.glusterfs.volume-id=0sIo9jxAIZTDmOh/OuI3/22Q== #重启glusterfs查看数据是否同步,不设置属性直接把文件复制过来是不行的,数据不会同步 [root@mystorage2 brick2]# /etc/init.d/glusterd restart Stopping glusterd: [ OK ] Starting glusterd: [ OK ] [root@mystorage2 brick2]# ls -a . .. 10M-1.file 10M-3.file 10M-4.file 20M.file .glusterfs .trashcan
2. 主机故障
一台节点故障的情况包括以下情况:
a) 物理故障;
b) 同时有多块硬盘故障,造成数据丢失;
c) 系统损坏不可修复;
解决方法:
找一台完全一样的机器,至少要保证硬盘数量和大小一致,安装系统,配置和故障机同样的ip,安装gluster软件,保证配置都一样,在其他健康的节点上执行命令gluster peer status,查看故障服务器的uuid, [root@mystorage1 gv2]# gluster peer status Number of Peers: 3 Hostname: mystorage2 Uuid: 2e3b51aa-45b2-4cc0-bc44-457d42210ff1 State: Peer in Cluster(Disconnected) 修改新加机器的/var/lib/glusterd/glusterd.info和故障机器的一样 cat /var/lib/glusterd/glusterd.info UUID= 2e3b51aa-45b2-4cc0-bc44-457d42210ff1 在新机器 挂载目录上执行磁盘故障的操作 在任意节点上执行 root@drbd01 ~]# gluster volume heal gv1 full Launching Heal operation on volume gv2 has been successful 就会自动开始同步,但是同步的时候会影响整个系统的性能。 可以查看状态 [root@drbd01 ~]# gluster volume heal gv2 info Gathering Heal info on volume gv2 has been successful