EugenyHo: 2012

Thursday, August 2, 2012

London's bridge is falling down...

OMG! Last few days one of my LUN at PVE cluster is falling down after night backup!

Symptoms:



root@cl02-n02:~# pvscan

/dev/NAS01LUN0VG0/vzsnap-cl02-n02-0: read failed after 0 of 4096 at 21478965248: Input/output error

/dev/NAS01LUN0VG0/vzsnap-cl02-n02-0: read failed after 0 of 4096 at 21479022592: Input/output error

/dev/NAS01LUN0VG0/vzsnap-cl02-n02-0: read failed after 0 of 4096 at 0: Input/output error

/dev/NAS01LUN0VG0/vzsnap-cl02-n02-0: read failed after 0 of 4096 at 4096: Input/output error

   /dev/sdb: Checksum error

PV /dev/sdk    VG NAS01LUN9VG0    lvm2 [2.00 TiB / 2.00 TiB free]

PV /dev/sdj    VG NAS01LUN8VG0    lvm2 [2.00 TiB / 2.00 TiB free]

PV /dev/sdi    VG NAS01LUN7VG0    lvm2 [511.98 GiB / 511.98 GiB free]

PV /dev/sdh    VG NAS01LUN6VG0    lvm2 [511.98 GiB / 511.98 GiB free]

PV /dev/sdg    VG NAS01LUN5VG0    lvm2 [255.99 GiB / 159.99 GiB free]

PV /dev/sdf    VG NAS01LUN4VG0    lvm2 [255.99 GiB / 157.99 GiB free]

PV /dev/sde    VG NAS01LUN3VG0    lvm2 [127.99 GiB / 29.99 GiB free]

PV /dev/sdd    VG NAS01LUN2VG0    lvm2 [127.99 GiB / 95.99 GiB free]

PV /dev/sdc    VG NAS01LUN1VG1    lvm2 [127.99 GiB / 26.99 GiB free]

PV /dev/sda2   VG pve             lvm2 [297.59 GiB / 16.00 GiB free]

PV /dev/sdb                       lvm2 [128.00 GiB]

Total: 11 [6.29 TiB] / in use: 10 [6.17 TiB] / in no VG: 1 [128.00 GiB]

SMS notification from "The Dude" monitoring system

For now at my server's farm I use perfect (and free) monitoring system named "The dude". I think its wonderful soft because its easy to use, easy to setup and perfectly working at linux host under wine, and have a package (npk) for Mikrotik RouterOS.

HA Storage cluster setup process

Here is step-by-step instruction to do HA storage cluster showed New Big Picture.

New Big Picture

Finally I assemble my HA storage cluster, which passed my tests! Here is:

Friday, April 13, 2012

Disappointment

So, finaly I must agree with El Di Pablo that the GlusterFS is Not Ready For HA SAN Storage. Because my HA storage with last config could not pass my failover test.

It was better than the first time: after rebooting first node, the iSCSI target still available and VM keep to write data, but after boot first node, and reboot second node iSCSI target was lost, and finaly all data on storage was corrupted. OMG!

Well, Now I have decided to leave GlusterFS and use the classical solution: Ubuntu+DRBD(master/slave)+IET+Heartbeat. Yes, I know about possible problems with perfomance. But for now, I think, it is only one sutable HA solution (available without payment) for production purpose.

Thursday, April 5, 2012

Test#2

Today the updated configuration to the nodes of my HA Storage. Created a virtual machine and started to install ubuntu on it. During installation, I rebooted the active node. As a result, the PM node has received:


root@cl02-n01:~# pvscan

  /dev/NASLUN0VG0/vm-100-disk-1: read failed after 0 of 4096 at 34359672832: Input/output error

  /dev/NASLUN0VG0/vm-100-disk-1: read failed after 0 of 4096 at 34359730176: Input/output error

  /dev/NASLUN0VG0/vm-100-disk-1: read failed after 0 of 4096 at 0: Input/output error

  /dev/NASLUN0VG0/vm-100-disk-1: read failed after 0 of 4096 at 4096: Input/output error

  PV /dev/sdb    VG NASLUN0VG0   lvm2 [4.00 TiB / 3.97 TiB free]

  PV /dev/sda2   VG pve          lvm2 [297.59 GiB / 16.00 GiB free]

  Total: 2 [4.29 TiB] / in use: 2 [4.29 TiB] / in no VG: 0 [0   ]

And VM is hungup.

So, after rebooted node returned to online, I rebooted the second one. At PM node has received:


root@cl02-n01:~# pvscan

  Found duplicate PV 4hKm0l1uebbcn5s3eV3ZUT6nb6exKjOD: using /dev/sdc not /dev/sdb

  PV /dev/sdc    VG NASLUN0VG0   lvm2 [4.00 TiB / 3.97 TiB free]

  PV /dev/sda2   VG pve          lvm2 [297.59 GiB / 16.00 GiB free]

  Total: 2 [4.29 TiB] / in use: 2 [4.29 TiB] / in no VG: 0 [0   ]

Seems to be /dev/NASLUN0VG0/vm-100-disk-1 is available now, but VM is in hungup stage anyway. At PM GUI about my VM:


Status: running

CPU usage: 50% of 2CPUs (im use 2 CPU at config)

PS.
At active Storage node dmesg show:

...

[ 3984.639799] iscsi_trgt: scsi_cmnd_start(972) e000080 25
[ 3984.665551] iscsi_trgt: cmnd_skip_pdu(475) e000080 1c 25 0
[ 3984.690853] iscsi_trgt: scsi_cmnd_start(972) f000080 25
[ 3984.716664] iscsi_trgt: cmnd_skip_pdu(475) f000080 1c 25 0
[ 3984.741993] iscsi_trgt: scsi_cmnd_start(972) 10000080 25
[ 3984.767753] iscsi_trgt: cmnd_skip_pdu(475) 10000080 1c 25 0
[ 3984.793068] iscsi_trgt: scsi_cmnd_start(972) 11000080 1a
[ 3984.818886] iscsi_trgt: cmnd_skip_pdu(475) 11000080 1c 1a 0
[ 3984.844352] iscsi_trgt: scsi_cmnd_start(972) 12000080 1a
[ 3984.870156] iscsi_trgt: cmnd_skip_pdu(475) 12000080 1c 1a 0
[ 3984.895546] iscsi_trgt: scsi_cmnd_start(972) 13000080 1a
[ 3984.921360] iscsi_trgt: cmnd_skip_pdu(475) 13000080 1c 1a 0
[ 3984.947039] iscsi_trgt: scsi_cmnd_start(972) 14000080 1a
...
and these messages more and more

Wednesday, April 4, 2012

Lets change GlusterFS config!

As a result, the first test showed that the storage reboot the cluster nodes with the current configuration causes data corruption. As a result, the virtual machine image becomes corrupted.I think the problem is in GlusterFS synchonizing and self-heal mechanism. My other thought is that maybe the cause of damage in the use of "thin provisioning". I'll check this option if the new configuration will not work correctly.

Now will change setup for my GlusterFS server and client using AFR translator:

Serever config on NAS01-NODE01 (172.16.0.1):


##############################################

###  GlusterFS Server Volume Specification  ##

##############################################



# dataspace on node1

volume gfs-ds

  type storage/posix

  option directory /data

end-volume



# posix locks

volume gfs-ds-locks

  type features/posix-locks

  subvolumes gfs-ds

end-volume



# dataspace on node2

volume gfs-node2-ds

  type protocol/client

  option transport-type tcp/client

  option remote-host 172.16.0.2         # storage network

  option remote-subvolume gfs-ds-locks

  option transport-timeout 10           # value in seconds; it should be set relatively low

end-volume



# automatic file replication translator for dataspace

volume gfs-ds-afr

  type cluster/afr

  subvolumes gfs-ds-locks gfs-node2-ds  # local and remote dataspaces

end-volume



# the actual exported volume

volume gfs

  type performance/io-threads

  option thread-count 8

  option cache-size 64MB

  subvolumes gfs-ds-afr

end-volume



# finally, the server declaration

volume server

  type protocol/server

  option transport-type tcp/server

  subvolumes gfs

  # storage network access only

  option auth.ip.gfs-ds-locks.allow 172.16.0.*,127.0.0.1

  option auth.ip.gfs.allow 172.16.0.*

end-volume

Serever config on NAS01-NODE02 (172.16.0.2):


##############################################

###  GlusterFS Server Volume Specification  ##

##############################################



# dataspace on node2

volume gfs-ds

  type storage/posix

  option directory /data

end-volume



# posix locks

volume gfs-ds-locks

  type features/posix-locks

  subvolumes gfs-ds

end-volume



# dataspace on node1

volume gfs-storage1-ds

  type protocol/client

  option transport-type tcp/client

  option remote-host 172.16.0.1         # storage network

  option remote-subvolume gfs-ds-locks

  option transport-timeout 10           # value in seconds; it should be set relatively low

end-volume



# automatic file replication translator for dataspace

volume gfs-ds-afr

  type cluster/afr

  subvolumes gfs-ds-locks gfs-node1-ds  # local and remote dataspaces

end-volume



# the actual exported volume

volume gfs

  type performance/io-threads

  option thread-count 8

  option cache-size 64MB

  subvolumes gfs-ds-afr

end-volume



# finally, the server declaration

volume server

  type protocol/server

  option transport-type tcp/server

  subvolumes gfs

  # storage network access only

  option auth.ip.gfs-ds-locks.allow 172.16.0.*,127.0.0.1

  option auth.ip.gfs.allow 172.16.0.*

end-volume

Client config on both nodes:


#############################################

##  GlusterFS Client Volume Specification  ##

#############################################



# the exported volume to mount                    # required!

volume cluster

  type protocol/client

  option transport-type tcp/client

  option remote-host 172.16.0.1 / or .2 for node2 !!!

  option remote-subvolume gfs                     # exported volume

  option transport-timeout 10                     # value in seconds, should be relatively low

end-volume



# performance block for cluster                   # optional!

volume writeback

  type performance/write-behind

  option aggregate-size 131072

  subvolumes cluster

end-volume



# performance block for cluster                   # optional!

volume readahead

  type performance/read-ahead

  option page-size 65536

  option page-count 16

  subvolumes writeback

end-volume

Monday, April 2, 2012

Meanwhile, Cluster Storage

Interesting dmesg:




[ 3773.406686] iscsi_trgt: Abort Task (01) issued on tid:1 lun:1 by sid:844424967684608 (Function Complete)

[ 3801.645553] iscsi_trgt: cmnd_rx_start(1863) 1 4b000010 -7

[ 3801.645861] iscsi_trgt: cmnd_skip_pdu(475) 4b000010 1 28 0

[ 3863.275926] iscsi_trgt: Abort Task (01) issued on tid:1 lun:1 by sid:844424967684608 (Function Complete)

[ 3863.276207] iscsi_trgt: Abort Task (01) issued on tid:1 lun:1 by sid:844424967684608 (Function Complete)

[ 3873.259744] iscsi_trgt: Abort Task (01) issued on tid:1 lun:1 by sid:844424967684608 (Function Complete)

[ 3873.260209] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:1 by sid:844424967684608 (Function Complete)

[ 3883.243507] iscsi_trgt: Abort Task (01) issued on tid:1 lun:1 by sid:844424967684608 (Function Complete)

[ 3883.243788] iscsi_trgt: Target Warm Reset (06) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 3893.227235] iscsi_trgt: Abort Task (01) issued on tid:1 lun:1 by sid:844424967684608 (Function Complete)

[ 4019.005333] iscsi_trgt: Abort Task (01) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4019.005688] iscsi_trgt: Abort Task (01) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4028.989010] iscsi_trgt: Abort Task (01) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4028.989375] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4038.972785] iscsi_trgt: Abort Task (01) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4038.973091] iscsi_trgt: Target Warm Reset (06) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4048.956574] iscsi_trgt: Abort Task (01) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4090.888634] iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4090.888985] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4090.889242] iscsi_trgt: Target Warm Reset (06) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4090.889575] iscsi_trgt: scsi_cmnd_start(972) 16000010 0

[ 4090.897476] iscsi_trgt: cmnd_skip_pdu(475) 16000010 1c 0 0

[ 4100.873321] iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4100.873629] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4100.873892] iscsi_trgt: Target Warm Reset (06) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4142.810375] iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4142.810768] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4142.811135] iscsi_trgt: Target Warm Reset (06) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4142.811485] iscsi_trgt: scsi_cmnd_start(972) 56000010 0

[ 4142.819205] iscsi_trgt: cmnd_skip_pdu(475) 56000010 1c 0 0

[ 4152.795191] iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4152.795542] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4152.795799] iscsi_trgt: Target Warm Reset (06) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4194.738037] iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4194.738300] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4194.738583] iscsi_trgt: Target Warm Reset (06) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4194.738789] iscsi_trgt: scsi_cmnd_start(972) 68000010 0

[ 4194.746329] iscsi_trgt: cmnd_skip_pdu(475) 68000010 1c 0 0

[ 4204.721919] iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4204.722125] iscsi_trgt: Logical Unit Reset (05) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4204.722380] iscsi_trgt: Target Warm Reset (06) issued on tid:1 lun:0 by sid:844424967684608 (Function Complete)

[ 4245.655427] iscsi_trgt: nop_out_start(907) ignore this request 69000010

[ 4245.663146] iscsi_trgt: cmnd_rx_start(1863) 0 69000010 -7

[ 4245.670986] iscsi_trgt: cmnd_skip_pdu(475) 69000010 0 0 0

[ 4246.653815] iscsi_trgt: Abort Task (01) issued on tid:1 lun:2 by sid:844424967684608 (Unknown LUN)

[ 4246.653981] iscsi_trgt: cmnd_rx_start(1863) 2 4b000010 -7

[ 4246.661635] iscsi_trgt: cmnd_skip_pdu(475) 4b000010 2 0 0

Add to test#1

Finaly do:


root@cl02-n01:~# qm start 100                                                     /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 3221159936: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 3221217280: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 0: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 4096: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 3221159936: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 3221217280: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 0: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 4096: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 3221159936: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 3221217280: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 0: Input/output error

  /dev/NAS01LUN1VG0/vm-100-disk-1: read failed after 0 of 4096 at 4096: Input/output error

  Volume group "NAS01LUN1VG0" not found

can't activate LV 'NAS01-LUN1:vm-100-disk-1':   Skipping volume group NAS01LUN1VG0

Hehe, now "content" via PM GUI for both NAS01-PVELUNS and NAS01-LUN1 is empty.

I successfully broke all stuff.

Test#1: Check HA for Storage cluster

For this testing, I created 1 VM. Then I setuped mikrotik 5.5. MT is just because, I have ISO here :)

Prepearing for testing

The cluster is ready to use.
Now I need to add my "HA-cluster-Storage" as a storage for my virtual machines:

Go to "Datacenter"-"Storage"-""Add" - iSCSITarget

Fill

ID:NAS01-PVELUNS
Portal : 172.16.70.2
Target: iqn.2012-03.nas01:iscsi.PVELUNS
Uncheck "Use LUNs directly"

And then, "Add" - LVM group

Fill

ID:NAS01-LUN1

Base storage: NAS01-PVELUNS

Base volume: CH 00 ID 0 LUN 1

Volume group: NAS01LUN1VG0

Check "Shared"

Ok, now we can see:

Now lookup to node1:


root@cl02-n01:~# pvdisplay

  Found duplicate PV MiXXJdMcRElPXQPEtzc6pPFAQLhQn0lC: using /dev/sde not /dev/sdd

  --- Physical volume ---

  PV Name               /dev/sde

  VG Name               NAS01LUN1VG0

  PV Size               4.00 TiB / not usable 4.00 MiB

  Allocatable           yes

  PE Size               4.00 MiB

  Total PE              1048575

  Free PE               1047807

  Allocated PE          768

  PV UUID               MiXXJd-McRE-lPXQ-PEtz-c6pP-FAQL-hQn0lC



  --- Physical volume ---

  PV Name               /dev/sda2

  VG Name               pve

  PV Size               297.59 GiB / not usable 0

  Allocatable           yes

  PE Size               4.00 MiB

  Total PE              76183

  Free PE               4095

  Allocated PE          72088

  PV UUID               mMvpci-6zko-K3uS-VOG1-GO4Z-Nsdj-t9WbtS

What does "Found duplicate PV MiXXJdMcRElPXQPEtzc6pPFAQLhQn0lC: using /dev/sde not /dev/sdd" mean? And why this appear?

I will thinking about it...

Cluster Tips: Keep your clocks on all nodes synchronized

Yes, I had not thought of.
There is a simple solutions for my first problem is just correct time at all nodes.

So, for now cluster looking well!

Saturday, March 31, 2012

Creating PM 2.0 Cluster

So, today was installed my fresh-new Proxmox Virtual Environment.

First problem occured with GUI.

Asked at forum: http://forum.proxmox.com/threads/8949-GUI-login-problem

Thinking....

Proxmox VE 2.0 final release !!!

Thats' perfect!
11 hours ago, was Proxmox VE 2.0 final release !!!

First of all, will make a USB stick for instalation on my ready-to-use 4 nodes.

Friday, March 30, 2012

Project: HA Storage cluster for Proxmox 2.0

Hi all!

Last few month I waiting for release stable version on Proxmox 2.0. Hope its will happen soon. And for my new cluster I was searching suitable HA "solid-rock" (hehe, "dream-come-true") solution.

Due to post from El Di Pablo : http://www.bauer-power.net/2011/08/roll-your-own-fail-over-san-cluster.html I was try to construct such cluster, and test it.

So here is:

But later El Di Pablo told horrible story about fault of his cluster:
http://www.bauer-power.net/2012/03/glusterfs-is-not-ready-for-san-storage.html

Anyway, I want to check this up, and make this project realy useful and easy-to-use solution.

At this blog, I will try to inform, what I done and what happining with my cluster.

EugenyHo

Pages