Hi
I have installed two P4600 NVME devices in a server and installed Proxmox 5.1-3. The running kernel is 4.13.13-6-pve. There is no RAID controller involved.
# nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 BTLE736103CH4P0KGN INTEL SSDPEDKE040T7 1 4.00 TB / 4.00 TB 512 B + 0 B QDV10170
/dev/nvme1n1 BTLE736103AG4P0KGN INTEL SSDPEDKE040T7 1 4.00 TB / 4.00 TB 512 B + 0 B QDV10170
The output of "isdct show -a -intelssd" is attached in the file "intelssdp4600-2.txt".
Now by using LVM I can always reproduce a "hanging" behaviour of 1-2 minutes that does not lead to any fatal error:
vgcreate SSD /dev/nvme0n1 /dev/nvme1n1
lvcreate -l 100%FREE -n SSDVMSTORE01 --stripes 2 --stripesize 128 --type striped SSD
lvremove -d -v SSD/SSDVMSTORE01
Do you really want to remove and DISCARD active logical volume SSDTEST/SSDTEST01? [y/n]: y
here the command seems to be hanging for 1-2 minutes. in the Logs I see:
Feb 21 14:14:17 px kernel: [ 3654.745355] nvme nvme0: I/O 200 QID 14 timeout, aborting
Feb 21 14:14:17 px kernel: [ 3654.745772] nvme nvme0: I/O 201 QID 14 timeout, aborting
Feb 21 14:14:17 px kernel: [ 3654.746110] nvme nvme0: I/O 202 QID 14 timeout, aborting
Feb 21 14:14:17 px kernel: [ 3654.746436] nvme nvme0: I/O 203 QID 14 timeout, aborting
Feb 21 14:14:32 px kernel: [ 3669.013614] nvme nvme0: Abort status: 0x0
Feb 21 14:14:32 px kernel: [ 3669.014012] nvme nvme0: Abort status: 0x0
Feb 21 14:14:32 px kernel: [ 3669.014325] nvme nvme0: Abort status: 0x0
Feb 21 14:14:32 px kernel: [ 3669.014629] nvme nvme0: Abort status: 0x0
Feb 21 14:15:10 px kernel: [ 3707.737495] nvme nvme1: I/O 297 QID 14 timeout, aborting
Feb 21 14:15:10 px kernel: [ 3707.737902] nvme nvme1: I/O 298 QID 14 timeout, aborting
Feb 21 14:15:10 px kernel: [ 3707.738231] nvme nvme1: I/O 299 QID 14 timeout, aborting
Feb 21 14:15:10 px kernel: [ 3707.738547] nvme nvme1: I/O 300 QID 14 timeout, aborting
Feb 21 14:15:25 px kernel: [ 3722.005726] nvme nvme1: Abort status: 0x0
Feb 21 14:15:25 px kernel: [ 3722.006113] nvme nvme1: Abort status: 0x0
Feb 21 14:15:25 px kernel: [ 3722.006434] nvme nvme1: Abort status: 0x0
Feb 21 14:15:25 px kernel: [ 3722.006751] nvme nvme1: Abort status: 0x0
After this, the command completes without error.
This does not happen with Debian 9.3 (Kernel 4.9.x). If I partition the devices with one 2GB primary partition and do the same operation but with /dev/nvmeXn1p1 or p2 the timeout does happen on the second partition but not on the first:
parted -a optimal /dev/nvme0n1 mklabel gpt
parted -a optimal /dev/nvme0n1 mkpart primary 4 2047
parted -a optimal /dev/nvme0n1 mkpart primary 2048 100%
parted -a optimal /dev/nvme1n1 mklabel gpt
parted -a optimal /dev/nvme1n1 mkpart primary 4 2047
parted -a optimal /dev/nvme1n1 mkpart primary 2048 100%
Any clues?
Best,
Pierre