Dears, I have a SC2600CP board in a server with 2 Xeon CPUs and 196GB of RAM.
This machine is used as a calculation node in a cluster environment, with other machines that has almost the same configuration.
A few days ago it started to reboot with no reason.
To try to identify the problem, I checked all DIMM slots and all the memory's looking for someone with error. I tested all of them but could not found an error.
Then I checked the SEL logs:
1 | 09/17/2017 | 16:38:10 | Event Logging Disabled #0x07 | Log area reset/cleared | Asserted
2 | 09/17/2017 | 17:16:55 | Power Unit #0x01 | Failure detected | Asserted
3 | 09/17/2017 | 17:16:56 | Power Unit #0x01 | Power off/down | Asserted
4 | 09/17/2017 | 17:17:01 | Power Unit #0x01 | Power off/down | Deasserted
5 | 09/17/2017 | 17:17:01 | Power Unit #0x01 | Failure detected | Deasserted
6 | 09/17/2017 | 17:17:02 | Power Unit #0x01 | Power off/down | Asserted
7 | 09/17/2017 | 17:17:07 | Power Unit #0x01 | Power off/down | Deasserted
8 | 09/17/2017 | 17:17:13 | Fan #0x32 | Lower Non-critical going low | Deasserted
9 | 09/17/2017 | 17:17:13 | Fan #0x32 | Lower Critical going low | Deasserted
a | 09/17/2017 | 17:17:13 | Fan #0x32 | Lower Non-critical going low | Deasserted
b | 09/17/2017 | 17:17:13 | Fan #0x32 | Lower Critical going low | Deasserted
c | 09/17/2017 | 17:17:24 | Fan #0x32 | Lower Non-critical going low | Asserted
d | 09/17/2017 | 17:17:24 | Fan #0x32 | Lower Critical going low | Asserted
e | 09/17/2017 | 17:17:31 | System Event #0x83 | Timestamp Clock Sync | Asserted
f | 09/17/2017 | 17:17:32 | System Event #0x83 | Timestamp Clock Sync | Asserted
10 | 09/17/2017 | 17:17:55 | System Event #0x83 | OEM System boot event | Asserted
and on the BMC web console:
30 | 09/17/2017 17:39:32 | Pwr Unit Status | Power Unit | reports the power unit is powered off or being powered down - Asserted |
29 | 09/17/2017 17:37:19 | BIOS Evt Sensor | System Event | reports OEM System Boot Event - Asserted |
28 | 09/17/2017 17:36:56 | BIOS Evt Sensor | System Event | reports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted |
27 | 09/17/2017 17:36:56 | BIOS Evt Sensor | System Event | reports Timestamp Clock Sync. Event is one of two expected events from BIOS on every power on. - Asserted |
26 | 09/17/2017 17:36:49 | System Fan 3 | Fan | reports the sensor is in a low, critical, and going lower state - Asserted |
25 | 09/17/2017 17:36:49 | System Fan 3 | Fan | reports the sensor is in a low, but non-critical, and going lower state - Asserted |
24 | 09/17/2017 17:36:36 | System Fan 3 | Fan | reports the sensor is in a low, critical, and going lower state - Deasserted |
23 | 09/17/2017 17:36:36 | System Fan 3 | Fan | reports the sensor is in a low, but non-critical, and going lower state - Deasserted |
22 | 09/17/2017 17:36:34 | System Fan 3 | Fan | reports the sensor is in a low, critical, and going lower state - Deasserted |
21 | 09/17/2017 17:36:34 | System Fan 3 | Fan | reports the sensor is in a low, but non-critical, and going lower state - Deasserted |
20 | 09/17/2017 17:36:31 | Pwr Unit Status | Power Unit | reports the power unit is powered off or being powered down - Deasserted |
19 | 09/17/2017 17:36:26 | Pwr Unit Status | Power Unit | reports the power unit has suffered a failure - Deasserted |
18 | 09/17/2017 17:36:20 | Pwr Unit Status | Power Unit | reports the power unit is powered off or being powered down - Asserted |
17 | 09/17/2017 17:36:20 | Pwr Unit Status | Power Unit | reports the power unit has suffered a failure - Asserted |
The power unit Failure detected it is not the main cause, since I have replaced the power source and the problem remains.
All the sensors, Fans etc are OK. There is no problem with them, but the LED fault is blinking amber, with no change.
No errors reported on the BIOS, only in SEL.
I have downloaded the debug logs, but I could not check it because it is password protected.
The board information is as follows:
Manufacturing Date : | 2012-10-09 03:53 |
Manufacturer : | Intel Corporation |
Product Name : | S2600CO |
Serial Number: | QSCO22700376 |
Part/Model Number : | G29920-205 |
FRU File ID : | FRU Ver 1.00 |
If anyone could help me, it would be great.
Best regards,