[tech] Temperature Monitoring in Server Room [repost]

Tue Mar 19 12:14:56 AWST 2019

Hi Dylan,

It depends on:

* Whether the sector failure rate is increasing, particularly if the increase is marked (that indicates a drive near end of life).

* How critical the data on the drive is. 

Generally, once sectors start to fail, all of the memory on the drive will have aged to the point that more and more will fail over time. The drive is certainly still usable for non-critical applications on a desktop machine on data that isn't financial, health related, or otherwise mission critical (meaning changes to the data will either get you sued for millions of dollars or endanger human life).

The expectation is a server SSD, that may be storing financial or critical data, should look like this:

  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       15708

  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       15708

That data is from two mirrored SSDs on Nyx, one of my production servers. Drives have been running for a total of 15,700 hours or two year at 24/7/365

If that number changes from zero, I will require OVH to replace the drive in question, and have a 3AM scheduled outage while the array rebuilds.

And from my 7 year old, 2012 Macbook Pro

 5 Reallocated_Sector_Ct   0x0033   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       18196

In spite of 

173 Wear_Leveling_Count     0x0032   191   191   100    Old_age   Always       -       304961552547

If you have a gaming PC, and you keep daily backups of your uni work, you definitely can keep using an SSD once it starts showing wear, although I'd begin to budget for its replacement, and there may be an argument for selling it on Ebay (with a warning about its reallocated sectors, if you are honest).

If it's as UCC, then... I wouldn't allow it on Ashera, but the rest of the machines are up to Wheel discretion.

If anyone's life, or livelihood depends on data not being corrupted, please Goddess no. Where the data on the drive is work 100x  or more what the drive is worth, you don't want such scenarios.

Regards,

Melissa

> On 18 Mar 2019, at 4:59 pm, Dylan H <dylanh333 at ucc.asn.au> wrote:
> 
> Hi Melissa,
> 
> Can you please clarify whether it really is an issue to have any reallocated sectors at all on an SSD, even if only very few?
> 
> My understanding is that the Raw Value is the actual number of sectors reallocated, and the Value is the normalised value, which should not reach the threshold (lower is worse, so a Value of 98 is better than 20).
> Assuming this, I don't imagine a small handful of reallocated sectors is much of an issue, unless the normalised value is getting close to the threshold (0 in this case), as most SSDs come with a sizeable chunk of spare, hidden sectors to swap in, as far as I am aware.
> 
> My main concern is that replacing an SSD that only has a small handful of reallocated sectors but a normalised value of 90+ (for example) would be a bit wasteful, when it still has 90% to go before reaching the vendor-defined threshold.
> 
> Thanks! 
> 
> Kind regards,
> Dylan Hicks [333]
> 
> 
> On 18 Mar 2019 3:13 pm, Melissa Star <melissa at netexperts.com.au> wrote:
> Hi Everyone,
> 
> I just realised - if you have smartmontools installed on linux machines, each hard drive or SSD will provide its “Airflow Temperature”, which I can extract via script.
> 
> I'm thinking of centralising this for all the servers I run, and collecting the data to chart, having a display at home that gives me live info for all machines under my control.
> 
> I could make a similar display for UCC, which could be on the website and/or a monitor in the club room (although this would likely be in the winter holidays due to increasing workload).
> 
> Note the reallocated sector count for SSDs, once this starts to happen, the drive should be replaced. 
> 
> For SSDs (and also HDDs) mounted at the front of servers, because they are getting airflow to the sensor sucked in directly from ambient air, and are thermally insulated from the rest of the machine, this will be equal to the temperature of the room.
> 
> For example, right now, the UCC server room temperature is 29 degrees, according to 3 of the four installed drives, and 30 degrees according to the 4th one.
> 
> For PCs, the same test will provide the temperature in the case. Some drives also have a count of total hours run outside of their acceptable temperature range and G/shocks or drops, as well as all types of other interesting data.
> 
> If there is an interest, I could parse this data, and the page with Ashera-related information could provide it and could also e-mail (and/or SMS) warnings to anyone on the list if the temperature passes a key threshold.
> 
> Here is what the data actually looks like (I've highlighted the airflow temperature):
> 
> smartctl -d sat -a /dev/pass1
> smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-STABLE amd64] (local build)
> Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org <http://www.smartmontools.org/>
> 
> === START OF INFORMATION SECTION ===
> Device Model:     Samsung SSD 860 QVO 1TB
> Serial Number:    S4CZNG0M138175F
> LU WWN Device Id: 5 002538 e701b1df5
> Firmware Version: RVQ01B6Q
> User Capacity:    1,000,204,886,016 bytes [1.00 TB]
> Sector Size:      512 bytes logical/physical
> Rotation Rate:    Solid State Device
> Form Factor:      2.5 inches
> Device is:        Not in smartctl database [for details use: -P showall]
> ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
> SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Mon Mar 18 15:03:46 2019 AWST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> ... (cut to prevent this email becoming ridiculous) ...
> 
> SMART Attributes Data Structure revision number: 1
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
>   9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       648
>  12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       15
> 177 Wear_Leveling_Count     0x0013   100   100   000    Pre-fail  Always       -       0
> 179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
> 181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
> 182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
> 183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
> 190 Airflow_Temperature_Cel 0x0032   071   058   000    Old_age   Always       -       29
> 195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
> 199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
> 235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       13
> 241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       336661820
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> 
> 
> 
> 
> 
> Regards,
> 
> Melissa
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ucc.gu.uwa.edu.au/pipermail/tech/attachments/20190319/39728ea0/attachment-0001.htm