[tech] Temperature Monitoring in Server Room [repost]
Melissa Star
melissa at netexperts.com.au
Tue Mar 19 12:14:56 AWST 2019
Hi Dylan,
It depends on:
* Whether the sector failure rate is increasing, particularly if the increase is marked (that indicates a drive near end of life).
* How critical the data on the drive is.
Generally, once sectors start to fail, all of the memory on the drive will have aged to the point that more and more will fail over time. The drive is certainly still usable for non-critical applications on a desktop machine on data that isn't financial, health related, or otherwise mission critical (meaning changes to the data will either get you sued for millions of dollars or endanger human life).
The expectation is a server SSD, that may be storing financial or critical data, should look like this:
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 15708
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 15708
That data is from two mirrored SSDs on Nyx, one of my production servers. Drives have been running for a total of 15,700 hours or two year at 24/7/365
If that number changes from zero, I will require OVH to replace the drive in question, and have a 3AM scheduled outage while the array rebuilds.
And from my 7 year old, 2012 Macbook Pro
5 Reallocated_Sector_Ct 0x0033 100 100 000 Pre-fail Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 18196
In spite of
173 Wear_Leveling_Count 0x0032 191 191 100 Old_age Always - 304961552547
If you have a gaming PC, and you keep daily backups of your uni work, you definitely can keep using an SSD once it starts showing wear, although I'd begin to budget for its replacement, and there may be an argument for selling it on Ebay (with a warning about its reallocated sectors, if you are honest).
If it's as UCC, then... I wouldn't allow it on Ashera, but the rest of the machines are up to Wheel discretion.
If anyone's life, or livelihood depends on data not being corrupted, please Goddess no. Where the data on the drive is work 100x or more what the drive is worth, you don't want such scenarios.
Regards,
Melissa
> On 18 Mar 2019, at 4:59 pm, Dylan H <dylanh333 at ucc.asn.au> wrote:
>
> Hi Melissa,
>
> Can you please clarify whether it really is an issue to have any reallocated sectors at all on an SSD, even if only very few?
>
> My understanding is that the Raw Value is the actual number of sectors reallocated, and the Value is the normalised value, which should not reach the threshold (lower is worse, so a Value of 98 is better than 20).
> Assuming this, I don't imagine a small handful of reallocated sectors is much of an issue, unless the normalised value is getting close to the threshold (0 in this case), as most SSDs come with a sizeable chunk of spare, hidden sectors to swap in, as far as I am aware.
>
> My main concern is that replacing an SSD that only has a small handful of reallocated sectors but a normalised value of 90+ (for example) would be a bit wasteful, when it still has 90% to go before reaching the vendor-defined threshold.
>
> Thanks!
>
> Kind regards,
> Dylan Hicks [333]
>
>
> On 18 Mar 2019 3:13 pm, Melissa Star <melissa at netexperts.com.au> wrote:
> Hi Everyone,
>
> I just realised - if you have smartmontools installed on linux machines, each hard drive or SSD will provide its “Airflow Temperature”, which I can extract via script.
>
> I'm thinking of centralising this for all the servers I run, and collecting the data to chart, having a display at home that gives me live info for all machines under my control.
>
> I could make a similar display for UCC, which could be on the website and/or a monitor in the club room (although this would likely be in the winter holidays due to increasing workload).
>
> Note the reallocated sector count for SSDs, once this starts to happen, the drive should be replaced.
>
> For SSDs (and also HDDs) mounted at the front of servers, because they are getting airflow to the sensor sucked in directly from ambient air, and are thermally insulated from the rest of the machine, this will be equal to the temperature of the room.
>
> For example, right now, the UCC server room temperature is 29 degrees, according to 3 of the four installed drives, and 30 degrees according to the 4th one.
>
> For PCs, the same test will provide the temperature in the case. Some drives also have a count of total hours run outside of their acceptable temperature range and G/shocks or drops, as well as all types of other interesting data.
>
> If there is an interest, I could parse this data, and the page with Ashera-related information could provide it and could also e-mail (and/or SMS) warnings to anyone on the list if the temperature passes a key threshold.
>
> Here is what the data actually looks like (I've highlighted the airflow temperature):
>
> smartctl -d sat -a /dev/pass1
> smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-STABLE amd64] (local build)
> Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org <http://www.smartmontools.org/>
>
> === START OF INFORMATION SECTION ===
> Device Model: Samsung SSD 860 QVO 1TB
> Serial Number: S4CZNG0M138175F
> LU WWN Device Id: 5 002538 e701b1df5
> Firmware Version: RVQ01B6Q
> User Capacity: 1,000,204,886,016 bytes [1.00 TB]
> Sector Size: 512 bytes logical/physical
> Rotation Rate: Solid State Device
> Form Factor: 2.5 inches
> Device is: Not in smartctl database [for details use: -P showall]
> ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
> SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is: Mon Mar 18 15:03:46 2019 AWST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> ... (cut to prevent this email becoming ridiculous) ...
>
> SMART Attributes Data Structure revision number: 1
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 648
> 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 15
> 177 Wear_Leveling_Count 0x0013 100 100 000 Pre-fail Always - 0
> 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
> 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
> 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
> 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
> 190 Airflow_Temperature_Cel 0x0032 071 058 000 Old_age Always - 29
> 195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
> 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
> 235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 13
> 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 336661820
>
> SMART Error Log Version: 1
> No Errors Logged
>
>
>
>
>
>
> Regards,
>
> Melissa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.ucc.gu.uwa.edu.au/pipermail/tech/attachments/20190319/39728ea0/attachment-0001.htm
More information about the tech
mailing list