据估计,世界上产生的所有新信息中超过 90% 都存储在磁介质上,其中大部分存储在硬盘驱动器上。尽管它们很重要,但关于磁盘驱动器的故障模式(failure patterns of disk drives)以及影响其寿命的关键因素的已发表的工作相对较少。大多数可用数据要么基于加速老化实验的推断,要么来自规模相对适中的实地研究。
此外,更大规模的人口研究很少有基础设施来收集运行中组件的健康信号,这是详细故障分析的关键信息。
根据卡内基梅隆大学(Carnegie Mellon University)对大约 100,000 个驱动器进行的一项研究,客户更换磁盘驱动器的速度远远高于驱动器供应商提供的估计平均故障间隔时间 ( MTBF ) 所建议的速度。
硬盘故障
Carnegie Mellon的一项研究检查了大型生产系统,包括运行SCSI、FC 和SATA驱动器的高性能计算站点和Internet服务站点。(Internet)这些驱动器的数据表列出的MTBF介于 100 万到 150 万小时之间,研究称这意味着年故障率“最多为 0.88%”。然而,该研究显示典型的年更换率在 2% 到 4% 之间,“在某些系统上观察到高达 13%”。
那么这对购买硬盘和带有硬盘的计算机的消费者意味着什么?
我拥有超过 25 年的工程、制造和软件开发经验,所以首先让我们来看看从汽车和飞机到硬盘驱动器和智能手机的典型制造过程的一个重要方面。任何最终产品的典型制造商实际上都会生产一些构成最终产品的组件。事实上,他们将几乎所有子组件的制造和设计外包给供应商,从根本没有监督到广泛的规范、测试和监督。选择供应组件的供应商通常是出价最低的供应商,而一些制造商则根据价格、质量和可靠性的综合价值来选择最佳供应商。
这种外包系统通常被称为分层供应商基础。一级供应商直接向最终产品的制造商供货。一级供应商的供应商是二级供应商,因此沿着食物链向下走。从技术上讲,就硬盘制造商而言,他们实际上是计算机制造商的一级供应商。该系统解释了为什么当美国(United) 政府(Government)正在为是否救助美国(States) 汽车制造商(US Automobile Manufacturers)而苦苦挣扎时,有人引用人们的话说,如果允许他们破产,数十万人将失去工作。他们指的是所有层级供应商的员工。
在这样的系统中,最终产品的质量与供应链中最薄弱的环节一样好。大多数供应商使用非常(Very)复杂和严格的质量控制和设计方法来确保其产品的质量,但归根结底,这仍然归结为人为错误的可能性。即使是世界上最复杂的熄灯、24/7、计算机控制和机器人化的制造工厂也会受到人为错误的影响。为机器人编程的人可能没有专注于任务,导致机器人每 100 次操作就会将微芯片放置在偏离目标的几分之一微米处,从而导致您的硬盘驱动器出现问题,而您同事的同一台计算机却很好。
像这样的早期失败并不少见。这就是所有保修所称的“制造缺陷”。行业内部术语是婴儿死亡率失败(Infant Mortality Failure)( IMF )。担保有时间限制,因为它们旨在保护您免受IMF(IMFs)的侵害。实际上有不同层次的IMF(IMFs)。大多数电子产品都经过某种通常称为老化的测试。这是测试立即失败或最初几分钟内的失败。这些是由几乎立即导致灾难性故障的严重制造缺陷引起的。
更麻烦的IMF(IMFs)是那些让你(消费者)在短时间内表现得完美无瑕的 IMF,然后砰,它就死了。制造商讨厌这些失败,因为现在您对制造商的看法受到损害。您从不知道老化期间的故障,并且很高兴不知道它们,但是当您的硬盘驱动器在关键截止日期的前一天晚上死机时,您会发火并要求全世界赔偿。这种故障的成本是长期的,而且比新硬盘的成本还要高。这可能会导致永远失去客户。这就是为什么我永远不会拥有另一台 HP 电脑,即使它们可能是很棒的电脑。我得到了一个坏的,它永远玷污了我对抗惠普。
相关(Related):硬盘自行擦除!我该怎么办?
那么你能做些什么来保护自己呢?
在购买任何新的电子产品之前,我个人总是做大量的研究。在找到并纠正问题的根本原因之前, IMF(IMF)可能是一个制造商或型号的长期问题。它甚至可能是设计缺陷,而不是制造问题。我最近购买了一台新的大屏幕高清电视,我认为我想要Panasonic 3D Plasma的顶级产品,直到我通过阅读多个来源的评论了解到 2010 型号在早期(3 个月内)出现黑电平损失并且不够信息可用于确定它是否已在 2011 模型中修复。所以我买了我的第二选择。
您可以专门使用计算机硬盘驱动器做的另一件更明显的事情是备份您的数据或对整个系统进行映像。我个人使用一个名为Acronis True Image的产品。我制作整个系统的备份映像,然后每晚进行增量备份。我将其设置为保留 10 个过去的增量,因此我始终可以重置回较早的最新版本。我将此备份到专用的 1 TB 外置硬盘(Hard Drive)。如果你说那个硬盘坏了怎么办?好吧,您的计算机硬盘驱动器和外部硬盘驱动器同时发生故障的可能性很小,但我拥有自己的企业,所以我有一个冗余的外部硬盘驱动器,为了安全起见我会在上面进行冗余备份。
我还建议您购买优质的电涌保护器,而不是您在延长线旁边的沃尔玛购买的那种,而是从(Walmart)百思买(Best Buy)或任何计算机供应零售商等零售商那里购买的优质产品。我使用售价约 40美元的(USD)贝尔金(Belkin)装置。
如果您需要一些免费软件来监控硬盘是否存在潜在故障,(Freeware to Monitor Hard Disk for Potential Failure)请选中此项。(Check this if you need some Freeware to Monitor Hard Disk for Potential Failure.)
这篇客座文章的作者 Randy L. Miller 是 Alagad Incorporated 的首席执行官。(The author of this Guest Post, Randy L. Miller is the C.E.O of Alagad Incorporated.)
Why did my hard disk fail or crash so fast & for no apparent reason?
It is estimated that over 90% of all new information prоduced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their іmportance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest-sized field studies.
Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis.
Customers replace disk drives at rates far higher than those suggested by the estimated mean time between failure (MTBF) supplied by drive vendors, according to a study of about 100,000 drives conducted by Carnegie Mellon University.
Hard Disk Failure
A Carnegie Mellon study examined large production systems, including high-performance computing sites and Internet services sites running SCSI, FC, and SATA drive. The datasheets for those drives listed MTBF between 1 million to 1.5 million hours, which the study said should mean annual failure rates “of at most 0.88%.” However, the study showed typical annual replacement rates of between 2% and 4%, “and up to 13% observed on some systems.”
So what does this mean to you, the consumer who purchases hard drives and computers with hard drives?
I have over 25 years of engineering, manufacturing, and software development experience so first let’s examine an important aspect of typical manufacturing processes from automobiles and airplanes to hard drives and smartphones. The typical manufacturer of any end product actually produces a few of the components that make up the end product. They, in fact, outsource the manufacture and often the design of almost all subcomponents giving the supplier oversight ranging from none at all to expansive specifications, testing, and oversight. The supplier picked to supply the component is often the lowest bidder while some manufacturers choose the best supplier based on the value which is a combination of price, quality, and reliability.
This system of outsourcing is often referred to as the tiered supplier base. A tier one supplier supplies directly to the manufacturer of the end product. The suppliers to the tier one supplier are tier two suppliers, and so goes it down the food chain. Technically in the case of a hard drive manufacturer, they, in fact, are a tier-one supplier to the computer manufacturer. This system explains why when the United States Government was wrestling with whether to bail out the US Automobile Manufacturers people were quoted as saying if they are allowed to go under, hundreds of thousands of people will lose their jobs. They were referring to the employees of all the tier suppliers.
In a system like this, the quality of the end product is only as good as the weakest link in the supply chain. Very complex and rigid quality control and design methods are used by most suppliers to ensure the quality of their products but in the end, it still comes down to the potential for human error. Even the most sophisticated lights out, 24/7, computer-controlled, & robotized manufacturing plant in the world is subject to human error. The person programming the robot may not be concentrating on the task causing the robot to place a microchip a fraction of a micrometer off target every 100th operation causing your hard drive to have problems when your co-worker’s identical computer us just fine.
Early failures like this are not uncommon. It is what all warranties refer to as “manufacturing defects”. The inside industry term is Infant Mortality Failure (IMF). Warranties have a time limit because they are intended to protect you against IMFs. There are in fact different levels of IMFs. Most electronics go through some sort of test often referred to as burn-in. This is testing for an immediate failure or a failure in the first few minutes. These are caused by gross manufacturing defects that cause catastrophic failure almost immediately.
The more bothersome IMFs are the ones that make it all the way to you, the consumer, perform flawlessly for a short period of time, and then bam, it’s dead. The manufacturers hate these failures because now your opinion of the manufacturer is tarnished. You never knew of the failures during burn-in and were happy not knowing about them but when your hard drive dies the night before a critical deadline, you go ballistic and demand the world for compensation. The cost of this failure is long-term and higher than the cost of a new hard drive. It may result in a lost customer forever. This is why I will never own another HP computer even though they may be great computers. I got a bad one and it tarnished me against HP forever.
Related: Hard drive wiped itself! What do I do?
So what can you do to protect yourself?
I personally always do a lot of research before any new electronics purchase. IMF can be a persistent problem with one manufacturer or model until the root cause of the problem is found and corrected. It could even be a design flaw and not a manufacturing problem. I recently purchased a new big-screen HD TV and I thought I wanted the top of the line Panasonic 3D Plasma until I learned through reading reviews from several sources that the 2010 models experience early (within 3 months) loss of black levels and not enough information was available to determine if it was fixed in the 2011 models. So I bought my second choice.
The other more obvious thing you can do specifically with a computer hard drive is to back up your data or image your entire system. I personally use a product called Acronis True Image. I make a backup image of my entire system and then make incremental backups every night. I have it set to keep 10 past increments so I can always reset back to an earlier recent version. I back this up to a dedicated 1 TB external Hard Drive. What if that hard drive fails you say? Well, the likelihood of your computer hard drive and your external hard drive failing at the same time is remote but I own my own business so I have a redundant external hard drive that I do redundant backups on just to be safe.
I would also recommend you get a good quality surge protector, not the kind you get at Walmart next to the extension cords but a good quality unit from a retailer like Best Buy or any computer supply retailer. I USE A Belkin unit that costs around $40 USD.
Check this if you need some Freeware to Monitor Hard Disk for Potential Failure.
The author of this Guest Post, Randy L. Miller is the C.E.O of Alagad Incorporated.