JoelOnSoftware

Joel on Software

A week of Murphy's Law gone wild.

一周疯狂遭遇墨菲定律

by Joel Spolsky Saturday, January 25, 2003


Chapter One. The Linux server hosting our CVS repository (all our source code) fails. No big deal, it is automatically mirrored (usingrdist) to a remote location. It takes a few hours to compress and transmit the mirrored data. We discover that we forgot the option tordist that removes deleted files, so the mirror isn't perfect: it includes files that were deleted. These have to be manually removed.

第一章:那台托管我们CVS代码服务(所有源代码)的LINUX宕机了。 没什么大不了的,我们有自动的远程镜像(通过RDIST)花了我们几个小时来压缩传输那堆镜像数据。 我们又发现我们似乎漏掉了RDIS的删除文件选项,因此镜像并不是完美的:它包含了那些已经删除掉的文件。 这些文件得手动删除。

When this is all done I decide to check out the whole source tree from scratch and compare it to what I already have, as a final sanity check. But I don't have enough disk space on my laptop to do this. Time to upgrade. I order a 60 GB laptop hard drive and a PCMCIA/harddrive connector that is supposed to allow you to clone the old hard drive on the new one. This process takes something like 6 hours and fails when it is 50% complete, instructing me to "run scandisk." Which takes a couple of hours. Start another copy. 6 hours more. At 50%, it fails again. Only now, the original hard drive is toast taking my entire life with it. It takes a couple of hours fiddling around, putting the drive into different computers, etc., to discover that it is indeed lost.

当这件事情完成的时候,我决定把源代码树完整的检出然后跟我已有的源代码版本进行比较,作为最后的完整性检查。但是我的笔记本电脑没有足够的空间来完成这项任务。 升级时间到。 我预定了60GB的笔记本硬盘和一个PCMCIA/硬盘连接器,这跟连接线应该能让我把旧的数据克隆到新的硬盘上。这个过程花了大概要6个小时,并且在完成50%的时候失败了, 并且建议我运行“磁盘扫描” 这又花了几个小时。 重新开始拷贝,再花6个小时。 在50%的时候,又失败了。 不过这一次,我原来的硬盘冒烟了,那里可承载着我一生的心血啊。 又花了我几个小时倒腾,把它插入到一个不同的计算机,发现数据已经丢失了。

OK, not too big a deal, we have daily backups (NetBackup Pro). I put the new 60 GB drive into the laptop, format it, and install Windows XP Pro. I instruct NetBackup Pro to restore that machine to its pre-crash state. I'll lose a day of work, but it was a day in which I hardly got anything done, anyway. A day of email was lost so if you sent me something this week and I didn't respond, resend.

好吧,没什么大不了的, 我们又每日备份(网络备份专家) 我把新的60GB的硬盘装进笔记本,格式化,安装了WINDOWS XP PRO 我命令网络备份专家把我的机器还原到崩溃之前, 我丢失了一天的工作, 不过这天我几乎也没干什么, 不管怎样, 至少一天的电子邮件丢失了,所以如果你给我发了点儿什么而我没有回复的话,那么请重发一遍。

NetBackup Pro works for a few hours. I go home, to let it finish overnight. In the morning the system is completely toast and won't even boot. I hypothesize that it must be because I tried to restore a Win2K image on top of an XP Pro OS. So I start again, this time installing Win 2K (format hard drive: 1 hour; install Win 2K: 1 hour; then install the NetBackup Pro Client). And I start the restore again. Five hours later, it's only halfway done, and I go home.

网络备份专家工作了几个小时, 我回家了, 想着让它在晚上完成这项工作。 早上的时候 系统完全被搞坏了甚至没办法启动。 我假设说这肯定是因为我想要再WINDOWSXP上还原WINDOWS2000系统造成的,于是我再次从头开始,这次我安装了WIN2K(格式化硬盘1小时; 安装WIN2K 1小时; 然后安装网络备份大师客户端) 然后我又启动了恢复过程。 过了5个小时,还是只完成了一半,于是我回家了。

The next morning, the system doesn't quite boot, it blue-screens, but a half hour of fiddling around with Safe Mode and I get it to boot happily. And behold, everything is restored, except, for some reason, a few files which I let Windows encrypt for me (using EFS) are inaccessible. This has something to do with public keys and certificates. When you restore a file that was encrypted I guess you can't read it. I still haven't found the solution to this. If you know how to fix this I will be forever indebted to you. [1/26: I fixed this problem after a few hours of tearing out my hair.]

第二天, 系统还是不能启动,总是蓝屏, 但是在安全模式下面倒腾了半个小时之后我还是设法把系统启动起来了。 打住,除了一些出于某种原因我让Windows帮我加密的文件无法访问之外,所有的东西都还原了. 这跟公钥和证书有关, 当你恢复那些加密的文件的时候,我猜你也没办法阅读。 我尚未发现解决方案。 如果你知道如何修复这种情况的话,那么我将一辈子都感激你。 【1月26号,在拉扯了自己的头发几个小时之后,我还是解决了这个问题】

Lesson Learned: This is not the first time that a hard drive failure has led to a series of other problems that wound up wasting days and days of work. Notice that I had a very respectable backup strategy, everything was backed up daily, offsite. In fact I believe this is the third time that a hard drive failure has led to a series of mishaps that wasted days. Conclusion: backups aren't good enough. I want RAID mirroring from now on. When a drive dies I want to spend 15 minutes putting in a new drive and resume working exactly where I left off. New policy: all non-laptops at Fog Creek will have RAID mirroring.

经验教训: 这已经不是第一次因为硬盘故障,导致一系列其他问题,最后让我浪费若干天的时间没法工作了。 主意我采取了非常值得推荐的备份策略, 每天都离线备份所有的东西,实际上我觉得这已经是第三次硬盘坏掉然后导致一系列悲催的事情发生不能工作了。 结论是:备份不够好,从今往后我需要RAID镜像。 当一个硬盘坏掉的时候,我希望只要话15分钟安装一个新的硬盘,然后马上能继续从我停止的地方继续工作。新制度:FogCreek的所有的笔记本电脑都要又RAID镜像。

Chapter Two. Did you notice that our web server was down? On Friday around noon a fire in a local Verizon switch knocked out all our phone lines and our Internet connectivity. Verizon got the phone lines working in a couple of hours, but the T1 was a bit more problematic. We purchased the T1 from SAVVIS, which, in turn, hired MCI to run the local loop, which is now called WorldCom, and of course Worldcom doesn't actually run any loops, God forbid they should get their handsdirty, they just buy the local loop from Verizon.

第二章: 你注意到我们的网页服务器宕机了么? 在星期五中午的时候,Verizon机房里一个交换机起火,导致我们所有电话连接被中断进而不能连接因特网。 Verizon花了几个小时处理了故障,连接恢复。 到那时T1还是有点问题。 我们从SAVVIS购买的T1,而SAVVIS雇佣了MCI来负责当地的运维, 现在MCI被叫做WorldCom,当然WorldCom没什么本地运维服务,上帝禁止他们脏了手,他们从Verizon本地购买运维服务。

So from Friday at noon until Saturday at midnight, Michael and I, working as a tag team, call Savvis every hour or so to see what's going on. We're pushing on Savvis, who, occassionally, push on Worldcom, who have decided that some kind of SQL Server DDOS attack can be blamed for everything, so they kind of ignore Savvis, who don't tell us that Worldcom is ignoring them, and we push on Savvis again, andthey push on Worldcom again, and around the third time Worldcom agrees to call Verizon who send out a tech who fixes the thing. Honestly, it's like pushing on string. Just like the last time Savvis made our T1 go down for a day, the technical problem was relatively trivial and could have been diagnosed and fixed in minutes if we weren't dealing with so many idiot companies.

所以从星期天中午的时候开始,直到星期六半夜,迈克尔和我作为一个标签团队每个小时都给Savvis打电话来看看事情进展的怎么样了。我们一直催促SAVVIS,他们去催促Worldcom,而WorldCom则声称某种SQLSERVER 分布式拒绝服务攻击是罪魁祸首, 所以他们忽略了SAVVIS, 而SAVVIS没有告诉我们WORLDCOM忽略了他们, 我们再次催促SAVVIS,然后他们再次催促WORLDCOM, 大概在第三次的时候 WORLDCOM同意SAVVIS派出一个技术人员最终解决了这个问题。老实说,就扯一根绳子一样, 就像上次SAVVIS上次让我们的T1挂掉一天那样,如果我们不是在跟一群傻×公司打交道的花,这个问题根本就是个小问题,马上就能解决好的。

Lesson Learned: When you're buying a service from a company that's just outsourcing that service, one level deep, it's difficult to get decent customer service. When there are two levels of oursourcing, it's nearly impossible. Much as I hate to encourage monopolistic local telcos, the only thing worse than dealing with a local telco directly is dealing with another idiot bureaucratic company who themselves have no choice but to deal with the local telco. Our next office space will be wired by Verizon DSL, thank you very much.

经验教训: 当你从一个公司购买的服务被他们外包出去的话,因为深了这一层你不可能获得令人满意的客户服务。 如果深了两层的外包的话, 几乎不可能。 就像我憎恨本地的垄断电信提供商那样,为一能够比跟本地电信公司打交道更糟的情况就是 跟一个傻X的官僚公司打交道,然后这个公司自己再去跟本地的电信公司打交道。 我们的下一个办公室将由Verizon专线接入,非常感谢。

Incidentally, none of you would have noticed this outage at all if Dell had delivered our damn server on time. We were supposed to be up and running in a nice Peer1 highly redundant secure colocation facility a month ago. See previous rant. Did I mention that I have a fever? I always get sick when things are going wrong.

偶然的,没有人注意到如果不是戴尔公司按时送达了我们的该死的服务器的话,所有的这些愤怒都不会发生,我们早在一个月之前就应该由一个漂亮的P1冗余安全备份服务器上线了 ,参见前面的抱怨。 我跟你们提到我发烧了么? 每当事情出错的时候我都会生病。

Chapter Three. For the thousandth time, the heat on the fourth floor of the Fog Creek brownstone is out. Heat is supplied by hot water pipes running through the walls. These pipes were frozen solid. How did they get a chance to freeze? Oh, that's because the furnace went off last week, because it was installed by an idiot moron, probably unlicensed, who put in a 25 foot long horizontal chimney segment which prevents ventilation and has, so far, hospitalized one tenant and caused the furnace to switch off dozens of times. Finally someone at the heating company admitted that it was possible to install a draft inducer forcing the chimney to ventilate, which they did, but not before the hot water pipes had frozen. Of course, the pipes are inadequately insulated due to another incompetent in the New York City construction trade, but this wouldn't have mattered if the furnace had kept running.

第三章: 第一千次, FogCreek Brownstone第四层办公室的暖气又断掉了, 暖气是通过穿墙的暖气管道输送的, 这些管道冻地梆梆硬。 他们怎么可能会冻起来呢? 因为火炉上周熄灭了, 因为这个火炉是一个傻叉白痴安装的,我怀疑他肯定没有授权, 这混蛋安装了一块25英寸的长条壮水平烟囱,这个烟囱因为不通风已经让一个业主住院,取暖器关闭几十次。最后暖气公司终于有人承认说可以通过安装一个诱导管道强迫烟囱通风, 他们真这样做了, 但是这是在水管结冰之后了。 当然这些管道没有充分绝缘,因为纽约城市建设贸易公司的那些不能胜任的家伙们。不过只要供暖能够继续运作,这也没多大关系了。

Lesson Learned: Weak systems may appear perfectly healthy until neighboring systems break down. People with allergies and back problems may go for months without suffering from either one, but suddenly an attack of hayfever makes them sneeze hard enough to throw out their back. You see this in systems administration all the time. Use these opportunities to fix all the problems at once. Get RAID on all your PCs and do backups, and don't use EFS and always get hard drives that are way too large so you'll never have to stop to upgrade them, and double check the command line options to rdist. Install the draft inducer and insulate the pipes. Move your important servers to a secure colo facility and switch the office T1 to Verizon.

经验教训: 弱系统在相关系统开始崩溃之前看起来都是绝对健康的。 由肠胃炎和腰疼的人们能够不行几个月而不表现出任何症状。 但是突然袭来的高烧可能会让他们拼命的打喷嚏,直到要甩出他们的背来似的。 利用这个机会去终结所有问题吧,为你的所有的PC买RAID 做备份, 不要使用EFS,永远记住要用足够大的硬盘,这样你就不需要停下手头的工作来升级你的硬盘了。仔细的检查RDIST的命令行参数。 安装一个诱导管道和绝缘管。 把你重要的服务器都搬到带备份的机房,从办公室的T1交换机换到VERIZON专线接入吧!