Thursday, October 15, 2009

Microsoft Begins Restoration of Lost Sidekick Data

I promised I wouldn’t add any Sidekick or Bitbucket items to the current Windows Azure and Cloud Computing Posts for 10/14/2009+ post, so here’s the latest on Microsoft’s Sidekick backup fiasco:

Roz Ho, Microsoft’s Corporate Vice President for Premium Mobile Experiences announced in this Microsoft Confirms Data Recovery for Sidekick Users thread of 10/15/2009 1:00 AM in T-Mobile’s Sidekick Help forum: “Data Restoration to Begin as Soon as Possible for Affected Customers:”

Dear T-Mobile Sidekick customers,

On behalf of Microsoft, I want to apologize for the recent problems with the Sidekick service and give you an update on the steps we have taken to resolve these problems.

We are pleased to report that we have recovered most, if not all, customer data for those  Sidekick customers whose data was affected by the recent outage.  We plan to begin restoring users’ personal data as soon as possible, starting with personal contacts, after we have validated the data and our restoration plan. We will then continue to work around the clock to restore data to all affected users, including calendar, notes, tasks, photographs and high scores, as quickly as possible.

We now believe that data loss affected a minority of Sidekick users.  If your Sidekick account was among those affected, please continue to log into these forums for the latest updates about when data restoration will begin, and any steps you may need to take. We will work with T-Mobile to post the next update on data restoration timing no later than Saturday.

We have determined that the outage was caused by a system failure that created data loss in the core database and the back-up. We rebuilt the system component by component, recovering data along the way.  This careful process has taken a significant amount of time, but was necessary to preserve the integrity of the data.

We will continue working closely with T-Mobile to restore user data as quickly as possible.  We are eager to deliver the level of reliable service that our incredibly loyal customers have become accustomed to, and we are taking immediate steps to help ensure this does not happen again.  Specifically, we have made changes to improve the overall stability of the Sidekick Service and initiated a more resilient backup process to ensure that the integrity of our database backups is maintained.

Once again, we apologize for this situation and the inconvenience that it has created.  Please know that we are working all-out to resolve this situation and restore the reliability of the service. 

Sincerely,

Roz Ho
Corporate Vice President
Premium Mobile Experiences, Microsoft Corporation

It’s clear from Roz’ explanation that hardware system design and governance caused the inability to immediately restore users’ data. If Danger, Inc. had maintained complete, multiple data replicas on separate machines with automatic failover, which is Microsoft’s architecture for the Azure Services Platform and SQL Azure Database, the problem would not have occurred.

It’s equally clear that the failure isn’t related to cloud computing; the same issues would have confronted on-premises data center users.

As of 10/15/2009 8:31 PT, other bloggers and news media had picked up the story:

Mary Jo Foley says Microsoft recovers 'most, if not all' Sidekick users' data in her post to the All About Microsoft blog:

On October 15, Microsoft reversed itself, claiming now that instead of losing all of the personal data of Sidekick users, it has recovered “most, if not all” of it.

(Over the past few days, Microsoft has moved from saying all data was lost, to some, to possibly none.) …

Microsoft isn’t explaining beyond that what went wrong, starting in early October, that knocked out the hundreds of thousands of Sidekick users. There’s been lots of speculation, ranging from sabotage, to an attempt by Microsoft to move Sidekick’s back-end infrastructure from its current platform to a Windows-based one. (Danger, which Microsoft acquired in 2008, is still running the back-end infrastructure for the Sidekick.)

One of my Microsoft sources told me

“(T)he data loss issue was caused by a hardware update on the existing Danger service that had NOT been ported over to a Microsoft platform and the issue was NOT part of a transition to an MS back end. It was an Oracle dB and Sun SAN solution that got a bad firmware update and the backup failed.”

Since then, I’ve heard from others that this scenario seems likely and that yes, Hitachi Data Systems was the company actually doing the maintenance/update for Microsoft. I’ve also heard that foul play has not been ruled out because the failure was so catastrophic and seemingly deliberate. Microsoft is supposedly continuing to do a full investigation.

Microsoft officials are declining to comment beyond the statement they posted to the Web on October 15. They are promising another update on the situation by this Saturday at the latest.

Meanwhile, lawsuits are beginning to pile up as a result of the Sidekick outage, though a full restoration of data may take the bite out of some of them, I’d think.

Adrian Kingsley-Hughes"Most, if not all" Sidekick data recovered for ZDNet’s Hardware 2.0 blog begins:

The Sidekick data wipe debacle rolls on. There is, however, good news on the horizon for Sidekick users.

Roz Ho, corporate vice president for Microsoft’s Premium Mobile Experiences division apologizes for the recent upheaval experienced by Sidekick users and also offers the good news that “most, if not all” of the lost Sidekick data has been recovered. …

So a single system failure took out the main database and the backup. Seriously, what bone-headed backup system had to be in place to allow that to happen? Was the data just copies to another folder on the same drive? Likely not that simple, but something equally grossly incompetent had to have happened. …

Graphics credit: ZDNet

Ian Paul reports Sidekick Data Recovered by Microsoft for PCWorld:

Microsoft has good news for most Sidekick users: the company says it has recovered most of the data for T-Mobile Sidekick users who saw personal information accidentally wiped from their devices earlier this week.

Redmond also provided a few more details about what went wrong with the servers that stored the cloud-based data, which includes contact lists, notes, tasks, calendar appointments, photographs and gaming high scores. …

A number of Sidekick users complained of data loss last Saturday after a server crash at Microsoft Subsidiary Danger. At the time, Microsoft believed the data would be unrecoverable; however, by Monday, the company changed its tune saying some data recovery was possible. Now, it looks like only a small number of Sidekick users will suffer permanent data loss. …

Graphics credit: PCWorld

Dustin Amrhein’s Cloud Computing Isn't a Substitute For Due Diligence post of 10/15/2009 to the Cloud Interoperability Magazine blog sets “Expectations Straight:”

Well, everyone knew it wouldn't be long before cloud computing got thrown under the proverbial bus after the latest Sidekick failure. Observers point at this specific failure, as they have with Gmail, Amazon, and other cloud provider outages in the past, as a broader problem. Some like to use these service outages as an opportunity to initiate a full-fledged attack on the idea of cloud computing. However, can we really just blame cloud computing and move on?

If cloud computing plays a part in the blame game for these outages, it's because of the hype around the industry. If the cloud is being portrayed as a magical bag of beans that can solve all IT ills then that is a problem. Companies need to take a hard look at the ever increasing cloud offerings to understand if and when they can leverage particular cloud technologies. Cloud computing should be viewed as something that can enhance, not necessarily replace, a company's current IT solutions. Potential users should understand that cloud isn't about simply turning your operations over to a service provider for hosting (the infamous "your mess for less" model). It's more about driving efficiencies into the way we procure, utilize, and manage services within IT. These efficiencies can mean savings in costs, increased agility, and intensified focus on the services and activities that a company derives its competitive advantage from. …

blog comments powered by Disqus