Skype Explains Why The Outage Happened

Iohana Georgescu

Written by Iohana Georgescu on December 30th 2010
Posted in: Technology
no comments

Do you like this story?


As you may well remember, especially if you’re a Skype user, the service has recently experienced a major outage that lasted for nearly a day. While the company promised it’d try to get the service back as soon as possible, they didn’t really explain, in depth, what caused the whole situation. Several days after, Skype’s Lars Rabbe took the time to provide a detailed explanation on the company’s blog. He started by explaining that Skype is based on a peer-to-peer network which became unstable and went through a critical failure last week.

The failure went on from December 22 to December 23, totaling nearly 24 hours of down time. So what caused that failure, you may ask. Well, on December 22 reportedly a cluster of support servers which was responsible for offline instant messaging became overloaded. Because of this, some Skype clients received delayed response from those servers. A certain version of the Skype for Windows client (5.0.0152) didn’t properly process the delayed response received from the overloaded servers. Windows clients of that version eventually crashed as a consequence.

At this point it’s worth mentioning that roughly a half of Skype users worldwide were using the problematic 5.0.0.152 version of the software for Windows and the crashes the service experienced dragged down about 40 percent of those clients. Apparently those clients included 25 to 30 percent of publicly available supernodes which also failed as a result of the same problem. Users that ran the latest Skype for Windows version, 5.0.0.156 and significantly older ones (4.0 versions), Skype for Mac, Skype for iPhone and Skype for several other devices were not affected by the first critical failure.

Obviously, a lot of reports let us know that the service was down complete and the vast majority of Skype users (if not all of them) were affected by the outage. So how is this possible, assuming only 20 percent of the total Skype clients failed? Rabbe goes on to explain this as well. While the company’s staff responded as quickly as possible, disabling the overloaded servers and eliminating client requests to them, a large number of supernodes had already failed. The reason why a supernode is important for Skype is that it covers additional responsibilities compared to a regular node and acts like a directory which supports other Skype clients and helps establish connections between them. On top of that supernodes also create local clusters of a few hundred peer nodes each. In the case when a supernode fails, even when it is restarted it will take a while for it to become available as a resource for the P2P network again. Because of the major problems Skype was facing, the network was left with a deficiency of about 25-30 compared to what it had usually. This lead to a disproportionate load on the remaining supernodes that were up and running. Because the latter had to put up with an increased load and at the same time a lot of users were trying to restart crashed Windows clients, soon enough the load was simply too much to bare. A short while after the first crash, traffic to the supernodes that remained was about 100 times what Skype would normally get during that time of day.

“Supernodes have a built in mechanism to protect themselves and to avoid adverse impact on the systems hosting them when operational parameters do not fall into expected ranges. We believe that increased load in supernode traffic led to some of these parameters exceeding normal limits, and as a result, more supernodes started to shut down” Rabbe added. This hugely increased load eventually caused a positive feedback loop which led to the close to complete failure that most people experienced a few hours later. That’s when most Skype users discovered that they couldn’t use the service anymore. The problem went on for the next 24 hours.

During all this time the Skype team was naturally busy with trying to come up with a solution and bring the functionality of the service back. The company’s engineering and operations team introduced hundreds of instances of the Skype software into the P2P network to provide provisory supernode capacity and make the recovery of the peer-to-peer cloud faster. The name for them was “mega-supernodes”. On Wednesday night though, things not working as well and fast as initially anticipated and a small percentage (15-20) of user connections had healed. The volume of the load on the supernodes was still exceptionally high. So Skype’s team went ahead and introduced several thousand more mega-supernodes. This plan eventually paid off and led to the full recovery of the P2P network and most users who wanted to connect to it on the morning of December 23 discovered that they could.

After this problem that has caused significant negative feedback from users, Skype is now determined to make sure it doesn’t happen again. The company mentioned on its blog that it understand just how important the reliability, security and quality of software is for users worldwide. The first step Skype is planning to take is to continue to examine software for potential issues, releasing “hotfixes” when they are needed which will be downloaded as well as delivered automatically to users. A fix to v.5.0 for the bug in the problematic version of Skype was released prior to this issue and the company plans to provide other updates for download during this week as well.

On top of that Skype believes that it’s learned all the lessons possible from this incident which will lead the company to re-analyze its processes and procedures, trying to find better ways in which to detect problems faster and possibly avoid any sort of outage in the future. Skype also plans to find ways in which it can recover the system faster after a failure.

Skype for Windows v5 software was released after extensive internal testing and months of beta testing. Since it significantly mis-behaved recently, Skype will now try to figure out whether there’s anything it can do to detect and avoid bugs that can affect the system. The blog post was politely ended with “Lessons will be learned and we will use this as an opportunity to identify and introduce areas of improvement to our software, further assess and invest in capacity and stability, and develop better processes for outage recovery and communications to our user base. Thank you to everyone.”

While I’m well aware that a lot of Skype users were quite bothered by this incident, somehow it makes sense that some services crash once every blue moon or so. For this particular reason I find it to be quite helpful and appropriate that Skype not only tried to get its service back up as fast as it could but also bothered to explain exactly what it was that caused the outage. But, as a commenter pointed out, this situation might have been prevented if a fix would’ve been delivered through Skype’s auto update. Most people still do a manual upgrade check and if they don’t remember, well…too bad for them. Fortunately, companies and people alike learn from mistakes and occasionally messing up will typically result in better services in the future. As a small offtopic, this was also the day when the company released its new Skype app for iPhone which will now allow users to place video calls. If you want to learn more about it head on to this link.11


Did you like it? Share it!

Watch tweets on:

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Protected

2013-06-20 10:46:08