Important Notices & Announcements

Scheduled Network Maintenance #20180106001

Between Saturday, January 6th, 2018 at 8:00 PM (EDT) - Sunday, January 7th, 2018 at 3:00 AM (EDT)

End of Year Holiday Schedule #20171221001

Please be advised, that in order to allow our employees some time with their family during this holiday season, our offices will be closed Thursday, December 21, 2017 thru Tuesday, January 2, 2017.

Scheduled Third Party LNP Black Out Dates #20171007002

Monday, October 16, 2017 through Sunday, October 22, 2017
Thursday, 02 March 2017 17:47

Human error broke the internet... Again.

Rate this item
(0 votes)

“WE WANT TO APOLOGIZE FOR THE IMPACT THIS EVENT CAUSED FOR OUR CUSTOMERS...”- Amazon Web Services (AWS)

 

At 9:37AM PST on the morning of February 28th, 2017, a large chunk of the internet simply disappeared when the servers that powered it suddenly went offline. The servers were part of S3, Amazon’s popular web hosting service, and when they went offline a "boat load" of big services dependant on the "S.S. AWS" were along for the ride into the nethers. Netflix, Reddit, IFTTT, MASHABLE, and ironically "Is It Down Right Now", a website that tells you when websites are down, services were mostly offline or severely degraded. The servers came back online more than four hours later after Amazon acknowledged the problem.   In a cruel twist of fate, today just happened to be "AWSome Day" in Edinburgh, Scotland. Talk about awkward timing.

Amazon has finally revealed the cause of the lengthy outage that disrupted service to dozens of internet services for hours — and it's pretty embarrassing.

The cause, according to the company, who posted a "post mortem" late this afternoon, was "human error".  Which sounds bad enough until you find out exactly what the "human error" was: a typo.

On Tuesday morning, members of the S3 team were debugging the billing system. As part of that, the team needed to take a small number of servers offline. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” Amazon said. “The servers that were inadvertently removed supported two other S3 subsystems.”

The subsystems were important. One of them “manages the metadata and location information of all S3 objects in the region,” Amazon said. Without it, services that depend on it couldn’t perform basic data retrieval and storage tasks.

After accidentally taking the servers offline, the various systems had to do “a full restart,” which apparently takes a lot longer than it does on your laptop. While S3 was down, a variety of other Amazon web services stopped functioning, including Amazon’s Elastic Compute Cloud (EC2), which is also popular with internet companies that need to rapidly expand their storage.

Amazon explained, S3 was designed to be able to handle losing a few servers. What it had more trouble handling was the massive restart. “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected,” the company said.

Preventing a recurrence

As a result, Amazon said it is making changes to S3 to enable its systems to recover more quickly. It’s also declaring war on typos. In the future, the company said, engineers will no longer be able to remove capacity from S3 if it would take subsystems below a certain threshold of server capacity.

"...We will do everything we can to learn from this event and use it to improve our availability even further.” - AWS

AWS's service status system went down, too!

It’s also making a change to the AWS Service Health Dashboard. During the outage, the dashboard embarrassingly showed all services running green, because the dashboard itself was dependent on S3. 

The next time S3 goes down, the dashboard should function properly, the company said.

Unfortunately, things like this happen to the best of companies, JJCOM.COM not excluded.

Please note, as of the date of this article's publishing, none of the services provided by J & J Communications currently or ever have ever utilized any underlying AWS services.  Not that we won't possibly in the future... AWS Services are pretty "AMAZ-ing".
Read 835 times

Media

Login to post comments

Highly Qualified And Friendly Support

We provide 24/7 service and support through a combination of online help, live operators, and our On-Call Technicians that come to your site when you need us.

Look at what we do for you!

  • Manage and maintain everything
  • Unlimited local and long distance
  • Easily add remote offices and mobile workers
  • Make moves and changes online
  • Free moves, adds, changes

Tons Of Useful Built-In Features

  • Toll Free Numbers
  • Call Forwarding
  • Call Screening
  • Caller ID
  • Voicemail to Email
  • Voicemail
  • Music-on-hold
  • Conference Calls
  • Cloud based
  • Many more features...