Complete.Org: Mailing Lists: Archives: freeciv: September 2002:
[Freeciv] FW: Friday network outage
Home

[Freeciv] FW: Friday network outage

[Top] [All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index] [Thread Index]
To: freeciv@xxxxxxxxxxx, discussion@xxxxxxxxx
Subject: [Freeciv] FW: Friday network outage
From: John Goerzen <jgoerzen@xxxxxxxxxxxx>
Date: Sat, 21 Sep 2002 08:07:23 -0500

It makes me feel good that this provider actually cares that they went down
:-)

----- Forwarded message from support@[hidden] -----

From: support@[hidden]
Date: Fri, 20 Sep 2002 23:38:34 -0700 (PDT)
To: jgoerzen@xxxxxxxxxxxx
Subject: Friday network outage

Hello,

Just a final note on the network outage that happened over the first half
of today (Friday).

We have confirmation that it was indeed a software problem in a Sprintlink
router three hops out of our network.  As we said earlier, it did not cause
a problem for all persons trying to reach our network - in fact a good portion
of the US IP network could still hit us, but a good portion could not.

We are addressing this issue in two ways.  First, I am going to our upstream
provider personally some night at 4 am or whatever and forcing them to prove
network failover to me.  Our network is triple-homed, and thus, even if
Sprint does go down on us we should not lose any meaningful amount of traffic.

The fact that we did means that the triple-homing wshas not set up
correctly.  I am no longer taking their word for it, and am thus going there
and making them physically disconnect each backbone link one at a time and
proving to me that it fails over properly.  I will probably have someone on our
end repeat this process monthly.

The second thing we are doing is setting up two additional monitoring systems
- one each on UUnet and Genuity.  We have an off-site monitoring station,
but it is on Sprint, which obviously doesn't give us the distant early
warning we need in all cases.  By adding two more off-site stations (one in
Denver and one on the east coast somewhere) we can be sure to spot things
right away.

We take what happened today _very, very_ seriously.  Not only are we bearing
down greatly on our providers (see step 1 above) but we are taking matters
into our own hands in terms of very thorough monitoring (see step 2 above).

In the end, this probably won't ever happen again - it was two years of
zero network downtime prior to this, and with our "pull the plug out and
prove it" auditing it seems inconceivable that it will happen again.  By
putting in place the extra monitoring, we are ensuring that much smaller
and more localized outages than happened today are also mitigated.


[Prev in Thread] Current Thread [Next in Thread]
  • [Freeciv] FW: Friday network outage, John Goerzen <=