BGP
Health Monitoring
Paper (under submission):
M. Caesar, L. Subramanian, R. Katz, "Towards localizing root causes of BGP dynamics," U.C. Berkeley Technical Report UCB/CSD-04-1302,
November 2003
This technical report obsoletes technical report UCB/CSD-03-1292.
[pdf]
Internet routing is plagued with several problems today, including
chronic
instability, convergence problems, and misconfigurations of routers
[2]. We
believe that a first step towards making BGP robust to these dynamics
is by
developing a systematic methodology for analyzing routing changes and
inferring
why they happen and where they originate. Answers to these questions
can
provide useful insights into the sources of anomalous routing events
and
instabilities.
We are working towards development of a BGP health inferencing system
for
determining the root cause of routing changes. The health inferencing
system
collects and correlates route updates from multiple vantage points to
determine
the routing events that trigger each route update. We envision
deploying our
inference algorithms in data collection centers like Routeviews [3] and
RIPE,
which receive streams of route updates from multiple vantage points
(views).
More generally, we can use a BGP health monitor to continuously infer
the state
of the network. Such inferences may then be used: (a) offline for
network
performance monitoring and troubleshooting; or (b) online to improve
path
selection and damping of instability.
I. Inference techniques
I.a Turbulent vs. Quiescent periods
The rate at which prefixes get updated signifies the type(s) of event
that
caused the stream of updates. In a Turbulent period, one or a few major
routing
events cause several routes to simultaneously get updated. We assume
that many
observations in such a period are correlated (i.e. arise from the same
routing
event). In a Quiescent period, when very few prefixes are updated, it
is
harder to determine which updates are caused by the same routing event.
In this
case, we analyze updates to each prefix in isolation.
I.b Matching causes with observations
For every potential cause of a routing event, there exists different
patterns
of route updates that can be observed at a vantage point. Based on the
pattern
of observations, we classify the causes into equivalence classes where
each
class contains different causes that might trigger the same pattern of
updates.
While Griffin et al. [1] have shown that matching causes with
observations is a
hard problem, we find that certain patterns of updates (e.g. presence
of route
withdrawals) can help in narrowing down the set of possible causes.
I.c Multiple vantage points
Observing the same event from several vantage points allows us to
acquire
additional information about the event. By comparing similarities and
differences in observations across the views, and by measuring the
magnitudes
of the event at each view, we can distinguish the signature of the
event from
effects introduced by intermediate routers along the path.
II. Validation and results
Most ISPs do not wish to reveal the types or frequency of events taking
place
in their networks, making validation of our approach difficult.
However, there
are several well-known major events that are public knowledge, such as
the
spread of Internet worms, or routing problems suffered by major ISPs.
In
addition, we know the location where certain classes of updates are
caused, for
example updates pertaining to prefixes originated by the AS containing
the
vantage point, or updates generated by BGP Beacons [4]. We considered a
large
number of such updates, and found that inference was performed
correctly in
every case. Although we aren't able to directly validate all of our
inferences
using this approach, we are able to verify the correctness of a base
set of
rules that we used to acquire our results.
To demonstrate the utility of such a system, we apply our inference
methodology
to updates collected from Routeviews and RIPE over a period of 18
months. We
make several observations from our analysis:
- We can pinpoint the location where the update was generated to a
single pair of AS's for over 70% of updates. Additionally, we output a
list of potential causes that might have caused an event, but may not
always be able to identify the specific cause.
- Our system can detect major routing anomalies, many of which
were previously unknown.
- We detected nearly 1,400 resets per month, and found certain
inter-AS links to be perennially unstable.
- Roughly 25% of prefixes continuously flap at least every 30
minutes, and these account for a large fraction (20%) of routing
updates.
- Routing events in the Internet core usually trigger short-term
flaps, but an event taking place at the network edge is 9 times more
likely to cause a long-term route change.
Bibliography:
[1] T. Griffin, "What is the sound of one route flapping?,"
presentation made at
the Network Modeling and Simulation Summer Workshop, 2002.
[2] C. Labovitz, A. Ahuja, F. Jahanian, "Experimental study of Internet
stability
and wide-area network failures," in Proc. of Fault Tolerant Computing
Symposium, June 1999.
[3] "Route Views Project," http://www.routeviews.org.
[4] Z. Mao, R. Bush, T. Griffin, M. Roughan, "BGP beacons," in Proc.
Internet
Measurement Conference, October 2003.