BGP Health Monitoring

Overview Realtime Data
Archives
References

Paper (under submission):
M. Caesar, L. Subramanian, R. Katz, "Towards localizing root causes of BGP dynamics," U.C. Berkeley Technical Report UCB/CSD-04-1302, November 2003  This technical report obsoletes technical report UCB/CSD-03-1292. [pdf]

Internet routing is plagued with several problems today, including chronic instability, convergence problems, and misconfigurations of routers [2]. We believe that a first step towards making BGP robust to these dynamics is by developing a systematic methodology for analyzing routing changes and inferring why they happen and where they originate. Answers to these questions can provide useful insights into the sources of anomalous routing events and instabilities.

We are working towards development of a BGP health inferencing system for determining the root cause of routing changes. The health inferencing system collects and correlates route updates from multiple vantage points to determine the routing events that trigger each route update. We envision deploying our inference algorithms in data collection centers like Routeviews [3] and RIPE, which receive streams of route updates from multiple vantage points (views). More generally, we can use a BGP health monitor to continuously infer the state of the network. Such inferences may then be used: (a) offline for network performance monitoring and troubleshooting; or (b) online to improve path selection and damping of instability.

I. Inference techniques

I.a Turbulent vs. Quiescent periods

The rate at which prefixes get updated signifies the type(s) of event that caused the stream of updates. In a Turbulent period, one or a few major routing events cause several routes to simultaneously get updated. We assume that many observations in such a period are correlated (i.e. arise from the same routing event). In a Quiescent period, when very few prefixes are updated, it is harder to determine which updates are caused by the same routing event. In this case, we analyze updates to each prefix in isolation.

I.b Matching causes with observations

For every potential cause of a routing event, there exists different patterns of route updates that can be observed at a vantage point. Based on the pattern of observations, we classify the causes into equivalence classes where each class contains different causes that might trigger the same pattern of updates. While Griffin et al. [1] have shown that matching causes with observations is a hard problem, we find that certain patterns of updates (e.g. presence of route withdrawals) can help in narrowing down the set of possible causes.

I.c Multiple vantage points

Observing the same event from several vantage points allows us to acquire additional information about the event. By comparing similarities and differences in observations across the views, and by measuring the magnitudes of the event at each view, we can distinguish the signature of the event from effects introduced by intermediate routers along the path.

II. Validation and results

Most ISPs do not wish to reveal the types or frequency of events taking place in their networks, making validation of our approach difficult. However, there are several well-known major events that are public knowledge, such as the spread of Internet worms, or routing problems suffered by major ISPs. In addition, we know the location where certain classes of updates are caused, for example updates pertaining to prefixes originated by the AS containing the vantage point, or updates generated by BGP Beacons [4]. We considered a large number of such updates, and found that inference was performed correctly in every case. Although we aren't able to directly validate all of our inferences using this approach, we are able to verify the correctness of a base set of rules that we used to acquire our results.

To demonstrate the utility of such a system, we apply our inference methodology to updates collected from Routeviews and RIPE over a period of 18 months. We make several observations from our analysis:

Bibliography:

[1] T. Griffin, "What is the sound of one route flapping?," presentation made at the Network Modeling and Simulation Summer Workshop, 2002.
[2] C. Labovitz, A. Ahuja, F. Jahanian, "Experimental study of Internet stability and wide-area network failures," in Proc. of Fault Tolerant Computing Symposium, June 1999.
[3] "Route Views Project," http://www.routeviews.org.
[4] Z. Mao, R. Bush, T. Griffin, M. Roughan, "BGP beacons," in Proc. Internet Measurement Conference, October 2003.