next up previous
Next: Implementation and First Results Up: Model-Based Alarm Correlation in Previous: Problem and Previous Solutions

Modeling Alarms in Cellular Networks

 

Model-Based Diagnosis

As discussed above, modifying and extending the knowledge base of most previous systems in cases of network reconfigurations and new types of network elements is difficult and error prone. Many systems handle only single faults or predefined fault scenarios. The model-based approach [13] has been developed to address these shortcomings. Model-based diagnosis systems separate topological knowledge from behavioral knowledge, which is specified in a modular way for each type of network element. Behavioral knowledge specifies an abstract simulation model for the device under consideration. It is used to predict the expected behavior, given the observed input parameters. Diagnoses are computed by comparison of predicted vs. actual behavior.

This approach uses a model of the device, called the system description (tex2html_wrap_inline592), often formalized as a set of formulas expressed in first-order logic. The system description consists of a set of axioms characterizing the behavior of system components of certain types, e.g.\ a base station transceiver of a given vendor. The topology is modeled separately by a set of facts. In model-based systems changes of topology can be carried out easily without affecting the consistency of the system description. Furthermore, since diagnoses are computed using only a model of the correct system behavior, unforeseen error situations can be diagnosed correctly.

Let us now define the diagnostic concept mathematically. The diagnostic problem is described by system description tex2html_wrap_inline592, a set tex2html_wrap_inline596 of components and a set tex2html_wrap_inline598 of observations (logical facts). A predicate tex2html_wrap_inline600 is used to denote that a component is behaving abnormally: If c is a component, the fact tex2html_wrap_inline604 denotes that c is not behaving correctly, according to its specification. In Consistency-Based Diagnosis, the concept we are using throughout this paper, a Diagnosis D is a set of faulty components, such that the observed behavior is consistent with the assumption that exactly the components in D behave abnormally. If a diagnosis contains no proper subset which is itself a diagnosis, we call it a Minimal Diagnosis.


defini143

Minimal Diagnoses are a natural concept, as we do not want to assume a component to be faulty, unless this is necessary to explain the observed behavior. Since the set of minimal diagnoses can still be quite large, several concepts exist to further discriminate among the minimal diagnoses. We will use the concept of Maximal Probability Diagnoses: In technical domains it is often useful to associate a probability tex2html_wrap_inline620 with each literal tex2html_wrap_inline604 denoting failure. The probability of a diagnosis is defined as the product of the individual failure probabilities.


defini161

As de Kleer [3] pointed out, the exact value of the probabilities used is not important here. The probabilities are only approximate values, used to identify the more plausible diagnoses.

Overview of the Necessary Model

As discussed before, model-based diagnosis needs a model tex2html_wrap_inline592 of the behavior of the system to simulate correct and/or faulty behavior and compare its results with the observations tex2html_wrap_inline598. The set of components tex2html_wrap_inline596 considered interesting are the network elements which can be faulty. The observations we evaluate are alarm messages received / generated at the base station controller. We therefore have to model the alarm behavior as well as the propagation of alarms over network elements and connections between them to explain the existence or non-existence of alarm messages. This model has to be modular, i.e. basically describing behavior and propagation information for each type of network element. Additionally, topology information is included as a set of facts. In this way, new network elements can be easily added by just changing / extending the facts representing topology information. New types of network elements (with possibly different alarm behavior) can be added by adding a description of the appropriate type together with a description of its alarm behavior.

We model base station transceivers, base station controllers and microwave links as basic network elements, which are connected in a star-like network. Data and alarm messages are sent through these elements and connections between them. In our current model, only microwave links may be assumed to be faulty, as this is by far the most important fault in these networks and is not handled by the current fault management software. A diagnosis therefore consists of one or more microwave links (considered to be faulty). Other faults, e.g. of base station transceivers can be (and actually are) handled by a simple alarm evaluation system, which can use a one-to-one correspondence for alarm messages generated for these faults and faulty network elements. The model-based alarm correlation system is used for those more complicated cases, where a one-to-one correspondence is not possible.

From the six alarm messages related to the base station controller and ten alarm messages related to the base station transceivers we only use five of the latter ones, as the alarms are quite redundant. Moreover, a detailed model for these alarms (involving the protocols on level 1, 2 and above) is not necessary. It is sufficient to divide these alarm messages into two classes, farend and bts_failure messages.

farend alarms are generated by a component if the components connected to this component on the down side are not reachable any more. bts_failure messages for a component are generated if this component is not reachable from a BSC. bts_failure messages are generated directly in the BSC, when it detects (using the existence or non-existence of signals from the level 1 protocol) that the base station transeivers are not reachable. As an abstract model, we assume that periodical alive messages are sent by each component to the BSC. If the path from BSC to component is disrupted by a faulty component, these messages cannot be delivered to the BSC and the BSC generates the appropriate alarm message. Notice, that such a description of the alarm behavior does not necessitate a detailed description of the underlying protocols.

Let us look at the relevant cases in detail using two examples. First, we get a bts_failure alarm for BTS20. This indicates that BTS20 is not reachable from the BSC. As poll messages can be lost only between the BSC and the BTS, and only a faulty microwave link can explain message loss, the three micro wave links between BSC and BTS20 are possible diagnoses. The following picture shows the model assuming the second microwave link to be faulty. In this case we predict poll messages for each BTS located downstream the faulty microwave link, i.e. BTS19, BTS20 and BTS21. We check the additional predictions for each diagnosis candidate to distinguish between the possible diagnoses.

tex2html_wrap648

The corresponding alarm rule can be phrased like this:

If we have a bts_failure alarm for network element tex2html_wrap_inline640
Then the BSC has not received the alive message from component tex2html_wrap_inline640.

On the other hand, assume we get a farend alarm from BTS20. This farend alarm tells us, that the components located downstream of BTS20 are not reachable. Using this observation, we can conclude that the faulty microwave link is located downstream of BTS20, in this case narrowing the set of diagnosis candidates to one.

tex2html_wrap650

The corresponding alarm rule can be phrased like this:

If we get a farend alarm from component tex2html_wrap_inline640
Then component tex2html_wrap_inline640 has sent a farend signal to the BSC and this signal has not been discarded on the way to the BSC.

Specific Model

 

The following set of predicate logic formulas expresses these informal specifications. First, we specify the signals used as well as their class (BTS_FAILURE) alarm or FAREND alarm. The ALIVE signal is not an alarm, but a status signal from the base station transeivers.


displaymath652

Similarly, a set of facts describes the network elements and their types.


displaymath653

The topology is described by a set of connection facts. We denote the upstream port of each network element by tex2html_wrap_inline678, i.e. the port directed towards the BSC, and the opposite, downstream port by tex2html_wrap_inline680. For example, Conn( ML16, UP, BSC, DOWN ) means that the UP port of ML16 is connected to the DOWN port of the BSC). When the topology of the network is changed, only this set of connection facts and the type facts described above have to be changed.


displaymath654

The following formulas describes the alarm behavior as well as the alarm propagation. First, we describe the abstraction from specific bts_failure alarms for a network element to an abstract observation, that (at least one) bts_failure alarm message has been received / generated for a specific network element. The predicate tex2html_wrap_inline684 represents the observation, that the alarm message tex2html_wrap_inline686 has been received / generated for the network element tex2html_wrap_inline688, where tex2html_wrap_inline688 can be any network element of type tex2html_wrap_inline692.


displaymath655

We assume that each base station transeiver sends an alive message and this message is present at its tex2html_wrap_inline678-port. The fact tex2html_wrap_inline696 means, that the signal ALIVE sent from network element ne2 is present at the port UP of network element ne1.


displaymath656

If we have observed a bts_failure alarm from a given network element, we can infer that no alive signal for this network element has been received at the base station controller tex2html_wrap_inline698.


displaymath657

If a farend alarm from a network element has been observed, then that element has sent this farend alarm and it has not been discarded on the way from the sender to the base station controller.


displaymath658

Signals are propagated over connections (into the direction of the base station controller).


displaymath659

Signals are also propagated over base stations (into the direction of the base station controller).


displaymath660

Signals are also propagated over microwave links. If the microwave link is working correctly, it propagates a signal just like a connection or a base station transeiver. If the microwave link is defect, the signal is discarded.


displaymath661


displaymath662

Finally, if a signal is discarded, a bts_failure alarm is generated.


displaymath663

A further enhancement of this specification of alarm behavior concerns the predictions of each candidate model. In the model described so far all effects are deterministic. The model specifies for example, that we have to observe an bts_failure alarm for each component below a faulty microwave link. These predictions are not totally accurate, as filtering mechanisms within the network drop some alarm messages. We can easily extend our model to include the possibility of such lost alarms by assigning a probability to such events.gif This allows us to tolerate lost alarms, as long as we have enough other messages indicating the faulty component. If too many alarms are dropped (meaning that we decrease the amount of evaluable alarm messages), more diagnosis candidates will be produced.

This enhancement is included by extending the last rule by the possibility that the message is lost, and assigning a probability to that event.


displaymath664


next up previous
Next: Implementation and First Results Up: Model-Based Alarm Correlation in Previous: Problem and Previous Solutions