As discussed above, modifying and extending the knowledge base of most previous systems in cases of network reconfigurations and new types of network elements is difficult and error prone. Many systems handle only single faults or predefined fault scenarios. The model-based approach [13] has been developed to address these shortcomings. Model-based diagnosis systems separate topological knowledge from behavioral knowledge, which is specified in a modular way for each type of network element. Behavioral knowledge specifies an abstract simulation model for the device under consideration. It is used to predict the expected behavior, given the observed input parameters. Diagnoses are computed by comparison of predicted vs. actual behavior.
This approach uses a model of the device, called the system description
(
), often formalized as a set of formulas expressed in
first-order logic. The system description consists of a set of axioms
characterizing the behavior of system components of certain types, e.g.\
a base station transceiver of a given vendor. The topology is modeled
separately by a set of facts. In model-based systems changes of
topology can be carried out easily without affecting the consistency of
the system description. Furthermore, since diagnoses are computed using
only a model of the correct system behavior, unforeseen error situations
can be diagnosed correctly.
Let us now define the diagnostic concept mathematically. The diagnostic
problem is described by system description
, a set
of components and a set
of observations (logical facts). A
predicate
is used to denote that a component is behaving
abnormally: If c is a component, the fact
denotes that c
is not behaving correctly, according to its specification. In
Consistency-Based Diagnosis, the concept we are using throughout
this paper, a Diagnosis D is a set of faulty components, such
that the observed behavior is consistent with the assumption that
exactly the components in D behave abnormally. If a diagnosis contains
no proper subset which is itself a diagnosis, we call it a Minimal
Diagnosis.

Minimal Diagnoses are a natural concept, as we do not want to assume a
component to be faulty, unless this is necessary to explain the observed
behavior. Since the set of minimal diagnoses can still be quite large,
several concepts exist to further discriminate among the minimal
diagnoses. We will use the concept of Maximal Probability
Diagnoses: In technical domains it is often useful to associate a
probability
with each literal
denoting
failure. The probability of a diagnosis is defined as the product of
the individual failure probabilities.

As de Kleer [3] pointed out, the exact value of the probabilities used is not important here. The probabilities are only approximate values, used to identify the more plausible diagnoses.
As discussed before, model-based diagnosis needs a model
of the behavior of the system to simulate correct and/or faulty
behavior and compare its results with the observations
. The set of components
considered interesting are
the network elements which can be faulty. The observations we
evaluate are alarm messages received / generated at the base station
controller. We therefore have to model the alarm behavior as well as
the propagation of alarms over network elements and connections
between them to explain the existence or non-existence of alarm
messages. This model has to be modular, i.e. basically describing
behavior and propagation information for each type of network
element. Additionally, topology information is included as a set of
facts. In this way, new network elements can be easily added by just
changing / extending the facts representing topology information. New
types of network elements (with possibly different alarm behavior) can
be added by adding a description of the appropriate type together with
a description of its alarm behavior.
We model base station transceivers, base station controllers and microwave links as basic network elements, which are connected in a star-like network. Data and alarm messages are sent through these elements and connections between them. In our current model, only microwave links may be assumed to be faulty, as this is by far the most important fault in these networks and is not handled by the current fault management software. A diagnosis therefore consists of one or more microwave links (considered to be faulty). Other faults, e.g. of base station transceivers can be (and actually are) handled by a simple alarm evaluation system, which can use a one-to-one correspondence for alarm messages generated for these faults and faulty network elements. The model-based alarm correlation system is used for those more complicated cases, where a one-to-one correspondence is not possible.
From the six alarm messages related to the base station controller and ten alarm messages related to the base station transceivers we only use five of the latter ones, as the alarms are quite redundant. Moreover, a detailed model for these alarms (involving the protocols on level 1, 2 and above) is not necessary. It is sufficient to divide these alarm messages into two classes, farend and bts_failure messages.
farend alarms are generated by a component if the components connected to this component on the down side are not reachable any more. bts_failure messages for a component are generated if this component is not reachable from a BSC. bts_failure messages are generated directly in the BSC, when it detects (using the existence or non-existence of signals from the level 1 protocol) that the base station transeivers are not reachable. As an abstract model, we assume that periodical alive messages are sent by each component to the BSC. If the path from BSC to component is disrupted by a faulty component, these messages cannot be delivered to the BSC and the BSC generates the appropriate alarm message. Notice, that such a description of the alarm behavior does not necessitate a detailed description of the underlying protocols.
Let us look at the relevant cases in detail using two examples. First, we get a bts_failure alarm for BTS20. This indicates that BTS20 is not reachable from the BSC. As poll messages can be lost only between the BSC and the BTS, and only a faulty microwave link can explain message loss, the three micro wave links between BSC and BTS20 are possible diagnoses. The following picture shows the model assuming the second microwave link to be faulty. In this case we predict poll messages for each BTS located downstream the faulty microwave link, i.e. BTS19, BTS20 and BTS21. We check the additional predictions for each diagnosis candidate to distinguish between the possible diagnoses.
The corresponding alarm rule can be phrased like this:
If we have a bts_failure alarm for network
element
Then the BSC has not received the
alive message from component
.
On the other hand, assume we get a farend alarm from BTS20. This farend alarm tells us, that the components located downstream of BTS20 are not reachable. Using this observation, we can conclude that the faulty microwave link is located downstream of BTS20, in this case narrowing the set of diagnosis candidates to one.
The corresponding alarm rule can be phrased like this:
If we get a farend alarm from component
Then component
has sent a
farend signal to the BSC and this signal has not been discarded on
the way to the BSC.
The following set of predicate logic formulas expresses these informal specifications. First, we specify the signals used as well as their class (BTS_FAILURE) alarm or FAREND alarm. The ALIVE signal is not an alarm, but a status signal from the base station transeivers.

Similarly, a set of facts describes the network elements and their types.

The topology is described by a set of connection facts. We denote the
upstream port of each network element by
, i.e. the port
directed towards the BSC, and the opposite, downstream port by
. For example, Conn( ML16, UP, BSC, DOWN ) means that the
UP port of ML16 is connected to the DOWN port
of the BSC). When the topology of the network is changed, only
this set of connection facts and the type facts described above have
to be changed.

The following formulas describes the alarm behavior as well as the
alarm propagation. First, we describe the abstraction from specific
bts_failure alarms for a network element to an abstract
observation, that (at least one) bts_failure alarm message
has been received / generated for a specific network element. The
predicate
represents the
observation, that the alarm message
has been
received / generated for the network element
, where
can be any network element of type
.

We assume that each base station transeiver sends an alive
message and this message is present at its
-port. The fact
means, that the signal
ALIVE sent from network element ne2 is present at the port
UP of network element ne1.

If we have observed a bts_failure alarm from a given network
element, we can infer that no alive signal for this network
element has been received at the base station controller
.

If a farend alarm from a network element has been observed, then that element has sent this farend alarm and it has not been discarded on the way from the sender to the base station controller.

Signals are propagated over connections (into the direction of the base station controller).

Signals are also propagated over base stations (into the direction of the base station controller).

Signals are also propagated over microwave links. If the microwave link is working correctly, it propagates a signal just like a connection or a base station transeiver. If the microwave link is defect, the signal is discarded.


Finally, if a signal is discarded, a bts_failure alarm is generated.

A further enhancement of this specification of alarm behavior concerns
the predictions of each candidate model. In the model described so far
all effects are deterministic. The model specifies for example, that
we have to observe an bts_failure alarm for each component
below a faulty microwave link. These predictions are not totally
accurate, as filtering mechanisms within the network drop some alarm
messages. We can easily extend our model to include the possibility of
such lost alarms by assigning a probability to such
events.
This allows us to tolerate lost alarms, as long as we have enough
other messages indicating the faulty component. If too many alarms are
dropped (meaning that we decrease the amount of evaluable alarm
messages), more diagnosis candidates will be produced.
This enhancement is included by extending the last rule by the possibility that the message is lost, and assigning a probability to that event.
