Increasing the Reliability of a Naval Tactical Data Link through the Design and Implementation of Automatic Mechanisms for Failure Recovery

Resumen Incremento en la fiabilidad de un enlace táctico naval mediante el diseño y la implementación de mecanismos de recuperación automática ante fallas Date Received: March 20th 2016 Fecha de recepción: Marzo 20 de 2016 Date Accepted: June 14th 2016 Fecha de aceptación: Junio 14 de 2016 Increasing the Reliability of a Naval Tactical Data Link through the Design and Implementation of Automatic Mechanisms for Failure Recovery 1 COTECMAR. Cartagena de Indias, Colombia. e-mail: gustavoperezv@gmail.com 2 COTECMAR. Cartagena de Indias, Colombia. e-mail: smarrugo@cotecmar.com 3 Tecnológica de Bolivar University. Cartagena de Indias, Colombia. e-mail: egomez@unitecnologica.edu.co Ship Science & Technology Vol. 10 n.° 19 (27-45) July 2016 Cartagena (Colombia) 28 Currently, when the units of some military components operate in groups/task forces, tactical information transfer among them is crucial, in order to provide a common tactical view that enables real time coordination of operations. This information exchange must be executed through a system that ensures privacy, that is reliable, user friendly and that, because of bandwidth constraints in the available means of communication (HF V/UHF), does not overly increase message size. However, failures that reduce system reliability may occur during the operation. These failures may be attributed to system design issues (dead states), physical network problems (equipment failure, unit disconnection), or electronic warfare technologies used to interfere and override these types of systems. In order to ensure the reliability of this type of system, we must thoroughly study the media access control mechanism of the tool and the system states, and identify possible failure conditions. The above, in order to be able to design automatic recovery mechanisms suitable for the analyzed system, with minimum investment in specialized hardware and applying solutions from the software component, using system synchronization as a starting point. This work describes the methodology used to design and implement an automatic recovery system for failures detected in a tactical data link system and how this will increase the system’s reliability by reducing average recovery times. The system under study is the prototype version of a Tactical Data Link developed by COTECTMAR for the Colombian Navy. A Tactical Data Link is a tactical communication system based on radio communications that enables running the tactical information of a force or task force and improve decision making and command and control functions through information exploitation tools (COTECMAR, 2011). Tactical Data Links (TDL) enable radio data exchange between platforms, in order to minimize voice communications that may be critical in action or combat environments (CPT/CIA, 2008). Its basic operating principle is to provide a real time link between subordinate units and their corresponding operational command. Currently, a large portion of military communications (voice and non-voice) are transmitted as data, making it easier for the military forces to coordinate their land, sea, and air-based operations (Asenstorfer, Cox, & Wilksch, 2004). Technically, TDLs define a family of protocols known as Links, that have broadened military communication coverage through wireless networks that connect vessels, submarines, tanks, land bases, etc. These protocols lie within the physical and link layers (one and two, respectively) of the OSI reference model, defining aspects regarding Media Access Control (MAC) and information transmission on the radio links (Benavides & Montañez, 2008). Technical characteristics of the system under study: • Includes a cartographic system in S-57 and Shape format. • Operates on HF/VHF/UHF frequency bands. • It has AES private key cryptography. • Three operating modes: Test, Silence, and Normal (operating). • FSK modulation • On-demand (polling) MAC. This implies that there must be a network controlling station. Functional characteristics of the system under study: • Operation management (unit configuration, charts and groups) • Network management (codes, network modes, radio communication options) • Weapon management Introduction Characteristics of the system under study Ship Science & Technology Vol. 9 n.° 19 (27-45) July 2016 Cartagena (Colombia) Pérez, Marrugo,Gómez 10 n.° 19 (27-45) July 2016 Cartagena (Colombia) 29 • Tactical information exchange (position reports, contacts, changes in confi guration, unoffi cial messaging, alerts, correlation/ decorrelation, among others). • Information exploitation and decision making support (RAM traces, points of reference, radar prediction, interception, PMA, position simulation). Hardware components of the system under study: Th e described system is accompanied and complemented by a hardware component that ensures integration of all the system functionalities with the radio communication equipment required to carry out the information exchange in the network. System hardware consists of a communications integrated box, which incorporates COTS1 components, such as: a multi-modem card to modulate and demodulate FSK data in the communication channels, a switch card with 8 ports to connect the on-board devices, an internal 12V and 5V DC source to power the cards and adaptors for all internal connections. 1 Commercial Off-The-Shelf. Non-developing element (NDI) for supply, which is also commerciall. Th e box has an external 115VAC and 12VDC supply. It has a universal USB port for PC connection and serial ports to connect radio equipment. Fig. 1 shows images of the communications integrated box. Fig. 2 shows the process followed in managing the system network. Basically, the Data Link is seen as a tactical data link between the participating units in a specifi c operation. During the process of exchanging tactical information to support decision making while performing the operation, the following resources are involved: • Th e sensors in each unit, which become combat intelligence information sources which is shared with all units. • Th e analysis tools that support decision making. • Th e external communications system of the units.

Increasing the Reliability of a Naval Tactical Data Link through the Design and Implementation of Automatic Mechanisms for Failure Recovery Currently, when the units of some military components operate in groups/task forces, tactical information transfer among them is crucial, in order to provide a common tactical view that enables real time coordination of operations.
This information exchange must be executed through a system that ensures privacy, that is reliable, user friendly and that, because of bandwidth constraints in the available means of communication (HF -V/UHF), does not overly increase message size.
However, failures that reduce system reliability may occur during the operation.These failures may be attributed to system design issues (dead states), physical network problems (equipment failure, unit disconnection), or electronic warfare technologies used to interfere and override these types of systems.
In order to ensure the reliability of this type of system, we must thoroughly study the media access control mechanism of the tool and the system states, and identify possible failure conditions.The above, in order to be able to design automatic recovery mechanisms suitable for the analyzed system, with minimum investment in specialized hardware and applying solutions from the software component, using system synchronization as a starting point.
This work describes the methodology used to design and implement an automatic recovery system for failures detected in a tactical data link system and how this will increase the system's reliability by reducing average recovery times.
The system under study is the prototype version of a Tactical Data Link developed by COTECTMAR for the Colombian Navy.
A Tactical Data Link is a tactical communication system based on radio communications that enables running the tactical information of a force or task force and improve decision making and command and control functions through information exploitation tools (COTECMAR, 2011).
Tactical Data Links (TDL) enable radio data exchange between platforms, in order to minimize voice communications that may be critical in action or combat environments (CPT/ CIA, 2008).
Its basic operating principle is to provide a real time link between subordinate units and their corresponding operational command.Currently, a large portion of military communications (voice and non-voice) are transmitted as data, making it easier for the military forces to coordinate their land, sea, and air-based operations (Asenstorfer, Cox, & Wilksch, 2004).
Technically, TDLs define a family of protocols known as Links, that have broadened military communication coverage through wireless networks that connect vessels, submarines, tanks, land bases, etc.These protocols lie within the physical and link layers (one and two, respectively) of the OSI reference model, defining aspects regarding Media Access Control (MAC) and information transmission on the radio links (Benavides & Montañez, 2008).
Technical characteristics of the system under study: • Includes a cartographic system in S-57 and Shape format.Hardware components of the system under study: Th e described system is accompanied and complemented by a hardware component that ensures integration of all the system functionalities with the radio communication equipment required to carry out the information exchange in the network.
System hardware consists of a communications integrated box, which incorporates COTS1 components, such as: a multi-modem card to modulate and demodulate FSK data in the communication channels, a switch card with 8 ports to connect the on-board devices, an internal 12V and 5V DC source to power the cards and adaptors for all internal connections.
Th e box has an external 115VAC and 12VDC supply.It has a universal USB port for PC connection and serial ports to connect radio equipment.
Fig. 1 shows images of the communications integrated box.
Fig. 2 shows the process followed in managing the system network.Basically, the Data Link is seen as a tactical data link between the participating units in a specifi c operation.
During the process of exchanging tactical information to support decision making while performing the operation, the following resources are involved: • Th e sensors in each unit, which become combat intelligence information sources which is shared with all units.
• Th e analysis tools that support decision making.
• Th e external communications system of the units.• The data modulating and demodulating devices, to be adapted for radio communications.• Means to deploy information.

SENSORS
• Database managing systems that store information during the operations.
Fig. 3 shows the flowchart for the system under study for the "Normal" operation mode.
During system operations, failures may occur in some participating unit due to internal or external factors, which would cause such unit to involuntary lose connection or a significant disturbance in communications.
Below are these type of situations, for which the recovery processes of the system have been taken into consideration.
• Fall of the Network Controller Station (NCS) or Control Unit (CU).• Fall of the unit with the token.
• Fall of a participating unit.
Visualization of these failures in a more detailed system scheme is shown in Figs. 4 and 5, for the CU and PU roles, respectively.Failure situations are shown in red in the schemes.Table 1.System failure identification.
The first one (from top to bottom) is the dropout of a PU.The controlling unit sends the token to the PU and since no ACK is received, it repeats the attempt; however, the design of the system considered only the possibility of the PU reconnecting during that second opportunity, ruling out the option of voluntary or accidental disconnection of the PU; therefore, upon reaching this point, the system fails and it cannot find a state in which to operate, thus completely disconfiguring the network.
The second failure is the dropout of a PU with token.In this case, the participating unit receives the token sent by the PU, replies the ACK, and therefore the CU sent all the information available to the PU and also receives all the information coming from it, but in the end it does not receive the token.This option was not considered during the design of the system, and therefore upon reaching this point, it fails and cannot find a state in which to operate, thus completely disconfiguring the network.

Disturbed media
The transmission channel is blocked by a higher power signal and therefore information cannot be received/transmitted.External NCS -PUs

Fall of the NCS
The network control station becomes disconnected and therefore the network coordination actions and information relay cannot be performed.

Internal -NCS PUs
Fall of the PU with Token The participating unit that receives the token disconnects and does not return the Token to the NCS, and therefore communications in the network are affected.

Internal -PU NCS -PUs
Fall of the PU A participating unit disconnects when it does not have the token; as a consequence, it cannot receive/transmit information.Below, Fig. 12 shows the moment (seen from the PU) that generates two of the most complex failure situations:

Ship
In the current system design, it is assumed that once connected to the network, the PU will always receive the token from the CU; however, whenever the controlling unit voluntarily or involuntarily disconnects, there is no way to generate a token in the network or to manage the information exchange between the units, and therefore the system will enter into an infinite silence or the network will disconfigure since the ID of the controller unit authorizing connections will not be detected.Now, given the naval environment in which this type of systems are used, it is possible that both the PU and the CU are connected but fail to communicate because the communication channel is blocked or disturbed (electronic warfare techniques); in this case, the PU will also assume that the CU is not connected, and therefore the point of origin of the failure is assumed to be the same one.

General design -failure recovery
The sequence diagrams , show an overview of the recovery mechanism to be implemented in each case.

Detailed Design -Failure Recovery
Considering the media access control of the system, and knowing that the main issue is to reduce recovery times, timers are implemented in the system as mechanisms to activate identification and fail recovery routes.
These timers will be based on the times designed for system synchronization; such times are listed and described below: • To: Network performance optimized time.It is noteworthy that during the token cycle, whenever a unit is disconnected, the NCS must wait for a Tc time so that such unit may have a time frame to connect to the network.TC is equivalent to 2To.
The "Disturbed media", "fall of the PU with Token", and "Fall of a PU" failure situations described in Table 1, will have a network recovery procedure seen from the CU or NCS as shown in Fig. 9.
The "fall of the NCS" failure situation shown in Table 1, shall have a network recovery period seen from the PUs, as shown in Fig. 10.
Generally, the recovery mechanisms are explained as follows: in CU a timer (te) is triggered as soon as the unit receives the token ACK message, i.e. as soon as token delivery to a PU has been confirmed.
In the case of fall of the PU failure, this timer is not triggered, since the PU does not receive the token message.The proposed solution in this situation is disconnecting the unit after the second attempt to deliver the token, and to continue monitoring the connected PUs sequence to send the token.
Thus, the affected PU may detect its inactivity and request a new connection in a subsequent token cycle.With this solution, the network will not lose its configuration and only the unit with problems will be affected.For the fall of the PU with token, the CU must verify the "expiration" of the timer time (te).The purpose of this is to provide a reasonable time frame for the PU that has the token to transmit information or to reclaim the token.Once this time is exceeded, the CU will disconnect the PU and invalidate the previous token, generate a new one, and continue with the token sequence.Thus, there will be no two tokens in the network and only the failed PU is affected.From the PU perspective, the two above mentioned procedures are not detected, unless it is the PU that had to be disconnected from the network, i.e. the failed PU.In this case, once each unit is connected and confi gured in the network, the timer (tet) that allows it to remain in a token-standby status is triggered.If such timeframe is exceeded, the PU shall check if it remains in a connection state; if so, it shall automatically disconnect from the network and trigger a second timer (tcr).During this time frame, the PU shall remain in a "listening" mode.
If it hears its ID (which is sent off by the CU) within this timeframe, it automatically sends its connection request message to the network again.
If, given the fi nal condition set forth in the previous paragraph, the environment is still silent and the (tcr) time is exceeded, the PU will automatically recognize that something happened to the CU, and it will therefore proceed to verify if it is its turn to assume CU functions.If so, the unit will autoreconfi gure and assume the CU functions.If not, a third timer (tdisturbance) will trigger.If this timer is exceeded, the PU shall interpret that it is being disturbed and will check the confi gured frequency table to suggest the operator switching to a secure frequency.
Th e simulation model made for the system is shown below.Th e entry variables for the simulation are read from an Excel spreadsheet with the entry values for the simulation.Th e purpose of this model is to assess how entry variable variation aff ects output, in order to select the values that yield the best performance, for implementation in the system.
• Entry variables: Amount of Participating Units and Optimized Network Time.• Output parameters: recovery time.
For the simulation, a failure in the system is assumed, using a random value distribution module of values ranging between 0 and 3. Said values lead the system to a failure (according to the details in Table 1).For each type of failure, a timer is activated and the recovery actions are taken, according to the fi gures shown in the previous section.In order to simulate the recovery mechanisms, the times that the system takes to perform certain actions during experiments, such as table reading, ID identification, among others, to achieve admissible ranges in the model were taken as model input.
The simulation was ran 1000 times, and the first 100 results of the recovery time for each type of failure were taken as study data.
The simulation model was implemented for the following input conditions: • Amount of units: 4 (Maximum number of units participating in operations) • Network optimization time: 3.9 seconds (measured value)

Simulation Results
The results obtained from the simulations are shown below.
The minimum recovery time of the simulation was 25 seconds, while the maximum time was 38 seconds.
• Standard deviation: 2.65 The minimum recovery time in the simulation was 16 seconds, and the maximum time was 21 seconds.
• Standard deviation: 1.12 The minimum recovery time in the simulation was 145 seconds, while the maximum time was 154 seconds.
• Standard deviation: 1.57 The minimum recovery time in the simulation was 206 seconds and the maximum time was 218 seconds.
• Standard deviation: 1.91 The table below summarizes the recovery times measured in the system simulation.
As a result of the simulation process, we verified that the recovery times associated to each failure  dropped in over 50% as compared to the times measured in manual system recovery; therefore, we confi rm the feasibility of implementing the proposed mechanisms.
Th e programming language used to implement these mechanisms is C++, and the work environment was Visual Studio 2010.We used a licensed version of this tool, property of COTECMAR.
Th e timers implemented in the system as a comprehensive part of the proposed failure recovery model are shown in Fig. 16, which presents, in the Visual Studio 2010 graphic interface, the corresponding icons and names assigned to each one of them in the system.
In order to verify the lab performance of the automatic fail recovery mechanism implementation, we drew a test plan consisting of four ( 04) packages (1 for each type of failure).Th e devices and/or tools considered to run the Test Plan are: • Four computers in working conditions, with Windows XP or higher, installed and updated (in this specifi c case we used 04 standard DELL Latitude E6400).• Radio equipment comprised by: • Four tactical radios with antenna charger, Motorola Pro 3100 UHF, with power source.• Four communication integrating boxes.
• Wiring suitable to connect computers, multi modems and radios.
Fig. 17 shows a picture of the laboratory where tests were held.Th is lab is located in the COTECMAR facilities in Cartagena, and its use was authorized to runt the testing protocol of the system under study, with the implemented automatic failure recovery system.
Th e summary of the results obtained in the lab tests is shown in Tables 3-4.Table 3 shows the operating results, i.e. if after the failure, the network could be returned to an operating state.On the other hand, Table 4 shows the average recovery times for each one of the 50 tests ran for each type of failure.Th e summary of the results obtained in the lab tests is shown below.Table 3 shows the operating results, i.e. if after the failure, the network could be returned to an operating state.On the other hand, Table 4 shows the average recovery times for each one of the 50 tests ran for each type of failure.
Table 5 shows the summary of the above presented results, including the reduction in the recovery time (as a percentage) between manual recovery (whose data was taken prior to the beginning of this project and the implemented automatic recovery.

System reliability
According to the specifi cations in Applied R&M Manual for Defense Systems Part D -Supporting   3% as compared to the information supplied by IBM.
After validating the reference information for the specific reliability measurement experiment for the system under study, two continuous assessment periods of three ( 03) months (equivalent to 2160 hours) were considered.
Table 7 shows the conversion information used for periods of less than one year, equivalent to the data contained in Table 6.
Table 8 was drawn from the information gathered during the three (03) months of testing, which summarizes the measured "downtimes".These values correspond to the average obtained during the testing period.
The measurements taken correspond to two conditions: manual recovery mode (data gathered between June -September, 2012) and the implemented automatic recovery mode (data collected between October 2014 -January 2015).
In general, four (04) types of failures were detected in the system under study.50% of the failures detected were due to Token loss, whether because the NCS or the CU lost connection or because a PU left the network while holding the token.The other 50% of the failures was distributed as follows: 25% due to PU disconnections and 25% due to external factors (electronic warfare techniques).
75% of the failures detected in the system are a result of dead states in the system, while the remaining 25% are due to external factors.In the naval operating environment, direct energy radiation is the most commonly used electronic warfare technique (external factor -from the operating environment).
With the implementation of the automatic failure recovery system, the following can be affirmed: • Upon comparison of the data obtained in the tests ran in the system for manual recovery vs the data obtained from the tests ran in the system for automatic recovery case, we found a 61.75% reduction in the system's failure recovery time, going from an average recovery time of 241 seconds to an average of 102.25 seconds.
• We were able to increase the reliability of the data link system under study by 9%.The 61.75% average reduction in recovery times allowed the reliability of the system to increase from 90% (equivalent to 72.35 hours/month in which the system was down), to 99%

Fig. 4
Fig. 4 shows two possible failures, seen from the CU:

Fig. 11
Fig.11shows the general layout and some subprocesses of the simulation model made with the ExtendSim 8 computing tool.

Fig. 17 .
Fig. 17.Lab facilities where the tests were held.
Fig. 5. Detailed flowchart of the Participating Unit.Increasing the Reliability of a Naval Tactical Data Link through the Design and Implementation of Automatic Mechanisms for Failure Recovery finished (it is also equivalent to NTo).• Tet: Time between tokens.It is the time of a token cycle.It is comprised by a time slot for each unit and multiplied the number of units in the table.• Tcr: Network cycle time.It is the time it takes to ensure that all units are aware of the NCS dropout.• Tcm: Multi-modem configuration time.It is the time taken for the multi-modem to configure as an NCS.
ConnectedShip Science & Technology -Vol.9 -n.° 19 -(27-45) July 2016 -Cartagena (Colombia) Pérez, Marrugo,Gómez Ship Science & Technology -Vol. 10 -n.° 19 -(27-45) July 2016 -Cartagena (Colombia) • Te: Lead time.It starts with a reception silence after a unit already has received the token.It is equivalent to 2To.This metric resets every time the NCS receives a message.• Tdisturbance: Waiting time to receive connections.It allows to determine if the disturbed media (ECM) situation is present.It is equivalent to NTo, where N is the number of units in the table.• Tic: Waiting time for units to switch to the new frequency (ECCM) and to synchronize the reconnection process.• Treconnection: Time period that begins at the end of Tic, until the connection time is Ship Science & Technology -Vol.9 -n.° 19 -(27-45) July 2016 -Cartagena (Colombia) Ship Science & Technology -Vol. 10 -n.° 19 -(27-45) July 2016 -Cartagena (Colombia) Increasing the Reliability of a Naval Tactical Data Link through the Design and Implementation of Automatic Mechanisms for Failure Recovery Fig.10.Sequence of actions for the automatic system recovery (associated to an ECR dropout).

Table 3 .
General summary of tests and results.

Table 4 .
Numerical summary of tests and results.

Table 6 .
Reliability/availability of a system according to system downtime 7 .

Table 7 .
Reliability/availability of a system -system downtime equivalence.

Table 8 .
Reliability/availability of the system under study.