XIX IMEKO World Congress Fundamental and Applied Metrology September 6–11, 2009, Lisbon, Portugal

# SINGLE EVENT UPSET (SEU): DIAGNOSTIC AND ERROR CORRECTION SYSTEM FOR AVIONCS DEVICE

Lorenzo Ciani<sup>1</sup>, Marcantonio Catelani<sup>1</sup>, Lorenzo Veltroni<sup>2</sup>

<sup>1</sup> University of Florence, Department of Electronics and Telecommunications, via S. Marta 3, 50139, Florence (Italy) lorenzo.ciani@unifi.it, marcantonio.catelani@unifi.it <sup>2</sup> Sirio Panel S.p.A. – Loc. Levanella Becorpi – 52025 Montevarchi (AR)

**Abstract** – In aerospace applications, Commercial-Off-The-Shelf (COTS) Field programmable Gate Array (FPGA) is becoming increasingly attractive by offering low-cost solutions, simplicity and flexibility.

This research faces the problem of disturbance induced by high energy particles on electronic devices. Based on detailed analysis of this phenomenon, the work is divided into two parts: in the first part evaluation of effects of the Single Event Upset (SEU) has been carried out with the aim of determining diagnostic techniques and the mitigation of this disturbance, taking into account the fact that testing is one of the fundamental points in electronic programmable devices; in the second part a fault tolerant technique has been devised so as to achieve the requirements demanded on a real avionic system.

**Keywords**: Diagnostic system, Single Event Upset (SEU), Field Programmable Gate Array (FPGA).

## 1. ATMOSPHERIC RADIATION EFFECTS

The environment where a system is due to work can influence the behaviour of electronic components contained inside it, due to high energy ionizing and non-ionizing incident particles (electrons, ions, protons, neutrons etc).

Single Event Effects (SEE) [1], [2] are due to the action of a single particle which crosses the substrate of integrated circuit while the others are due to the total action of the flow of particles to which the integrated circuit is subject during its entire operative life.

Main interest is directed to SEE as the increase in the microelectronics integration scale has led to an increase in this kind of disturbance.

SEE are of a different type and differentiate in both soft, form and hard faults. In some cases, hard faults can be so catastrophic that they cause the breakdown of the device.

SEE are classified in:

• Soft Faults: Single Event Upset (SEU) and Single Event Transient (SET);

• Firm Faults: Single Event Functional Interrupt (SEFI);

• Permanent Faults: Single Event Latch up (SEL), Single Event Burnout (SEB) and Single Event Gate Rupture (SEGR).

In particular, single event upset (SEU) is defined by NASA as "radiation-induced errors in microelectronic circuits caused when charged particles (usually from the radiation belts or from cosmic rays) lose energy by ionizing the medium through which they pass, leaving behind a wake of electron-hole pairs".

When a SEU occurs, radiation deposits a quantity of energy on a bistable element so as to cause the logic state commutation (upset). This effect predominates among those produced from the radiation of high energy particles. Upsets are easily noticed in unprotected memory cells, particularly in the SRAM cells (Static Random Access Memory), as shown in Figure 1.



Figure 1. SRAM cell upset example.

When a charge particle, with a mechanism similar to SEU, it induces a transient on a logic port we talk about SET. This effect is hard both to characterize and to foresee, because an effect of memorization does not take place as in the case of SEU. It can therefore generally be verified when the component is coupled with a memory device such as a latch or flip-flop which can memorize the transient as wrong information: in this case it is said that a SET has originated a SEU. Depending on the instant of the reproduction of the logic transient compared to the clock, one can or cannot obtain a SEU. If the transient occurs during the period of set-up or hold time of a register, the SET will lead to the reproduction of a SEU.

## 2. CAUSES OF SINGLE EVENT UPSET

SEUs are caused by the presence of high energy particles in cosmic radiation and, only in the smallest part, in the radioactive discharge of the component package. Cosmic radiation is generated by subatomic particles and high energy photons with the prevalence of protons produced by thermo-nuclear reaction which occurs in the stars.

From this radiation knows as "primary", it is possible to distinguish a secondary radiation which originated as the result of a collision between primary rays and atoms from the atmosphere, composed of neutrons, protons and muons. Neutrons do not have charging capacity therefore are less likely to be adsorbed and have also a high capacity of penetration. This is why at the flight altitude (30000-50000 ft), the particles which are prevalent and responsible for the upset are neutrons [2], see Figure 2.



Figure 2. Atmospheric Radiation Environment.

To estimate the entity of disturbance, a statistical approach is adopted through which, initially, the upset rate named Single Event Rate (SER) is calculated by means of several statistical models. They describe the flow of radiation and calculate the upset rate. These models have been realized by IBM [3], NASA [4], NRL (Naval Research Laboratory) [5], USNA (United States Naval Academy) [2] and Boeing [6]. The last one represents the model usually adopted in aeronautic field.

#### 3. DIAGNOSTIC AND CORRECTION TECHNIQUES

The techniques for diagnosis and mitigation are classified in fault avoidance and fault tolerant techniques [7]. The first one consists in hardware techniques that allow

to reduce the sensitivity of the device to radiation, that is, reduce the probability that upset will occur.

A large number of design solution have been developed for memory cells, latches, and registers; in particular, such solutions aim to reduce the bandwidth of the cell to achieve immunity to the transient caused by collected charge or to provide redundant storage or blocking provision to prevent upset.

Although effective in improving cell single event upset characteristics, the disadvantages of hardening the process are that the cost and the die size might increase and the performance of the device is typically reduced. Moreover the reduced bandwidth is contrary to the achievement of high-speed operation.

Fault tolerant techniques, instead, allows the system to function even in the case of fault. These techniques use redundancy to disguise, correct or reveal eventual upset and are the same ones used to protect digital system from any other type of error. Implementation of fault tolerant typically utilizes some form of redundancy; variations include informational redundancy (redundant data structures), spatial redundancy (redundant hardware), and temporal redundancy (redundant sequential operations).

The most common way of mitigating SEUs in semiconductor devices is by error detection and correction (EDAC). Today, a large number of designs incorporate some form of EDAC. Some common methods of EDAC are shown in Table 1 [8].

#### Table 1. EDAC methods.

| EDAC Method                   | EDAC Capability                                          |  |
|-------------------------------|----------------------------------------------------------|--|
| Parity                        | Single bit error detect                                  |  |
| Cyclic Redundancy Check (CRC) | Detects if any errors have occurred in a given structure |  |
| Hamming Code                  | Single bit correct, double bit detect                    |  |
| Reed-Solomon Code             | Corrects multiple and consecutive bytes in error         |  |

The increase in hardware content in informational redundancy is typically less than spatial modular redundancy. Informational redundancy consists in the addition of k bit of control to the m bit of information, so obtaining coding from m+k bit. This coding lets us have  $2^{m+k}$  combinations,  $2^m$  will single out valid words, the others will make up words which are not valid so that the Hamming distance of the coding is greater than zero. We refer to the fact that the distance of Hamming is defined as the number of bit for which two valid words differ. It also known that if the minimum Hamming distance of the coding is equal to t+1, t errors can be revealed, whereas it must be at least equal to 2t+1 to correct t errors.

Fault tolerant techniques which are more suitable for this kind of disturbance, have been studied:

• *Parity bit* technique needs the addition of only one control bit to make the number of "1" of the word equal. This technique allows an odd number of errors to be revealed but doesn't allow them to be corrected.

• Single Error Correction – Double Error Detection (SEC-DED) allows the correction of single errors and reveal multiple upset on two bit. As the SEU is an effect which

interests the upset of a bit and MBU strikes mainly two bit, the SEC-DED technique is particularly adapt for this type of disturbance.

• *Cyclic Redundancy Check* (CRC) only lets errors be revealed. Its performances depend on the algorithm which determines number and form of the control bit.

• *Triple Modular Redundancy* (TMR) a fault in an individual module is corrected by the action of a voter through the majority consensus, or two-out-of-three, voting rules.

Disturbance produced by high energy particles on an electronic device can result, in some applications, unacceptable. This is the case for avionics and space applications where project requirements demand high reliability levels and what's more, the extent of the disturbance is such that it can not be ignored.

# 4. TECHNIQUE DEVELOPED FOR AN AVIONIC SYSTEM

In order to prove the validity of the proposed technique we can considered an avionic application. In particular, the aeronautics operating system under examination is an integrated control panel for military aircraft cockpit. The main task of system is to define the altitude of the airport for landing by means of an encoder, to visualize the selected value on a display and to transmit such information to the other systems constituting present in the aircraft cockpit. The block diagram of the integrated control panel in shown in Figure 3.



Figure 3. Integrated control panel block diagram.

The functions of the system are carried out through a logic device realized with an FPGA. Such component receives signals from the encoder and translates them into information that are visualized on the display via I2C bus (Inter-Integrated Circuit) and, at the same time, send to the other subsystems of the cockpit by CAN bus (Controller Area Network bus).

For this operating system the FPGA represents the element subjected to the upset phenomenon.

Once the critical component of the surveillance display is located in the FPGA, it is necessary to do a detailed examination of the disturbances brought on by the high energy particles on this particular component. In literature [9] errors in FPGA are classified as:

• *Permanent Errors* – if the configuration memory is involved; permanent errors are considered as the worst type of errors and can be removed only with a new configuration of the memory. It is important to observe that these errors

differ from those which damage the device (hard errors or physical defects). In this case, the configuration bit remains erroneous until the new configuration is downloaded into the FPGA. So, these permanent errors are recoverable.

• *Transient Errors* – they are errors localized in the combinational logic components, in the registers and in the user memory. These errors are called transient because they maybe overwritten or corrected using error-detection-and-correction techniques.

In Table 2 the upset rate value for the configuration memory, the user memory and the registers implemented in this project are evaluated. The values are obtained by tests carried out at the Los Alamos Laboratory (New Mexico, USA) [10], and relate to airborne environment by means of the Boeing model [6].

| Table 2. | FPGA | upset | rates. |
|----------|------|-------|--------|
|----------|------|-------|--------|

| EP2C5T: Altera CyclonII (SRAM 90 nm)                        |                                                          |                                                           |  |  |  |
|-------------------------------------------------------------|----------------------------------------------------------|-----------------------------------------------------------|--|--|--|
| Configuration<br>Memory<br>upset rate                       | Registers<br>upset rate                                  | User Memory<br>upset rate                                 |  |  |  |
| $\lambda_{C-RAM} = 1.08 \times 10^{-4}$<br>n°upset/(chip·h) | $\lambda_{reg} = 1.6 \times 10^{-6}$<br>n°upset/(chip·h) | $\lambda_{mem} = 2.39 \times 10^{-5}$<br>n°upset/(chip·h) |  |  |  |

We can observe how the configuration memory presents the biggest rate, because the cells are realized in SRAM technology which offers a high level of sensitivity towards this disturbance.

We choose informational redundancy techniques in order to avoid an excessive use of hardware redundancy which often lead to a loss of resources; by doing so we enabled a remarkable cost reduction and a significant increase of the resources available and of the systems' speed.

By means of a detailed risk analysis and assessment and of the reliability analysis of the possible design options, the best techniques have been chosen in order to comply with the project requirements. This analysis enables to achieve the system possible states and the reliability performances which it has to comply to, such as probability of occurrence of faults (Table 3).

Table 3. System States.

| System State                 | On                               | Loss                                    | Erroneous                       |
|------------------------------|----------------------------------|-----------------------------------------|---------------------------------|
|                              | aren't errors or they            | the presence of<br>errors in the system | errors not detected             |
| Risk<br>Classification       | No Effects                       | Major                                   | Catastrophic                    |
| Probability of<br>Occurrence | 10 <sup>-3</sup> per flight hour | 10 <sup>-5</sup> per flight hour        | 10 <sup>9</sup> per flight hour |

Protection of the configuration memory is obtained by means of the CRC technique. The FPGA used in the project permits the CRC techniques to fill automatically up to 32 bit, activating this function in the programming of the chip [11]. Alternatively, it is however to use FPGA with Flash or Antifused configuration memory which is immune to these disturbances. These FPGA have been developed with the clear intention of getting rid of the more common SRAM based FPGA from the problem and of introducing a higher sensitivity level to the upset but these devices are much more expensive than Commercial-Off-The-Shelf FPGAs. Whenever possible however, use of SRAM based FPGA which normally offer better performance and which can be protected with fault tolerant techniques, is preferred.

Therefore, SEC-DED fault tolerant technique is introduced differentiating between harmful and unharmful errors and that is only data which lead to a variation of the permanent function have been protected, so reducing the complexity of the additional code: sign of the condition of state machine and data contained in the ROM (Read-only memory) implemented in the system.

A development of the SEC-DED technique has been realized for the state machine, the SEC-DED system together with the parity bit method has been used so allowing protection of the registers with SEC-DED code from eventual upset and the signal inside the logic, from eventual transition by means of parity bit. This is obtained by adding parity bit to the coding and by inserting the SEC-DED code onto the word. In this case it is necessary to implement a SEC-DED decoder inside the FPGA.

In Figure 4 is shown the state machine fault tolerant technique block diagram, where *state* is the state value register and *output* is the output value register.

The next state and output of this state machine is a function of the input and of the current state by means of combinational logic components indicated as *lamda* (for the output) and *delta* (for the state).



Figure 4. Block diagram of the State Machine fault tolerant tecnhique.

The user memory is protected by directly storing data with the SEC-DED code and by checking the correctness which comes out with a SEC-DED decoder (see Figure 5).



Figure 5. Block diagram of the user memory fault tolerant technique.

In order to verify the design choices, several specific tests have been carried out, simulating upset presence in the system and it has been proved that it produced the required fault tolerance. The system has also shown a higher potentiality than expected of detecting multiple faults (quadruple error detection).

#### 5. CONCLUSIONS

In this paper a mitigation technique for SRAM based FPGA avionics device has been proposed.

Acquired knowledge has allowed us to individuate diagnostic techniques and the mitigation of this disturbance has been developed to estimate both the need to introduce fault tolerant techniques and which of these allow us to comply with the project requirements.

Therefore a fault tolerant technique for a system present on a military aircraft has been analyzed and devised. The technique which has been developed is general purpose and can be introduced into any generic electronic device on an aircraft.

Electronics systems reliability problems, due to radiation disturbance, are also affecting other application fields (automotive, railway, biomedical) other than aerospace one. The achieved know-how and the diagnosis and mitigation techniques which have been carried out can be used in other critical applications connected to radiations phenomena.

### REFERENCES

- Fan Wang and Vishwani D. Agrawal, "Single Event Upset: An Embedded Tutorial", Proc. of 21st International Conference on VLSI Design, Hyderabad – India, 4-8 Jan. 2008.
- [2] Justin A. Sarlese, Development of a semi-empirical model for SEUs in modern DRAMs, U.S. Naval Academy, Annapolis – MD, 2000.
- [3] J. F. Ziegler, "Terrestrial cosmic rays intensities" IBM J. Res. Develop., Vol. 42, No. 1, 1998.
- [4] E. Normand, T. J. Baker, "Altitude and latitude variations in avionics SEU and atmospheric neutron flux", IEEE Transaction on Nuclear Science, Volume: 40, Issue: 6, Part 1-2, Dec. 1993.
- [5] James H. Adams, Jr., Rein Silberberg and C. H. Tsao, "Cosmic Ray Effects on Microelectronics", IEEE Transactions on Nuclear Science, Vol. NS-29, No. 1, Feb. 1982.

- [6] E. Normand, "Single-Event Effects in Avionics", IEEE Transaction on Nuclear Science, Volume 43, Issue 2, Part 1, April 1996.
- [7] W. Heidergott, "SEU Tolerant Device, Circuit and Processor Design", Proc. of the 42nd Annual ACM IEEE Design Automation Conference, Anaheim, CA, USA, 13 - 17 June 2005.
- [8] M. Kochar, K. Murchek, Single event upsets in FPGAs, Quicklogic Corporation, 2003.
- [9] G. Asadi, M. B. Tahoori, "Soft Error Rate Estimation and Mitigation for SRAM Based FPGAs", Proc. of 13th international symposium on Field-programmable gate arrays, Monterey, CA – USA, February 20 - 22, 2005.
- [10] Overview of iROC Technologies Report: radiation results of the test of FPGA december 2005, Actel, Aug. 2006.
- [11] *Error Detection and Recovery Using CRC in Altera FPGA Devices*, Altera Application Note 357, Jan. 2007.