ABSTRACT
This paper addresses two reliability-based security threats and mitigations for embedded systems namely, aging and thermal side channels. Device aging can be used as a hardware attack vector by using voltage scaling or specially crafted instruction sequences to violate embedded processor guard bands. Short-term aging effects can be utilized to cause transient degradation of the embedded device without leaving any trace of the attack. (Thermal) side channels can be used as an attack vector and as a defense. Specifically, thermal side channels are an effective and secure way to remotely monitor code execution on an embedded processor and/or to possibly leak information. Although various algorithmic means to detect anomaly are available, machine learning tools are effective for anomaly detection. We will show such utilization of deep learning networks in conjunction with thermal side channels to detect code injection/modification representing anomaly.

KEYWORDS
Embedded Systems; Cyber-Physical Systems; Reliability; Long-Term Aging; Short-Term Aging; Side Channels; Thermal Measurements; Infrared Images.

1 INTRODUCTION
Modern embedded systems (ES), industrial control systems (ICS), and other cyber-physical systems (CPS) are becoming complex interconnected systems of heterogeneous hardware and software components such as sensors, actuators, controllers, physical systems/processes that are controlled or monitored, computational nodes, and communication interfaces and protocols. Increasing network connectivity and remote programmability of embedded devices in CPS is increasing the attack surface. At the same time, these capabilities simplify deployment and maintenance of CPS that are geographically spread. One can appreciate the potential widespread and possibly long-lasting impact of attacks on embedded systems especially when an attacker uses knowledge of the process dynamics characteristics to craft process-aware attacks to maximize process impact or to elude detection or both. The complexity and connectivity of embedded devices in CPS necessitates robust cyber-security techniques [20, 21, 42, 45, 53]. There have been several publicized attacks on CPS over the past few years [2, 3, 16, 18, 27, 47, 48, 50, 62, 69]. The number of incidents that the ICS Cyber Emergency Response Team (ICS-CERT) received and responded to in the US has increased from 245 in 2014 to 295 in 2015 [2, 3].

While cyber-security for embedded systems is a broad research area spanning hardware, firmware, and software, both in terms of threats and mitigations, this paper addresses two specific directions. First, the paper addresses the security impact of device aging - both long-term aging and short-term aging. The second direction considered in this paper is thermal side channels for remote monitoring of an embedded system (or possibly creating leaks).

When a device is aged either by running specific instruction sequences or by varying the device supply voltage or the clock, it can create temporal violations of the device guard band. Aging is relevant to embedded systems both as a threat and as a mitigation. On one hand, aging degrades an embedded system impacting the physical processes in the CPS. Malicious aging can be utilized to launch a warranty attack (by a malicious consumer who wishes to wear out a device to misuse the warranty) or planned obsolescence (by a manufacturer who wishes to prematurely degrade a device to force users to replace/upgrade the device). On the other hand, aging can be used as a hardware-level signature of the system to detect counterfeit/damaged/compromised embedded devices.

When an embedded device executes code, the underlying physical processes on the device result in analog emissions called side channels. Side channel modalities include thermal, power, electromagnetic (EM), magnetic, and acoustic and reflect, in general, information about the characteristics of the code being executed...
with different temporal scales and resolutions depending on the side channel modalities. While side channels have been extensively studied in the context of information leakage, i.e., remotely inferring properties about code executing on a device with/without assistance of a malicious code resident on the device, we consider here the possibility of using side channels to remotely monitor the device to continuously verify that the device is operating as intended (e.g., to detect code modifications due to cyber-attacks, etc.). In particular, we consider the thermal side channel of an embedded device and show that thermal images can be used to remotely extract information on device activity patterns.

This paper is organized as follows: Device and circuit aging and its relevance to embedded systems security is addressed in Section 2. Side channels of embedded devices are discussed in Section 3 including, in particular, the thermal side channel. Some concluding remarks are provided in Section 4.

2 AGING IN CIRCUITS

Aging is one of the major concerns in the current and emerging CMOS technology where displacing just few atoms inside transistors due to aging phenomena may degrade the functionality of circuit. Negative and Positive Bias Temperature Instability (NBTI and PBTI) are the most prominent phenomena. In general, BTI increases delays of PMOS and nMOS devices and hence circuits become slower over time [10]. To ensure the correct functioning of a circuit, guard bands (i.e., safety margins) are included in order to compensate for and overcome any such delay increases during the projected lifetime of the device. In advanced technologies, larger guard bands are necessary since only a few defects within a transistor can degrade its functionality.

A guard band is an over-design of the circuit to tolerate degradations to sustain reliable operation during its projected lifetime. A timing guard band can be implemented by adding extra time on top of the maximum delay of the circuit as shown in Eq. 1 [12]:

\[
t_{\text{clk. period}} = t_{\text{delay}}(\text{critical path}) + t_{\text{GB}}.
\]

To avoid the increase in overall power and hence on-chip power density and temperature, timing guard bands are employed [11]. From a reliability perspective, circuit designers ponder two aging-related questions: a) how can one accurately estimate guard bands? and b) what is the smallest, yet reliable guard band?

2.1 Long-Term Aging

Classic long-term aging considers BTI-based mechanisms that only increases the threshold voltage of transistors \(V_{\text{th}}\) which leads to a gradual increase in the circuit’s delay. It occurs when traps are generated at the Si-SiO\(_2\) interface when a negative voltage is applied to a PMOS device [14]. (N)BTI increases the magnitude of threshold voltage \(V_{\text{th}}\) of the PMOS transistor under stress and hence degrades the delay through it. At the circuit level, this manifests as circuit timing and functional failures [5, 43, 66].

Figure 1 shows the threshold voltage drift of a PMOS transistor at an operating temperature of 80°C that is continuously under stress for 6 months (blue) as well as a transistor that is under stress and recovery every other month (red). In practice, a PMOS transistor experiences stress (when the transistor is on, i.e., when a negative voltage is applied to its gate) and a recovery (when a positive voltage is applied to the gate of the transistor). The impact of BTI on circuit performance has become severe, especially after the introduction of high-k gate dielectrics since the 45 nm technology node [73]. For long-term aging, we use the model from [73].

However, other considerations have been noticed recently. The different types of defects (i.e., interface and oxide traps) generated by aging-related phenomena interact with the applied electric fields in the transistors and manifest as different degradations. Besides increasing \(V_{\text{th}}\), aging degrades other device parameters including carrier mobility (\(\mu\)), transconductance (\(g_m\)), drain current (\(I_d\)), sub-threshold slope (SS), and gate-drain capacitance \(C_{gd}\) [9].

In embedded systems that do not switch frequently between high and low operating voltages (i.e., chips that do not employ Dynamic Voltage and Frequency Scaling (DVFS)), considering BTI-based aging may be sufficient.

2.2 Limitations of Long-Term Aging Models

Considering the impact of aging only on \(V_{\text{th}}\), underestimates the overall impact of aging on circuit delay. Figure 2 confirms this with our aging analysis for the Berkeley Out-of-Order Machine (BOOM). Considering only \(V_{\text{th}}\) underestimates the impact of aging on timing guard bands by about 22%. Neglecting impact of aging on carrier mobility underestimates the guard bands by 11% [7].

Besides accurately modeling all physical origins of aging, considering the operating conditions of the embedded system is a pre-requisite to improve the accuracy of estimates for the guard bands. Such operating conditions include voltage, temperature, and duty cycle (i.e., % of time the transistor is under stress).

The operating conditions stimulate the aging mechanisms [63]. We used different hardware and software approaches to study how workloads running on the system accelerate aging-induced degradation [8]. The workloads determine how a chip ages and how long the guard band can hold before timing violations start to occur. As a result of aging-induced degradations, transistors gradually slow-down over their lifetime.

---

1Based on our measurements on devices at the 45nm technology node.

2Software Download: The short-term aging models, aging-aware cell libraries, reliability framework, etc. are publicly available at [36].
Further, to accurately estimate the guard band in advanced technology nodes, Random Telegraph Noise (RTN) needs to be considered as well. This is because while BTI is the dominant aging phenomenon in CMOS at high voltages (e.g., 1.2 V) [41], RTN is the dominant aging phenomenon at lower voltages (e.g., 0.7 V) [71]. In [71], we reported the first comprehensive model that jointly considers BTI and RTN-based aging allowing designers to accurately assess the impact of long-term and short-term aging-induced degradation across a wide range of voltages (\(V_{dd} \in [0.4 V : 2.1 V]\)).

### 2.3 Short-Term Aging [70]

Besides the long-term reliability mechanisms which gradually increase the delay in circuits, there has been a paradigm shift in our understanding that shows aging induces short-term reliability degradation as well [70]. The reason behind this is integration of voltage regulators that support ultra-fast voltage switching (sub-\(\mu s\)) in Intel Haswell [17] and other chips. While ultra-fast voltage switching reduces the overhead of voltage switching, it considerably accelerates aging. Every time the voltage switches from high \(V_{dd}\) to low \(V_{dd}\), the circuit becomes sensitive to aging degradation [70]. While this increase in sensitivity to aging follows the voltage changes instantaneously, the recovery from degradation takes time.

When the circuit operates at high \(V_{dd}\), higher degradations accumulate. When the voltage rapidly switches (i.e., within \(\leq 1\mu s\)) to a lower \(V_{dd}\), the higher sensitivity at the lower \(V_{dd}\) combined with the high degradations (accumulated at the previous high \(V_{dd}\)) causes a temporal violation of the employed guard band. Such a violation is transient because at the lower \(V_{dd}\), recovery mechanisms kick-in healing the accumulated aging. Figure 3 demonstrates such guard band violations using SPICE simulations that employ a physics-based aging model that accounts for voltage dynamics [31]. Not every high-to-low \(V_{dd}\) switching causes a transient error. The two pre-requisites for short-term aging induced transient errors are a) the circuits spends sufficient time at the high \(V_{dd}\) to accumulate enough degradations and b) the voltage switches to a sufficiently low \(V_{dd}\) to amplify the impact of the degradation.

If the stress is not long enough or voltage switches barely to a lower \(V_{dd}\), transient errors may not occur. Short-term aging does not occur when switching from low \(V_{dd}\) to high \(V_{dd}\). This is because degradations at the low \(V_{dd}\) are weak and the resiliency to degradation effects is large at high \(V_{dd}\).

In emerging embedded systems that switch frequently between high and low operating voltages (i.e., chips that employ DVFS) in order to meet performance and power constraints, it is a prerequisite to consider BTI- and RTN-based aging effects.

### 2.4 Malicious Aging [39]

Deliberately accelerating aging degradation can undermine embedded systems and is an emerging security threat. An adversary may maliciously accelerate the aging of an IC (MAGIC) and shorten its useful life as shown in Fig. 4. MAGIC does not need access to the IC user in an untrusted environment and it can cause catastrophic failure of the system [39]. MAGIC exploits the fact that the circuit delay is input dependent and that it exhibits worst-case delay for specific input patterns [77]. MAGIC attack identifies such input patterns and constructs a malicious program that generates such patterns. Executing this malicious program on the embedded device will be especially effective in mobile phones, tablets, and PCs which continues to age and can cause the chip to fail sooner than expected, i.e., shortens the chip’s normal lifetime. We used the long-term aging models [73] to demonstrate MAGIC. One can envision at least two types of MAGIC attacks [39].

**Warranty Attack:** A consumer C purchases a device manufactured by company X. The warranty period for this device is \(W\) months. Consumer C uses the device for a while but when the device is still under warranty, a physical damage occurs (e.g., scratch on the LCD). C wants to get a new device but the warranty does

![Figure 2: \(\Delta V_{th}\) is one of many factors to consider when investigating the impact of aging on circuit delay.](image1)

![Figure 3: Aging-related degradation at high \(V_{dd}\) plus ultra-fast voltage switching cause short-term aging which results in transient errors by temporarily violating the timing guard band.](image2)
not cover physical damage. C downloads the MAGIC program and the OS from forums such as Cyanogenmod [1], executes the MAGIC program to intentionally wear-out the device and returns the device to X to get a new one. In the warranty attack, the user attempts to brick the device. The warranty attack may not inflict a considerable financial loss to the victim manufacturer, but may hurt its reputation. In the warranty attack, MAGIC is created by an expert-attacker and launched by a non-expert malicious user. The expert-attacker is usually an insider with access to the processor netlist and can use VLSI CAD tools. The expert-attacker creates a MAGIC program and distributes it to users to bring disrepute to the manufacturer by wearing out even a few devices. On the other hand, the goal of the novice user is to damage his device before the warranty period expires and exchange it for a new one.

**Planned Obsolescence:** A malicious manufacturer M slows down the previously sold devices in order to nudge (force) its customers to buy a recently released device. M sends a patch to its customers before releasing a new device. Installing such patch slows the older devices forcing the users to buy the new device [55, 59, 65, 67, 78]. Planned obsolescence financially benefits the manufacturer by nudging its customers to upgrade to the latest device. The manufacturing company wears out the device. Planned obsolescence makes the device stop working properly but malicious aging is hidden from the user, i.e., pretending that the device has aged normally and hiding that aging is due to MAGIC.

Figure 5 shows the design flow (solid box) and the MAGIC attack launched within this flow (dotted box). The processor is synthesized and netlist and layout are generated. The layout is sent for fabrication. After IC testing, fault-free ICs are shipped. The warranty attack can be launched as follows:

**Step 1:** A malicious insider in the design house obtains the processor netlist.

**Step 2:** The attacker identifies the critical path in the processor and creates input patterns that place this path under BTI stress. The attacker analyzes the processor Instruction Set Architecture (ISA) and crafts instructions to create a MAGIC program to generate the above patterns.

**Step 3:** The attacker uploads the MAGIC program to a website and the non-experts download and execute it on their processors. MAGIC [39] was demonstrated on the OpenSPARC T1 processor [56]. The degradation was evaluated using the long-term BTI-based aging model from [73]. The execute stage (E-stage) in the processor was maliciously aged. When the MAGIC program was executed, the performance of the E-stage degraded by 10.92%, 13.25%, and 16.8% after one, two, and six months, respectively, bypassing guard band and other protections, causing the processor to fail. MAGIC patterns can be generated for any pipeline stage. For Open SPARC, we chose E-stage as it has the critical paths.

Post-manufacture critical path may differ from design-time critical path due to process variations. After manufacturing, when the MAGIC program is executed, the design-time critical path will age and become longer than the manufacture-time critical path. We observed that the top 10 longest paths in the E-stage of the OpenSPARC were within 2% of each other. The change in threshold voltage, and in turn the critical path delay, is affected by the temperature. Thus, temperature is another knob for the attacker.

### 2.5 What are the Security Implications of Short-term Aging?

Short-term aging manifests as a temporal violation of the guard band resulting in transient errors (i.e., timing violations by increasing the path delay). We have and are investigating important questions along the following directions: a) short-term aging attacks be launched in a controlled manner to transiently undermine the security of on-chip systems at specific instances and for specific durations, b) short-term aging attacks to accelerate warranty and obsolescence attacks that were demonstrated using long-term aging, c) short-term aging as a standalone attack vector; if the temporal violation of the guard band is large enough, the on-chip system becomes unstable because of the unsustainable clock frequency leading to errors/crashes in the software due to the induced transient timing errors, and d) short-term aging as a trigger for a Trojan that is inserted in the chip (e.g., by a rogue in the foundry) and lays dormant until short-term aging effects exceed a threshold and then activates its malicious behavior.
3 SIDE CHANNELS [57]

Various analog side channel modalities including thermal, power, electromagnetic (EM), magnetic, and acoustic are relevant to embedded devices and have been heavily studied in the literature [4, 13, 15, 19, 23, 25, 26, 28, 29, 32–35, 37, 38, 40, 51, 52, 54, 60, 61, 64, 68, 72]. Various efforts have addressed side channels such as electromagnetic (EM) [22, 26, 28, 29, 32, 54, 72], acoustic [22, 30, 37, 49], thermal [35, 38], magnetic [15], and power [24, 61]. Side channels leak information from air-gapped devices by running malicious code on the device so as to create signatures in the side channels which when monitored can yield retrieve sensitive information. A computer infected with malicious code that excites specific radio frequency signal patterns using the graphics card can leak information to a mobile phone with a radio FM receiver [34].

The majority of the prior works have addressed side channels in the context of emission patterns from digital devices and information leakage through these analog side channels. Information leakage can be viewed as a special case of monitoring wherein specifically crafted code on the device generates sequences of activity patterns (of the processor load, clock, memory bus, peripherals, etc.) that can be decoded by an air-gapped receiver to extract messages (e.g., a sequence of bits) sent by the code resident on the device [44]. Multiple side channel modalities are applicable for remote monitoring of the device state (e.g., whether the code execution on the device matches nominal expected patterns or exhibits anomalies).

EM signals enable high-bandwidth monitoring of embedded processors (up to several GHz) and code execution patterns within the device by observing the signal frequency content and temporal patterns. The EM signals generated by the components of the embedded processor – including the CPU, GPU, memory, clocks, data storage components, voltage regulators, and input/output modules and associated analog/digital circuitry and wiring – vary depending on their usage, computation, and communication patterns (e.g., use of system memory bus to exchange data between CPU and memory). Actuators such as motors also generate distinctive EM signal patterns during their operation. From high-bandwidth EM signal measurements, machine learning techniques can provide a high-level of detail of the device activity. EM signals generated by components and operations in an embedded processor can be differentiated by their frequency content and temporal patterns yielding discernible signatures of events during code execution. To capture noisy EM signals over a wide range of frequencies and under cluttered conditions, combinations of multiple receiver/antenna pairs (e.g., helical, microstrip, Vivaldi) and antenna arrangements can be used with optimized configurations for specific types of devices. Multi-antenna geometric arrangements and polarizations can enable robust data acquisition in noisy environments where multiple devices and other EM sources may be present.

Power measurements provide aggregate readings reflecting the activity of the embedded processor including CPU and GPU usage patterns. Power measurements from CPS peripherals (such as sensors and actuators) yield information on device activity. Over a longer time scale, thermal measurements provide readings corresponding to CPS device activity. Thermal signatures can differentiate among components in a system (e.g., between two processors in a multi-processor system). Magnetic signals provide a low-bandwidth device signature that can be used in conjunction with EM signal measurements to detect hardware-level modifications (e.g., unauthorized hardware changes or tampering), especially in close proximity to the target device. Actuators (e.g., motors) generate distinctive acoustic signals correlated to their operational states (e.g., RPM of a motor). Various physical processes in a microcontroller generate an acoustic signal (e.g., vibrations of electronic components in the power regulation circuitry), albeit outside the human auditory range.

These analog side channels provide somewhat overlapping, but complementary, sources of information about the state of the monitored device. When multiple side channels are monitored, these side channel signals can be sampled at different sampling rates depending on the sensing modality and then time-synchronized. Fusion of multiple information streams enables robust, remote-awareness of the state of the monitored device including the device characteristics, code modifications, and in general, real-time analysis of the code execution state and control flow within the device.

These analog side channels can be complemented by on-processor digital side channels that measure special-purpose registers (e.g., Hardware Performance Counters or HPCs). Real-time, on-device monitoring using digital side channels have been studied [74–76]. These methods can be used for signature-based detection of malware and detection of device code-specific HPC pattern deviations. HPCs are special-purpose registers built into modern processors (e.g., Intel x86, ARM, MIPS, and PowerPC). The Num-Checker [74, 75] and ConFirm [76] demonstrates that HPCs can detect malicious firmware and software modifications [74–76]. For example, HPC-based monitoring can detect kernel rootkits by analyzing the system call behavior of unmodified and modified code blocks [74, 75].

Unlike on-device monitoring, the proposed approach uses remote monitoring of analog emissions across an air-gap. While prior work primarily used side channels as an information leakage mechanism and considers security vulnerabilities of analog emissions from an air-gapped device, the proposed approach uses these emissions for real-time monitoring. In particular, we consider the thermal side channel using an infrared camera and show that the code execution on the device can be remotely monitored using sequences of thermal images.

3.1 Thermal Side Channels

In this paper, the thermal side channel is considered as a representative remote sensing modality. The high-resolution thermal imaging testbed shown in Figure 6 is used for remote thermal monitoring of a multi-core Intel processor. In order to keep the processor operating without packaging and heat sinks, a stable and controlled source of cooling is provided by a thermoelectric Peltier element that dissipates the heat generated from the chip from the back side. This setup enables the thermal camera to capture the IR radiations
emitted from the chip directly without any intervening layers interfering with the radiations. Furthermore, the cooling mechanism from the back side can be calibrated by changing the power to the Peltier device to mimic the behavior of the original cooling of the chip using heat sinks, packaging, etc.

The thermal side channel monitoring approach described below in this section considers CPS applications wherein embedded devices run periodic computations. For instance, a CPS device implementing a control algorithm (e.g., to control motors and other electromechanical systems) typically displays a relatively well-structured temporal behavior as a repeated sequence of sensor reading, sensor data processing, control algorithm computation, and actuator writing steps as shown in Figure 7. Other CPS-relevant applications (e.g., aggregating data from sensors, fusing data from sensors to provide a situational awareness to a human operator, etc.) have similar periodic code structures. The periodicity of CPS code results in well-defined periodic characteristics of (thermal) side channel emissions from the device. Hence, by observing the characteristics and the temporal patterns of side channel emissions, deviations in the embedded device behavior during code execution can be detected.

The temporal patterns in the thermal imagery generated due to the code running on the processor can be used to identify the changes in code using machine learning. A simple approach will extract low-dimensional features such as temperature variations in each processor core, maximum, minimum or average temperatures in the region of each processor core, frequency-domain features like periodicity of the measured signal from thermal images. The time-series of such low-dimensional feature data can be used to detect changes from the “nominal” behavior using a one-class Support Vector Machine classifier. Furthermore, an end-to-end machine learning approach can be used to automatically learn subtle spatial and temporal patterns of thermal images obviating the need for manual feature extraction. One can automatically and implicitly learn feature representations optimized for the device computational activity estimation and anomaly detection.

Similar to typical embedded controller code in CPS devices, a code comprising of periodic iterations of a time period of relatively high computational activity (activity time) followed by a fixed time period of low activity (sleep time) is considered. An instantiation of this code structure is characterized by a loop time period \( T \) and an activity time period \( \Delta \). The loop time period is the sampling time or iteration time in an embedded controller code in a CPS device while the activity time period \( \Delta \) is the amount of time required for the computations performed in an iteration of the loop. The sampling time \( T \) is typically a fixed quantity that is chosen depending on the task being performed by the CPS device while the activity time period \( \Delta \) depends on the computations being done within each sampling period. We consider a fixed period \( T \) and a variable activity time \( \Delta \in (0, T) \) and pose the machine learning problem as estimation of \( \Delta \), given a time sequence of thermal heat maps over a sliding window of time. From the estimated time-series of activity times \( \Delta \), an anomaly detection algorithm probabilistically determines whether the estimated activity times correspond to expected values for the device based on the observation that a cyber-attack that removes, adds, or modifies code in a CPS device will result in a modification of the activity time. The overall machine learning methodology for computational activity time estimation and thereby anomaly detection is shown in Figure 8.

For simplicity and to focus on machine learning based activity time estimation, we consider a scenario in which the activity time \( \Delta \) is a nominally fixed during normal operation, \( \Delta \) could vary during normal operation depending on, for example, input data to the CPS device. The methodology can extend to such a case by characterizing ranges of temporal patterns of \( \Delta \) instead of a single fixed nominal \( \Delta \) and basing anomaly detection on evaluation of the deviation between the machine learning based estimated activity times and the expected ranges or temporal patterns of activity time.

In our end-to-end machine learning based framework, a sequence of thermal heat maps over a sliding window of time (defined here to be 0.5 s, corresponding to 25 consecutive images since the thermal imager provides 50 frames per second) of the microprocessor over a time window are used as the input to a convolution neural
Emerging (Un-)Reliability Based Security Threats and Mitigations for Embedded Systems

ES WEEK’17, Oct. 2017, Seoul, South Korea

Figure 7: The periodic code structure in a CPS device comprises of periodically repeating computations interspersed with sleep times, e.g., a loop of sensor reading, control algorithm calculations, and actuator writing with a fixed sampling time (a code snippet is shown on the right). [57]

Figure 8: Machine learning based methodology to estimate the CPS device computational activity time and for anomaly detection from a sequence of thermal heat maps. [57]

network (introduced in [57]) to predict the activity time. The use of a high-speed thermal imager will enable more precise activity time prediction due to finer temporal granularity.

The proposed neural network architecture has five convolutional blocks, each comprising of a spatial convolution layer, a rectified linear unit (ReLU) layer and a max-pooling layer. The number of convolutional kernels in each block are 16, 32, 32, 64 and 64. The size of each convolutional kernel in all the blocks is 3x3 with a stride of 1 and the size of each max-pooling kernel is 2x2 with a stride of 2. The weights of all the convolution blocks are shared over all the images in the specified time window. The output of the last convolutional block for all the images in the specified time window are flattened and combined. The concatenated output is passed through three fully connected neural network with a ReLU non-linearity to output feature vectors of size 1024, 128 and 32 respectively. The feature vector of size 32 is passed through a fully connected neural network to predict the activity time. The weights of the network were optimized using Adaptive moment estimation optimizer with a Huber loss function.

In order to generate the thermal heat map dataset for our end-to-end learning system, code for various configurations of \( T \) and \( \Delta \) was implemented using the algorithmic structure shown in Figure 7. The code performs floating-point calculations over the specified time periods. For each value of \( \Delta \), sliding time windows are defined for the collected thermal data set with a stride of 5 frames, i.e., time windows comprised of frames 1 to 25, frames 6 to 30, etc. The set of normalized gray-scale thermal images in a time window is input to the end-to-end learning system which predicts the corresponding value of \( \Delta \) for that data set. The overall thermal dataset was split into training and validation dataset with a ratio of 75:25. Extraneous off-die parts of the overall acquired thermal image are cropped out of the overall heat map to obtain a heat map of 270x270 pixels.

The accuracy of estimation of \( \Delta \) was evaluated on the testing data set. The estimation of \( \Delta \) for testing data sets with actual \( \Delta \) value of 0.1 s is shown in Figure 9. In this figure, the time series of estimates of \( \Delta \) for sliding time windows of thermal image sequences (with a stride of 5 frames as discussed above; hence, a new estimate of \( \Delta \) after every 0.1 s since the thermal imager provides 50 frames...
Figure 9: Estimates of computation time from sliding time windows of thermal heat maps collected with computational activity time $\Delta = 0.1$ s. The top figure shows the time series of estimates of $\Delta$ from successive sliding time windows of heat maps and the bottom figure shows the histogram of the errors (predicted - actual) in the estimated values of $\Delta$. The histogram shows that the estimated values of $\Delta$ are centered around the correct value of 0.1 s with a Gaussian distribution of errors around this correct value. [57]

Based on the estimation of $\Delta$ from sequences of thermal images, anomalies (i.e., changes to the running code) are detected by probabilistically matching sequences of estimated $\Delta$ values over sliding time windows of thermal images against the nominal $\Delta$, or more generally, expected ranges or temporal variation patterns of $\Delta$. In the simplest case wherein the nominal $\Delta$ is a constant $\Delta_{nom}$, an anomaly is detected if in a time sequence $\Delta_t$ of estimated $\Delta$ values over a time window (set here to 2 s, i.e., 20 consecutive estimates of $\Delta_t$), a sufficient percentage (set to 90%) of $\Delta_t$ are different from $\Delta_{nom}$ by more than a threshold (specified here as 0.0015 s) and if the mean of the estimates $\Delta_t$ over the considered time window is different from $\Delta_{nom}$ by more than a threshold (also 0.0015 s).

The results of our anomaly detection algorithm on data sets collected with $\Delta$ settings in the ranges around 0.1 s and 0.2 s are shown in Figure 10. The anomaly detection likelihoods correspond to the percentages of time windows in these data sets that the anomaly detection algorithm declared as anomalous when comparing against the nominal values of 0.1 s and 0.2 s, respectively. There are no false positives and variations in $\Delta$ by around 4 ms increase or 8 ms decrease are detected as anomalous with 100% accuracy (i.e., without false negatives). It is noteworthy that both the estimation of $\Delta$ and the detection of variations of $\Delta$ provide temporal granularities superior to the 0.02 s (i.e., 50 frames per second) sampling period of the thermal imager. This indicates that the machine learning system can use the fine-grain variations in temperature and the spatial and temporal patterns to accurately estimate the computation activity.

Changes in periodic code structures can be robustly detected using the high-resolution thermal imaging data. While we use an infrared thermal camera in our experiments, the algorithmic approaches can operate on scalar temperature measurement streams as long as they are of sufficient thermal signal and temporal resolution. The technique can be effectively used with on-processor temperature measurements if the processor-integrated temperature sensors provide better resolution than the 1 degree Celsius typically provided by the on-chip, integrated sensors (on-chip monitoring using typical integrated temperature sensors and processor fan was considered in [58]). To find the regions of the thermal image (corresponding to discrete locations of a small set of on-chip, thermal sensors) that are of most utility for estimating processor activity, the machine learning system can be modified to include a sparsity-inducing component to automatically learn salient parts of the image. A masking matrix approach was utilized in [57] for this purpose. Salient parts of the thermal image were learned in...
terms of a masking matrix whose weights are learned through backpropagation in an end-to-end manner along with the weights of the proposed network architecture. The sum of absolute values of the weights was utilized as a sparsity-inducing regularization component in the machine learning loss function. It was seen in [57] that when retrained with this modified loss function, a small fraction of the overall image was sufficient to estimate \( \Delta \) without any appreciable loss of accuracy. Thus, it can be inferred that integrating high-resolution high-sampling-rate temperature sensors into processors at strategic locations (which may physically correspond to power circuitry, cores, caches, etc.) can enable accurate estimation of computation times and robust anomaly detection.

4 CONCLUSION

While this paper addressed aging of and analog side channels in embedded systems, there are connections/synergies between these two directions that are relevant to embedded systems security. Side channels can be used to monitor for aging effects, both in the context of detecting aging-based attacks and also in the context of facilitating device integrity testing using short-term aging as a signature for the monitored embedded device. Short-term aging can be utilized to cause patterns of transient changes on the device to leak information via side channels. On a device with a pre-loaded malicious code, short-term aging can be used as a trigger to create transient effects that wake up a Trojan on the device that then executes the malicious code to leak information via side channels from the device or the CPS physical process. On the flip side, side channel monitoring can be used to detect execution of such malicious code. Aging effects (especially short-term aging) naturally disappear following an attack due to the "recovery" intrinsic to the aging mechanism. Subsequent forensic analysis to retrace the attack is difficult if not impossible. This is unique to aging and unlike other attack vectors that may leave a forensic trail.

5 ACKNOWLEDGEMENTS

The work is supported in part by a US-German travel supplement to NSF grant 1319841, ONR grants N00014-15-12182 and N-00014-17-12006, Boeing, and the German Research Foundation (DFG) as part of priority program “Dependable Embedded Systems” (SPP 1500 – http://spp1500.itec.kit.edu/).

REFERENCES
