THERMOS

This paper introduces THERMOS, a thermally aware, multi-objective scheduling framework designed for heterogeneous multi-chiplet Processing-In-Memory (PIM) architectures. By integrating various PIM implementations, including ReRAM-based, SRAM-based the framework effectively leverages the strengths of each technology while mitigating their limitations.

The scheduling challenge is tackled using a two-level approach:

Multi-Objective Reinforcement Learning (MORL): At the high level, a single MORL policy is trained to map deep learning models to clusters of chiplets. This policy can dynamically optimize for execution time, energy consumption, or a balanced objective by adjusting a runtime preference vector.
Proximity-Driven Assignment: At the lower level, a proximity-driven algorithm allocates the specific chiplets within the selected cluster, thereby reducing inter-chiplet communication overhead.

Comprehensive evaluations demonstrate that THERMOS significantly outperforms baseline schedulers, achieving up to 89% faster execution and 57% lower energy consumption. This work is the first to deliver a unified RL-based scheduling policy for heterogeneous chiplet systems that is both thermally aware and multi-objective.

Code Repository THERMOS GitHub Repository

Simulation Framework

The simulation framework in this work is developed to accurately model the performance and energy consumption of deep learning (DL) workloads on PIM architectures. It incorporates two main models:

Compute Model for PIM: We profile DL layers on different PIM types using CimLoop, a fast, statistical, data-dependent simulator that integrates Accelergy, Timeloop, and NeuroSim. CimLoop estimates execution time and energy consumption for a given layer n_i under a scheduling policy ψ_i across multiple chiplets, providing total compute time per input (t_comp,i), total compute energy (e_comp,i), and total leakage power (p_leak,i). Leakage energy is derived from an energy reference table (ERT) generated by CimLoop. When a layer is distributed across multiple chiplets, execution proceeds in parallel, and t_comp,i is determined by the slowest chiplet. During idle periods, PIM macros continue consuming leakage energy to retain weights in crossbar arrays, since they must preserve the stored weights in the crossbar arrays. Additionally, the compute model tracks each chiplet’s available memory (M_i(t)) as a function of time.
Communication Model: Designed specifically for DL workloads, this model uses a circuit‐switched network to handle the transmission of large activation payloads (often hundreds of kilobytes) between chiplets. In contrast to CPU-based systems (where small cache-line transfers favor packet-switched networks) the bursty and deterministic nature of DL activations motivates the use of circuit switching. This approach simplifies router design by allowing only the head flit to make routing decisions, while subsequent flits follow a precomputed path, ensuring predictable traffic flow and enabling a deterministic reward function in MORL training.
For a given DL characterization graph (DCG), if two neural layers n_i and n_j are connected, the volume of data to be communicated is given by f_ij. The shortest path between the chiplets mapped to n_i and n_j is determined based on link availability; links along this path are allocated until all activations from n_i are transmitted. We assume that chiplets are connected via UCIe ports with the following parameters:
- Energy per bit per hop: e_ucie = 0.5 pJ
- Link traversal latency: t_ucie = 2 ns
- Link width: l_ucie = 64 bits
For links connecting different chiplet types, FIFO-based synchronization is employed to handle timing variations. If a chiplet c_a (belonging to the scheduling policy ψ_i) and a chiplet c_b (belonging to ψ_j) are used, and if c_a processes an x fraction of the activations f_ij, then the latency and energy for sending data from c_a to c_b over the shortest available path are given by:

$$ t_{\text{comm}, c_a \to c_b} = t_{\text{ucie}} \times (\# \text{hops}_{a \to b} - 1) + t_{\text{ucie}} \times \frac{x f_{ij}}{l_{\text{ucie}}} $$ $$ e_{\text{comm}, c_a \to c_b} = e_{\text{ucie}} \times x f_{ij} \times \# \text{hops}_{a \to b} $$

Here, # hops_{a → b} denotes the number of hops in the shortest path between chiplets c_a and c_b. To compute the total communication time and energy between all chiplet pairs mapped to layers n_i and n_j, we assume that these communications occur in parallel, thus time to transmit data from layers n_i and n_j is determined by:

$$ t_{\text{comm}, ij} = \max \{ t_{\text{comm}, c_a \to c_b} \quad \forall \; c_a \in ψ_i,\; c_b \in ψ_j \} $$ $$ e_{\text{comm}, ij} = \sum \{ e_{\text{comm}, c_a \to c_b} \quad \forall \; c_a \in ψ_i,\; c_b \in ψ_j \} $$

For a receiving layer n_j, the total communication time and energy are computed by considering that all transmitting layers operate in parallel:

$$ t_{\text{comm}, j} = \max \{ t_{\text{comm}, ij} \quad \forall \; n_i \in \text{DCG} \} $$ $$ e_{\text{comm}, j} = \sum \{ e_{\text{comm}, ij} \quad \forall \; n_i \in \text{DCG} \} $$

Finally, assuming that layers process activations in a pipelined manner, with each input undergoing a communication-to-compute stage, the overall deterministic latency and energy for I inputs can be approximated using a coarse pipeline (each stage corresponding to a neural layer):

$$ t_{\text{DCG}} = \sum_{n_i \in \text{DCG}} \left( t_{\text{comp},i} + t_{\text{comm},i} \right) + (I - 1) \times \max_{n_i \in \text{DCG}} \left( t_{\text{comp},i} + t_{\text{comm},i} \right) $$ $$ e_{\text{DCG}} = I \times \sum_{n_i \in \text{DCG}} \left( e_{\text{comp},i} + e_{\text{comm},i} + p_{\text{leak},i} \times t_{\text{DCG}} \right) $$

This communication model tracks also link availability over time and allocates energy consumption across all chiplets along the communication path. It captures the deterministic execution time and energy for a given DL workload, although temperature-dependent and non-deterministic effects require runtime thermal model.
Thermal Model: We use a thermal model that uses a discrete state-space approach from the MFIT (Multi-Fidelity Thermal Modeling) framework. This model is designed to efficiently predict and manage thermal behavior in multi-chiplet systems, ensuring both accuracy and speed for thermal predictions. The thermal model is constructed by considering the system's geometric properties, with the chiplet layout and material properties determining the thermal resistance and capacitance of the system. This enables the model to accurately represent the thermal dynamics based on the physical characteristics of the architecture.
Dynamic power the model for each chiplet is derived from both the energy consumption and the execution time. This relationship is crucial for capturing the time-varying nature of thermal behavior during system operation. The thermal dynamics are governed by a discrete-time state-space equation, which can be expressed as:

$$ \mathbf{T}(K+1) = \mathbf{A} \cdot \mathbf{T}(K) + \mathbf{B} \cdot \mathbf{P}(K) $$

where T(K) is the temperature vector at time step (10 ms) K, A is the thermal state matrix, B is the power input matrix, and P(K) is the power vector at time step K. The model captures the thermal dynamics of each chiplet and their interactions, allowing for accurate predictions of temperature variations over time.

This simulation framework is fast and accurate, enabling the evaluation of THERMOS across various PIM architectures and workloads. We plan to release the simulation framework along with the THERMOS code to facilitate further research in this area.

THERMOS: Thermally-Aware Multi-Objective Scheduling for Heterogeneous Multi-Chiplet PIM Architectures

Simulation Framework

References