High-performance computing (HPC) servers aim to meet an increase in the number and complexity of tasks and, consequently, to address the energy efficiency challenge. In addition to energy efficiency, it… Click to show full abstract
High-performance computing (HPC) servers aim to meet an increase in the number and complexity of tasks and, consequently, to address the energy efficiency challenge. In addition to energy efficiency, it is essential to manage lifetime limitations of power-hungry components of servers (e.g., cores and cache), hence avoiding server failure before its lifetime period. Traditional approaches focus on either using hybrid caches to reduce the leakage power of traditional static random-access memory (SRAM) cache, and thus increase the energy efficiency, or the tradeoff between the lifetime and performance of multicore processors. However, these approaches fall short in terms of flexibility and applicability for HPC tasks in terms of multiparametric optimization, including quality-of-service (QoS), lifetime reliability, and energy efficiency. As a result, in this article, we propose COCKTAIL, a holistic strategy framework to jointly optimize the energy efficiency of multicore server processors and tasks performance in the HPC context, while guaranteeing the lifetime reliability. First, we analyze the best cache technology among traditional SRAM and resistive random access memory (RRAM), within the context of hybrid cache architectures, to improve the energy efficiency and manage cache endurance limits with respect to tasks requirements. Second, we introduce a novel efficient proactive queue optimization policy to reorder HPC tasks for execution considering their end time and possible reliability effects on the use of the hybrid caches. Third, we present a dynamic model predictive control (MPC)-based reliability management method to maximize task performance, by controlling the frequency, temperature, and target lifetime of the server processor. Our results demonstrate that, while consuming similar energy, COCKTAIL provides up to 60% QoS improvement when compared to latest state-of-the-art energy optimization and reliability management techniques in the HPC context. Moreover, our strategy guarantees a design lifetime longer than five years for the whole HPC system.
               
Click one of the above tabs to view related content.