Software tuning doubles computer processing speed, halves energy consumption

Existing processors in PCs, smartphones and other devices can be overridden for enormous gains in power and efficiency using a new parallel processing software framework designed to eliminate bottlenecks and use multiple chips at once.

Most modern computers, from smartphones and PCs to data center servers, contain graphics processing units (GPUs) and hardware accelerators for AI and machine learning. Well-known commercial examples include Tensor Cores on NVIDIA GPUs, Tensor Processing Units (TPU) on Google Cloud servers, Neural Engines on Apple iPhones, and Edge TPUs on Google Pixel phones.

Each of these components processes information separately, mixing information from one processing unit to another, and this often creates bottlenecks in the flow of data. In a new study, researchers from the University of California Riverside (UCR) have demonstrated a method in which existing disparate components work simultaneously to greatly improve processing speed and reduce energy consumption.

“You don’t need to add new processors because you already have them,” said Hung-Wei Tseng, an associate professor of electrical and computer engineering at UCR and co-leader of the study.

The researchers’ framework, called simultaneous heterogeneous multithreading (SHMT), departs from traditional programming models that can only delegate a region of code exclusively to one type of processor, leaving other resources idle and not contributing to the current function.

Instead, SHMT exploits the diversity—or heterogeneity—of multiple components, breaking up the computational function to share it among them. In other words, it is a type of parallel processing.

Comparison of how (a) conventional heterogeneous computers, (b) conventional software-pipelined heterogeneous computers, and (c) SHMTs perform functions

How does it work

Feel free to skip this part, but for those more computer science-savvy, here’s a (still very basic) overview of how SHMT works. A virtual operation set (VOP) allows a CPU program to ‘offload’ a function to a virtual hardware device. During program execution, the runtime system runs SHMT’s virtual hardware, measuring the hardware resource’s ability to make scheduling decisions.

SHMT uses a quality-aware scheduling (QAWS) policy that does not consume resources but helps maintain quality control and workload balance. The execution system divides VOPs into one or more high-level operations (HLOPs) to use multiple hardware resources simultaneously.

Then, SHMT’s runtime system allocates these HLOPs to target hardware task queues. Because HLOPs are hardware-independent, the execution system can adjust task allocation as needed.

Prototype testing and results

To test the concept, the researchers built a system with the kinds of chips and processing power you’d find in any decent late-model smartphone, with a few tweaks to also test what it could do in a data center.

Specifically, they built a custom embedded system platform using NVIDIA’s Jetson Nano module that features a quad-core ARM Cortex-A57 processor (CPU) and 128 Maxwell architecture GPU cores. The Google Edge TPU was connected to the system via its M.2 Key E slot.

The CPU, GPU and TPU exchanged data via the built-in PCIe interface – a standardized interface for motherboard components such as graphics cards, memory and storage devices. The system’s main memory – 4 GB 64-bit LPDDR4, 1600 MHz at 25.6 GB/s – hosted the shared data. The Edge TPU additionally contains 8 MB of device memory, and Ubuntu Linux 18.04 was used as the operating system.

They tested the SHMT concept using benchmark applications and found that the framework with the best QAWS policy knocked it out of the park, with a 1.95X speedup and a remarkable 51% reduction in power consumption compared to the baseline method.

SHMT speedup with different scheduling rules (vs. base GPU)

What does it all mean?

The researchers say the implications for SHMT are huge. Yes, software applications on your existing phones, tablets, desktops and laptops could use this new software library to achieve some pretty incredible performance improvements. But it could also reduce the need for expensive high-performance components, leading to cheaper and more efficient devices.

By reducing energy consumption and expanding cooling requirements, the approach could optimize two key items if you have a data center, while also reducing carbon emissions and water consumption.

Energy consumption and energy delay products

As always, further research is needed regarding system implementation, hardware support, and the types of applications that will benefit the most, but with results like these, we imagine the team will have little trouble attracting the resources to pull this off.

The study was presented at MICRO 2023, the 56th annual IEEE/ACM International Symposium on Microarchitecture.

Source: UCR

Source link