A new technique enables AI on edge devices to keep learning over time.
With the PockEngine training method, machine-learning models can efficiently and continuously learn from user data on edge devices like smartphones.
Table of Contents
Personalized deep-learning models can power artificial intelligence chatbots that adapt to understand a user’s accent, as well as smart keyboards that constantly update to better predict the next word based on a user’s typing history. This customization necessitates the continuous fine-tuning of a machine-learning model with new data.
Because smartphones and other edge devices lack the memory and computational power required for fine-tuning, user data is typically uploaded to cloud servers, where the model is updated. However, data transmission consumes a lot of energy, and sending sensitive user data to a cloud server is a security risk.
Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere developed a technique that enables deep-learning models to efficiently adapt to new sensor data directly on an edge device.
Their on-device training method, called PockEngine, determines which parts of a huge machine-learning model need to be updated to improve accuracy and only stores and computes with those specific pieces. It performs the bulk of these computations while the model is being prepared, before runtime, which minimizes computational overhead and boosts the speed of the fine-tuning process.
PockEngine significantly accelerated on-device training when compared to other methods, performing up to 15 times faster on some hardware platforms. Furthermore, PockEngine had no effect on model accuracy. The researchers also discovered that their method of fine-tuning allowed a popular AI chatbot to answer complex questions more accurately.
“On-device fine-tuning can enable better privacy, lower costs, customization ability, and also lifelong learning, but it is not easy. Everything has to happen with a limited number of resources. We want to be able to run not only inference but also training on an edge device. With PockEngine, now we can,”says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, a distinguished scientist at NVIDIA, and senior author of an open-access paper describing PockEngine.
Han is joined on the paper by lead author Ligeng Zhu, an EECS graduate student, as well as others at MIT, the MIT-IBM Watson AI Lab, and the University of California San Diego. The paper was recently presented at the IEEE/ACM International Symposium on Microarchitecture.
AI on edge devices – Layer by layer
Deep-learning models are built on neural networks, which are made up of many interconnected layers of nodes, or “neurons,” that process data in order to make a prediction. When the model is run, a data input (such as an image) is passed from layer to layer until the prediction (perhaps the image label) is output at the end, a process known as inference. Each layer no longer needs to be saved after processing the input during inference.
However, the model goes through a process known as backpropagation during training and fine-tuning. Backpropagation compares the output to the correct answer before running the model in reverse. As the model’s output approaches the correct answer, each layer is updated.
Because each layer may require updating, the entire model and intermediate results must be saved, making fine-tuning more memory-intensive than inference.
However, not all neural network layers are equally important for improving accuracy. Even for important layers, the entire layer may not need to be updated. Those layers and fragments of layers do not need to be saved. Furthermore, going all the way back to the first layer may not be necessary to improve accuracy; the process could be stopped somewhere in the middle.
PockEngine takes advantage of these factors to speed up the fine-tuning process and cut down on the amount of computation and memory required.
The system first fine-tunes each layer, one at a time, on a certain task and measures the accuracy improvement after each individual layer. In this way, PockEngine identifies the contribution of each layer, as well as trade-offs between accuracy and fine-tuning cost, and automatically determines the percentage of each layer that needs to be fine-tuned.
“This method matches the accuracy very well compared to full back propagation on different tasks and different neural networks,”Han adds
A pared-down model
Traditionally, the backpropagation graph is generated during runtime, which requires a significant amount of computation. PockEngine, on the other hand, does this during compile time, while the model is being prepared for deployment.
PockEngine deletes code to remove unnecessary layers or pieces of layers, resulting in a model graph that can be used during runtime. It then performs additional optimizations on this graph to improve efficiency even further.
Because all of this only needs to be done once, it reduces runtime computational overhead.
“It is like before setting out on a hiking trip. At home, you would do careful planning — which trails are you going to go on, which trails are you going to ignore. So then at execution time, when you are actually hiking, you already have a very careful plan to follow,”Han explains
When they used PockEngine to train deep-learning models on various edge devices, such as Apple M1 chips and digital signal processors found in many smartphones and Raspberry Pi computers, it performed on-device training up to 15 times faster without sacrificing accuracy. PockEngine also reduced the amount of memory required for fine-tuning.
The technique was also applied to the large language model Llama-V2. The fine-tuning process for large language models involves providing many examples, and it’s critical for the model to learn how to interact with users, according to Han. The process is also important for models that must solve complex problems or reason about possible solutions.
For example, PockEngine-tuned Llama-V2 models answered the question “What was Michael Jackson’s last album?” Models that had not been fine-tuned performed poorly. On an NVIDIA Jetson Orin edge GPU platform, PockEngine reduced the time required for each iteration of the fine-tuning process from about seven seconds to less than one second.
The researchers hope to use PockEngine in the future to fine-tune even larger models designed to process text and images simultaneously.
“This work addresses growing efficiency challenges posed by the adoption of large AI models such as LLMs across diverse applications in many different industries. It not only holds promise for edge applications that incorporate larger models, but also for lowering the cost of maintaining and updating large AI models in the cloud,”says Ehry MacRostie, a senior manager in Amazon’s Artificial General Intelligence division who was not involved in this study but works with MIT on related AI research through the MIT-Amazon Science Hub.
This work was supported, in part, by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT-Amazon Science Hub, the National Science Foundation (NSF), and the Qualcomm Innovation Fellowship.