YOLO Enhanced with Temporal Race Logic: A Novel Framework for Energy-Efficient Real-Time Object Detection
Edward Gonzalez, Cooper Hawley, Jayden Jardine, Erik Rodriguez
1 Introduction
1.1 Introduction
We seem to be in the golden age of Artificial Intelligence (AI) and Machine Learning (ML). The rapid advancements in AI and ML have fueled a growing demand for high-performance and specialized computation architectures for the sole purpose of training and running inference. Neural Networks (NNs) and Convolutional Neural Networks (CNNs) are examples of the most widely used and successful models capable of image recognition and segmentation, pattern finding, language processing, image generation,and much more.
While CNNs are useful for high-performance and low-latency tasks, they are often used in large-scale systems with little worry for power consumption, however, their demand for high computational requirements and significant power consumption presents major challenges in small-scale or edge systems. As the field evolves, the need to adapt powerful networks for smaller, energy-constrained systems such as electric vehicles, portable devices, and smart sensors—has become increasingly urgent. Achieving this scalability requires innovative approaches that balance energy efficiency, speed, and predictive accuracy. By reducing the barriers to AI integration, we can ensure its numerous benefits are accessible and applicable across diverse situations and environments. One of the most sought-after and critically important subfields of AI is computer vision and object detection. These technologies drive advancements in autonomous vehicles, security systems, and augmented reality. The current leader of real-time and accurate object detection comes from the YOLO (You Only Look Once) model [8]. This state-of-the-art model identifies objects, determines their positions within the data, and provides a confidence score for each classification, all in real-time. It achieves this by passing its input through a feed-forward network, meaning the model has no cycles or loops. While this solution is efficient, YOLO’s convolutions present major challenges when embedded into mobile and edge computing environments. This is primarily due to computationally expensive Multiply-Accumulate operations (MAC), which are the basis operations in performing convolutions [2]. When these models are deployed in resource-constrained environments, there is a delicate balancing act between accuracy and energy efficiency to ensure reliable performance.
1.2 Research context and Problem Statement
The development of high-performance AI and machine learning models has become a central focus, with substantial time, energy, and expertise dedicated to maximizing their capabilities. These models perform exceptionally well when operating in environments with abundant computing resources. However, they fall short when energy efficiency and rapid processing are critical constraints, particularly in embedded systems. In such cases, relying on cloud-based data centers for computation is infeasible, as sending data over the internet for processing and waiting for it to return introduces significant delays. To address this, it is crucial to bring the computing platform closer to the data source. The ultimate solution involves integrating the computing system directly with the sensor, enabling immediate data processing and inference. However, embedding powerful computing platforms with sensors presents a major challenge: achieving energy efficiency, especially when constrained by the need to preserve battery life. The current computational paradigm, optimized for environments with near-unlimited resources, does not translate well to these integrated, resource-constrained systems. A promising approach to address the energy and computation challenges lies in Temporal Encoding—a novel method that represents data as time delays instead of traditional binary bits [1][2]. This encoding scheme processes data within the temporal space using Race Logic (RL). This technique formulates algorithms that derive their solution by determining which elements complete their operations the fastest, leveraging the temporal dimension to optimize efficiency and computation [5]. Unlike traditional computations that run on CPU-based architectures, temporal encoding relies on CMOS architectures optimized for executing RL operations with significantly higher efficiency and performance. Temporal Encoding is particularly energy efficient for three reasons:
• First, since values are temporally encoded, delays only consume energy while actively being processed. Temporal encoding inherently makes the system stateless, with no storage of elements; values must be operated on immediately to retain their meaning, as their representation depends on the timing of their processing. The need for immediacy is due to the fact that new data is always coming in and the previously gathered data must finish processing before said data is received.
• Second, temporal inputs under the delay space can process addition and multiplication instructions efficiently in comparison to their digital counterparts. These Multiply Accumulate (MAC) operations, fundamental to convolutions in CNNs, are calculated using simpler operations based on temporal arithmetic principles [2].
• Third, image sensor inputs can be easily converted into the delay space with a method known as Voltage to Time Conversion (VTC) [2]. This approach eliminates the need for traditional, energy intensive Analog to Digital Conversion (ADC) for the entire image. Instead, operations on the data are done directly in its temporal form (delay space), and only the final output is converted using ADC, resulting in significant energy savings compared to converting the entire image digitally.
Using a recursive architecture, convolutional layers can be built into the delay space and have been shown to achieve similar performance to important space CNNs while achieving significant energy savings[2]. Temporal encoding, however, presents some challenges. A fully RL-based system remains constrained by the absence of a native memory mechanism. To overcome this, researchers have developed hybrid solutions that temporarily store values in conventional digital memory[6]. These values are retrieved and converted back to delay space only when necessary, reducing energy consumption while maintaining performance. To optimize YOLO to reduce power consumption, we must process its convolutions in delay space. YOLO utilizes a streamlined structure to create its CNN. YOLO’s convolutional layers utilize leaky-rectified linear unit (leaky-ReLu) as its activation function [8], and max-pooling layers to reduce dimensionality, culminating into multiple fully connected layers to produce a final output consisting of its predictions.
Our project seeks to modify YOLO’s Neural Network layers by converting them into their delay space equivalents to create a proof-of-concept Delay-Space YOLO (DS-YOLO) model. This adaptation will involve extensive modifications to the existing YOLO codebase to replace its layers with their delay space equivalents.
While convolutional layers have previously been modified to use delay space equivalent instructions [2], some functions (i.e leaky-ReLu, fully connected layers) will need to be implemented or modified to function with YOLO (Refer to Section 3.2 for a detailed breakdown). To evaluate the performance of DS-YOLO, we will develop simulations in Python. These will assess the network’s predictive accuracy and computational efficiency compared to its standard implementation. Then, a low-level architectural simulation using SPICE software will help analyze delays and power consumption, providing insights into potential optimizations. Given the substantial data throughput requirements of CNN models like YOLO, efficient memory management becomes a critical challenge. Since Race Logic lacks a native memory representation, we may need to periodically convert values from delay space to conventional digital encoding for storage. These conversions are energy-intensive, so our project will aim to optimize the frequency of these conversions to achieve the best balance between performance and energy efficiency. We will also explore strategies to streamline convolution operations, minimize memory access, and reduce energy consumption.
Overall, this research aspires to establish a foundation for energy-efficient, real-time AI systems tailored for embedded applications. By pushing the boundaries of integrated system design, we aim to unlock new possibilities for deploying powerful neural networks in resource-constrained environments, driving innovation in the field of embedded AI.
2 Related Work
2.1 Previous Proposed Solutions
Previous efforts to create energy-efficient Neural Networks have explored a variety of approaches, including changes in data representation [6], temporal computation frameworks [4], and biologically inspired models [10]. Spiking Neural Networks (SNNs) [10] leverage discrete spike-based communication to drastically reduce power consumption by mimicking the brain’s energy-efficient signal processing. These networks encode information in sparse, event-driven spikes, minimizing redundant computations. However, they often face challenges in achieving the accuracy of traditional networks and require significant reengineering of core neural operations, hindering their scalability and widespread adoption. Temporal State Machines (TSMs) [4] introduce a framework for managing computations across time using temporal memory. This paper emphasizes energy savings through time-evolving operations, similar to how temporal arithmetic applies delays for efficient computation, allowing computations to operate under the delay-based race logic. However, the presented TSM model focuses on graph computations rather than convolutional operations, indicating that further adaptation is required for CNNs. Hybrid Temporal Computing [6] explores combining time-driven and conventional computing methods to improve energy efficiency. This approach provides a middle ground between purely temporal models and traditional computation, making it relevant to neural network research. However, the hybrid model introduces new challenges related to synchronization and consistency when combining different computation paradigms. While these approaches offer important insights into energy-efficient computing, they do not directly solve the problem of implementing efficient real-time CNNs that maintain both energy savings and high accuracy. Spiking models like Spiking YOLO reduce energy usage but sacrifice performance and accuracy[10]. Temporal State Machines demonstrate the potential of time-based operations but are not optimized for convolutional architectures[4]. Hybrid Temporal Computing suggests a promising direction but does not fully explore race logic or delay spaces for neural networks[6].
2.2 Race Logic
Race Logic (RL) is a fast, energy-efficient framework that encodes data as time delays, where the value of a signal is represented by the moment energy spikes through a wire [5]. This temporal encoding allows logic operations to be performed by manipulating these delays rather than traditional binary values. Race Logic has four operations: First Arrival (FA), Last Arrival (LA), Add-Constant, and Inhibit[1].
• First Arrival (FA) and Last Arrival (LA) operate by taking the input of two signals and passing the encoded delays of the first and last arriving signals, respectively. FA uses an OR gate, which outputs the earliest arriving signal, propagating the shorter delay. In contrast, LA uses an AND gate, which waits for both signals to arrive, propagating the longer delay. As a result, FA represents the MIN of two signal delays, while LA represents the MAX. The difference between these delays, LAFA, captures the temporal offset between the signals.
• Utilizing a sequence of inverters and clocks, we can add any amount of time, known as the Add-Constant, to any delay. This operation is useful for adding constants or fixing values to a certain range.
• An Inhibiting signal determines whether a delay is passed through. If the inhibiting signal delay is shorter than the data signal delay passed through the operation, the data signal is never passed, and Inhibit outputs a delay of infinity. If the inhibiting signal delay is longer, the data signal is passed through. Thus the data signal is passed through the inhibit operator only if the MIN of the inhibit signal and the data signal is the data signal and no signal is passed otherwise. The inhibit operator can add a threshold of ”importance” to values, eliminating smaller value signals and increasing energy efficiency. Due to these operations, Race Logic is highly applicable to graph problems and has been proven to be a highly efficient solution to implementing shortest path algorithms and decision trees[1][5]. This allows Race Logic to be an efficient framework when working with CNNs.
2.3 Temporal Arithmetic and Energy Efficient Convolutions
One of the previous challenges of Race Logic was mapping arithmetic functions into the delay space, particularly multiplication and negative numbers. Temporal Arithmetic (TA) enables operations such as multiplication, division, addition, subtraction, and the representation of negative numbers within the delay space, making it logically complete and compatible with the importance space [2]. Importance space numbers are mapped onto the delay space through the Negative Log (NL) of the input (a property achieved using starved inverters in the VTC process). The NL function is particularly useful as larger weights in important space map onto shorter delays in delay space. Larger numbers can be considered more ”important” under convolutions in a CNN and the delay space allows the most important values to arrive first. The mapping also has the property that addition in the delay space corresponds to multiplication in important space, addition maps to a Negative Log Sum Equation (nSLE). While these operations are possible, the delay space itself lacks a way to represent negative values. This is overcome by splitting all values into non-negative pairs: < xpos, xneg >, positive and negative values can be represented. A positive value is represented with xpos as its a value and xneg = 0. A negative value is represented with xneg as its a value and xpos = 0. Zero is represented when xpos = xneg = 0. This approach expands the range of numbers by allowing operations to handle both positive and negative components, with a final renormalization step to ensure the resulting values remain consistent by adjusting by setting xpos or xneg or both to 0. Although this renormalization step introduces some overhead, it typically needs to occur only once at the end of a series of operations, making its impact on performance minimal. Implementing addition, subtraction and multiplication allows the delay space to have equivalent arithmetic operations in the importance space. MAC instructions are the basis for CNNs and a logically complete delay space means these operations are able to be processed under the delay space. Multiplication maps to a less computationally demanding operation, addition, making convolutions much more efficient with Temporal Arithmetic. Representing negative values also allows negative inputs to be passed through convolutional layers, an operation essential to processing negative weights. Utilizing a recursive architecture to process MAC instructions, CNNs are able to be built and processed under Race Logic and Temporal Arithmetic. As a result, Temporal Arithmetic, and Race Logic have become an emerging non-traditional form of computation for AI that allows for energy-efficient convolutions while still preserving performance. Furthermore, Temporal Arithmetic and Race Logic have proved to be versatile and highly applicable in multiple applications such as Pixel to Voltage Conversion and In-Sensor Classifiers [1], Superconducting Accelerators [9], Photon 3D Imaging [3], and General Matrix Multiplication (GEMM) instructions [7]. To further expand the applicability of Temporal Arithmetic and Race Logic, we propose the implementation of a highly important area in computer vision: Image Segmentation and Object Detection.
2.4 You Only Look Once (YOLO)
You Only Look Once (YOLO)[8] is a model for real-time object detection that is able to identify and distinguish the bounds of common everyday objects from an image. YOLO marks a big improvement in object detection as it’s a relatively simple class-based probability and bounding box approach that allows object detection to be processed in real time. YOLO improves object detection by training classifiers on entire images, enabling the encoding of contextual information such as background details. This is achieved through its unified detection framework, which uses a single convolutional neural network to extract all necessary information efficiently. The model also generates a class probability map, allowing for streamlined and effective object classification with minimal impact on accuracy. YOLO divides an input image into a smaller m x m matrix, where each cell predicts two bounding boxes (multiple bounding boxes are used in later models of the YOLO framework) boxes along with their dimensions and a set of probabilities indicating the likelihood that the detected object within each bounding box belongs to one of the pretrained classes. While multiple versions of YOLO exist, each CNN consists of a set of convolutional layers consisting of Leaky-ReLu as its activation function and max-pooling layers for reduction within these layers. A final reduction layer and sequence of fully connected layers produce the final tensor output of predictions. Leaky ReLU is preferred over ReLU to address the dying neuron problem, where a neuron becomes entirely inactive during training. It is defined by the piecewise function: n αx if x < 0, x if x ≥ 0, alpha is typically set to 0.1 because this value strikes a balance between allowing some gradient flow for negative inputs and maintaining non-linearity. A higher alpha would make the function behave more like a linear activation, reducing its ability to model complex patterns, while a lower alpha would cause it to act more like a standard ReLU, reintroducing the risk of dying neurons. Later versions of YOLO introduce additional layers, fine-tuning, and functions to improve the accuracy of YOLO. Implementing additional layers into the delay space will improve the overall applicability of Temporal Arithmetic into other convolutional models, however, the complexity and runtime of the model increases with each iteration, affecting the applicability of YOLO into embedded systems. Thus, our project will need to achieve a balance of accuracy and energy consumption to ensure reliable performance.
2.5 Mural Mind Map
Below is a mural of previous research that has been done relating to and building off of Temporal Arithmetic. This research is split into five categories shown below. Pointing to the top red box with this project’s title are the previous pieces of research we believe are most relevant to what our project builds upon. Here is a link to view the Mural3 Implementation Plan
3.1 Overall Implementation Plan
YOLO has become its own field of research, with multiple articles, research papers, and previous implementations being easily accessible. Our solution will implement a proof-of-concept DS-YOLO model, replacing its underlying convolutions with their delay space equivalents.
The code implementation of our model will be twofold:
• First, we will either create or find a YOLO model built under TensorFlow, ensuring that it performs as expected in terms of accuracy and speed. Once validated, we will redefine each of the utilized TensorFlow layers by creating overrides that implement their delay-space equivalents. These overrides will be written in a combination of Python and C to efficiently simulate the temporal arithmetic operations. We will then iterate over the original YOLO model, layer by layer, converting each TensorFlow layer to its delay-space counterpart, ultimately achieving a fully delay-space DS-YOLO model.
• Second, after obtaining a functional DS-YOLO model, we will conduct comprehensive benchmarking tests. We will compare the performance of DS-YOLO with YOLO, focusing on two key aspects: predictive accuracy and inference time at a high level, and energy consumption at a low level. For the energy consumption analysis, we will use low-level architectural simulations using SPICE software to model the energy characteristics of the delay-space computations. A critical consideration in our evaluation will be the energy trade-offs associated with the delay-space implementation, particularly regarding memory management and data storage. Since RL lacks a native memory representation, we need to assess whether it is feasible to perform all computations entirely in delay space or if we must periodically convert values between delay space and traditional digital encoding for storage purposes. These conversions, especially from digital to analog, can be energy-intensive and may offset some of the energy savings achieved through temporal arithmetic. In the end, our solution will be a proof-of-concept model for energy-efficient object detection using YOLO. This will allow object detection to be applied to in-sensor and embedded applications and increase the overall versatility of Neural Networks.
3.2 Implementing a Proof-of-Concept YOLO Model
The YOLO V3 model we are currently considering is built on a neural network backbone called Darknet. Darknet comprises convolutional layers organized into residual blocks, leveraging batch normalization and Leaky ReLU activation functions for stability and non-linearity. The architecture strategically employs max-pooling layers to reduce the spatial dimensions of feature maps without overcomplicating computations. Additionally, Darknet integrates skip connections via concatenation, upsampling layers, and specialized components for bounding box regression and object detection. This model’s modular design makes it a suitable candidate for delay-space implementation, providing a proof-of-concept for energy efficient neural network computation. Previous work has successfully implemented convolutional and max-pooling layers in delay space to validate energy-efficient computations [2]. Building on these efforts, we aim to extend delay-space implementations to all layers and methods defined in YOLO V3. This includes convolutional layers, max pooling layers, activation functions such as Leaky ReLU, batch normalization, upsampling layers, skip connections (via concatenation), object detection layers, bounding box regression, and the final non-max suppression step. Each component will be adapted to operate within the delay space while preserving its original functionality. Activation functions will be treated as their own layers, enabling straightforward conversion to delay space in the iterative process. Training will still occur in important space, allowing the neural network to learn its weights using conventional methods. After training, the learned weights will be extracted and converted into their delay-space equivalents. Inference will then be performed entirely in delay space using these converted weights, leveraging the efficiency of temporal encoding while retaining the accuracy of the pre-trained model. We will adopt the dual-rail approach[2] by splitting tensors into positive and negative components, ensuring consistency with real-world hardware implementations and CMOS design principles. At the start of each delay-space layer, tensors will be split into these components, and the dual-rail approach will be applied throughout the layer’s operations. Since we are directly overriding TensorFlow functions and leveraging TensorFlow’s graph optimization for performance, the tensors will be re-merged at the end of the layer (to adhere to TensorFlow requirements) before being passed to the next layer in the model. By converting the YOLO V3 model to operate within the delay space, we aim to demonstrate the feasibility of implementing complex neural network architectures in energy-efficient temporal encoding frameworks. Leveraging the modularity of Darknet and adhering to CMOS design principles ensures compatibility with real-world hardware implementations. This proof-of-concept not only validates delayspace computation for state-of-the-art object detection models but also paves the way for future advancements in energy-efficient neural network design.
3.3 Evaluating the YOLO Model
We will be using the Cifar 10 data set for benchmarking. Training will likely be done on the COCO dataset as many pre-trained weights are available for that model. For a base reference of accuracy, we will run the default YOLO model on the testing data and record its average time of inference and accuracy. We will then record the same data points for the DS-YOLO model and compare the two models. Accuracy is extremely important for the purpose of image classification and segmentation, however, our solution would inherently introduce some degree of negative influence on the model’s accuracy. This negative influence is derived from the inability to natively perform negative log sum exponentials in delay space, resulting in the use of approximations that can make the error arbitrarily small. It appears that the difference of accuracy would be negligible, but further testing is needed. It may seem illogical to purposefully make the models worse, but the trade-off is a hypothesized reduction in energy consumption. Assessing energy efficiency in delay space will be done by mapping all operations done in digital logic to their CMOS equivalent implementation. We will keep track of which operations were called and how many times. Each operation has been simulated using the SPICE hardware simulation software and energy consumption for all delay space operations has been precalculated. To estimate the total energy consumption we can use this simple formula:
This sum will give an estimated total energy consumption of our YOLO model. We will perform the same operation cost breakdown for purely digital systems and see how our YOLO model compares to a base YOLO model.
References
[1] Dilip Vasudevan Dmitri Strukov Timothy Sherwood Georgios Tzimpragos, Advait Madhavan.
Boosted race trees for low energy classification. In ASPLOS ’19: Architectural Support for Program-
ming Languages and Operating Systems, 2019.
[2] Rhys Gretsch, Peiyang Song, Advait Madhavan, Jeremy Lau, and Timothy Sherwood. Energy efficient
convolutions with temporal arithmetic. In Proceedings of the 29th ACM International Conference on
Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, page
354–368, New York, NY, USA, 2024. Association for Computing Machinery.
[3] Atul Ingle and David Maier. Count-free single-photon 3d imaging with race logic. IEEE Transactions
on Pattern Analysis and Machine Intelligence, pages 1–12, 2023.
[4] Advait Madhavan, Matthew W. Daniels, and Mark D. Stiles. Temporal state machines: Using tem-
poral memory to stitch time-based graph computation. ACM Journal on Emerging Technologies in
Computing Systems, 17(3), 2021.
[5] Advait Madhavan, Timothy Sherwood, and Dmitri Strukov. Race logic: A hardware acceleration
for dynamic programming algorithms. In 2014 ACM/IEEE 41st International Symposium on Computer
Architecture (ISCA), pages 517–528, 2014.
[6] Advait Madhavan and Mark D. Stiles. Hybrid temporal computing for lower power hardware accel-
erators. arXiv preprint arXiv:2407.08975, 2024.
[7] Zhewen Pan, Joshua San Miguel, and Di Wu. Carat: Unlocking value-level parallelism for multiplier-
free gemms. In Proceedings of the 29th ACM International Conference on Architectural Support for
Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, page 167–184, New York,
NY, USA, 2024. Association for Computing Machinery.
[8] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-
time object detection, 2016.
[9] Georgios Tzimpragos, Dilip Vasudevan, Nestan Tsiskaridze, George Michelogiannakis, Advait Mad-
havan, Jennifer Volk, John Shalf, and Timothy Sherwood. A computational temporal logic for super-
conducting accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS ’20, page 435–448, New York,
NY, USA, 2020. Association for Computing Machinery.
[10] Di Wu and Joshua San Miguel. Spiking-yolo: Spiking neural network for energy-efficient object
detection. arXiv preprint arXiv:1903.06530, 2019