Neural Network Unit with Output Buffer Feedback for Performing Recurrent Neural Network Computations

PublishedFebruary 4, 2020

Assigneenot available in USPTO data we have

InventorsG. GLENN HENRY TERRY PARKS KYLE T. O'BRIEN

Technical Abstract

Patent Claims

19 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A neural network unit (NNU) that performs calculations for a recurrent neural network (RNN) having input layer nodes, hidden layer nodes, output layer nodes and context layer nodes, wherein each of the input layer nodes, hidden layer nodes, output layer nodes, and context layer nodes is implemented in circuitry and configured to perform an arithmetic operation, the NNU comprising: an array of neural processing units (NPU), at least one random access memory (RAM), and an output buffer, the array of NPUs: (a) read, from the output buffer, values of the context layer nodes associated with a first time step; (b) read, from the RAM, values of the input layer nodes associated with a second time step subsequent to the first time step; (c) generate values of the hidden layer nodes associated with the second time step based on the values of the input layer nodes read from the RAM and the values of the context layer nodes read from the output buffer; (d) output the hidden layer node values associated with the second time step to the output buffer rather than to the RAM; (e) read, from the output buffer, the hidden layer node values associated with the second time step; (f) generate values of the context layer nodes associated with the second time step based on the hidden layer node values read from the output buffer; (g) output the context layer node values associated with the second time step to the hidden layer nodes rather than to the RAM; (h) generate values of the output layer nodes associated with the second time step using the hidden layer node values associated with the second time step; (i) write the output layer node values associated with the second time step to the RAM; and (j) repeat (a) through (i) for a sequence of time steps.

Plain English Translation

This invention relates to hardware acceleration for recurrent neural networks (RNNs). It addresses the challenge of efficiently processing sequential data by providing a specialized neural network unit (NNU). The NNU is designed to perform calculations for an RNN that includes input, hidden, output, and context layers. Each of these layers is implemented in circuitry capable of arithmetic operations. The NNU comprises an array of neural processing units (NPUs), random access memory (RAM), and an output buffer. The NPUs operate in a time-sequential manner. At each time step, they first retrieve context layer node values from the output buffer, which represent information from a previous time step. They then read input layer node values for the current time step from the RAM. Based on these inputs, the NPUs compute and generate values for the hidden layer nodes. Crucially, these hidden layer node values are sent directly to the output buffer, bypassing the RAM. Subsequently, the NPUs read these newly computed hidden layer node values from the output buffer to generate the context layer node values for the current time step. These context layer values are then outputted to the hidden layer nodes, again avoiding the RAM. The NPUs then use the hidden layer node values to compute the output layer node values, which are finally written to the RAM. This process is repeated for a sequence of time steps, enabling the NNU to process temporal data efficiently.

Claim 2

Original Legal Text

2. The NNU of claim 1 , further comprising: the array of NPUs comprises N NPUs, each comprising a multiplexed register, an arithmetic unit, and an accumulator circuit, wherein the accumulator circuit has an output and the arithmetic unit that performs operations on inputs, and wherein N is an integer value; the arithmetic unit receives an output of the multiplexed register and an output of the accumulator circuit, and the arithmetic unit generates a result provided to the accumulator circuit; the output buffer is N words wide and is configured to hold N of the context/hidden layer node values; and the N multiplexed registers are arranged to form an N-word hardware rotater that receives the N words of the output buffer.

Plain English Translation

This invention relates to neural network units (NNUs) designed for efficient processing in neural network applications. The problem addressed is optimizing the hardware architecture for neural network computations, particularly in handling context or hidden layer node values with improved data flow and processing efficiency. The invention describes a neural network unit (NNU) with an array of N neural processing units (NPUs), where each NPU includes a multiplexed register, an arithmetic unit, and an accumulator circuit. The accumulator circuit has an output, and the arithmetic unit performs operations on inputs received from the multiplexed register and the accumulator circuit, generating a result that is fed back to the accumulator circuit. This feedback loop enables iterative computations, such as those required in neural network training or inference. The output buffer in the NNU is N words wide, allowing it to hold N context or hidden layer node values simultaneously. The N multiplexed registers are arranged to form an N-word hardware rotater, which receives the N words from the output buffer. This rotater enables efficient data rotation and reordering, facilitating parallel processing and reducing latency in neural network operations. The architecture improves computational efficiency by leveraging parallelism and optimized data flow, making it suitable for high-performance neural network applications. The hardware rotater ensures seamless data handling, while the feedback loop in each NPU supports iterative computations without external intervention. This design enhances throughput and reduces power consumption in neural network hardware implementations.

Claim 3

Original Legal Text

3. The NNU of claim 2 , further comprising: to said (e) read, from the output buffer, the hidden layer node values associated with the second time step, the N NPUs read the N values of the hidden layer nodes from the output buffer into the rotater; the N NPUs read, from the RAM, weight values associated with connections between the hidden layer nodes and the output layer nodes; and to said (f) generate values of the context layer nodes associated with the second time step based on the hidden layer node values read from the output buffer, the N NPUs: rotate the N values of the hidden layer nodes through the rotater for provision to the arithmetic unit of each of the N NPUs; multiply, by each of the N arithmetic units, each of the N rotated hidden layer node values by one of the weight values to generate N respective products; and accumulate, into each of the N accumulator circuits, the N respective products to generate a result.

Plain English Translation

This invention relates to a neural network unit (NNU) designed for efficient processing of sequential data, particularly in recurrent neural networks (RNNs) or similar architectures. The problem addressed is the computational inefficiency in handling hidden layer node values across time steps, which can bottleneck performance in real-time applications. The NNU includes multiple neural processing units (NPUs) that operate in parallel to process data. Each NPU has an arithmetic unit, an accumulator circuit, and a rotater for managing data flow. The NNU reads hidden layer node values from an output buffer into the rotater, where these values are distributed to the arithmetic units of the NPUs. Simultaneously, weight values associated with connections between hidden layer nodes and output layer nodes are fetched from RAM. For generating context layer node values at a second time step, the NPUs rotate the hidden layer node values through the rotater to ensure each arithmetic unit receives the correct data. Each arithmetic unit multiplies the rotated hidden layer node values by corresponding weight values, producing intermediate products. These products are then accumulated in the accumulator circuits to compute the final context layer node values. This parallelized approach enhances processing speed and efficiency, particularly in time-series or sequential data tasks.

Claim 4

Original Legal Text

4. The NNU of claim 3 , further comprising: a plurality of activation function units (AFU) that receive the accumulator circuit output of associated one or more of the NPUs and perform an activation function on the accumulator circuit output, wherein each of the AFUs is implemented in circuitry.

Plain English Translation

A neural network accelerator system includes a plurality of neural processing units (NPUs) and an associated neural network unit (NNU). Each NPU processes input data using a multiply-accumulate operation and generates an output. The NNU includes an accumulator circuit that aggregates the outputs from one or more NPUs. The system further includes a plurality of activation function units (AFUs), each implemented in dedicated circuitry. Each AFU receives the output from an associated accumulator circuit and applies a non-linear activation function, such as ReLU, sigmoid, or tanh, to the accumulated result. The AFUs enhance the computational efficiency of the neural network by performing activation functions in hardware, reducing the need for software-based processing. This architecture is particularly useful in deep learning applications where rapid, parallelized computations are required. The system optimizes performance by integrating activation functions directly into the hardware pipeline, minimizing latency and improving throughput. The AFUs can be configured to support different activation functions, allowing flexibility in neural network design. This hardware-based approach accelerates inference and training tasks in neural networks, making it suitable for real-time applications in AI systems.

Claim 5

Original Legal Text

5. The NNU of claim 4 , further comprising: to said (h) generate values of the output layer nodes associated with the second time step using the hidden layer node values associated with the second time step, the plurality of AFUs: for each result of the N results of the accumulated respective N respective products received from the accumulator circuit output of each of the N NPUs, perform an activation function on the result to generate a respective output layer node value.

Plain English Translation

This invention relates to neural network units (NNUs) designed for efficient computation in neural networks, particularly those involving recurrent or time-series processing. The problem addressed is the need for optimized hardware implementations of neural networks that can handle sequential data while maintaining computational efficiency and accuracy. The invention describes a neural network unit (NNU) that includes multiple neural processing units (NPUs) and activation function units (AFUs). Each NPU computes products of input values and corresponding weights, then accumulates these products to generate intermediate results. These results are then processed by the AFUs, which apply activation functions to produce output layer node values for a given time step. The NNU further includes a mechanism to generate output layer node values for a second time step using hidden layer node values from the same time step. Specifically, the AFUs perform activation functions on the accumulated products from the NPUs to generate these output layer node values. This approach allows the NNU to efficiently process sequential data by leveraging hidden layer states from the current time step to compute outputs for the next time step, reducing latency and improving throughput in recurrent neural network applications. The design ensures parallel processing of multiple data paths, enhancing overall computational efficiency.

Claim 6

Original Legal Text

6. The NNU of claim 4 , further comprising: the plurality of AFUs is N, and each of the N AFUs is coupled to receive the accumulator circuit output of a respective one of the N NPUs and to provide its result of the activation function to a respective one of the N words of the output buffer.

Plain English Translation

Technical Summary: This invention relates to neural network architectures, specifically a Neural Network Unit (NNU) designed to enhance computational efficiency in deep learning systems. The problem addressed is the bottleneck in processing activation functions, which are computationally intensive operations in neural networks. Traditional architectures often rely on a single activation function unit (AFU), leading to inefficiencies in parallel processing. The NNU includes multiple Neural Processing Units (NPUs) and Activation Function Units (AFUs). Each NPU generates an output that is fed into an accumulator circuit, which produces an accumulator circuit output. The NNU further includes N AFUs, where each AFU is coupled to receive the accumulator circuit output from a respective NPU. Each AFU computes an activation function (e.g., ReLU, sigmoid) and provides its result to a respective word in an output buffer. This parallel processing structure allows for simultaneous execution of activation functions, significantly improving throughput and reducing latency in neural network computations. The output buffer stores the results in a structured manner, enabling efficient data flow for subsequent layers in the network. This design optimizes the performance of neural networks by leveraging parallelism in activation function computations.

Claim 7

Original Legal Text

7. The NNU of claim 2 , further comprising: to said (a) read, from the output buffer, values of the context layer nodes associated with a first time step, the N NPUs read the N values of the context layer nodes from the output buffer into the rotater; and to said (c) generate values of the hidden layer nodes associated with the second time step based on the values of the input layer nodes read from the RAM and the values of the context layer nodes read from the output buffer, the N NPUs: rotate the N values of the context layer nodes through the rotater for provision to the arithmetic unit of each of the N NPUs; and accumulate, into each of the N accumulator circuits, the N values of the context layer nodes.

Plain English Translation

This invention relates to a neural network unit (NNU) designed for efficient processing of sequential data, such as in recurrent neural networks (RNNs). The problem addressed is the computational inefficiency in handling context layer nodes across time steps, which are critical for maintaining state information in sequential processing tasks. The NNU includes multiple neural processing units (NPUs) that operate in parallel to process input data and context layer values. Each NPU contains an arithmetic unit, an accumulator circuit, and a rotater. The rotater is a specialized circuit that distributes context layer node values to the arithmetic units of the NPUs. The accumulator circuits store intermediate results during computation. During operation, the NNU reads values of context layer nodes from an output buffer into the rotater. These values correspond to a previous time step and are essential for generating hidden layer node values for the current time step. The rotater then rotates these context layer node values to each NPU's arithmetic unit. Simultaneously, the NPUs accumulate the context layer node values into their respective accumulator circuits. The hidden layer node values for the current time step are computed based on both the input layer node values (read from RAM) and the context layer node values (read from the output buffer). This parallel processing approach enhances computational efficiency and reduces latency in sequential data tasks.

Claim 8

Original Legal Text

8. The NNU of claim 7 , further comprising: to said (b) read, from the RAM, values of the input layer nodes associated with a second time step subsequent to the first time step, the N NPUs read the N values of the input layer nodes nodes from the RAM into the rotater; the N NPUs read, from the RAM, weight values associated with connections between the input layer nodes and the hidden layer nodes; and to said (c) generate values of the hidden layer nodes associated with the second time step based on the values of the input layer nodes read from the RAM and the values of the context layer nodes read from the output buffer, the N NPUs further: rotate the N values of the input layer nodes through the rotater for provision to the arithmetic unit of each of the N NPUs; multiply, by each of the N arithmetic units, each of the N rotated input layer node values by one of the weight values to generate N respective products; and accumulate, into each of the N accumulator circuits, the N respective products along with the accumulated N values of the context layer nodes.

Plain English Translation

This invention relates to a neural network unit (NNU) designed for efficient processing of sequential data, particularly in recurrent neural networks (RNNs) or similar architectures. The problem addressed is the computational inefficiency in handling time-step-dependent operations, where input and hidden layer values must be dynamically updated across multiple time steps while maintaining context from previous computations. The NNU includes multiple neural processing units (NPUs) that operate in parallel to process input layer node values and weight values stored in a random-access memory (RAM). Each NPU contains a rotater, an arithmetic unit, and an accumulator circuit. During operation, the NPUs read input layer node values from RAM for a given time step and rotate these values through the rotater to distribute them to the arithmetic units. Simultaneously, the NPUs read weight values associated with connections between input and hidden layer nodes from RAM. The arithmetic units multiply each rotated input layer node value by a corresponding weight value, generating products that are accumulated in the accumulator circuits. Additionally, the NPUs read context layer node values from an output buffer, which store accumulated results from prior time steps. These context values are combined with the accumulated products to generate updated hidden layer node values for the current time step. This parallelized approach enhances processing speed and efficiency in sequential data tasks.

Claim 9

Original Legal Text

9. The NNU of claim 1 , further comprising: a program memory that holds instructions of a non-architectural program; a sequencer circuit that fetches the non-architectural program instructions from the program memory and generates micro-operations to control the array of NPUs to perform (a) through (j).

Plain English Translation

A neural network unit (NNU) is designed to accelerate neural network computations, particularly for tasks involving non-architectural programs. The NNU includes an array of neural processing units (NPUs) configured to perform specific operations such as data loading, arithmetic computations, and memory management. To enhance functionality, the NNU incorporates a program memory that stores instructions for a non-architectural program, which is distinct from the core architectural operations of the NNU. A sequencer circuit fetches these instructions from the program memory and generates micro-operations to control the NPU array. These micro-operations direct the NPUs to execute a range of tasks, including data movement, matrix multiplications, activation functions, and synchronization operations. The non-architectural program allows for flexible, program-driven control of the NPUs, enabling custom neural network operations beyond the standard architectural capabilities. This design improves adaptability and efficiency in neural network processing by leveraging programmable control over the NPU array.

Claim 10

Original Legal Text

10. The NNU of claim 9 , further comprising: the NNU is comprised in a processor that fetches and executes instructions of an architectural program of the processor.

Plain English Translation

A neural network unit (NNU) is integrated into a processor to accelerate neural network computations. The NNU is designed to execute neural network operations, such as matrix multiplications, convolutions, and activations, with high efficiency. The NNU includes specialized hardware components, such as systolic arrays, memory buffers, and parallel processing elements, to optimize performance for deep learning workloads. The NNU operates under the control of the processor's architectural program, which fetches and executes instructions that configure and trigger the NNU's operations. This integration allows the processor to offload neural network tasks from the general-purpose cores, improving overall system performance and energy efficiency. The NNU may also support dynamic reconfiguration to adapt to different neural network architectures and workloads, ensuring flexibility in deployment. The processor's architectural program includes instructions that define the neural network operations to be performed, the data flow between the NNU and other processor components, and the scheduling of tasks to maximize throughput. This approach enables real-time processing of neural network workloads in applications such as autonomous systems, edge computing, and AI-driven analytics.

Claim 11

Original Legal Text

11. The NNU of claim 10 , further comprising: the output buffer is accessible by the non-architectural program and is not accessible by the architectural program.

Plain English Translation

A neural network unit (NNU) is designed to accelerate machine learning tasks while maintaining security and isolation between different software components. The NNU includes a processing engine that executes neural network operations and an output buffer that stores the results of these operations. The output buffer is specifically configured to be accessible only by non-architectural programs, such as specialized machine learning frameworks or runtime environments, while being inaccessible to architectural programs, such as general-purpose operating system or application software. This isolation ensures that sensitive neural network computations and their results remain protected from unauthorized access or interference, enhancing security in systems where the NNU is deployed. The NNU may also include additional features, such as input buffers for receiving data, control logic for managing operations, and interfaces for communication with other system components. The design allows for efficient and secure execution of neural network tasks while preventing unauthorized access to intermediate or final results.

Claim 12

Original Legal Text

12. The NNU of claim 10 , further comprising: the at least one memory is accessible by the architectural program to write the values of the input layer nodes associated with the sequence of time steps and to read the values of the output layer nodes associated with the sequence of time steps; and the program memory is accessible by the architectural program to write the non-architectural program to the program memory.

Plain English Translation

A neural network unit (NNU) is designed to process sequential data by executing a neural network model. The NNU includes a processing unit that runs an architectural program to manage the neural network's operations. The architectural program controls the execution of a non-architectural program, which performs the actual neural network computations. The NNU also includes at least one memory that stores values of input layer nodes and output layer nodes for a sequence of time steps. The architectural program can write input node values to this memory and read output node values from it. Additionally, the NNU has a program memory that stores the non-architectural program, which the architectural program can write to or modify as needed. This design allows the NNU to dynamically update the neural network model while processing sequential data, improving flexibility and adaptability in applications like time-series prediction or natural language processing. The system ensures efficient data flow between the processing unit and memory, optimizing performance for real-time or high-throughput neural network tasks.

Claim 13

Original Legal Text

13. A method for operating a neural network unit (NNU) that performs calculations for a recurrent neural network (RNN) having input layer nodes, hidden layer nodes, output layer nodes and context layer nodes, the NNU having an array of neural processing units (NPU), at least one random access memory (RAM), and an output buffer, wherein each of the input layer nodes, hidden layer nodes, output layer nodes, and context layer nodes is implemented in circuitry and configured to perform an arithmetic operation, the method comprising: (a) reading, from the output buffer, values of the context layer nodes associated with a first time step; (b) reading, from the RAM, values of the input layer nodes associated with a second time step subsequent to the first time step; (c) generating values of the hidden layer nodes associated with the second time step based on the values of the input layer nodes read from the RAM and the values of the context layer nodes read from the output buffer; (d) outputting the hidden layer node values associated with the second time step to the output buffer rather than to the RAM; (e) reading, from the output buffer, the hidden layer node values associated with the second time step; (f) generating values of the context layer nodes associated with the second time step based on the hidden layer node values read from the output buffer; (g) outputting the context layer node values associated with the second time step to the hidden layer nodes rather than to the RAM; (h) generating values of the output layer nodes associated with the second time step using the hidden layer node values associated with the second time step; (i) writing the output layer node values associated with the second time step to the RAM; and (j) repeating (a) through (i) for a sequence of time steps.

Plain English Translation

This invention relates to hardware acceleration for recurrent neural networks (RNNs), addressing inefficiencies in processing sequential data. RNNs rely on context layer nodes to maintain state across time steps, requiring frequent memory access for intermediate values. Traditional implementations suffer from bottlenecks due to repeated reads and writes to memory, degrading performance. The method optimizes RNN execution in a neural network unit (NNU) with an array of neural processing units (NPUs), random access memory (RAM), and an output buffer. The NNU implements input, hidden, output, and context layer nodes as dedicated circuitry performing arithmetic operations. The process begins by reading context layer values from the output buffer for a first time step and input layer values from RAM for a subsequent time step. Hidden layer values are computed using these inputs and stored directly in the output buffer, bypassing RAM. These hidden values are then read from the buffer to generate context layer values for the next time step, which are fed back to the hidden layer nodes. Output layer values are computed from the hidden layer values and written to RAM. This sequence repeats for each time step, minimizing memory access by leveraging the output buffer for intermediate data. The approach reduces latency and improves throughput by localizing critical data transfers within the NNU.

Claim 14

Original Legal Text

14. The method of claim 13 , further comprising: the array of NPUs comprises N NPUs, each comprising a multiplexed register, an arithmetic unit, and an accumulator circuit, wherein the accumulator circuit has an output and the arithmetic unit that performs operations on inputs, and wherein N is an integer value; the arithmetic unit receives an output of the multiplexed register and an output of the accumulator circuit, and the arithmetic unit generates a result provided to the accumulator circuit; the output buffer is N words wide and is configured to hold N of the context/hidden layer node values; the N multiplexed registers are arranged to form an N-word hardware rotater that receives the N words of the output buffer; said (e) reading, from the output buffer, the hidden layer node values associated with the second time step comprises: reading, by the N NPUs, the N values of the hidden layer nodes from the output buffer into the rotater; reading, by the N NPUs from the RAM, weight values associated with connections between the hidden layer nodes and the output layer nodes; and said (f) generating values of the context layer nodes associated with the second time step based on the hidden layer node values read from the output buffer comprises: by the N NPUs: rotating the N values of the hidden layer nodes through the rotater for provision to the arithmetic unit of each of the N NPUs; multiplying, by each of the N arithmetic unit, each of the N rotated hidden layer node values by one of the weight values to generate N respective products; and accumulating, into each of the N accumulator circuits, the N respective products to generate a result.

Plain English Translation

This invention relates to neural processing units (NPUs) for accelerating neural network computations, particularly in recurrent neural networks (RNNs) where sequential data processing is required. The problem addressed is the inefficiency in handling hidden layer node values during time-step transitions, which can bottleneck performance in RNN inference. The system includes an array of N NPUs, each containing a multiplexed register, an arithmetic unit, and an accumulator circuit. The arithmetic unit processes inputs from both the multiplexed register and the accumulator, generating results stored in the accumulator. An N-word-wide output buffer holds hidden layer node values, and the N multiplexed registers form an N-word hardware rotater that receives these values. During operation, the NPUs read N hidden layer node values from the output buffer into the rotater and fetch corresponding weight values from RAM. The rotater sequentially provides these values to each NPU's arithmetic unit, which multiplies them by the weights. The resulting products are accumulated in each NPU's accumulator to produce context layer node values for the next time step. This parallelized, pipelined approach enhances throughput by efficiently reusing hidden layer data across multiple NPUs.

Claim 15

Original Legal Text

15. The method of claim 14 , further comprising: the apparatus includes a plurality of activation function units (AFU) that receive the accumulator circuit output of associated one or more of the NPUs and perform an activation function on the accumulator circuit output, wherein each of the AFUs is implemented in circuitry.

Plain English Translation

This invention relates to neural processing units (NPUs) and activation function units (AFUs) in hardware-accelerated neural network systems. The problem addressed is the efficient processing of neural network computations, particularly the application of activation functions to outputs from NPUs, to improve performance and reduce latency in hardware implementations. The system includes multiple NPUs, each with an accumulator circuit that processes data from neural network layers. The accumulator circuit outputs are then fed into a plurality of AFUs, each implemented in dedicated circuitry. Each AFU performs an activation function on the corresponding NPU's accumulator output. The activation functions may include operations like ReLU, sigmoid, or tanh, which are critical for neural network computations. By integrating AFUs directly into the hardware, the system avoids the need for software-based activation function processing, reducing computational overhead and improving throughput. The hardware implementation of AFUs ensures low-latency execution, making the system suitable for real-time applications. The modular design allows for parallel processing, where each AFU operates independently on its associated NPU's output, further enhancing efficiency. This approach optimizes neural network inference and training by accelerating critical operations in hardware.

Claim 16

Original Legal Text

16. The method of claim 15 , further comprising: said (h) generating values of the output layer nodes associated with the second time step using the hidden layer node values associated with the second time step, comprises, by the plurality of AFUs: for each result of the N results of the accumulated respective N respective products received from the accumulator circuit output of each of the N NPUs, performing an activation function on the result to generate a respective output layer node value.

Plain English Translation

This invention relates to neural network processing, specifically to methods for efficiently computing output layer node values in a recurrent neural network (RNN) or similar time-series processing architecture. The problem addressed is the computational inefficiency in generating output layer node values from hidden layer node values in sequential time steps, particularly when using multiple neural processing units (NPUs) and arithmetic function units (AFUs). The method involves processing input data through a neural network with at least one hidden layer and an output layer, where computations are distributed across multiple NPUs and AFUs. Each NPU computes products of hidden layer node values and corresponding weights, then accumulates these products to produce intermediate results. These accumulated results are then processed by AFUs, which apply an activation function to each result to generate the final output layer node values for a given time step. The process is repeated for subsequent time steps, with hidden layer node values from the previous time step being used as inputs for the current time step. This approach leverages parallel processing to improve computational efficiency while maintaining the sequential dependencies inherent in RNNs. The invention optimizes the workflow by ensuring that each AFU processes the accumulated results from the NPUs to produce the output layer node values in a streamlined manner.

Claim 17

Original Legal Text

17. The method of claim 14 , further comprising: said (a) reading, from the output buffer, values of the context layer nodes associated with a first time step, comprises: reading, by the N NPUs, the N values of the context layer nodes from the output buffer into the rotater; and said (c) generating values of the hidden layer nodes associated with the second time step based on the values of the input layer nodes read from the RAM and the values of the context layer nodes read from the output buffer comprises: by the N NPUs: rotating the N values of the context layer nodes through the rotater for provision to the arithmetic unit of each of the N NPUs; and accumulating, into each of the N accumulator circuits, the N values of the context layer nodes.

Plain English Translation

This invention relates to neural processing units (NPUs) for efficient computation in neural networks, particularly for handling context layer nodes in recurrent neural networks (RNNs). The problem addressed is the efficient management of context layer data during sequential processing, where values from previous time steps must be reused in subsequent computations. The method involves a system with multiple NPUs (N NPUs) and an output buffer storing context layer node values from a prior time step. For a given time step, the NPUs read N values of context layer nodes from the output buffer into a rotater circuit. These values are then rotated through the rotater to distribute them to the arithmetic units of each NPU. Simultaneously, the NPUs accumulate the N context layer node values into their respective accumulator circuits. The hidden layer node values for the next time step are generated based on input layer node values read from RAM and the accumulated context layer node values. This approach optimizes data reuse and parallel processing, reducing latency and improving throughput in RNN computations. The rotater and accumulator circuits enable efficient distribution and aggregation of context data across multiple NPUs, enhancing performance in sequential neural network tasks.

Claim 18

Original Legal Text

18. A computer program product encoded in at least one non-transitory computer usable medium for use with a computing device, the computer program product comprising: computer usable program code embodied in said medium, for specifying a neural network unit (NNU) that performs calculations for a recurrent neural network (RNN) having input layer nodes, hidden layer nodes, output layer nodes and context layer nodes, wherein each of the input layer nodes, hidden layer nodes, output layer nodes, and context layer nodes is implemented in circuitry and configured to perform an arithmetic operation, the computer usable program code comprising: first program code for specifying an array of neural processing units (NPU), at least one random access memory (RAM), and an output buffer; and the array of NPUs: (a) read, from the output buffer, values of the context layer nodes associated with a first time step; (b) read, from the RAM, values of the input layer nodes associated with a second time step subsequent to the first time step; (c) generate values of the hidden layer nodes associated with the second time step based on the values of the input layer nodes read from the RAM and the values of the context layer nodes read from the output buffer; (d) output the hidden layer node values associated with the second time step to the output buffer rather than to the RAM; (e) read, from the output buffer, the hidden layer node values associated with the second time step; (f) generate values of the context layer nodes associated with the second time step based on the hidden layer node values read from the output buffer; (g) output the context layer node values associated with the second time step to the hidden layer nodes rather than to the RAM; (h) generate values of the output layer nodes associated with the second time step using the hidden layer node values associated with the second time step; (i) write the output layer node values associated with the second time step to the RAM; and (j) repeat (a) through (i) for a sequence of time steps.

Plain English Translation

This invention relates to hardware acceleration for recurrent neural networks (RNNs), addressing the computational inefficiency of traditional RNN implementations. RNNs process sequential data by maintaining context through hidden and context layer nodes, but conventional hardware implementations often suffer from bottlenecks due to frequent memory access and data movement between layers. The invention describes a specialized neural network unit (NNU) implemented in circuitry, designed to optimize RNN computations. The NNU includes an array of neural processing units (NPUs), random access memory (RAM), and an output buffer. The NPUs perform arithmetic operations for input, hidden, output, and context layer nodes. During operation, the NPUs read context layer values from the output buffer and input layer values from RAM for a given time step. Hidden layer values are generated and written to the output buffer, bypassing RAM to reduce latency. These hidden values are then read from the output buffer to generate context layer values, which are fed back to the hidden layer nodes. Output layer values are computed and stored in RAM. This process repeats iteratively across time steps, minimizing memory transfers and improving computational efficiency. The design leverages localized data reuse and avoids unnecessary RAM access, enhancing performance for sequential data processing tasks.

Claim 19

Original Legal Text

19. The computer program product of claim 18 , further comprising: the array of NPUs comprises N NPUs, each comprising a multiplexed register, an arithmetic unit, and an accumulator circuit, wherein the accumulator circuit has an output and the arithmetic unit that performs operations on inputs, and wherein N is an integer value; the arithmetic unit receives an output of the multiplexed register and an output of the accumulator circuit, and the arithmetic unit generates a result provided to the accumulator circuit; the output buffer is N words wide and is configured to hold N of the context/hidden layer node values; and the N multiplexed registers are arranged to form an N-word hardware rotater that receives the N words of the output buffer.

Plain English Translation

This invention relates to a neural processing unit (NPU) architecture designed to accelerate neural network computations. The system addresses the challenge of efficiently processing large-scale neural networks by leveraging parallel processing and optimized data flow. The architecture includes an array of NPUs, each containing a multiplexed register, an arithmetic unit, and an accumulator circuit. The arithmetic unit performs operations on inputs from the multiplexed register and the accumulator circuit, generating results that are fed back into the accumulator. The output buffer, which is N words wide, stores context or hidden layer node values from the neural network. The multiplexed registers are configured as an N-word hardware rotater, enabling efficient data rotation and reuse across the NPUs. This design enhances computational throughput by minimizing data movement and maximizing parallelism, particularly in tasks like matrix multiplications and activations common in deep learning. The system is optimized for high-performance neural network inference and training, reducing latency and improving energy efficiency. The hardware rotater allows for seamless data circulation, supporting iterative computations without external memory access bottlenecks. This architecture is particularly suited for edge devices and data centers requiring real-time neural network processing.

Patent Metadata

Filing Date

Unknown

Publication Date

February 4, 2020

Inventors

G. GLENN HENRY

TERRY PARKS

KYLE T. O'BRIEN

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search