Synchronization Scheduler of Distributed Neural Network Training

PublishedFebruary 16, 2021

Assigneenot available in USPTO data we have

InventorsAdam Procter Vikram Saletore Deepthi Karkada Meenakshi Arunachalam

Technical Abstract

Patent Claims

25 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A neural network training system, comprising: a chip implementing at least one node in a cluster of systems training the neural network; and a scheduler to: conduct a first timing measurement of a blockage timing of a first window of the training of the neural network, the blockage timing to measure a time that processing is impeded at layers of the neural network during the first window of the training due to synchronization of one or more synchronizing parameters of the layers; based upon the first timing measurement, determine whether to modify a synchronization barrier policy to include a synchronization barrier to impede synchronization of one or more synchronizing parameters of one of the layers during a second window of the training; and impede the synchronization of the one or more synchronizing parameters of the one of the layers during the second window if the synchronization barrier policy is modified to include the synchronization barrier.

Plain English Translation

Neural network training systems. This invention addresses issues with synchronization during distributed neural network training, where processing can be impeded due to synchronizing parameters at different layers. The system includes a chip within a cluster of systems training the neural network. A scheduler component is responsible for improving training efficiency. The scheduler performs a first timing measurement to determine the blockage timing of a specific window during neural network training. This blockage timing quantifies the duration processing is hindered at neural network layers due to the synchronization of one or more synchronizing parameters. Based on this measurement, the scheduler decides whether to alter a synchronization barrier policy. This modification would involve introducing a synchronization barrier to restrict the synchronization of certain parameters in one of the layers during a subsequent training window. If the policy is indeed modified, the system then actively impedes the synchronization of those specified parameters in that layer during the second window.

Claim 2

Original Legal Text

2. The system of claim 1 , wherein the one or more synchronizing parameters of the layers include one or more of weights or biases of the layers, and the scheduler is to conduct the first timing measurement to determine a respective blockage timing of each of the layers of the neural network during forward propagation of the first window.

Plain English Translation

This invention relates to neural network training systems, specifically addressing synchronization challenges during distributed training across multiple layers. The system synchronizes neural network layers by measuring and adjusting timing parameters to optimize forward propagation efficiency. The key innovation involves a scheduler that conducts timing measurements to determine blockage timings for each layer during forward propagation of a data window. The synchronizing parameters include weights and biases of the layers, which are dynamically adjusted to minimize delays and improve training throughput. The system ensures that layers remain synchronized by tracking and compensating for variations in processing times, particularly in distributed or parallel training environments where layers may operate asynchronously. This approach enhances training efficiency by reducing idle time and improving resource utilization, addressing the problem of synchronization overhead in large-scale neural network training. The scheduler's timing measurements allow for real-time adjustments, ensuring consistent performance across all layers. The invention is particularly useful in scenarios where neural networks are trained across multiple devices or nodes, where synchronization delays can significantly impact training speed and accuracy.

Claim 3

Original Legal Text

3. The system of claim 2 , wherein the scheduler is to determine that the synchronization barrier policy is to be modified to include the synchronization barrier, and the scheduler is to: modify the synchronization barrier policy to add the synchronization barrier to be between another of the layers, that has a longest blockage timing from among the respective blockage timings, and the one of the layers; and during the second window of the training and based upon the synchronization barrier policy, prevent synchronization of the one or more synchronizing parameters of the one of the layers until the another of the layers has completed synchronization of one or more synchronizing parameters of the another of the layers.

Plain English Translation

This invention relates to a system for optimizing synchronization in a multi-layered processing environment, particularly addressing inefficiencies caused by synchronization blockages between layers. The system includes a scheduler that dynamically adjusts synchronization barriers to minimize delays. The scheduler identifies layers with the longest synchronization blockage timings and inserts a synchronization barrier between them and another layer to enforce an ordered synchronization process. During a training window, the scheduler prevents a layer from synchronizing its parameters until the layer with the longest blockage timing has completed its synchronization. This ensures that synchronization occurs in an optimized sequence, reducing overall processing delays. The system dynamically modifies synchronization policies to adapt to varying blockage timings, improving efficiency in real-time processing environments. The invention is particularly useful in applications where multiple layers must synchronize parameters while minimizing latency, such as in machine learning, distributed computing, or real-time data processing systems.

Claim 4

Original Legal Text

4. The system of claim 3 , wherein the scheduler is to: conduct a second timing measurement of another blockage timing during the second window of the training of the neural network, the another blockage timing to measure a time that processing is impeded at the layers of the neural network during the second window of the training due to synchronization of one or more synchronizing parameters of the layers; determine whether a second total blockage timing of the second timing measurement is greater than a first total blockage timing of the first timing measurement; and remove the synchronization barrier from the synchronization barrier policy when the second total blockage timing is determined to be greater than the first total blockage timing.

Plain English Translation

This invention relates to optimizing neural network training by dynamically adjusting synchronization barriers to reduce processing delays caused by layer synchronization. During training, neural networks often experience blockages when layers must synchronize parameters, leading to inefficiencies. The system measures blockage timings during training windows to assess the impact of synchronization barriers. A scheduler conducts a second timing measurement during a second training window, tracking the time processing is impeded due to synchronization of layer parameters. The system compares the second total blockage timing with a previously measured first total blockage timing. If the second blockage timing is greater, indicating synchronization barriers are causing more delays, the system removes the synchronization barrier from the policy. This adaptive approach dynamically adjusts synchronization policies to minimize training delays, improving efficiency. The system may also include a scheduler that conducts initial timing measurements during a first training window to establish baseline blockage timings. The invention addresses the problem of synchronization-induced inefficiencies in neural network training by providing a feedback-driven mechanism to optimize synchronization policies.

Claim 5

Original Legal Text

5. The system of claim 1 , wherein the scheduler is to: determine that a synchronization barrier is to be implemented between first and second layers of the layers; modify the synchronization barrier policy to add the synchronization barrier between the first and second layers; and during another window of the training, stop synchronization of one or more synchronizing parameters of the second layer until synchronization of one or more synchronizing parameters of the first layer is completed.

Plain English Translation

This invention relates to a system for managing synchronization barriers in a multi-layered machine learning model during training. The problem addressed is inefficient or improper synchronization between layers, which can degrade training performance or accuracy. The system includes a scheduler that dynamically adjusts synchronization policies to optimize training. The scheduler determines when a synchronization barrier is needed between two layers of the model. If a barrier is required, the scheduler modifies the synchronization policy to insert the barrier between those layers. During training, the scheduler enforces the barrier by temporarily halting synchronization of certain parameters in the second layer until synchronization of parameters in the first layer is complete. This ensures proper coordination between layers, preventing conflicts or inconsistencies in parameter updates. The system may also include a training controller that manages the overall training process, including parameter updates and synchronization operations. The scheduler works with the training controller to implement the synchronization barriers at specific points during training, such as between iterations or epochs. The barriers help maintain data consistency and improve convergence by ensuring that dependent layers are synchronized in the correct order. This approach is particularly useful in distributed training environments where multiple nodes or devices are involved.

Claim 6

Original Legal Text

6. The system of claim 1 , wherein the scheduler is to: maintain a history data structure including a plurality of elements, each element of the history data structure including a respective total blockage timing during a respective window of the training and a synchronization barrier added based upon the respective window; and remove synchronization barriers of the synchronization barriers from the synchronization barrier policy based upon a comparison of total blockage timings during windows of the training to the respective total blockage timings maintained in the history data structure.

Plain English Translation

This invention relates to a system for optimizing synchronization barriers in a training process, particularly in distributed computing environments where synchronization barriers can cause inefficiencies. The system addresses the problem of excessive synchronization delays by dynamically adjusting synchronization barriers based on historical performance data. The system includes a scheduler that maintains a history data structure containing multiple elements. Each element records the total blockage timing during a specific window of the training process and includes a synchronization barrier added during that window. The scheduler analyzes these timings to determine whether synchronization barriers are causing unnecessary delays. If the total blockage timings during current windows are significantly lower than those recorded in the history data structure, the scheduler removes synchronization barriers from the synchronization barrier policy. This adaptive approach reduces unnecessary synchronization overhead, improving training efficiency without compromising data consistency. The system dynamically adjusts synchronization barriers by comparing current performance metrics with historical data, ensuring that barriers are only retained when they provide meaningful benefits. This method helps balance synchronization needs with performance optimization, particularly in large-scale distributed training scenarios.

Claim 7

Original Legal Text

7. A scheduling apparatus for a neural network, comprising: a substrate; and logic coupled to the substrate and implemented at least partly in one or more of configurable logic or fixed-functionality logic hardware, the logic to: conduct a first timing measurement of a blockage timing of a first window of training of the neural network, the blockage timing to measure a time that processing is impeded at layers of the neural network during the first window of the training due to synchronization of one or more synchronizing parameters of the layers; based upon the first timing measurement, determine whether to modify a synchronization barrier policy to include a synchronization barrier to impede synchronization of one or more synchronizing parameters of one of the layers during a second window of the training; and impede the synchronization of the one or more synchronizing parameters of the one of the layers during the second window if the synchronization barrier policy is modified to include the synchronization barrier.

Plain English Translation

This invention relates to optimizing neural network training by dynamically adjusting synchronization barriers to reduce processing delays. During training, neural networks often experience blockages where processing is impeded due to synchronization of parameters across layers, such as gradients or weights. The apparatus measures the timing of these blockages during a first training window to assess their impact. If the blockage timing exceeds a threshold, the system modifies a synchronization barrier policy to introduce a synchronization barrier that selectively impedes synchronization of certain parameters in one or more layers during a subsequent training window. This barrier prevents unnecessary synchronization delays, improving training efficiency. The apparatus includes a substrate and logic implemented in configurable or fixed-functionality hardware to perform these operations. The logic measures blockage timing, evaluates whether synchronization barriers are needed, and enforces the barriers during training to mitigate synchronization-induced delays. This approach dynamically adapts synchronization policies based on real-time performance metrics, enhancing neural network training speed and resource utilization.

Claim 8

Original Legal Text

8. The apparatus of claim 7 , wherein the one or more synchronizing parameters of the layers include one or more of weights or biases of the layers, and the logic is to conduct the first timing measurement to determine a respective blockage timing of each of the layers of the neural network during forward propagation of the first window.

Plain English Translation

This invention relates to neural network synchronization, specifically addressing the challenge of timing inconsistencies during forward propagation in multi-layer neural networks. The apparatus includes a neural network with multiple layers and logic to synchronize these layers by adjusting one or more synchronizing parameters, such as weights or biases. The logic performs a first timing measurement to determine the blockage timing of each layer during forward propagation of a first input window. This timing data is used to synchronize the layers, ensuring consistent processing times across the network. The apparatus may also include a timing module to measure propagation delays and a synchronization module to adjust the synchronizing parameters based on the measured timings. The synchronization process may involve comparing the blockage timings of different layers and adjusting the parameters to minimize timing discrepancies. This ensures efficient and reliable neural network operation by maintaining temporal alignment of the layers during forward propagation.

Claim 9

Original Legal Text

9. The apparatus of claim 8 , wherein the logic is to determine that the synchronization barrier policy is to be modified to include the synchronization barrier, and the logic is to: modify the synchronization barrier policy to add the synchronization barrier to be between another of the layers, that has a longest blockage timing from among the respective blockage timings, and the one of the layers; and during the second window of the training and based upon the synchronization barrier policy, prevent synchronization of the one or more synchronizing parameters of the one of the layers until the another of the layers has completed synchronization of one or more synchronizing parameters of the another of the layers.

Plain English Translation

This invention relates to optimizing synchronization in multi-layered systems, particularly addressing inefficiencies caused by synchronization blockages between layers. The problem arises when certain layers experience prolonged synchronization delays, leading to system-wide performance bottlenecks. The solution involves dynamically modifying a synchronization barrier policy to strategically place synchronization barriers between layers to minimize blockage timing. The apparatus includes logic to analyze blockage timings across layers and identify the layer with the longest synchronization delay. The logic then updates the synchronization barrier policy to insert a barrier between this layer and another layer, ensuring that synchronization of parameters in the latter layer is delayed until the former layer completes its synchronization. This adjustment occurs during a training window, where the system evaluates and refines synchronization policies to improve overall efficiency. By dynamically reconfiguring synchronization barriers based on real-time performance data, the system reduces synchronization conflicts and enhances throughput. The approach is particularly useful in high-performance computing environments where precise timing and coordination between layers are critical.

Claim 10

Original Legal Text

10. The apparatus of claim 9 , wherein the logic is to: conduct a second timing measurement of another blockage timing during the second window of the training of the neural network, the another blockage timing to measure a time that processing is impeded at the layers of the neural network during the second window of the training due to synchronization of one or more synchronizing parameters of the layers; determine whether a second total blockage timing of the second timing measurement is greater than a first total blockage timing of the first timing measurement; and remove the synchronization barrier from the synchronization barrier policy when the second total blockage timing is determined to be greater than the first total blockage timing.

Plain English Translation

This invention relates to optimizing neural network training by dynamically adjusting synchronization barriers to reduce processing delays caused by layer synchronization. During neural network training, synchronization barriers are used to ensure consistency across layers, but excessive synchronization can impede processing efficiency. The invention addresses this by measuring blockage timings—periods when processing is delayed due to synchronization—during different training windows. A first timing measurement is taken during an initial window to assess the impact of synchronization barriers on processing delays. If a subsequent timing measurement in a second window shows increased blockage timings compared to the first, the synchronization barrier is removed from the synchronization policy. This adaptive approach dynamically adjusts synchronization to balance consistency and efficiency, improving overall training performance. The method involves comparing blockage timings between different training phases and modifying synchronization policies based on the results to minimize unnecessary delays while maintaining network integrity.

Claim 11

Original Legal Text

11. The apparatus of claim 7 , wherein the logic is to: determine that a synchronization barrier is to be implemented between first and second layers of the layers; modify the synchronization barrier policy to add the synchronization barrier between the first and second layers; and during another window of the training, stop synchronization of one or more synchronizing parameters of the second layer until synchronization of one or more synchronizing parameters of the first layer is completed.

Plain English Translation

This invention relates to neural network training systems, specifically addressing synchronization barriers between different layers during training to improve efficiency and convergence. The problem solved is the inefficiency in training deep neural networks where synchronization of parameters across layers can lead to bottlenecks, slow convergence, or suboptimal performance. The invention introduces a dynamic synchronization barrier mechanism that selectively controls when and how synchronization occurs between layers. The apparatus includes logic to determine when a synchronization barrier should be implemented between two layers of a neural network. Once identified, the synchronization barrier policy is modified to enforce this barrier. During training, the system stops synchronization of certain parameters in a second layer until synchronization of corresponding parameters in a first layer is fully completed. This ensures that updates from the first layer are fully propagated before the second layer proceeds, preventing conflicts or inconsistencies in parameter updates. The approach allows for more controlled and efficient training by selectively enforcing synchronization only when necessary, reducing overhead and improving convergence speed. The system dynamically adjusts synchronization policies based on training progress, optimizing performance without manual intervention.

Claim 12

Original Legal Text

12. The apparatus of claim 7 , wherein the logic is to: maintain a history data structure including a plurality of elements, each element of the history data structure including a respective total blockage timing during a respective window of the training and a synchronization barrier added based upon the respective window; and remove synchronization barriers of the synchronization barriers from the synchronization barrier policy based upon a comparison of total blockage timings during windows of the training to the respective total blockage timings maintained in the history data structure.

Plain English Translation

This invention relates to optimizing synchronization barriers in parallel computing systems to reduce performance bottlenecks. Synchronization barriers are used to coordinate tasks in parallel processing, but excessive or unnecessary barriers can cause significant delays. The invention addresses this by dynamically adjusting synchronization barriers based on historical performance data to minimize blockage time. The apparatus includes logic to maintain a history data structure that tracks total blockage timing for each window of training. Each element in the history data structure records the total blockage time during a specific window and the corresponding synchronization barrier applied. The logic then compares current blockage timings during training windows to the historical data. If the current blockage timings are lower than the historical values, the system removes unnecessary synchronization barriers from the synchronization barrier policy. This adaptive approach ensures that barriers are only retained when they provide measurable benefits, improving overall system efficiency. The invention also includes logic to update the history data structure as training progresses, allowing the system to continuously refine barrier policies based on real-time performance. By dynamically adjusting barriers, the system avoids unnecessary synchronization delays, leading to faster execution times in parallel computing environments.

Claim 13

Original Legal Text

13. A method of training a neural network, comprising: conducting a first timing measurement of a blockage timing of a first window of the training of the neural network, the blockage timing to measure a time that processing is impeded at layers of the neural network during the first window of the training due to synchronization of one or more synchronizing parameters of the layers; based upon the first timing measurement, determining whether to modify a synchronization barrier policy to include a synchronization barrier to impede synchronization of one or more synchronizing parameters of one of the layers during a second window of the training; and impeding the synchronization of the one or more synchronizing parameters of the one of the layers during the second window if the synchronization barrier policy is modified to include the synchronization barrier.

Plain English Translation

The invention relates to optimizing neural network training by dynamically adjusting synchronization barriers to reduce processing delays caused by synchronization of layer parameters. During training, neural networks often experience blockages where processing is impeded due to synchronization requirements between layers, particularly in distributed or parallel training environments. These synchronization points can create inefficiencies by forcing layers to wait for parameter updates, even when immediate synchronization is unnecessary. The method involves measuring the blockage timing during a first training window to assess the impact of synchronization on processing delays. If the measurements indicate significant delays, the synchronization barrier policy is modified to introduce a synchronization barrier, which selectively impedes synchronization of certain layer parameters during a subsequent training window. This barrier prevents unnecessary synchronization, allowing layers to proceed independently when possible, thereby improving training efficiency. The decision to modify the policy is based on real-time performance data, ensuring adaptive optimization of the training process. This approach reduces idle time and accelerates convergence without compromising model accuracy.

Claim 14

Original Legal Text

14. The method of claim 13 , wherein the one or more synchronizing parameters of the layers include one or more of weights or biases of the layers, and the conducting includes measuring a respective blockage timing of each of the layers of the neural network during forward propagation of the first window.

Plain English translation pending...

Claim 15

Original Legal Text

15. The method of claim 14 , wherein the determining includes determining that the synchronization barrier policy is to be modified to include the synchronization barrier, the method further including: modifying the synchronization barrier policy to add the synchronization barrier to be between another of the layers, that has a longest blockage timing from among the respective blockage timings, and the one of the layers; and during the second window of the training and based upon the synchronization barrier policy, preventing synchronization of the one or more synchronizing parameters of the one of the layers until the another of the layers has completed synchronization of one or more synchronizing parameters of the another of the layers.

Plain English translation pending...

Claim 16

Original Legal Text

16. The method of claim 15 , further comprising: conducting a second timing measurement of another blockage timing during the second window of the training of the neural network, the another blockage timing to measure a time that processing is impeded at the layers of the neural network during the second window of the training due to synchronization of one or more synchronizing parameters of the layers; determining whether a second total blockage timing of the second timing measurement is greater than a first total blockage timing of the first timing measurement; and removing the synchronization barrier from the synchronization barrier policy when the second total blockage timing is determined to be greater than the first total blockage timing.

Plain English translation pending...

Claim 17

Original Legal Text

17. The method of claim 13 , further comprising: determining that a synchronization barrier is to be implemented between first and second layers of the layers; modifying the synchronization barrier policy to add the synchronization barrier between the first and second layers; and during another window of the training, stopping synchronization of one or more synchronizing parameters of the second layer until synchronization of one or more synchronizing parameters of the first layer is completed.

Plain English Translation

This invention relates to machine learning systems, specifically to techniques for managing synchronization barriers between layers in neural network training. The problem addressed is inefficient or improper synchronization of parameter updates during training, which can lead to suboptimal model convergence or instability. The invention introduces a method to dynamically implement and enforce synchronization barriers between layers of a neural network during training. When a synchronization barrier is determined to be necessary between two layers, the system modifies the synchronization policy to enforce this barrier. During subsequent training windows, synchronization of parameters in the second layer is halted until synchronization of parameters in the first layer is fully completed. This ensures that updates from the first layer are properly propagated before the second layer's parameters are synchronized, improving training stability and convergence. The method may be applied iteratively across multiple layers and training windows to optimize synchronization behavior. The invention is particularly useful in distributed or parallel training environments where synchronization timing can significantly impact performance.

Claim 18

Original Legal Text

18. The method of claim 13 , further comprising: maintaining a history data structure including a plurality of elements, each element of the history data structure including a respective total blockage timing during a respective window of the training and a synchronization barrier added based upon the respective window; and removing synchronization barriers of the synchronization barriers from the synchronization barrier policy based upon a comparison of total blockage timings during windows of the training to the respective total blockage timings maintained in the history data structure.

Plain English translation pending...

Claim 19

Original Legal Text

19. At least one computer readable storage medium comprising a set of instructions, which when executed, cause a computing system to: conduct a first timing measurement of a blockage timing of a first window of training of a neural network, the blockage timing to measure a time that processing is impeded at layers of the neural network during the first window of the training due to synchronization of one or more synchronizing parameters of the layers; based upon the first timing measurement, determine whether to modify a synchronization barrier policy to include a synchronization barrier to impede synchronization of one or more synchronizing parameters of one of the layers during a second window of the training; and impede the synchronization of the one or more synchronizing parameters of the one of the layers during the second window if the synchronization barrier policy is modified to include the synchronization barrier.

Plain English translation pending...

Claim 20

Original Legal Text

20. The at least one computer readable storage medium of claim 19 , wherein the one or more synchronizing parameters of the layers include one or more of weights or biases of the layers, and the instructions, when executed, cause the computing system to conduct the first timing measurement to determine a respective blockage timing of each of the layers of the neural network during forward propagation of the first window.

Plain English Translation

This invention relates to neural network synchronization in distributed computing systems. The problem addressed is the inefficiency and potential errors in synchronizing neural network layers across multiple computing nodes during training, particularly when layers are processed in parallel. The solution involves a method for synchronizing layers of a neural network by measuring and adjusting timing parameters to ensure proper coordination during forward propagation. The system includes a computing system with at least one computer-readable storage medium storing instructions for synchronizing neural network layers. The instructions cause the computing system to perform a first timing measurement to determine the blockage timing of each layer during forward propagation of a data window. The synchronizing parameters of the layers include weights or biases, which are adjusted based on the measured timing to prevent misalignment or delays in processing. The system may also perform a second timing measurement to further refine synchronization, ensuring that all layers remain aligned throughout the training process. This approach improves efficiency and accuracy in distributed neural network training by dynamically adjusting synchronization parameters based on real-time timing data.

Claim 21

Original Legal Text

21. The at least one computer readable storage medium of claim 20 , wherein the synchronization barrier policy is to be modified to include the synchronization barrier, and wherein the instructions, when executed, cause the computing system to: modify the synchronization barrier policy to add the synchronization barrier to be between another of the layers, that has a longest blockage timing from among the respective blockage timings, and the one of the layers; and during the second window of the training and based upon the synchronization barrier policy, prevent synchronization of the one or more synchronizing parameters of the one of the layers until the another of the layers has completed synchronization of one or more synchronizing parameters of the another of the layers.

Plain English translation pending...

Claim 22

Original Legal Text

22. The at least one computer readable storage medium of claim 21 , wherein the instructions, when executed, cause the computing system to: conduct a second timing measurement of another blockage timing during the second window of the training of the neural network, the another blockage timing to measure a time that processing is impeded at the layers of the neural network during the second window of the training due to synchronization of one or more synchronizing parameters of the layers; determine whether a second total blockage timing of the second timing measurement is greater than a first total blockage timing of the first timing measurement; and remove the synchronization barrier from the synchronization barrier policy when the second total blockage timing is determined to be greater than the first total blockage timing.

Plain English translation pending...

Claim 23

Original Legal Text

23. The at least one computer readable storage medium of claim 19 , wherein the instructions, when executed, cause the computing system to: determine that a synchronization barrier is to be implemented between first and second layers of the layers; modify the synchronization barrier policy to add the synchronization barrier between the first and second layers; and during another window of the training, stop synchronization of one or more synchronizing parameters of the second layer until synchronization of one or more synchronizing parameters of the first layer is completed.

Plain English translation pending...

Claim 24

Original Legal Text

24. The at least one computer readable storage medium of claim 19 , wherein the instructions, when executed, cause the computing system to: maintain a history data structure including a plurality of elements, each element of the history data structure including a respective total blockage timing during a respective window of the training and a synchronization barrier added based upon the respective window; and remove synchronization barriers of the synchronization barriers from the synchronization barrier policy based upon a comparison of total blockage timings during windows of the training to the respective total blockage timings maintained in the history data structure.

Plain English translation pending...

Claim 25

Original Legal Text

25. The at least one computer readable storage medium of claim 19 , wherein the instructions, when executed, cause the computing system to implement the synchronization barrier to stop synchronization of gradients of the one or more synchronizing parameters of the one of the layers until synchronization of gradients of one or more synchronizing parameters of another of the layers is complete during the second window.

Plain English Translation

This invention relates to distributed machine learning systems, specifically improving gradient synchronization in neural network training across multiple computing nodes. The problem addressed is inefficient or stalled gradient synchronization during parallel training, which can degrade model convergence and performance. The solution involves a synchronization barrier mechanism that controls the timing of gradient updates across different layers of a neural network. When training a neural network in a distributed environment, gradients for certain parameters in one layer are prevented from synchronizing until the synchronization of gradients for parameters in another layer is fully completed within a defined time window. This ensures that gradient updates occur in a coordinated manner, preventing conflicts and improving training efficiency. The system monitors synchronization progress across layers and dynamically adjusts the synchronization process to maintain consistency without unnecessary delays. This approach is particularly useful in large-scale distributed training scenarios where synchronization delays can significantly impact performance. The invention enhances the reliability and speed of gradient synchronization, leading to more efficient neural network training.

Patent Metadata

Filing Date

Unknown

Publication Date

February 16, 2021

Inventors

Adam Procter

Vikram Saletore

Deepthi Karkada

Meenakshi Arunachalam

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search