Described are noise suppression techniques applicable to various systems including automatic speech processing systems in digital audio pre-processing. The noise suppression techniques utilize a machine-learning framework trained on cues pertaining to reference clean and noisy speech signals, and a corresponding synthetic noisy speech signal combining the clean and noisy speech signals. The machine-learning technique is further used to process audio signals in real time by extracting and analyzing cues pertaining to noisy speech to dynamically generate an appropriate gain mask, which may eliminate the noise components from the input audio signal. The audio signal pre-processed in such a manner may be applied to an automatic speech processing engine for corresponding interpretation or processing. The machine-learning technique may enable extraction of cues associated with clean automatic speech processing features, which may be used by the automatic speech processing engine for various automatic speech processing.
Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.
1. A method for noise suppression, comprising: receiving, by a first processor communicatively coupled with a first memory, first noisy speech, the first noisy speech obtained using two or more microphones; extracting, by the first processor, one or more first cues from the first noisy speech, the one or more first cues including cues associated with noise suppression and automatic speech processing; and creating clean automatic speech processing features using a mapping and the extracted one or more first cues, the clean automatic speech processing features being for use in automatic speech processing and the mapping being provided by a process including: receiving, by a second processor communicatively coupled with a second memory, clean speech and noise; producing, by the second processor, second noisy speech using the clean speech and the noise; extracting, by the second processor, one or more second cues from the second noisy speech, the one or more second cues including cues associated with noise suppression and noisy automatic speech processing; extracting clean automatic speech processing cues from the clean speech; and generating, by the second processor, the mapping from the one or more second cues to the clean automatic speech processing cues, the generating including at least one machine-learning technique.
A noise suppression method involves two processing stages. First, a processor receives noisy speech captured by multiple microphones. It extracts "cues" from the noisy speech related to both noise suppression and automatic speech processing. Second, a training process occurs. A processor receives clean speech and noise, mixes them to create synthetic noisy speech, and extracts cues from both the noisy and clean speech. Using a machine learning technique (e.g., neural network), it generates a mapping between noisy speech cues and clean speech features. Finally, the original noisy speech cues are transformed into clean speech features using the learned mapping, improving performance in automatic speech processing.
2. The method of claim 1 , wherein the automatic speech processing comprises automatic speech recognition.
The noise suppression method described in Claim 1, where the automatic speech processing application is specifically automatic speech recognition (ASR). This enhances the accuracy of converting spoken words into text, by removing noise artifacts before they reach the speech recognition engine.
3. The method of claim 1 , wherein the automatic speech processing comprises one or more of automatic speech recognition, language recognition, keyword recognition, speech confirmation, emotion detection, voice sensing, and speaker recognition.
The noise suppression method from Claim 1 is used in automatic speech processing, which includes one or more of: automatic speech recognition (converting speech to text), language recognition (identifying the language spoken), keyword recognition (detecting specific words), speech confirmation (verifying spoken commands), emotion detection (identifying the speaker's emotional state), voice sensing (detecting the presence of speech), and speaker recognition (identifying the speaker). By removing noise, these applications can operate more reliably.
4. The method of claim 1 , wherein receiving, by the second processor, the clean speech and the noise comprises receiving predetermined reference clean speech and predetermined reference noise from a reference database.
In the noise suppression method of Claim 1, the clean speech and noise used to train the mapping are predetermined reference samples from a reference database. This database contains known clean speech and noise examples that are used to train the machine learning model for noise reduction.
5. The method of claim 1 , wherein the clean speech and noise are each obtained using at least two microphones, the one or more first and second cues each including at least one inter-microphone level difference (ILD) cues and inter-microphone phase difference (IPD) cues.
In the noise suppression method of Claim 1, both the noisy speech and the clean speech/noise are captured using at least two microphones. This allows for spatial information to be used in the noise suppression process. The extracted cues include inter-microphone level difference (ILD) and inter-microphone phase difference (IPD). These cues capture spatial characteristics of the sound sources, helping to distinguish between speech and noise.
6. The method of claim 4 , wherein the automatic speech processing comprises one or more of automatic speech recognition, language recognition, keyword recognition, speech confirmation, emotion detection, voice sensing, and speaker recognition.
Building on Claim 4, the automatic speech processing in the method that uses predetermined reference speech and noise includes one or more of: automatic speech recognition (ASR), language recognition, keyword recognition, speech confirmation, emotion detection, voice sensing, and speaker recognition.
7. The method of claim 1 , wherein the one or more first cues and the one or more second cues each further include at least one of energy at channel cues, voice activity detection (VAD) cues, spatial cues, frequency cues, Wiener gain mask estimates, pitch-based cues, periodicity-based cues, noise estimates, and context cues.
In the noise suppression method from Claim 1, the extracted cues (both from the noisy speech and the synthesized noisy speech) include, in addition to any other cues, one or more of: energy at channel cues, voice activity detection (VAD) cues, spatial cues, frequency cues, Wiener gain mask estimates, pitch-based cues, periodicity-based cues, noise estimates, and context cues. These cues provide a richer representation of the audio signal, improving the machine learning model's ability to suppress noise.
8. The method of claim 1 , wherein the at least one machine-learning technique includes one or more of a neural network, regression tree, a nonlinear transform, a linear transform, and a Gaussian Mixture Model (GMM).
In the noise suppression method from Claim 1, the machine-learning technique used to generate the mapping includes one or more of: a neural network, a regression tree, a nonlinear transform, a linear transform, and a Gaussian Mixture Model (GMM). These models learn the relationship between noisy speech cues and clean speech features.
9. The method of claim 1 , wherein the generating applies the at least one machine-learning technique to the clean speech and the second noisy speech.
In the noise suppression method from Claim 1, the machine-learning technique is applied directly to both the clean speech and the synthesized noisy speech during the mapping generation process. This direct application allows the model to learn the optimal transformation from noisy to clean speech.
10. A system for noise suppression, comprising: a first frequency analysis module, executed by at least one processor, that is configured to receive first noisy speech, the first noisy speech being each obtained using at least two microphones; a second frequency analysis module, executed by the at least one processor, that is configured to receive clean speech and noise; a combination module, executed by the at least one processor, that is configured to produce second noisy speech using the clean speech and the noise; a first cue extraction module, executed by the at least one processor, that is configured to extract one or more first cues from the first noisy speech, the one or more first cues including cues associated with noise suppression and automatic speech processing; a second cue extraction module, executed by the at least one processor, that is configured to extract one or more second cues from the second noisy speech, the one or more second cues including cues associated with noise suppression and noisy automatic speech processing; a third cue extraction module, executed by the at least one processor, that is configured to extract clean automatic speech processing cues from the clean speech; and a learning module, executed by the at least one processor, that is configured to generate a mapping from the one or more second cues associated with the noise suppression cues and the noisy automatic speech processing cues to the clean automatic speech processing cues, the generating including at least one machine-learning technique; and a modification module, executed by the at least one processor, that is configured to create clean automatic speech processing features using the mapping and the extracted one or more first cues, the clean automatic speech processing features being for use in automatic speech processing.
A noise suppression system contains frequency analysis modules for both noisy and clean/noise audio. A combination module creates synthesized noisy speech. Cue extraction modules extract features from both real and synthetic noisy speech and the clean speech. A learning module, using machine learning, generates a mapping between noisy speech cues and clean speech features. A modification module applies this mapping to incoming noisy speech, creating clean speech features for use in automatic speech processing. The noisy speech input is obtained using at least two microphones.
11. The system of claim 10 , wherein the automatic speech processing comprises automatic speech recognition.
The system in Claim 10, where the automatic speech processing application is specifically automatic speech recognition.
12. The system of claim 10 , wherein the automatic speech processing comprises one or more of automatic speech recognition, language recognition, keyword recognition, speech confirmation, emotion detection, voice sensing, and speaker recognition.
The system in Claim 10 performs automatic speech processing, which includes one or more of: automatic speech recognition, language recognition, keyword recognition, speech confirmation, emotion detection, voice sensing, and speaker recognition.
13. The system of claim 10 , wherein the second frequency analysis module is configured to receive the clean speech and the noise from a reference database, the clean speech and noise being predetermined reference clean speech and predetermined reference noise.
In the system from Claim 10, the clean speech and noise used for training are predetermined reference samples retrieved from a reference database.
14. The system of claim 10 , wherein the at least one machine-learning technique includes one or more of a neural network, regression tree, a non-linear transform, a linear transform, and a Gaussian Mixture Model (GMM).
In the system from Claim 10, the machine-learning technique employed includes one or more of: a neural network, regression tree, a non-linear transform, a linear transform, and a Gaussian Mixture Model (GMM).
15. The system of claim 10 , wherein the one or more first cues and the one or more second cues each include at least one of ILD cues and IPD cues.
In the system from Claim 10, the extracted cues include at least inter-microphone level difference (ILD) and inter-microphone phase difference (IPD) cues, derived from using at least two microphones.
16. The system of claim 10 , wherein the one or more first cues and the one or more second cues each include at least one of energy at channel cues, VAD cues, spatial cues, frequency cues, Wiener gain mask estimates, pitch-based cues, periodicity-based cues, noise estimates, and context cues.
In the system from Claim 10, the extracted cues include one or more of: energy at channel cues, voice activity detection (VAD) cues, spatial cues, frequency cues, Wiener gain mask estimates, pitch-based cues, periodicity-based cues, noise estimates, and context cues.
17. The system of claim 14 , wherein the at least one machine-learning techniques each include one or more of a neural network, regression tree, a non-linear transform, a linear transform, and a GMM.
The system from Claim 14 utilizes machine learning techniques that include one or more of: a neural network, regression tree, a non-linear transform, a linear transform, and a Gaussian Mixture Model (GMM).
18. The method of claim 1 , wherein the first processor communicatively coupled with the first memory are included in a cloud-based computing environment.
In the noise suppression method from Claim 1, the processor and memory used for processing the first noisy speech are part of a cloud-based computing environment.
Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.
October 4, 2013
May 2, 2017
Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.