End-Of-Turn Detection in Spoken Dialogues

PublishedMarch 23, 2021

Assigneenot available in USPTO data we have

InventorsLazaros Polymenakos Dimitrios B. Dimitriadis Zakaria Aldeneh Emily Mower Provost

Technical Abstract

Patent Claims

20 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A system, comprising: a memory that stores computer executable components; a processor that executes the computer executable components stored in the memory, wherein the computer executable components comprise: a speech receiving component that receives a spoken dialogue from a first entity; and a speech processing component that employs a neural network that concurrently processes a first classifier and a second classifier using acoustic cues from the spoken dialogue to predict a source of a subsequent spoken dialogue, wherein; the first classifier generates a first prediction of an intention of the spoken dialogue, the second classifier generates a second prediction of a type of turn of the spoken dialogue, and the neural network combines the first prediction and the second prediction using a minimizing joint loss function to predict whether the source of the subsequent spoken dialogue will be the first entity or another entity.

Plain English Translation

This invention relates to a speech processing system designed to predict the source of subsequent spoken dialogue in a conversation. The system addresses the challenge of determining whether the next speaker in a dialogue will be the same entity or a different entity, improving conversational flow and interaction management. The system includes a memory storing executable components and a processor that executes these components. A speech receiving component captures spoken dialogue from a first entity, such as a user or participant. A speech processing component then analyzes the dialogue using a neural network that simultaneously processes two classifiers: one for predicting the intention of the spoken dialogue and another for predicting the type of conversational turn (e.g., question, statement, or command). The neural network combines these predictions using a joint loss function to minimize errors and accurately forecast whether the subsequent dialogue will come from the same entity or a different one. This approach enhances dialogue systems by enabling proactive responses and smoother transitions between speakers. The system leverages acoustic cues from the spoken dialogue to improve prediction accuracy, making it suitable for applications in virtual assistants, customer service bots, and other interactive systems.

Claim 2

Original Legal Text

2. The system of claim 1 , wherein the neural network is a multi-task neural network, and wherein the system further comprises a network optimizing component that optimizes the multi-task neural network by employing a plurality of speech labels to predict the source of the subsequent spoken dialogue.

Plain English translation pending...

Claim 3

Original Legal Text

3. The system of claim 2 , wherein the plurality of speech labels comprises an optimizing data set.

Plain English translation pending...

Claim 4

Original Legal Text

4. The system of claim 1 , wherein the minimizing joint loss function comprises a first loss function for the first prediction and a second loss function for the second prediction.

Plain English translation pending...

Claim 5

Original Legal Text

5. The system of claim 1 , wherein the speech processing component predicts the source of the subsequent spoken dialogue in real time during a communication session comprising the spoken dialogue.

Plain English translation pending...

Claim 6

Original Legal Text

6. The system of claim 1 , wherein the type of turn is selected from a group consisting of a turn hold, a turn switch, a smooth switch, and an overlapping switch.

Plain English translation pending...

Claim 7

Original Legal Text

7. The system of claim 1 , wherein the acoustic cues comprise timing of the spoken dialogue.

Plain English Translation

The invention relates to a system for processing spoken dialogue, specifically focusing on the use of acoustic cues to enhance dialogue understanding. The system analyzes the timing of spoken dialogue as an acoustic cue to improve the accuracy and context-aware processing of conversational interactions. By extracting and interpreting temporal patterns in speech, such as pauses, speech rate, and overlapping speech, the system can infer speaker intent, emotional tone, or conversational structure. This timing-based analysis complements traditional speech recognition and natural language processing techniques, enabling more nuanced and contextually appropriate responses in applications like virtual assistants, customer service automation, or real-time translation systems. The system may integrate with existing dialogue management frameworks to dynamically adjust responses based on detected timing cues, improving user experience and interaction efficiency. The invention addresses challenges in accurately interpreting spoken dialogue by leveraging temporal features that are often overlooked in conventional speech processing methods.

Claim 8

Original Legal Text

8. The system of claim 1 , wherein the acoustic cues comprise a cue selected from the group consisting of intonation, pitch change, speaking rate, and pause.

Plain English Translation

The invention relates to a system for analyzing and processing acoustic cues in speech to enhance communication or interaction. The system is designed to address challenges in accurately detecting and interpreting subtle variations in speech patterns, which are often critical for understanding emotional tone, emphasis, or intent in spoken language. These variations, such as changes in intonation, pitch, speaking rate, and pauses, can significantly impact the meaning and effectiveness of communication but are often difficult to capture and analyze with conventional methods. The system includes components for capturing and processing audio input, extracting acoustic features, and interpreting these features to derive meaningful insights. Specifically, it focuses on detecting and analyzing intonation, pitch changes, speaking rate, and pauses within speech. Intonation refers to the rise and fall in pitch during speech, which can convey emotions or questions. Pitch changes involve variations in the fundamental frequency of the voice, which can indicate stress or emphasis. Speaking rate refers to the speed at which words are spoken, which can reflect urgency or hesitation. Pauses are silent intervals between words or phrases, which can signal hesitation, reflection, or emphasis. By analyzing these acoustic cues, the system can improve applications such as speech recognition, emotion detection, or interactive voice response systems. The system may be integrated into devices or software that require nuanced understanding of speech, such as virtual assistants, customer service platforms, or educational tools. The invention aims to provide more accurate and context-aware interactions by leveraging these acoustic features.

Claim 9

Original Legal Text

9. The system of claim 1 , wherein the other entity is a computerized spoken dialog system.

Plain English Translation

A system for interacting with a computerized spoken dialog system involves a user device and a computerized spoken dialog system. The user device includes a microphone for capturing audio input from a user and a speaker for outputting audio to the user. The computerized spoken dialog system processes the audio input to generate a response, which is then transmitted back to the user device for playback. The system may include additional components such as a network interface for communication between the user device and the computerized spoken dialog system, as well as processing units for handling audio data. The computerized spoken dialog system is designed to engage in natural language conversations with users, interpreting spoken input and generating appropriate spoken responses. This system enables users to interact with automated systems through voice commands and queries, facilitating hands-free operation and accessibility. The technology addresses the need for efficient, user-friendly interfaces in applications such as customer service, virtual assistants, and automated information retrieval. The system may also include features for improving speech recognition accuracy, such as noise filtering and context-aware processing, to enhance the reliability of interactions.

Claim 10

Original Legal Text

10. A computer-implemented method, comprising: receiving, by a system operatively coupled to a processor, a spoken dialogue from a first entity; and predicting, by the system, a source of a subsequent spoken dialogue by employing a neural network that concurrently processes a first classifier and a second classifier using acoustic cues from the spoken dialogue, wherein: the first classifier generates a first prediction of an intention of the spoken dialogue, the second classifier generates a second prediction of a type of turn of the spoken dialogue, and the neural network combines the first prediction and the second prediction using a minimizing joint loss function to predict whether the source of the subsequent spoken dialogue will be the first entity or another entity.

Plain English translation pending...

Claim 11

Original Legal Text

11. The computer-implemented method of claim 10 , wherein the neural network is a multi-task neural network, and wherein the computer-implemented method further comprises optimizing, by the system, the multi-task neural network by employing a plurality of speech labels to predict the source of the subsequent spoken dialogue.

Plain English translation pending...

Claim 12

Original Legal Text

12. The computer-implemented method of claim 11 , wherein the plurality of speech labels comprises an optimizing data set.

Plain English translation pending...

Claim 13

Original Legal Text

13. The computer-implemented method of claim 10 , wherein the minimizing joint loss function comprises a first loss function for the first prediction and a second loss function for the second prediction.

Plain English Translation

This invention relates to a computer-implemented method for optimizing predictions in a machine learning system, particularly for scenarios where multiple predictions are generated and must be jointly optimized. The problem addressed is the challenge of improving prediction accuracy when multiple interdependent predictions are involved, such as in multi-task learning or ensemble models, where individual predictions may conflict or reinforce each other. The method involves minimizing a joint loss function that combines a first loss function for a first prediction and a second loss function for a second prediction. The joint loss function ensures that the optimization process accounts for the relationship between the two predictions, reducing errors in both simultaneously. This approach is particularly useful in applications like recommendation systems, where multiple predictions (e.g., user preferences and item rankings) must be balanced, or in medical diagnostics, where multiple diagnostic predictions must align. The method may also include generating the first and second predictions using separate models or a single model with multiple outputs, and adjusting model parameters to minimize the joint loss function. By doing so, the system avoids overfitting to one prediction while neglecting the other, leading to more robust and accurate results. The technique is applicable in any domain where multiple predictions must be optimized in a coordinated manner.

Claim 14

Original Legal Text

14. The computer-implemented method of claim 10 , wherein the predicting the source of the subsequent spoken dialogue occurs in real time during a communication session comprising the spoken dialogue.

Plain English translation pending...

Claim 15

Original Legal Text

15. The computer-implemented method of claim 10 , wherein the type of turn is selected from a group consisting of a turn hold, a turn switch, a smooth switch, and an overlapping switch.

Plain English translation pending...

Claim 16

Original Legal Text

16. The computer-implemented method of claim 10 , wherein the acoustic cues comprise timing of the spoken dialogue.

Plain English translation pending...

Claim 17

Original Legal Text

17. The computer-implemented method of claim 10 , wherein the acoustic cues comprise a cue selected from the group consisting of intonation, pitch change, speaking rate, and pause.

Plain English Translation

This invention relates to computer-implemented methods for analyzing and processing acoustic cues in speech to enhance communication or interaction systems. The method involves detecting and interpreting specific acoustic features in spoken language to improve natural language processing, speech recognition, or human-computer interaction. The acoustic cues include intonation, pitch change, speaking rate, and pause, which are used to infer speaker intent, emotional tone, or structural elements of speech. These cues help distinguish between statements, questions, emphasis, or pauses for effect, enabling more accurate interpretation of spoken input. The method may be applied in applications such as virtual assistants, voice-controlled interfaces, or real-time transcription systems to improve responsiveness and accuracy. By analyzing these acoustic features, the system can better understand context, intent, and emotional nuances in speech, leading to more natural and effective interactions. The technique may also be combined with other speech processing methods to refine output or adapt responses dynamically.

Claim 18

Original Legal Text

18. The computer-implemented method of claim 10 , wherein the other entity is a computerized spoken dialog system.

Plain English translation pending...

Claim 19

Original Legal Text

19. A computer program product facilitating predicting a source of a subsequent spoken dialogue, the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to: receive, by the processor, a spoken dialogue from a first entity; and predict, by the processor, the source of the subsequent spoken dialogue by employing a neural network that concurrently processes a first classifier and a second classifier using acoustic cues from the spoken dialogue, wherein: the first classifier generates a first prediction of an intention of the spoken dialogue, the second classifier generates a second prediction of a type of turn of the spoken dialogue, and the neural network combines the first prediction and the second prediction using a minimizing joint loss function to predict whether the source of the subsequent spoken dialogue will be the first entity or another entity.

Plain English Translation

This invention relates to predicting the source of a subsequent spoken dialogue in conversational systems. The problem addressed is accurately determining whether the next speaker in a dialogue will be the same entity or a different one, which is critical for improving natural language processing and dialogue management. The system uses a neural network that processes spoken dialogue input from a first entity. The neural network employs two classifiers working concurrently: a first classifier analyzes the dialogue to predict the speaker's intention, while a second classifier determines the type of conversational turn (e.g., question, statement, or command). The neural network combines these predictions using a joint loss function to minimize errors in predicting whether the subsequent dialogue will come from the same entity or another entity. The approach leverages acoustic cues from the spoken dialogue to enhance prediction accuracy. By integrating intention and turn-type analysis, the system improves dialogue flow and responsiveness in applications like virtual assistants, customer service bots, and interactive voice response systems. The neural network's dual-classifier architecture ensures robust predictions by considering multiple dialogue aspects simultaneously.

Claim 20

Original Legal Text

20. The computer program product of claim 19 , wherein the neural network is a multi-task neural network, and wherein the program instructions are further executable by the processor to cause the processor to optimize, by the processor, the multi-task neural network by employing a plurality of speech labels to predict the source of the subsequent spoken dialogue.

Plain English translation pending...

Patent Metadata

Filing Date

Unknown

Publication Date

March 23, 2021

Inventors

Lazaros Polymenakos

Dimitrios B. Dimitriadis

Zakaria Aldeneh

Emily Mower Provost

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search