Patentable/Patents/US-11276249
US-11276249

Method and system for video action classification by mixing 2D and 3D features

PublishedMarch 15, 2022
Assigneenot available in USPTO data we have
Inventorsnot available in USPTO data we have
Technical Abstract

A method, system, and computer program product provide for video action classification by selecting a first video frame and a first plurality of video frames from a received video to process the first video frame with a 2D convolutional neural network processing pathway to extract spatial features classifying the first video frame, and to process the first plurality of video frames with a 3D convolutional neural network processing pathway to extract spatiotemporal features classifying the first plurality of video frames so that the spatial features are combined with the spatiotemporal features to generate a classification label for the video action.

Patent Claims
17 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A computer-implemented method for classifying video action, the method comprising: receiving, by an information handling system comprising a processor and a memory, a video for action analysis; loading, by the information handling system, a plurality of processing pathways comprising a 2D convolutional neural network processing pathway and a 3D convolutional neural network processing pathway that is formed by inflating the 2D convolutional neural network processing pathway; selecting, by the information handling system, a first video frame and a first plurality of video frames from the video; processing, by the information handling system, the first video frame with the 2D convolutional neural network processing pathway to extract spatial features classifying the first video frame; processing, by the information handling system, the first plurality of video frames with the 3D convolutional neural network processing pathway to extract spatiotemporal features classifying the first plurality of video frames; and combining, by the information handling system, the spatial features with the spatiotemporal features to generate a classification label for the video action.

Plain English Translation

Video analysis and classification. This invention addresses the problem of accurately classifying actions within video data by leveraging both spatial and temporal information. The method involves receiving a video for analysis by a computer system. The system utilizes multiple processing pathways, including a 2D convolutional neural network (CNN) pathway for extracting spatial features from individual frames and a 3D CNN pathway. The 3D CNN pathway is derived from the 2D CNN pathway by a process called "inflation," enabling it to process sequences of frames. The system selects a single video frame and a group of consecutive video frames. The single frame is processed by the 2D CNN pathway to identify spatial characteristics. The group of frames is processed by the 3D CNN pathway to extract spatiotemporal characteristics, capturing motion and changes over time. Finally, the spatial features and spatiotemporal features are combined to produce a classification label for the action occurring in the video.

Claim 2

Original Legal Text

2. The computer-implemented method of claim 1 , further comprising initializing, by the information handling system, the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with pretrained weights.

Plain English Translation

This invention relates to a computer-implemented method for processing medical imaging data using a hybrid neural network architecture. The method addresses the challenge of accurately analyzing complex medical images, such as those from computed tomography (CT) scans, by combining 2D and 3D convolutional neural network (CNN) pathways to extract complementary features. The system initializes both the 2D and 3D CNN pathways with pretrained weights, which enhances the network's ability to recognize patterns in medical images. The 2D CNN processes individual image slices, capturing fine-grained details, while the 3D CNN analyzes volumetric data, preserving spatial relationships across multiple slices. The pretrained weights ensure the network starts with learned features, improving efficiency and accuracy in tasks like lesion detection or tissue segmentation. This hybrid approach leverages the strengths of both 2D and 3D CNNs, enabling more robust and precise medical image analysis compared to single-pathway systems. The method is particularly useful in healthcare applications where high accuracy and reliability are critical.

Claim 3

Original Legal Text

3. The computer-implemented method of claim 2 , further comprising training, by the information handling system, a final fully connected layer of the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with the pretrained weights for a first specified epoch training period.

Plain English Translation

This invention relates to a computer-implemented method for training a hybrid neural network system combining 2D and 3D convolutional neural networks (CNNs). The method addresses the challenge of effectively integrating pretrained weights from both 2D and 3D CNNs to improve performance in tasks requiring multi-dimensional data analysis, such as medical imaging or autonomous navigation. The method involves a hybrid neural network architecture where a 2D CNN processing pathway and a 3D CNN processing pathway are initially trained separately using pretrained weights. These pathways are then combined into a unified model. The final fully connected layer of both pathways is jointly trained for a specified epoch period, allowing the network to refine its learned features while leveraging the strengths of both 2D and 3D feature extraction. This approach enhances the model's ability to process complex, multi-dimensional data by refining high-level abstractions through collaborative training of the final layers. The method ensures that the pretrained weights from both pathways are effectively utilized, avoiding the need for complete retraining while still improving the model's accuracy and generalization. This is particularly useful in applications where computational efficiency and performance are critical, such as real-time image or video analysis. The hybrid training approach optimizes the network's ability to handle diverse data representations, improving overall system robustness.

Claim 4

Original Legal Text

4. The computer-implemented method of claim 3 , further comprising: training, after the first specified epoch training period, all layers of the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with the pretrained weights for a second specified epoch training period.

Plain English Translation

This invention relates to a computer-implemented method for training a hybrid neural network architecture combining 2D and 3D convolutional neural networks (CNNs). The method addresses the challenge of efficiently training deep learning models that process both 2D and 3D data, such as images and volumetric data, by leveraging pretrained weights to improve performance and reduce training time. The method involves a two-stage training process. Initially, a 2D CNN processing pathway and a 3D CNN processing pathway are trained separately for a first specified epoch training period using pretrained weights. These pretrained weights are derived from prior training on relevant datasets, allowing the networks to start with learned features rather than random initialization. After this initial training phase, the method proceeds to a second stage where all layers of both the 2D and 3D CNN pathways are jointly trained for a second specified epoch training period, again using the pretrained weights. This joint training refines the combined model, enabling it to better integrate features from both 2D and 3D data sources. The approach ensures that the hybrid network benefits from the strengths of both 2D and 3D CNNs while minimizing training overhead. This is particularly useful in applications like medical imaging, where both 2D slices and 3D volumes are analyzed. The method optimizes the training process by leveraging pretrained weights to accelerate convergence and improve generalization.

Claim 5

Original Legal Text

5. The computer-implemented method of claim 1 , where the spatial features at each stage of the 2D convolutional neural network processing pathway are fused with corresponding spatiotemporal features at each stage of the 3D convolutional neural network processing pathway.

Plain English Translation

This invention relates to a computer-implemented method for processing video data using a hybrid neural network architecture that combines 2D and 3D convolutional neural networks (CNNs). The method addresses the challenge of effectively capturing both spatial and spatiotemporal features in video analysis, where traditional approaches often struggle to balance computational efficiency with feature richness. The method involves a multi-stage processing pipeline where a 2D CNN extracts spatial features from individual video frames, while a parallel 3D CNN processes the same frames to capture spatiotemporal dynamics across consecutive frames. At each stage of the processing pathway, the spatial features from the 2D CNN are fused with the corresponding spatiotemporal features from the 3D CNN. This fusion ensures that the network leverages both types of features at every level of abstraction, enhancing the model's ability to recognize complex patterns in video data. The fusion process may involve concatenation, addition, or other forms of feature combination, allowing the network to integrate complementary information from both pathways. By fusing features at multiple stages rather than just at the final layer, the method ensures that the benefits of both spatial and spatiotemporal representations are preserved throughout the learning process. This approach improves accuracy in tasks such as action recognition, video segmentation, and object tracking, where understanding both static and dynamic visual elements is critical. The method is particularly useful in applications requiring real-time processing, as it balances computational efficiency with high feature representation quality.

Claim 6

Original Legal Text

6. The computer-implemented method of claim 1 , where combining the spatial features with the spatiotemporal features comprises concatenating a final spatial feature from the 2D convolutional neural network processing pathway with a final spatiotemporal feature from the 3D convolutional neural network processing pathway to connect to a final, fully connected layer that is used for classifying the video action of the video.

Plain English Translation

This invention relates to video action recognition using a hybrid neural network architecture that combines spatial and spatiotemporal features. The problem addressed is the challenge of accurately classifying actions in videos by leveraging both spatial and temporal information. Traditional approaches often rely solely on spatial features or temporal features, which may not capture the full complexity of dynamic actions. The method involves processing a video through two parallel neural network pathways: a 2D convolutional neural network (CNN) for extracting spatial features and a 3D CNN for extracting spatiotemporal features. The 2D CNN analyzes individual frames to capture spatial patterns, while the 3D CNN processes sequences of frames to capture temporal dynamics. The final spatial feature from the 2D CNN and the final spatiotemporal feature from the 3D CNN are concatenated into a combined feature vector. This combined feature vector is then fed into a fully connected layer, which performs the final classification of the video action. By integrating both spatial and spatiotemporal information, the method improves the accuracy and robustness of action recognition in videos.

Claim 7

Original Legal Text

7. An information handling system comprising: one or more processors; a memory coupled to at least one of the processors; a set of instructions stored in the memory and executed by at least one of the processors to classify video action, wherein the set of instructions are executable to perform actions of: receiving, by the system, a video for action analysis; loading, by the system, a plurality of processing pathways comprising a 2D convolutional neural network processing pathway and a 3D convolutional neural network processing pathway that is formed by inflating the 2D convolutional neural network processing pathway; selecting, by the system, a first video frame and a first plurality of video frames from the video; processing, by the system, the first video frame with the 2D convolutional neural network processing pathway to extract spatial features classifying the first video frame; processing, by the system, the first plurality of video frames with the 3D convolutional neural network processing pathway to extract spatiotemporal features classifying the first plurality of video frames; and combining, by the system, the spatial features with the spatiotemporal features to generate a classification label for the video action.

Plain English Translation

The invention relates to video action recognition systems that analyze video content to classify actions. Traditional methods often struggle to effectively capture both spatial and temporal features in video data, leading to inaccurate action classification. This system addresses the problem by combining 2D and 3D convolutional neural networks (CNNs) to extract complementary features. The system includes processors and memory storing instructions for video action classification. Upon receiving a video, the system loads multiple processing pathways, including a 2D CNN for spatial feature extraction and a 3D CNN derived by "inflating" the 2D CNN to capture temporal dynamics. The system processes individual video frames through the 2D CNN to extract spatial features and sequences of frames through the 3D CNN to extract spatiotemporal features. These features are then combined to generate a classification label for the action depicted in the video. The approach leverages the strengths of both 2D and 3D CNNs, improving accuracy in action recognition tasks.

Claim 8

Original Legal Text

8. The information handling system of claim 7 , wherein the set of instructions are executable to initialize the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with pretrained weights.

Plain English Translation

The invention relates to an information handling system designed for advanced neural network processing, specifically for tasks involving both two-dimensional (2D) and three-dimensional (3D) convolutional neural networks (CNNs). The system addresses the challenge of efficiently processing complex data that requires multi-dimensional analysis, such as medical imaging, autonomous navigation, or object recognition in 3D space. The system includes a processing unit and a memory storing a set of instructions. These instructions enable the system to initialize and manage two distinct processing pathways: one for 2D CNNs and another for 3D CNNs. The 2D CNN pathway is optimized for tasks where spatial relationships in a single plane are critical, while the 3D CNN pathway handles volumetric data where depth and spatial context are essential. The system further includes a data fusion module that integrates outputs from both pathways, allowing for more accurate and comprehensive analysis. The instructions are executable to initialize both pathways with pretrained weights, ensuring that the networks are primed with learned features from prior training data, thereby improving performance and reducing training time for new tasks. This dual-pathway approach enhances the system's ability to handle diverse data types and improve decision-making in applications requiring multi-dimensional data interpretation.

Claim 9

Original Legal Text

9. The information handling system of claim 8 , wherein the set of instructions are executable to train a final fully connected layer of the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with the pretrained weights for a first specified epoch training period.

Plain English Translation

This invention relates to an information handling system designed to process and analyze data using a hybrid neural network architecture. The system addresses the challenge of efficiently training and utilizing both 2D and 3D convolutional neural networks (CNNs) to improve performance in tasks such as image or video recognition. The system includes a processing unit and a memory storing instructions that, when executed, configure the processing unit to implement a 2D CNN processing pathway and a 3D CNN processing pathway. These pathways are pretrained with weights derived from a pretrained 2D CNN and a pretrained 3D CNN, respectively. The system further includes a fusion layer that combines the outputs of the 2D and 3D CNNs to generate a final output. The instructions are also executable to train a final fully connected layer of both the 2D and 3D CNN processing pathways using the pretrained weights for a specified initial training period. This approach leverages the strengths of both 2D and 3D CNNs, enhancing the system's ability to extract and integrate spatial and temporal features from input data. The hybrid architecture improves accuracy and robustness in tasks requiring multi-dimensional data analysis.

Claim 10

Original Legal Text

10. The information handling system of claim 9 , wherein the set of instructions are executable to train, after the first specified epoch training period, all layers of the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with the pretrained weights for a second specified epoch training period.

Plain English Translation

This invention relates to an information handling system configured to process data using a hybrid neural network architecture combining 2D and 3D convolutional neural networks (CNNs). The system addresses challenges in efficiently training deep learning models by leveraging pretrained weights to improve performance while reducing computational overhead. The system includes a processing unit and memory storing instructions for implementing a hybrid CNN architecture. The architecture comprises a 2D CNN processing pathway and a 3D CNN processing pathway, each designed to extract spatial and spatiotemporal features from input data. The system initializes the hybrid network with pretrained weights for both pathways, allowing the model to benefit from prior learning. During training, the system first trains only the 3D CNN pathway for a specified epoch period, enabling the model to focus on learning spatiotemporal features before integrating the 2D pathway. After this initial training phase, the system trains all layers of both the 2D and 3D pathways simultaneously for a second specified epoch period, refining the combined feature extraction process. This staged training approach optimizes convergence and performance by leveraging pretrained knowledge while adapting to new data. The system is particularly useful in applications requiring efficient processing of complex, multidimensional data, such as video analysis or medical imaging.

Claim 11

Original Legal Text

11. The information handling system of claim 7 , where the spatial features at each stage of the 2D convolutional neural network processing pathway are fused with corresponding spatiotemporal features at each stage of the 3D convolutional neural network processing pathway.

Plain English Translation

This invention relates to an information handling system that processes data using a hybrid neural network architecture combining 2D and 3D convolutional neural networks (CNNs). The system addresses the challenge of effectively integrating spatial and spatiotemporal features for improved data analysis, particularly in applications like video processing or medical imaging where both spatial and temporal information are critical. The system includes a 2D CNN processing pathway that extracts spatial features from input data at multiple stages. Simultaneously, a 3D CNN processing pathway extracts spatiotemporal features, capturing both spatial and temporal dependencies. A key innovation is the fusion of these features at each corresponding stage of the two pathways. This ensures that spatial and spatiotemporal information is combined progressively, enhancing the network's ability to learn hierarchical representations. The fusion process may involve concatenation, addition, or other operations to merge the features effectively. The system may also include preprocessing modules to prepare input data for the CNNs, such as normalization or frame extraction for video data. Post-processing modules may refine the fused features for downstream tasks like classification, detection, or segmentation. The architecture is designed to be flexible, allowing adaptation to different input data types and task requirements. By integrating spatial and spatiotemporal features at multiple levels, the system achieves more robust and accurate data processing compared to using either 2D or 3D CNNs alone.

Claim 12

Original Legal Text

12. The information handling system of claim 7 , wherein the set of instructions are executable to provide combine the spatial features with the spatiotemporal features by concatenating a final spatial feature from the 2D convolutional neural network processing pathway with a final spatiotemporal feature from the 3D convolutional neural network processing pathway to connect to a final, fully connected layer that is used for classifying the video action of the video.

Plain English Translation

This invention relates to video action recognition systems that combine spatial and spatiotemporal features for improved classification. The system addresses the challenge of accurately identifying actions in video sequences by leveraging both spatial and temporal information. A video is processed through two parallel pathways: a 2D convolutional neural network (CNN) that extracts spatial features from individual frames and a 3D CNN that captures spatiotemporal features across multiple frames. The final spatial feature from the 2D CNN and the final spatiotemporal feature from the 3D CNN are concatenated to form a combined feature representation. This combined feature is then fed into a fully connected layer for final action classification. The system enhances recognition accuracy by integrating complementary spatial and temporal information, making it suitable for applications requiring precise action detection in video data. The invention improves upon traditional methods that rely solely on spatial or temporal features by fusing both types of information for more robust classification.

Claim 13

Original Legal Text

13. A computer program product stored in a computer readable storage medium, comprising computer instructions that, when executed by an information handling system comprising a processor and a memory, causes the system to classify video action by: receiving, by the system, a video for action analysis; loading, by the system, a plurality of processing pathways comprising a 2D convolutional neural network processing pathway and a 3D convolutional neural network processing pathway that is formed by inflating the 2D convolutional neural network processing pathway; selecting, by the system, a first video frame and a first plurality of video frames from the video; processing, by the system, the first video frame with the 2D convolutional neural network processing pathway to extract spatial features classifying the first video frame; processing, by the system, the first plurality of video frames with the 3D convolutional neural network processing pathway to extract spatiotemporal features classifying the first plurality of video frames; and combining, by the system, the spatial features with the spatiotemporal features to generate a classification label for the video action.

Plain English Translation

This invention relates to video action classification using a hybrid neural network architecture. The problem addressed is the need for accurate and efficient analysis of actions in video data, which requires capturing both spatial and temporal features. The solution involves a computer program product that implements a dual-pathway neural network system. The system includes a 2D convolutional neural network (CNN) pathway for extracting spatial features from individual video frames and a 3D CNN pathway for extracting spatiotemporal features from sequences of video frames. The 3D CNN is derived by "inflating" the 2D CNN, meaning it shares structural similarities with the 2D CNN but operates on volumetric data. The system processes a video by first selecting a single frame and a sequence of frames. The single frame is analyzed by the 2D CNN to extract spatial features, while the sequence is analyzed by the 3D CNN to extract spatiotemporal features. These features are then combined to generate a classification label for the action depicted in the video. This approach leverages the strengths of both 2D and 3D CNNs to improve action recognition accuracy.

Claim 14

Original Legal Text

14. The computer program product of claim 13 , further comprising computer instructions that, when executed by the system, causes the system to initialize the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with pretrained weights.

Plain English Translation

This invention relates to a computer program product for medical image analysis, specifically for processing volumetric medical imaging data such as CT or MRI scans. The technology addresses the challenge of accurately detecting and classifying abnormalities in 3D medical images, which often require high computational resources and sophisticated neural network architectures. The system includes a dual-pathway architecture featuring both 2D and 3D convolutional neural networks (CNNs) to extract features from the input data. The 2D CNN processes individual 2D slices of the 3D volume, while the 3D CNN analyzes the volumetric data in three dimensions. The outputs of these pathways are combined to improve detection accuracy. The invention further includes initializing both the 2D and 3D CNNs with pretrained weights, which enhances performance by leveraging prior knowledge from large datasets. This initialization step ensures the networks start with optimized parameters, reducing training time and improving generalization. The combined approach allows for efficient and accurate analysis of medical images, aiding in early diagnosis and treatment planning. The system is designed to handle high-dimensional medical data while maintaining computational efficiency.

Claim 15

Original Legal Text

15. The computer program product of claim 14 , further comprising computer instructions that, when executed by the system, causes the system to train a final fully connected layer of the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with the pretrained weights for a first specified epoch training period.

Plain English Translation

This invention relates to a computer program product for training a hybrid neural network system combining 2D and 3D convolutional neural networks (CNNs). The system addresses the challenge of effectively processing and analyzing complex data that may contain both spatial and volumetric information, such as medical imaging or autonomous vehicle sensor data. The hybrid architecture leverages the strengths of 2D CNNs for spatial feature extraction and 3D CNNs for volumetric feature extraction, improving accuracy in tasks like image classification, segmentation, or object detection. The computer program product includes instructions for training a final fully connected layer of both the 2D and 3D CNN pathways using pretrained weights. This training occurs over a specified initial epoch period, allowing the system to fine-tune the combined network while preserving the learned features from the pretrained layers. The hybrid approach enhances performance by integrating complementary features from both pathways, leading to more robust and accurate predictions. The system is designed to optimize computational efficiency and accuracy, making it suitable for applications requiring high-dimensional data analysis.

Claim 16

Original Legal Text

16. The computer program product of claim 15 , further comprising computer instructions that, when executed by the system, causes the system to train, after the first specified epoch training period, all layers of the 2D convolutional neural network processing pathway and the 3D convolutional neural network processing pathway with the pretrained weights for a second specified epoch training period.

Plain English Translation

This invention relates to a computer program product for training a hybrid neural network system combining 2D and 3D convolutional neural networks (CNNs). The system addresses the challenge of effectively training deep learning models that process both 2D and 3D data, which is common in applications like medical imaging, autonomous navigation, or video analysis. The hybrid architecture leverages the strengths of both 2D and 3D CNNs, where 2D CNNs excel at spatial feature extraction and 3D CNNs capture volumetric or temporal dependencies. The training process involves an initial phase where only the 3D CNN pathway is trained with pretrained weights, while the 2D CNN pathway remains frozen. This allows the 3D CNN to adapt to the specific dataset without disrupting the 2D CNN's pretrained features. After this first training period, both pathways are unfrozen and trained together for a second epoch period, allowing the entire network to refine its weights collaboratively. This staged training approach improves convergence and performance by preventing catastrophic forgetting in the pretrained 2D CNN while enabling the 3D CNN to specialize. The invention optimizes hybrid CNN training by leveraging pretrained weights and a phased unfreezing strategy, enhancing accuracy and efficiency in tasks requiring multi-dimensional data processing.

Claim 17

Original Legal Text

17. The computer program product of claim 13 , where the spatial features at each stage of the 2D convolutional neural network processing pathway are fused with corresponding spatiotemporal features at each stage of the 3D convolutional neural network processing pathway.

Plain English Translation

This invention relates to a computer program product for processing video data using a hybrid neural network architecture that combines 2D and 3D convolutional neural networks (CNNs). The system addresses the challenge of effectively capturing both spatial and spatiotemporal features in video analysis, where traditional approaches often struggle to balance computational efficiency with feature richness. The hybrid architecture includes a 2D CNN processing pathway for extracting spatial features from individual video frames and a 3D CNN processing pathway for capturing spatiotemporal features across consecutive frames. At each stage of processing, the spatial features from the 2D CNN are fused with the corresponding spatiotemporal features from the 3D CNN. This fusion ensures that the network leverages both types of features at every level of abstraction, enhancing the model's ability to recognize complex patterns in video data. The fusion process may involve concatenation, addition, or other combination techniques to integrate the features. The hybrid approach improves accuracy in tasks such as action recognition, object tracking, and video segmentation by preserving fine-grained spatial details while also incorporating temporal dynamics. The system is designed to be computationally efficient, making it suitable for real-time applications.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

Patent Metadata

Filing Date

May 14, 2020

Publication Date

March 15, 2022

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Citation & reuse

Analysis on this page is generated by Patentable — an AI-powered patent intelligence platform. AI-generated summaries, explanations, FAQs, and analysis may be reused with attribution and a visible link back to the canonical URL below. Patent abstracts and claims are USPTO public domain.

Cite as: Patentable. “Method and system for video action classification by mixing 2D and 3D features” (US-11276249). https://patentable.app/patents/US-11276249

© 2026 Nomic Interactive Technology LLC. Machine-readable context available at /api/llm-context/US-11276249. See llms.txt for full attribution policy.