US-9813721

Layer-based video encoding

PublishedNovember 7, 2017

Assigneenot available in USPTO data we have

Inventorsnot available in USPTO data we have

Technical Abstract

A technique for encoding a video signal generates multiple layers and multiple corresponding masks for each of a set of blocks of the video signal. Each of the layers for a given block is a rendition of that block, and each of the masks distinguishes pixels of the respective layer that are relevant in reconstructing the block from pixels that are not. The encoder applies lossy compression to each of the layers and transmits the lossily compressed layers and a set of the masks to a decoder, such that the decoder may reconstruct the respective block from the layers and the mask(s).

Patent Claims

19 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A method of encoding video signals, the method comprising: operating electronic encoding circuitry to express a portion of a video signal as a set of blocks, each of the set of blocks including a two-dimensional array of pixels of the video signal; for each current block of the set of blocks, (i) generating multiple layers, each layer including a two-dimensional array of pixels and providing a rendition of the current block, (ii) for each layer generated for the current block, generating an associated mask that identifies (a) one set of pixels that are to be used in reconstructing the current block and (b) another set of pixels that are not to be used in reconstructing the current block, and (iii) compressing each of the layers using a lossy compression procedure, and providing the compressed layers and a set of the masks for each of the set of blocks to a video decoder for reconstructing the portion of the video signal, wherein compressing each of the layers for the current block includes (i) calculating a residual block for each layer generated for the current block, each residual block representing a difference between the current block and a prediction of the current block, and (ii) applying the lossy compression procedure to each residual block, wherein the mask generated for each layer is a pixel-wise, 1-bit mask that has a first value for each pixel of the respective layer to be used in reconstructing the current block and a second value for each pixel of the respective layer not to be used in reconstructing the current block, wherein, for each residual block, applying the lossy compression procedure to the residual block includes performing a DCT (Discrete Cosine Transform) operation on the residual block, wherein the DCT operation (i) receives, as input, the residual block and the mask for that residual block and (ii) generates, as output, a set of DCT coefficients that are based on both the residual block and the mask, wherein performing the DCT operation on each residual block includes performing multiple radix-2 butterfly operations, each radix-2 butterfly operation receiving a pair of inputs and generating a pair of outputs, wherein, for one of the radix-2 butterfly operations, the pair of inputs represents a pair of pixels of the residual block, and wherein, when generating the pair of outputs from the pair of pixels, the radix-2 butterfly operation performs the acts of: (i) detecting that both of the pair of pixels are masked pixels, and (ii) in response to detecting, providing zeros for both of the pair of outputs.

Plain English Translation

A video encoding method divides a video signal into blocks of pixels. For each block, it generates multiple "layers" representing the block, and a mask for each layer. The mask identifies which pixels in the layer should be used to reconstruct the block and which should not. Each layer is compressed using lossy compression. The compressed layers and masks are sent to a video decoder. The lossy compression calculates a residual block (the difference between the original block and a prediction of it) for each layer. The mask is a 1-bit per pixel mask. Lossy compression uses DCT (Discrete Cosine Transform) on the residual block, using both the residual block and its mask as input to generate DCT coefficients. The DCT includes radix-2 butterfly operations. If a butterfly operation receives two masked pixels as input, it outputs zeros.

Claim 2

Original Legal Text

2. The method of claim 1 , wherein providing the set of masks to the video decoder includes transmitting the set of masks to the video decoder in a losslessly compressed form.

Plain English Translation

The video encoding method described previously sends the set of masks to the video decoder in a losslessly compressed form. This means the mask data is compressed to reduce its size for transmission, but the compression doesn't lose any information, ensuring the decoder receives the exact mask values that were originally generated by the encoder (unlike the video layer data, which *is* lossy).

Claim 3

Original Legal Text

3. The method of claim 2 , wherein the set of masks includes N masks, wherein providing the set of masks to the video decoder includes, for each current block, transmitting N−1 masks, wherein an N-th mask that is not transmitted to the video decoder for the current block is computable from the masks that are transmitted to the video decoder.

Plain English Translation

The video encoding method previously described, where masks are sent to the decoder in losslessly compressed form, transmits N-1 masks for each block, instead of all N masks. The N-th mask can be calculated by the decoder using the other N-1 masks that were transmitted. This reduces the amount of data transmitted because the decoder can infer the missing mask.

Claim 4

Original Legal Text

4. The method of claim 3 , wherein the mask that is not transmitted to the video decoder for the current block is computable by performing a pixel-wise logical NOR'ing operation on all of the masks that are transmitted to the video decoder.

Plain English Translation

In the video encoding method where one mask is not transmitted and is calculated by the decoder, the missing mask is calculated by performing a pixel-wise logical NOR operation on all the transmitted masks. This means for each pixel, if all the transmitted masks have a "0" value, the calculated mask will have a "1" value; otherwise, it will have a "0" value.

Claim 5

Original Legal Text

5. The method of claim 1 , wherein applying the lossy compression procedure to each residual block includes applying the mask for that residual block as input to the lossy compression procedure to generate a compressed residual block that is based on both the residual block and the mask.

Plain English Translation

In the described video encoding method, where lossy compression is applied to a residual block, the mask for that residual block is used as input to the lossy compression process. This allows the compression algorithm to take the mask into account when compressing the residual block, potentially improving compression efficiency and visual quality by prioritizing unmasked pixels. The compressed residual block therefore depends on both the original residual pixel data and the mask.

Claim 6

Original Legal Text

6. The method of claim 1 , wherein performing the DCT operation on each residual block includes: identifying a set of masked pixels, for which the mask has the second value; and substituting, in place of values of the set of masked pixels in the residual block, a set of alternative pixel values as inputs to the DCT operation, wherein computing the set of DCT coefficients is based in part on the set of alternative pixel values and results in DCT coefficients that are more compressible by entropy encoding than those that would be produced by performing the DCT operation based on the values of the set of masked pixels.

Plain English Translation

In the video encoding method performing DCT on residual blocks, masked pixels (those with a "second value" in the mask, meaning they're not to be used in reconstruction) are replaced with alternative pixel values *before* the DCT operation. These replacement values are chosen to make the DCT coefficients more compressible by entropy encoding. So instead of performing DCT on the original residual block with some pixels masked out, the masked pixels are *substituted* with other values specifically to improve compression.

Claim 7

Original Legal Text

7. The method of claim 1 , wherein, for another of the radix-2 butterfly operations, the pair of inputs includes a first input representing a first pixel of the residual block and a second input representing a second pixel of the residual block, and wherein, when computing the pair of outputs from the pair of inputs, the radix-2 butterfly operation performs the acts of: (i) detecting that the first pixel is an unmasked pixel and that the second pixel is a masked pixel; and (ii) in response to detecting, generating the pair of outputs using a value of the first pixel for both the first input and the second input.

Plain English Translation

Within the radix-2 butterfly operations used during the DCT in the described video encoding method, if one input pixel is unmasked and the other is masked, the butterfly operation uses the value of the unmasked pixel for *both* inputs when calculating the outputs. This effectively duplicates the unmasked pixel's value in the calculation.

Claim 8

Original Legal Text

8. The method of claim 2 , wherein the method further comprises providing a merge mode to the video decoder for any of the set of blocks, the merge mode directing the video decoder to combine layers when reconstructing the respective block using one of (i) a selection process in which values of one layer replace values of another layer or (ii) a blending process in which values of multiple layers are combined.

Plain English Translation

The video encoding method (where masks are sent losslessly) provides a "merge mode" to the video decoder. This merge mode allows the decoder to combine the different layers of a block during reconstruction, either by selectively replacing pixel values from one layer with values from another layer, or by blending the pixel values of multiple layers together to produce the final reconstructed block.

Claim 9

Original Legal Text

9. The method of claim 1 , wherein generating multiple layers of video data includes, for each current block: generating multiple predictions of the current block; identifying different groups of predictions from among the multiple predictions; for each group of predictions, calculating a smallest absolute difference (SAD), the SAD being a minimum difference between the current block and the respective group of predictions across all pixel locations of the current block; identifying the group for which the lowest SAD difference was calculated; and using the identified group as a source of the layers for the current block, such that each of the predictions in the identified group forms a respective one of the layers for the current block.

Plain English Translation

In the described video encoding method, generating multiple video data layers involves generating multiple predictions of the current block. The method then identifies different groups of these predictions and, for each group, calculates the smallest absolute difference (SAD) between the original block and the predictions in that group. The group with the *lowest* SAD is selected, and the predictions within this group are used as the source for the layers of the current block, with each prediction forming a separate layer.

Claim 10

Original Legal Text

10. The method of claim 9 , wherein generating multiple layers of video data further includes, for each of the predictions, calculating a map of pixel-wise differences between the current block and the respective prediction, and wherein calculating the SAD between the current block and the respective group includes: for each pixel in the current block, identifying the minimum of the pixel-wise differences across all maps of pixel-wise differences in the respective group; and summing the identified minimum pixel-wise differences across all pixel locations of the current block.

Plain English Translation

In the video encoding method that generates multiple layers by using predictions and calculating SAD, a map of pixel-wise differences is calculated between the current block and each prediction. When calculating the SAD between the current block and a group of predictions, the *minimum* pixel-wise difference across all maps within the group is identified *for each pixel*. These minimum differences are then summed across all pixel locations in the block to produce the SAD for that group.

Claim 11

Original Legal Text

11. The method of claim 10 , further comprising, for each pixel location of the current block, identifying a best prediction, the best prediction being the prediction that resulted in the minimum pixel-wise difference identified across all maps of pixel-wise differences in the identified group with the lowest SAD, wherein generating the mask associated with each layer of the current block includes: setting the mask to the first value at all pixel locations for which the prediction on which the respective layer is based produced the best prediction; and setting the mask to the second value at all pixel locations for which the prediction on which the respective layer is based produced a worse prediction than does another prediction.

Plain English Translation

Building upon the described SAD-based layer generation, the video encoding method also identifies the "best prediction" for each pixel location, which is the prediction that resulted in the minimum pixel-wise difference. The mask for each layer is generated by setting it to a "first value" (meaning "use this pixel") at locations where the layer's prediction was the "best prediction" and to a "second value" (meaning "don't use this pixel") where it was worse than another prediction.

Claim 12

Original Legal Text

12. The method of claim 9 , wherein generating the multiple predictions includes generating both intra-frame predictions and inter-frame predictions.

Plain English Translation

In the video encoding method generating predictions for multiple layers, the predictions include both intra-frame predictions (predictions based on data within the current frame) and inter-frame predictions (predictions based on data from other frames). So the layer generation uses information from both spatial and temporal redundancy.

Claim 13

Original Legal Text

13. The method of claim 1 , wherein generating multiple layers of video data includes, for each current block, providing the current block as both a first layer and a second layer, and wherein generating the mask associated with each layer of the current block includes: identifying a first set of pixels in the current block that each have an inter-frame motion vector that falls within a first range; identifying a second set of pixels in the current block that each have an inter-frame motion vector that falls within a second range, the second range not overlapping with the first range; setting the mask associated with the first layer to the first value at pixel locations of all of the first set of pixels and to the second value at pixel locations of all other pixels in the current block; and setting the mask associated with the second layer to the first value at pixel locations of all of the second set of pixels and to the second value at pixel locations of all other pixels in the current block.

Plain English Translation

In this video encoding method, two layers are created for each block using the original block data as both the first and second layers. The mask for the first layer is set to "use" for pixels with motion vectors falling within a first range, and "don't use" for all other pixels. The mask for the second layer is set to "use" for pixels with motion vectors falling within a *different*, non-overlapping second range, and "don't use" for all other pixels. The masks separate the layers based on motion vector characteristics.

Claim 14

Original Legal Text

14. The method of claim 13 , wherein the first set of pixels represents a static overlay in a foreground of the current block and the second set of pixels represents moving video content in a background of the current block.

Plain English Translation

Building on the previous description, the first set of pixels (with motion vectors in the first range) represent a *static* overlay in the foreground of the current block. The second set of pixels (with motion vectors in the second, non-overlapping range) represent *moving* video content in the background of the current block. The two layers essentially separate static foreground elements from dynamic background elements.

Claim 15

Original Legal Text

15. The method of claim 1 , wherein generating multiple layers of video data includes, for each current block, providing the current block as both a first layer and a second layer, and wherein generating the mask associated with each layer of the current block includes: distinguishing a set of foreground pixels from a set of background pixels using edge detection; setting the mask associated with the first layer to the first value at pixel locations of all of the set of foreground pixels and to the second value at pixel locations of all of the other pixels; and setting the mask associated with the second layer to the first value at pixel locations of all of the set of background pixels and to the second value at pixel locations of all of the other pixels.

Plain English Translation

This video encoding method creates two layers for each block, both containing the original block data. It uses edge detection to distinguish foreground and background pixels. The mask for the first layer is set to "use" for foreground pixels and "don't use" for background pixels. The mask for the second layer is set to "use" for background pixels and "don't use" for foreground pixels, effectively separating the foreground and background into different layers.

Claim 16

Original Legal Text

16. The method of claim 15 , further comprising directing the video decoder to preserve anti-aliased content around detected edges by blending the first layer with the second layer in vicinities of the detected edges.

Plain English Translation

The described video encoding method performs edge detection to separate foreground and background layers and then *blends* the two layers together *near the detected edges* to preserve anti-aliased content. This avoids harsh transitions and maintains smoother edges in the reconstructed video. The blending happens only in the vicinity of edges, not across the whole block.

Claim 17

Original Legal Text

17. The method of claim 1 , wherein generating multiple layers of video data includes, for each current block, providing the current block as both a first layer and a second layer, and wherein generating the mask associated with each layer of the current block includes: distinguishing a first set of pixels having a first color from a second set of pixels having a second color; setting the mask associated with the first layer to the first value at locations of all of the first set of pixels and to the second value at pixel locations of all other pixels; and setting the mask associated with the second layer to the first value at pixel locations of all of the second set of pixels and to the second value at pixel locations of all other pixels.

Plain English Translation

This video encoding method creates two layers for each block, each containing the original block data. It distinguishes between pixels of a first color and pixels of a second color. The mask for the first layer is set to "use" for pixels of the first color and "don't use" otherwise. The mask for the second layer is set to "use" for pixels of the second color and "don't use" otherwise, separating the block into layers based on color.

Claim 18

Original Legal Text

18. An apparatus for encoding video signals, the apparatus comprising electronic encoding circuitry constructed and arranged to: express a portion of a video signal as a set of blocks, each of the set of blocks including a two-dimensional array of pixels of the video signal; for each current block of the set of blocks, (i) generate multiple layers, each layer including a two-dimensional array of pixels and providing a rendition of the current block, (ii) for each layer generated for the current block, generate an associated mask that identifies (a) one set of pixels that are to be used in reconstructing the current block and (b) another set of pixels that are not to be used in reconstructing the current block, and (iii) compress each of the layers using a lossy compression procedure, and provide the compressed layers and a set of the masks for each of the set of blocks to a video decoder for reconstructing the portion of the video signals, wherein the electronic encoding circuitry constructed and arranged to compress each of the layers for the current block is further constructed and arranged to (i) calculate a residual block for each layer generated for the current block, each residual block representing a difference between the current block and a prediction of the current block, and (ii) applying the lossy compression procedure to each residual block, wherein the mask generated for each layer is a pixel-wise, 1-bit mask that has a first value for each pixel of the respective layer to be used in reconstructing the current block and a second value for each pixel of the respective layer not to be used in reconstructing the current block, wherein, for each residual block, the electronic encoding circuitry constructed and arranged to apply the lossy compression procedure to the residual block is further constructed and arranged to perform a DCT (Discrete Cosine Transform) operation on the residual block, wherein the DCT operation is configured to (i) receive, as input, the residual block and the mask for that residual block and (ii) generate, as output, a set of DCT coefficients that are based on both the residual block and the mask, wherein the DCT operation is further configured to perform multiple radix-2 butterfly operations on each residual block, each radix-2 butterfly operation receiving a pair of inputs and generating a pair of outputs, wherein, for one of the radix-2 butterfly operations, the pair of inputs represents a pair of pixels of the residual block, and wherein, when configured to generate the pair of outputs from the pair of pixels, the radix-2 butterfly operation is further configured to: (i) detect that both of the pair of pixels are masked pixels, and (ii) in response to detecting, provide zeros for both of the pair of outputs.

Plain English Translation

An apparatus (hardware system) for encoding video signals includes electronic circuitry that performs the same video encoding steps as described in method claim 1. It divides the video into blocks, generates multiple layers and masks for each block, compresses the layers using lossy compression, and provides the compressed data to a video decoder. Specifically, the circuitry calculates residual blocks, uses 1-bit masks, and performs DCT (Discrete Cosine Transform) operations that leverage the masks. If a radix-2 butterfly operation in the DCT receives two masked pixels, it outputs zeros.

Claim 19

Original Legal Text

19. A non-transitory, computer-readable medium including instructions which, when executed by electronic encoding circuitry, cause the electronic encoding circuitry to perform a method for encoding video signals, the method comprising: operating electronic encoding circuitry to express a portion of a video signal as a set of blocks, each of the set of blocks including a two-dimensional array of pixels of the video signal; for each current block of the set of blocks, (i) generating multiple layers, each layer including a two-dimensional array of pixels and providing a rendition of the current block, (ii) for each layer generated for the current block, generating an associated mask that identifies (a) one set of pixels that are to be used in reconstructing the current block and (b) another set of pixels that are not to be used in reconstructing the current block, and (iii) compressing each of the layers using a lossy compression procedure, and providing the compressed layers and a set of the masks for each of the set of blocks to a video decoder for reconstructing the portion of the video signal, wherein compressing each of the layers for the current block includes (i) calculating a residual block for each layer generated for the current block, each residual block representing a difference between the current block and a prediction of the current block, and (ii) applying the lossy compression procedure to each residual block, wherein the mask generated for each layer is a pixel-wise, 1-bit mask that has a first value for each pixel of the respective layer to be used in reconstructing the current block and a second value for each pixel of the respective layer not to be used in reconstructing the current block, wherein, for each residual block, applying the lossy compression procedure to the residual block includes performing a DCT (Discrete Cosine Transform) operation on the residual block, wherein the DCT operation (i) receives, as input, the residual block and the mask for that residual block and (ii) generates, as output, a set of DCT coefficients that are based on both the residual block and the mask, wherein performing the DCT operation on each residual block includes performing multiple radix-2 butterfly operations, each radix-2 butterfly operation receiving a pair of inputs and generating a pair of outputs, wherein, for one of the radix-2 butterfly operations, the pair of inputs represents a pair of pixels of the residual block, and wherein, when generating the pair of outputs from the pair of pixels, the radix-2 butterfly operation performs the acts of: (i) detecting that both of the pair of pixels are masked pixels, and (ii) in response to detecting, providing zeros for both of the pair of outputs.

Plain English Translation

A non-transitory computer-readable medium (like a flash drive or hard drive) stores instructions that, when executed by electronic circuitry, cause the circuitry to perform the video encoding method described in claim 1. The instructions cause the device to divide video into blocks, create multiple layers and masks, compress layers using lossy compression and DCT, sending the result to a decoder. This process involves using 1-bit masks and radix-2 butterfly operations where masked pixels result in zeroed output.

Classification Codes (CPC)

Cooperative Patent Classification codes for this invention. Click any code to explore related patents in that topic.

H04N

Patent Metadata

Filing Date

November 20, 2014

Publication Date

November 7, 2017

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search