Citation
SNR scalability by coefficient refinement for hybrid video coding

Material Information

Title:
SNR scalability by coefficient refinement for hybrid video coding
Creator:
Rao, Kaustubh M
Publication Date:
Language:
English
Physical Description:
xii, 117 leaves : ; 28 cm

Subjects

Subjects / Keywords:
Video compression ( lcsh )
Image processing -- Digital techniques ( lcsh )
Image processing -- Digital techniques ( fast )
Video compression ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Bibliography:
Includes bibliographical references (leaves 115-117).
General Note:
Department of Electrical Engineering
Statement of Responsibility:
by Kanstubh M. Rao.

Record Information

Source Institution:
|University of Colorado Denver
Holding Location:
|Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
259272735 ( OCLC )
ocn259272735
Classification:
LD1193.E54 2007m R36 ( lcc )

Full Text
SNR SCALABILITY BY COEFFICIENT REFINEMENT
FOR HYBRID VIDEO CODING
By
Kaustubh M. Rao
B.S., Visveswaraiah Technological University, India 2004
A thesis submitted to the
University of Colorado Denver
in partial fulfillment
of the requirements for the degree of
Master of Science in Electrical Engineering
2007


This thesis for the Master of Science
degree by
Kaustubh M. Rao
has been approved
by
Dr. Tim Lei
Dr. Renjeng Su
2**e_


Rao, Kaustubh M. (M.S., Electrical Engineering)
SNR Scalability by Coefficient Refinement for Hybrid Video Coding
Thesis directed by Associate Professor Dr. Joseph Beaini
ABSTRACT
The ability of a system to display imagery at various picture quality levels is
known as SNR Scalability. With respect to video streaming to multicast and unicast
applications, this feature is already in use. So, depending on the bit-rate for various
heterogeneous networks, the communication framework can choose a signal
representation to display the imagery at the bit rate of the compressed stream.
In the present context of work, video which is a 3-dimensional signal, with the
x, y and the time t axes, is further divided and processed in the form of blocks or more
precisely macro-blocks (MB) of 16 x 16 pixels each. Video coding following the aspects
of a DPCM/DCT coding schematics could be, termed as Hybrid Video Coding. Hybrid
video coding follows the procedure of motion estimation and compensation, in an effort
to reduce the redundancy between the pixels in one particular frame and in the previous
subsequent frames.
Coefficient Refinement corresponds to the fact of changing the Quantization
Parameter (QP) and coding the MBs accordingly. The QPs for the Base layer and the


Enhancement layers work accordingly and obtain two separate layers of video bit-
streams. The aim of a hybrid video codec is to minimize the prediction errors energy,
which is the difference of the two QPs in the transform domain, by removing the
temporal and the spatial redundancies or in other words the energy of the transform
coefficients.
This research works in the sense of reusing the same side information which is the
header information, the data containing coding mode information, motion vector
information for one particular mode and replicating the information for other layers as
well. This corresponds to a lot of saving in the bit-rate spending as well as maintaining
more or less the same visual quality.


DEDICATION
To my family and friends...
Kaustubh M. Rao


ACKNOWLEDGEMENT
I would like to express my deepest and sincere thanks to Dr. Joseph Beaini, at the
University of Colorado at Denver and Health Sciences Center. It is because of his
enthusiasm and the energy, which inspired me for the present work which would
not have been possible without it. The discussions with him helped me dig deeper to
achieve what I was looking for.
I am deeply indebted to my family for their incredible support and encouragement
from the start to the end, where my family never failed to show their support in all
aspects.
I would also like to thank the Department of Electrical Engineering for allowing me
to proceed with my topic of interest in this research.
And finally, big thanks to all my friends who have been with me and seen both the
best and the worst of me.


TABLE OF CONTENTS
Figures..................................................................x
Tables.................................................................xii
Chapter
1. Video Compression.....................................................1
1.1 History............................................................3
1.2 Principle of Compression systems...................................5
1.3 Video Codec.......................................................12
1.3.1 DPCM Model........................................................13
1.3.2 DCT Model.........................................................20
2. Overview of Standards: MPEG-2 to H.264 ..............................23
2.1 MPEG-2............................................................23
2.1.1 MPEG-2 Systems....................................................23
2.1.2 MPEG-2 Profiles and levels........................................28
2.1.2.1 Non-Scalable coding modes.......................................30
2.1.2.2 Scalability.....................................................31
2.2 MPEG-4..........................................................38
2.2.1 Simple Profile....................................................41
2.2.1.1 Coding Efficiency Tools.........................................46
2.2.1.2 Transmission Efficiency Tools...................................47
vii


2.2.2 Scalable Video coding.....................................................50
2.2.2.1 Spatial Scalability.....................................................51
2.2.22 Temporal Scalability....................................................52
2.2.2.3 Fine Granular Scalability...............................................53
2.3 H.264 ....................................................................57
2.3.1 Baseline Profile........................................................61
2.3.1.1 I-slice.................................................................61
2.3.1.2 P-Slice.................................................................63
2.3.1.3 Context Adaptive Variable Length Coding.................................68
2.3.1.4 ASO and Redundant Slices................................................68
2.3.2. Main Profile..............................................................71
2.3.2. lB-slices................................................................71
2.3.2.2 Weighted Prediction.....................................................72
2.3.2.3 CABAC...................................................................72
2.3.2.4 Support for Interlace pictures..........................................76
2.3.3 Extended Main Profile.....................................................77
2.3.3.1 Data Partitioning........................................................77
2.3.3.2 SPand SI slices.........................................................78
3. SNR Scalability by Hybrid Coefficient Refinement............................81
3.1 Algorithm.................................................................83
3.1.1 Video Coding using Lagrangian optimization................................92
viii


3.2 Simulation....................................................99
3.3 Results......................................................100
Appendices:
Appendix-A: Distortion measures..................................113
Bibliography.....................................................115
IX


LIST OF FIGURES
Figure
1.1(a) Lossless predictive codec............................................7
1.1 (b) Lossy predictive codec..............................................8
1.2 Concept of Intra and Inter pictures....................................10
1.3 Optic Flow Axis........................................................11
1.4 Video Codec model......................................................12
1.5 Block matching by moving blocks over the respective search area........16
1.6 Principle of gradient matching.........................................17
2.1 Transport Stream Details...............................................25
2.2 Regeneration of clock at the decoder...................................27
2.3 Block Diagram of two layer SNR Scalable encoder based on DCT...........34
2.4 Two layer SNR Scalable encoder with drift at base layer................36
2.5 Two Layer SNR Scalable encoder with no drift at base and the enhancement
layer............................................................37
2.6 Tools & objects for coding of rectangular frames......................41
2.7 I-VOP encoding and decoding stages....................................42
2.8 P-VOP encoding and decoding stages....................................43
2.9 One or Four vectors per MB............................................46
2.10 Video Packet Structure ...............................................48
2.11 Scalable Coding- a general concept...................................50


2.12(a) Temporal enhancement of P-VOP prediction options.....................52
2.12(b) Temporal enhancement of B-VOP prediction options.....................53
2.13 FGS Encoder structure...................................................54
2.14 FGS Decoder.............................................................54
2.15 H.264 Encoder...........................................................57
2.16 Comparison of MPEG-2 and H.264/AVC Coding layers.......................59
2.17 Profiles in H.264 and sequence of NAL units............................61
2.18 4 x 4 luma prediction modes.............................................62
2.19 MB Partitions for Tree structured MC....................................64
2.20 Interpolation of luma half pel positions................................66
2.21 Interpolation of luma samples for quarter pel positions.................67
2.22 Slice groups: interleaved and dispersed.................................70
2.23 Slice group: Foreground and Background..................................70
2.24 Sub ranges of symbol probabilities......................................73
2.25 Decoding mechanism in CABAC.............................................76
2.26 Switching streams between I and P slices................................79
3.1 SNR Scalable encoder.....................................................85
3.2 SNR Scalable decoder.....................................................86
3.3 Graphical interpretation of RD curve....................................90
xi


LIST OF TABLES
Table
2.1 MPEG-2 Profiles.........................................................29
2.2 MPEG-2 Levels...........................................................29
2.3 Allowed Combinations of levels/profiles of MPEG-2.......................29
2.4 Applications of SNR Scalability.........................................31
2.5 Applications of Spatial Scalability.....................................32
2.6 Applications of Temporal Scalability....................................32
2.7 MPEG-4 Profiles for natural video.......................................40
2.8 MPEG-4 Profiles for synthetic or hybrid video...........................40
2.9 dcscalarvalues depending on QP range....................................44
2.10 Nine modes of operation for a 4 x 4 set of pixels......................63
2.11 Four modes of operation for a 16 x 16 pixel MB.........................63
2.12 Mapping of MB to a slice group.........................................69
2.13 Probability of symbols and sub-range allocated to it...................73
2.14 Encoding procedure for a vector sequence..............................74
2.15 Decoding procedure for a vector sequence..............................75
xii


1.
Introduction to Video Compression
A pixel or a pel or a picture element is considered to be the basic building unit for
an image or a video. A digital image is composed of a finite number of elements
each of which has a particular location and value. These elements are known as
pixels. [1]
An image is a 2-dimensional function defined as f(x,y) where x and y are the spatial
coordinates and the function f(x,y) denotes the gray level or the intensity at that
particular point. Apparently a video signal is a 3-dimensional signal where along
with the spatial co-ordinates (x,y) we have the time co-ordinate too. Thus, a video
signal is denoted as (x,y,t), which means that a video signal could be imagined as a
culmination of a string of images over a time axis.
The analog video signal being generated by the camera captures it in a RGB signal
or in the form of primary colors- Red, Green and Blue. The color components
generate different color spaces accommodating different standards- NTSC, PAL,
SECAM. In the PAL system, the color space is in the YUV, where Y represents the
luminance and U and V represent the chrominance components. The YUV color
space can be generated from the gamma-corrected RGB denoted as RGB' in the
following equations. [2]
Y=0.299R' + 0.587G' + 0.1 MB';.................................
U= -0.147R' -0.289G' + 0.436B' = 0.492(B'-Y);.................. (1.1)
V=0.615R' -0.515G' -0.100B' =0.877(R'-Y);......................
1


The process of converting an analog video to a digital video involves the process of
filtering, sampling and quantizing. The filtering process is employed to avoid the
aliasing effects seen after the sampling process. The Nyquist sampling rate is twice
the bandwidth at 10-11 MHz, where the PAL system bandwidth is at 5 MHz. After
the sampling, the sampled signals are quantized to an 8-bit resolution. The need for
video compression arises here, when we start to calculate the total bit rates required
for a 1 second of video.
The number of lines in a frame is 576 and each line has 720 pixels in a CCIR
601/625 video signal, with 25 frames per second. Thus, the total number of pixels
becomes,
720 x 576 x 25 = 10,368,000. [2]
The total bit-rate with an 8-bit resolution comes to, 10,368,000 x 8 x 2 =
165,888,000 bits/sec (2 times because of the separate count of luminance and
chrominance samples). This mind-boggling number for 1 sec worth of video thus
needs to come down, and thats when the compression for video arises. Thus, we
can enumerate the advantages of compression as follows:
A smaller amount of storage is required for a given amount of source
material.
Compression reduces the bandwidth required for real time transfers between
the media and network.
2


1.1 History of Compression
Work on compression started with the H.120 standardization, the first digital video
coding standards by the then standardization committee-CCITT (Comite Consultatif
Internationale de Telegraphie et Telephonie), now known as ITU-T. This standard
was approved in 1984 and was originally a conditional replenishment (CR) coder
with the Differential Pulse Coding Modulation (DPCM), scalar quantization and
variable length coding. In 1988, the second version of this standard introduced the
concept of motion compensation and background prediction. Its operational bit rates
were at 1544 and 2048 Kbits/s.
H.261 came in as a huge success under the auspices of ITU-T, with widespread bit
rates of 80-320 Kbits/s. This standard was approved in 1991 and later revised in
1993 to accommodate target bit rates varying from 64 to 2048 Kbits/s. H.261 had
features such as a 8 x 8 DCT block, scalar quantization, and two-dimensional run-
level variable- length entropy coding.
The next standard that came into fray, which was highly successful, was the JPEG
standard, developed by the (Joint Photographic Experts Group) a joint project by
the ITU-T and the ISO standards organizations and approved in 1992. Typically,
the JPEG standard is the H.261 standard applied in the INTRA coding mode with
the quantizer reconstruction and entropy encoding. This standard is a continuous
tone, still-picture coding. Also, JPEG has three more modes of operation with
progressive coding, baseline coding and lossless coding.
3


MPEG-1 (Moving Pictures Experts Group) was the next successful video codec
standard as a project with the ISO and the ITU-T which was approved in 1993 and
with a video quality level equivalent to the VHS or better and with operating bit
rates of about 1.5 Mbits/s with a range of 1-2 Mbits/sec. MPEG-1 brought in
features like half-pixel accuracy, bi-directionally predicted frames. H.261
outperforms MPEG-1 in the bit rate range of 1 Mbits/s or below, but for bit rates
above 1 Mbits/s, MPEG-1 is much better.
MPEG-2 was the next successful video codec standard which now forms the heart
of broadcast-quality and digital television for both standard television and high
definition television. This standard was approved in 1994 and was developed as a
joint effort between the ITU-T and the ISO organizations and came in with a step
higher in bit rate and picture quality. MPEG-2 came with features like higher bit
rates, better picture quality, efficient handling of interlaced scan pictures, and
hierarchical bit usage scalability. This was the first time, scalability as a profile was
introduced. The target bit-rate range was 4-30 Mbits/s.
The next in the line of development was the MPEG-3 which is known as the MP3.
This video aspect of the standard was adopted within the MPEG-2 standard itself,
whereas the audio aspect is now commonly referred to as the MP3.
Due to the requirements of PSTN and mobile applications, ITU-T then started to
work on the low bit rate applications and introduced a new video codec named
H.263, in succession to the H.261. The performance aspect of H.263 under low bit
4


rate application is still a big topic of research. The first version of H.263 had
features such as variable block size motion compensation, over-lapped block
motion compensation, picture extrapolating motion vectors, three- dimensional run
- level last variable -length coding, and more efficient header information. Its
target bit- rate was 10-2048 Kbits/s. At lower bit-rates, H.263 outperforms H.261.
The first version of the standard was approved in early 1996.
H.263+, a second version of H.263 was introduced with features like an improved
PB frames mode, where a PB frame consists of one P-picture and one B-picture
coded as one single unit. The P picture is predicted from the last decoded P picture
and the B picture is predicted from both the last decoded P- picture and the P
picture currently being decoded. Apart from the PB frames mode, two different
kinds of scans for the transform coefficients and usage of different variable length
coding tables was introduced. This standard also gave more error resilience when
compared to H.263.
H. 264/AVC (Advanced Video Coding) is the standard that ITU-T and ISO both
developed as a team which is discussed in detail as a portion of Chapter 2.
I. 2 Principle of Compression Systems
The initial bandwidth reduction techniques included the interlacing technique,
which gave us good compression efficiency of 2:1 reduction. The main
disadvantage in the interlacing technique for digital compression was the blocking
artifacts that get introduced between the vertical and the temporal information
5


reducing the usable vertical resolution. Also, the complexity involved in MPEG-2
systems result from the handling of interlaced images.
For color images having an RGB format, the information has to be passed through
the same channels used for monochrome images. So compression was very much
essential for these images. RGB images had to be converted to the YUV format
where the Y represents luminance and the UV represents the chrominance
components. This way we adapt the human visual perception, into the compression
schemes, which helps in reducing the redundant information.
Compression essentially involves two different schemes based on the concept of
lossless and lossy compression. Lossless compression schemes involve the concept
of removal of redundant information and usage of the Huffman coding and run-
length coding. The lossless compression scheme does not help a lot though, because
the information that is removed is the redundant information, which does not
change the bit rate or the bandwidth requirements a whole lot.
Lossy compression schemes remove the vital information or the lesser relevance
data so as to achieve the data reduction and achieve the bit rate and bandwidth
requirements. This combined with the human visual perception gives out a good
compression ratio and is actually used in video compression. All real programs have
two components to their effect: the novel and the redundant information. The novel
component is the entropy, which is the true information of a signal. An ideal
encoder would extract all the entropy and would send in only the entropy or the
6


information content to the decoder. The decoder predicts the rest of the signal and
along with the entropy content to produce an exact replica of the transmitted signal.
However, such a case is very ideal to begin with-because an ideal coder would be
complex and causes a very long delay when using temporal redundancy.
In all real signals the least part of the signal is obvious from what has gone before
or in other words, what can be predicted from the previous samples. An efficient
decoder would predict some of the obvious parts so that only the remainder needs to
be sent.
Figure 1.1 (a) Lossless Predictive Codec [4]
7


Coropres^d Dcodd
prediction error predicticn error
Figure 1.1 (b) Lossy Predictive codec [4]
Figure 1.1 gives us a picture of a predictive codec which has identical predictors in
both encoder and decoder. The prediction error or the residual is sent to the decoder
which is used to cancel the prediction error and which gives us the perfect lossless
replication of the source. Figure 1.1 b. gives us a predictive codec with lossy
compression feature, where the residual is subjected to further lossy compression;
this should however take place within the encoder loop, so that we do not encounter
drift problems, as discussed later.
The MPEG compression scheme is not a single compression format but is a range
of standardized coding tools, which when combined flexibly suits a range of
applications. For this purpose the MPEG-2 and MPEG-4 are divided in several
profiles and levels where each profile can be implemented to correspond to a
different level depending on the image format of the input picture. The profiles and
8


levels could be explained as the applications and the extent to which the
applications could be implemented respectively subject to a condition like the
correct image format and the application intended. The entropy coding is performed
in such a way that the method of coding which is performed is included in
compressed data, so that the decoder can automatically handle whatever the coder
decided to do. Both MPEG-2 and MPEG-4 use this feature in their standards.
For a digital video format, what matters the most is the compression ratio or the
amount of compression performed apart from the bit rate of the system (e.g. 2:1, 5 :1
and so on). If on increasing the compression factor, we reduce the subjective quality
of the video, it is better to perform a filtering operation to reduce the bit rate
through filtering and lose the resolution rather than obtain the compression artifacts.
Video is coded in two separate forms in terms of the picture content and the frame
prediction, Intra and Inter coding. Intra-coding (Intra = within) refers to the coding,
which exploits the spatial redundancy or redundancy within the picture. Inter-
coding (Inter = between) refers to coding exploiting temporal redundancy. Intra
coding maybe used for JPEG coding or could be used as a part of MPEG along with
inter-coding. Intra coding involves the analysis of the spatial frequencies in an
image and forms the basis of the transforms coding. These transforms produce
coefficients that describe the magnitude of each spatial frequency. With these most
of the coefficients would be zero or near to be zero. These coefficients can be
omitted which results in a bit rate reduction.
9


(a)
4-
>
t
Spatial or intra-
coding explores
redundancy within
a picture
(b)
Temporal or intra-
coding explores
redundancy between
* pictures
Figure 1.2 Concept of Inter and Intra pictures [4]
Figure 1.2.a shows the individual pictures are coded without any reference to any
previous pictures. Inter coding relies on identifying the similarities between two
successive pictures. At the decoder end, the next picture can be created by sending
only the picture difference. These picture differences increase when the object is in
a motion, which can be reduced using motion compensation.
Motion Compensation is a process of measuring motion of the objects from one
picture to another, so that it can look for that motion when looking for redundancies
between pictures. Objects in motion are on an optic flow axis, whereas for still
objects the motion is only on the time axis. The optic flow axis is not necessarily
parallel to the time axis and is defined as the locus of all the moving objects as it
takes on various screen positions. The advantage of the optic flow axis is obtained
when the appearance of moving object is deformed or moves into a shadow or
10


rotates, so if we can locate the optic flow axis in the object moving, a significant
coding gain can be expected.
Figure 1.3 Optic Flow axis [4]
A motion compensated coder works as a reference picture is sent to the decoder but
a locally stored picture is also kept which is compared to another incoming picture
to find the motion vectors for the picture. This reference picture is then moved
according to the motion vectors so that the prediction error or the residual is
cancelled out [3]. The video frames are divided into blocks called as Macro Blocks
(MBs) and the process of finding the optic flow axis is called Motion Estimation
which is a process of determining which part of the reference picture is best suited
for the current MB based on some energy criteria as explained in detail in chapter 3.
Motion Estimation and Motion Compensation is explained in detail in the next
section as a part of the DPCM model.
11


1.3 Video Codec
The block diagram of a video codec is as shown below
Figure 1.4 Video Codec model [3]
The standard CODEC model uses a block based approach where the incoming
frames are converted to blocks of 16x16 pixels each. This CODEC comprises of
three functional units as a temporal model, a spatial model, and an entropy encoder.
The temporal model has an input as an uncompressed video sequence. The
temporal model attempts to reduce the temporal redundancy by comparing the
neighboring frames and attempting to recreate the present frame with an estimate.
The difference of the current and the estimate frame is the residual that is sent to the
spatial model of the codec which calculates the coefficients of the input macro
block. The residual frame is created by subtracting the present frame with the
predicted frame. Thus, the output of the temporal model consists of a residual
frame and a set of motion vectors which describe the motion compensation. This
residual frame is sent to the spatial model which compares it within the frame with
the neighborhood pixels and removes the spatial redundancies by transforming the
residual frame to a transform domain and performing quantization on these
12


transformed coefficients. This helps in removing a lot of spatial frequencies
because most of the coefficients are near to zero. Thus, the output of the spatial
domain is a set of quantized coefficients which are then passed onto the entropy
encoder where transformed coefficients are encoded along with the motion vectors.
This coding thus, reduces the statistical redundancy from the quantized transformed
coefficients and the motion vectors. The above video codec is based off of the
DPCM/DCT model where DPCM corresponds to the Differential Pulse Code
Modulation, a concept borrowed from the telecommunication systems where only
the first signal sample is encoded and all other samples are the differences from the
previous samples encoded. The DPCM aspect of the video codec incorporated is the
predictive coding used in the form of ME/MC or Motion Estimation/Motion
Compensation and the DCT aspect, which stands for the Discrete Cosine
Transform, is the use of the Transform domain and the entropy coding used. We
would now discuss the DPCM/DCT model in detail as follows:
1.3.1 DPCM
The process of motion estimation involves taking a previous frame (or in some
cases future frames as we will see later) as a predictor frame and the current frame
and subtract their energies, to get a residual energy frame. This residual energy
frame tells us how much energy is left. This residual frame then undergoes the
aspect of motion compensation or compensating the motion of the object through
13


the use of motion vectors. We will first discuss the various aspects of Motion
Estimation and then talk in detail about Motion Compensation,
a) Motion Estimation:
Motion Estimation forms a very important part of a video codec. There are three
different motion estimation techniques which are generally found in some or the
other applications namely: [4]
1. Block Matching
2. Gradient Matching
3. Phase correlation.
We now look at an overview of these three ME techniques.
Block Matching: A block of pixels is selected in a given reference frame where the
present block is moved over the reference frame looking for matching pixels and
once the match is obtained, the displacement needed to obtain it is used as a basis
for a motion vector. So if there is a match over the whole of the block, the moving
area must be bigger than the block. Also, on the other hand if the edge of the
moving area may cut through the block, the correlation obtained is only in the
moving areas of the block.
Simply put, the block matching involves tremendous amount of computations
where the first thing that would strike is to match the block of pixels for all pixels in
the reference frame. But to perform the estimation process efficiently, we can look
at different hierarchical approaches and a very generic hierarchical method of
14


performing motion estimation is presented. In the first stage of matching, the
motion range covered is large but inaccurate and as the stages the motion range
covered is less and gives accurate results. The results of the first stage, the
displacement, which is obtained by heavy filtering and subsampling, is used as a
basis for the second stage thus working on a lesser amount of accuracy. Thus, the
last stage would give us the results for the required accuracy. However, this method
of obtaining accurate results can measure motion upto the nearest pixel. If more
accuracy is needed, we would make use of interpolators where for that intended
accuracy, the interpolators would create pixels and the residual energy is computed
by subtracting this with the reference frame pixels. This is the method used in
H.264/AVC which uses a quarter-pixel precision whereas most other motion
estimation algorithms use a half-pixel precision. A detailed study on motion
estimation algorithms based on block matching, and their comparison is given in
[18]. The following figure gives us an idea of the motion estimation algorithm.
15


Ssarc3i2i Figure 1.5 Block Matching performed by moving the block of pixels positioned at
all pixels locations within the search area and correlation measured at each
position. [4]
Gradient Matching: At some point in a picture, the function of brightness with
respect to distance across the screen will have a certain slope, known as the spatial
luminance gradient. If the associated picture is moving, the slope will traverse a
fixed point on the screen and the result will be that the brightness will change with
respect to time. This is a temporal luminance gradient and the following figure
explains the principle [4]. Thus motion speed can be found using the formula given
by
16


Figure 1.6 Principle of gradient matching [4]
_. . . Temporal lu min ance difference , .
Displacement{in pixels) =----------------------............................1.2
Spatial lu min ance difference
However, this method works only if the gradient remains essentially constant over
the displacement distance, which is generally not seen in real videos. The reasons
for variations in spatial gradient can be different like when the object moves into
shade or a change in position of the camera itself.
Phase correlation: The main principle of phase correlation is to perform a discrete
Fourier Transform on two successive fields and then subtracting all the phases of
the spectral components. These phase differences are then subject to a reverse
transform which would directly give the positions corresponding to the motion
between the fields. In practical systems, the phase correlation stage is followed by a
17


matching stage similar to block matching to get the optimal results, however this
does add to the complexity of the motion estimation algorithm.
Thus, this section gives us an insight into the motion estimation of the DPCM
model. We now, look at the motion compensation methods,
b) Motion Compensation: The selected best matching region in the reference
frame is subtracted from the current MB to produce a residual MB which is
encoded and transmitted together with a motion vector describing the position of
the best matching region. Often there are variations of motion estimation and
compensation where instead of the previous frame, a future frame (in order of
display) is a better match to the pixel block. If the future frame is chosen, then this
frame has to be encoded before the present frame. Similarly, the concept of intra
and inter is seen wherever the current frame does not have any frame for a good
reference, we could code the current frame as an intra frame. Also, the fact that a 16
x 16 pixel block does not always necessarily follow an integer pixel motion
compensation, rather it would need a half or a quarter pixel motion compensation
which as we would see is the case in H.264. A better prediction is obtained with the
motion compensation taking place with a quarter pixel precision.
Another adaptation to the block size in motion compensation is that for flat and
homogeneous regions of a frame we could prefer a bigger block size whereas for
areas with heavy motions we could use smaller block sizes, thus resulting in more
accurate results in the form of decreased residual energy.
18


To summarize the block matching based motion estimation and the motion
compensation, the process undergoes these following steps [3]. The blocks assumed
are as M x N samples, where M= no. of rows and N= no. of columns, the following
procedures are performed on each and every block of M x N samples in the current
frame.
Search an area in the reference frame (past or future frames, previously
coded and transmitted) to find a matching M x N sample region. This is
carried out by comparing the M x N block in the current frame with some or
all of the possible M x N regions of the search region and finding the region
with the best match. A popular criterion is the energy in the residual
formed by subtracting the candidate from the current M x N block, so that
the candidate region that minimizes the residual energy is chosen as the best
match. This process of finding out the best match is known as motion
estimation
The chosen candidate region becomes the predictor for the current M x N
block and is subtracted from the current block to form a residual M x N
block (motion compensation).
The residual block is encoded and transmitted and the offset between the
current block and the position of the candidate region (motion vector) is also
transmitted.
19


1.3.2 DCT
The DCT aspect of a DPCM/DCT video codec is the transform domain and the
entropy coding being used. The basic principle of transform coding is to map the
pixel values into a set of linear transform coefficients, which are subsequently
quantized and encoded. By applying an inverse transformation on the decoded
transform coefficients, it is possible to reconstruct the image with some loss. It must
be noted that the loss is not due to the process of transformation and inverse
transformation, but due to quantization alone. DCT is one such orthogonal
transform which has a performance as close to the Karhunen-Loeve Transform,
which is an optimal transformation in terms of the retained transform coefficients.
Let us look at the different transforms and why DCT was preferred over all other
transforms.
The basic requirements for a transformation technique are as follows:
1) The coefficients in the transform domain should be decorrelated.
2) A very less number of transformed coefficients should contain most of the
energy, which would then allow us to quantize efficiently and use the
transformed coefficients.
3) Computationally, the technique kernel should be symmetric and separable.
A number of transformation techniques are present like: DCT (Discrete Cosine
Transform), DFT (Discrete Fourier Transform), KLT (Karhunen-Loeve Transform),
DWT (Discrete Wavelet Transform), Discrete Haar Transform and Discrete
20


Hadamard Transform. Of all these transforms the most efficient is the KLT in terms
of energy packing. [19] However, KLT has practical limitations such as the
transform is image dependent and to implement it we do not have any fast method
of computation for e.g. a butterfly method for the DFT termed as FFT. These
limitations are removed by the use of a DCT and a DFT. Within these two
transforms the best transform that emerged was the DCT because, DCT is an
orthogonal transform consisting purely of cosine terms whereas the DFT consists of
sine and cosine terms both. This only means that the DCT uses real computations as
opposed to the DFT which uses complex computations resulting in blocking
artifacts when compared to DCT. [19] Also, DCT has the same energy compaction
properties that the KLT possesses thus making it a very popular choice for the use
of a transformation technique.
The DCT transform operates on a block of N x N samples and creates Y, an N x N
block of coefficients. The action of the DCT can be explained in terms of a
transformation matrix A where the forward DCT of an N x N sample block is given
by,
Y = AXAt........................................................1.3
and the inverse DCT (IDCT)by,
X = AtYA........................................................1.4
where X is a matrix of samples, Y is a matrix of coefficients and A is an N x N
transform matrix. The elements of A are:
21


, =C,c os^^lL
,J 2N
where C, = J-^(i = 0), and C,. = ^(i > 0)....1.5
4 CAZiX cos^ii^cos^tl^.....................1.6
^ 1 v 2W 2iV
*,-IlC.C^cos^t^cos^1^........................1.7
J y IN 2N
Since, the DCT transform is separable and offers a simpler implementation in
comparison to the DFT it turned out to be an obvious choice for the implementation
in standards as a transformation technique.
22


2.
Overview of Standards: MPEG-2 to H.264.
ITU-T which is the main standardization body felt the need to standardize the video
conferencing standard of H.261 and since then has come up with different standards
like H.262, H.263 and the more recent H.264/AVC. Similarly, ISO as another
standardization body came up with standards like MPEG-1, MPEG-2, MP3 and
MPEG-4. However, both these standardization bodies, the ITU-T and ISO had a
common team, named JVT (Joint Video Team) which developed the standard
H.264/AVC.
2.1 MPEG-2
MPEG-2 served the industrys first real time audio video coding of high quality
moving pictures. The goal of the MPEG-2 systems was to include error resilience
for broadcasting, and ATM networks, where it is designed to deliver multiple
programs simultaneously.
2.1.1 MPEG-2 Systems
MPEG-2 systems have two different streams as Program Streams and Transport
Streams, which are defined as follows:
An elementary stream is an output of a single MPEG audio or video coder and is an
endless near real time signal. This elementary stream is broken down into data
blocks of manageable size forming a packetized elementary stream (PES). The data
blocks need header information to identify the start of the packet which must
include time stamps as packetizing disrupts the time axis. Program streams have
23


variable sized packets with headers and find use in DVDs, optical and hard drives.
Transport streams can accommodate PES and several other program streams on a
single transport system. PES packets here are further subdivided into short fixed
sized packets carrying multiple programs encoded with different clocks. The
details of a transport stream can be found
in the following figure[4]:
24


Packet is always 155 feyta along
Sync Error Start Priority PID SCR Adapt Cent Adpt
Byte flag 1 fl&s; 1 1 13 Tl coot r -*? At 4 field.
Adapt Disc. Random Elem.*tr. Flags Opt.
Fid length flag &ccess priority fields Stuffing
Extended header
PCR OFCR Splice Transport Adaption
Countdoisn Private data Extension
Program c lock reference
Figure 2.1 Tran,sport Stream Details
The transport stream is based upon packets of constant size, which allows for
interleaving and error correction codes, thus, easing out this requirement on the
higher layers. Transport streams always begin with a header followed by a payload.
The payload could vary, extending the header size. The header begins with a sync
byte which is a unique pattern detected by a demultiplexer. A transport stream may
25


contain many different elementary streams and are identified by giving each a
unique 13-bit packet identification code or PID which is included in the header. A
multiplexer trying to look for an elementary stream would just check that particular
PID of every packet and decodes that particular stream matching that PID and
rejects the rest. In a multiplex there are many packets from other programs as well
as between packets of a given PID. To help the demultiplexer, the packet header
contains a continuity count. This is a 4 bit value which increments at each new
packet having a given PID. [4]
The clock reference known as Program Clock Reference (PCR) is mentioned,
which reflects the synchronization mechanism involved in MPEG-2 systems. A
decoder running from a transport stream has to genlock to the encoder. The
transport stream has to have a mechanism to allow this to be done independently for
each program, which implies that all programs must be synchronous so that the
decoder can have only one clock. This mechanism of synchronization is called
System Clock Reference (SCR).
Figure 2.3 gives us a clear picture of how the SCR/PCR works where the main
concept is to re-create at the decoder a 27MHz clock which is synchronous with
that of the encoder.
26


27 MHz
clock atth* dKodti
The figure [4] explains the synchronization mechanism where the PCR or the SCR
is recreated at the decoder. The transport stream multiplexer samples a counter and
places the state of the count in an extended packet header as a PCR. The
demultiplexer selects only the PIDs of the required program, and it will extract the
PCRs from the packets in which they were inserted. In a program stream, the count
is placed in a pack header as an SCR which the decoder can identify. [4]
27


The PCR/SCR codes are used to control a numerically locked loop which is similar
to the phase locked loop (PLL). The comparison parameter is the state of the binary
counter which is forty eight in count instead of the phases as in PLL. The NLL
contains a 27 MHz VCXO (voltage controlled crystal oscillator), which drives a 48
bit counter as in the encoder. The state of the encoder is compared with the
contents of the PCR/SCR and the difference is then used to modify the VCXO
frequency. When the loop reaches lock, the decoder counter would arrive at the
same value as is contained in the PCR/SCR and no change in the VCXO would then
occur. The loop filters, here, are used to remove the phase distortions introduced as
a result of the jitters during transmission of the transport streams. The lock-up time
or the time required to lock up to a particular counter value when changing the
programs, can be reduced if the decoder counter is jammed to a value of the first
PCR received in the new program. Once the 27 MHz clock is available at the
decoder, this is divided by the 90 kHz which drives the main time stamp
mechanism. Also, MPEG-2 sets standards for the maximum amount of jitter to be
accepted in a real transport stream. [4]
2.1.2 MPEG-2 Profiles and Levels
A profile is defined as a subset of the entire bit-stream syntax that is defined by the
MPEG-2 specification. Within the bounds imposed by the syntax of a given profile,
it is still possible to encompass very large variations in the performance of encoders
and decoders depending upon the values taken by the parameters in the bit-stream.
28


For instance, it is possible to specify frame sizes as large as 214 samples by 214. In
order to regulate this, we introduce the concept of a level in the specifications. The
levels are thus introduced for each profile. A level is thus defined as a set of
constrained parameters within the bit-stream [2]. The following figure gives us an
insight into the MPEG-2 profiles and levels.
Table 2.1 MPEG-2 Profiles \2]
Profile Typical application Features
SIMPLE MAIN SNR scalable SPATIAL scalable HIGH Broadcast DSM, Broadcast ATM networks HDTV Special applications No B-pictures, No scalability, 4:2:0 All pictures, No scalability, 4:2:0 Two-layer SNR coding, 4:2:0 Two-layer SS coding, 4:2:0 Three-level hybrid coding, 4:2:0
Table 2.2 MPEG-2 Levels [2]
Level Format Frame rate (Hz) Compressed data rate
LOW SIF 30 < 4 Mbit/s
MAIN CCIR-601 30 <15 Mbit/s
HIGH 1440 1440 x 1250 60 <60 Mbit/s
HIGH 1920 x 1250 60 <80 Mbit/s
Table 2.3 Allowed combinations of level/profile [2]
Level/ Simple Main SNR Spatial High
Profile Profile Profile Scalable Scalable Profile
LOW X X
MAIN X X X X
HIGH X X X
1440 HIGH X X
29


As seen from the above table we see that MPEG-2 was the first standard which
introduced the concept of scalability.
2.1.2.1 Non-Scalable coding modes
The biggest feature of non-scalable coding within the MPEG-2 framework is the
accommodation of interlaced video coding within the standard. This gives rise to
possibilities of interpicture prediction between the fields and frames as a whole. A
frame constitutes as a top or a bottom field, where the top field consists of odd
horizontal lines of a frame and the bottom field consists of even horizontal lines of a
frame.
The concept of B pictures or the Bi-directional predicted pictures also is introduced
here in MPEG-2. The field MB consists of a 16 x 16 pixels and five different modes
of prediction are introduced which can be applied to P, B frames.
The coding modes are listed below:
Frame prediction for frame pictures
Field prediction for field pictures
Field prediction for frame pictures
16x8 motion compensation for field pictures
Dual Prime for P pictures.
These coding modes represent the complete coding methodology for interlaced
video in MPEG-2. A very interesting feature of MPEG-2 is the introduction of
30


concealment motion vectors for error resilience of video transmission, where intra
MBs carry the motion vectors. We would discuss the profile of scalability in detail
rather than the details of the other profiles since this profile is in direct relation to
the thesis work.
2.I.2.2. Scalability
Scalability profile was first introduced with the MPEG-2 standards, where the idea
of scalability meant a layered coding approach. The concept of scalability was to
have a codec generate two different bit-streams with the more important
information in a layer called- base layer, and the residual information or the layer
carrying enhancement information as a enhancement layer. The base layer is to be
made available under all conditions and when the network congestion allows, we
could use an enhancement layer which would provide a better video. There are
different types of scalability profiles defined and the following tables suggest that.
Table 2.4 Applications of SNR scalability [2]
Base-layer Enhancement- layer Application
ITU-R-601 Same resolution and format as lower layer Two quality service for Standard TV
High Definition Same resolution and format as lower layer Two quality service for HDTV
4:2:0 High Definition 4:2:2 chroma simulcast Video production/distribution
31


Table 2.5 Applications of spatial scalability [2]
Base-layer Enhancement- layer Application
Progressive (30 Hz) Progressive (30 Hz) CIF/QCIF compatibility or scalability
Interlace (30 Hz) Interlace (30 Hz) HDTV/SDTV scalability
Progressive (30 Hz) Interlace (30 Hz) ISO/IECE11172-2/compatibility with this specification
Interlace (30 Hz) Progressive (30 Hz) Migration to HR progressive HDTV
Table 2.6 Applications of temporal scalability [2]
Base-layer Enhancement- layer Higher Application
Progressive (30 Hz) Progressive (30 Hz) Progressive (60 Hz) Migration to HR progressive HDTV
Interlace (30 Hz) Interlace (30 Hz) Progressive (60 Hz) Migration to HR progressive HDTV
These tables above show us the applications of various types of scalabilities like
Spatial Scalability, Temporal Scalability and SNR Scalability. There is also the
concept of data partitioning.
Spatial Scalability involves generating two different spatial resolution video streams
from a single video source such that the base layer is coded by itself to provide the
basic spatial layer and the enhancement layer employs the spatially interpolated
base-layer which carries the full spatial resolution of the input video source.
32


Temporal Scalability works on the principle of generating two different video bit
streams with the base layer corresponding to the basic temporal rate and the
enhancement layer is coded with temporal prediction with respect to the base which
when combined, provides a full temporal resolution at the decoder.
Another type of scalability is also considered where it is more or less a kind of Data
Partitioning where the bit stream of a single layer video is separated into two parts
or two layers. The first layer corresponds to the lower order DCT coefficients and
headers and motion vectors; whereas a second layer would have the higher order
DCT coefficients. These scalabilities within themselves could be further used
together to create hybrid scalabilities like SNR and Spatial Scalability or SNR and
Temporal Scalability and so on.
We will discuss the SNR Scalability for MPEG-2 in more detail.
SNR Scalability is defined as the ability of a communication system to display the
imagery at different quality levels. [8] This corresponds to generating two different
video bit streams having the same spatial-temporal resolutions but having different
video qualities. The base layer would have a video quality corresponding to a basic
video quality whereas the enhancement layer would have a better video quality.
The base layer provides a basic video quality and the enhancement layer is coded to
enhance the base layer. When added to the base layer, the enhancement layer
generates a higher quality reproduction of the input video signal and thus increases
the SNR of the base signal, thus, the name.
33


Figure 2.3 Typical block diagram of a two layer SNR scalable encoder based on
DCT [2]
Let us discuss the math associated with the DCT based SNR Scalable encoder.
According to the figure, the difference between the input pixels block X and their
motion compensated predictions Y are transformed into coefficients T(X-Y). These
coefficients after quantization can be represented as T(X-Y)-Q, where Q is the
quantization distortion introduced. The quantized coefficients after the inverse DCT
reconstruct the prediction error and are added to the motion compensated prediction
to reconstruct a locally decoded pixel block Z.
The interframe error signal X-Y in the transform coding comes out to be as,
T(X-Y).......................................................................2.1
and after quantization distortion Q is introduced to the transform coefficients. This
equation becomes,
T(X-Y) Q...................................................................2.2
34


and after the inverse transformation, reconstruction error can be put up as,
T[T(X-Y)-Q]........................................................2.3
This could be written as since inverse DCT is a linear transform,
r'T(X-Y) r(Q)....................................................2.4
Since, T1T=1, we can write the following,
X-Y-T(Q)............................................................2.5
When this error is added to the motion compensated prediction Y, the decoded
block becomes,
Z=Y + X Y -T'tQ) = X- T ^Q).......................................2.6
Thus, r'(Q) = X-Z...................................................2.7
which gets coded by the second layer encoder and which is the inverse transform of
the base layer quantization distortion, where again the orthonormality of the inverse
transform is applied and the final result that we obtain is
T(X-Z) = TT(Q)=Q.........................................................2.8
This is exactly the reason why this scheme is also called, coefficient amplitude
scalability or quantization noise scalability unit. Thus, we see from the figure 2.4,
the second layer is basically a re-quantizer without much complexity.
The problem with this SNR Scalable design is that since the prediction is done with
base layer whose picture quality is not as good as the combined base layer +
enhancement layer, we get a poor estimate of the actual picture.[2] A better
representation would be that the prediction is done through the picture with the sum
35


of both layers and that the enhancement layer would still encode the quantization
distortion of the base layer. This type of a SNR Scalable encoder is presented here
with Figure 2.5.
Fig 2.4 A two- layer SNR scalable encoder with drift at the base layer [2]
In this encoder, the quantizers Qb and Qe are the base and the enhancement layer
quantization step sizes respectively. The quantization distortion of the base layer is
quantized with a finer precision (Qb < Qe), which is fed back to the prediction loop
to represent the coding loop of the enhancement layer.
But, there is a problem of drift error which creeps in if you do not provide the two
bit streams at the decoder side. If the base layer bit-stream is decoded by itself,
then due to loss of differential coefficients, the decoded picture in this layer will
suffer from picture drift. This drift is carried through, until an I frame comes as it
flushes out this information since does not code using prediction means. The
36


coupling between the two layers should be removed to get drift free pictures in both
the layers. This can be achieved by not feeding the enhancement data back into the
base layer prediction loop. However, this results in an increase in the bit rate as the
enhancement layer would be intra-coded [2]. This is illustrated in the figure 2.5.
Fig. 2.5 A two-layer SNR scalable encoder with no drift at the base and the
enhancement layer [2]
The figure tells us of how a prediction is performed in this kind of a figure with a
leaky prediction method, thus removing the drift errors from both the base and the
enhancement layers. In this case the prediction in the second layer is a proportion
(a) for the inter frame loop and (1- a) from the intra frame. Since data in the
prediction loop are the transform coefficient quantization distortions, the motion
compensation is performed in the frequency domain. The standard, however, is
compliant with the figure of 2.4 and the drift error problem is removed using more I
pictures which happens to be the case in practice.
37


2.2 MPEG-4
MPEG-4 is an improvement over the previous standard of MPEG-2 in all aspects of
compression efficiency, giving us better compression efficiency for the same visual
quality and flexibility, giving us more options for more applications. The MPEG-4
encoder/decoder is based on the DPCM/DCT coding model, supported with more
features that enabled reliable transmission, better compression efficiency, and
coding of separate objects within a frame and which enables animation of face and
body models.
The salient features of MPEG-4 could be listed as follows: [3]
Efficient compression of progressive and interlaced natural video
sequences (compression of sequences of rectangular video frames). The core
compression tools are based on the ITU-T H.263 standard and can out-
perform MPEG-1 and MPEG-2 video compression. Optional additional
tools are present which further enhance video compression efficiency.
Coding of video objects, a new concept introduced for MPEG-4, enables
independent coding of foreground and background objects in a video scene.
Support for effective transmission over practical networks, error resilience
tools help a decoder to recover from transmission errors and maintain a
successful video connection in an error prone network environment and
scalable coding tools can help to support flexible transmission at a range of
coded bitrates.
38


Coding of still texture (image data). This means, for example, that still
images can be coded and transmitted within the same framework as moving
video sequences. Texture coding tools may also be useful in conjunction
with animation-based rendering.
Coding of animated visual objects such as 2-D and 3-D polygonal meshes,
animated faces and animated human bodies.
Coding for specialist applications such as studio quality video. In this type
of application, visual quality is perhaps more important than high
compression.
The MPEG-4 standard specifies profiles, and corresponding tools and objects
where, a tool is a subset of coding functions to support a specific feature like video
coding, interlaced video or coding object shapes; an object is a video sequence like
a sequence of rectangular frames or a sequence of arbitrarily shaped regions or a
still image that is coded using one or more tools. A profile then is a set of objects
that a CODEC is capable of handling. Also, as in MPEG-2 we have the levels too,
which puts a restriction on the maximum performance of a CODEC. The level
places a restriction on the amount of buffer memory, the decoded frame size, the
processing rate required to decode the frame, and the number of video objects in a
frame.
The tables below give us an idea of the MPEG-4 Visual Profiles and coding of the
object types giving us the features that go along with each profile.
39


Table 2.7 MPEG-4 Visual Profiles for coding natural video.[3]
MPEG- 4 Visual profile Main Features
Simple Low complexity coding of rectangular video frames
Advanced Simple Coding rectangular frames with improved efficiency and support for interlaced video.
Advanced Real-Time Simple Core Main Advanced Coding Efficiency N-Bit Coding rectangular frames for real- time streaming Basic coding of arbitrary-shaped video objects Feature-rich coding of video objects Highly efficient coding of video objects Coding of video objects with sample resolution other than 8 bits
Simple Scalable Fine Granular Scalability Core Scalable Scalable Texture Advanced Scalable Texture Scalable coding of rectangular video frames Advanced scalable coding of rectangular frames Scalable coding of video objects Scalable coding of still texture Scalable still texture with improved efficiency and object- based features
Advanced Core Combines features of simple, core and Advanced Scalable Texture profiles
Simple Studio Core studio Object-based coding of high quality video sequences Object-based coding of high quality video with improved compression efficiency
Table 2.8 MPEG-4 Visual Profiles for coding synthetic or hybrid video [3]
MPEG-4 Visual Profile Main Features
Basic Animated Texture Simple Face Animation Simple Face and body Animation Hybrid 2D mesh coding with still texture Animated human face models Animated face and body models Combines features of simple, core, Basic Animated Texture and Simple Face Animated profiles
The Simple, Advanced Simple and Advanced Real-Time Simple correspond to the
coding of rectangular frames, the scalability profile of MPEG-2 is broken down into
types of scalability as Simple Scalability and Fine Granular Scalability in MPEG-4
which is used for scalable coding of rectangular frames (the focus of our research).
40


The Core, Main and other profiles focus on the coding of video objects. There are
certain studio quality video coding profiles defined which are then again divided
into Natural video images and Synthetic or computer generated images.
2.2.1 Simple Profile:
The coding of rectangular frames forms a very important aspect of the Simple
Profile, with most of the applications based on the Simple Profile, where the
rectangular frames are considered as one VOP (video object plane). The simple
profile is a profile based on the previous video codec models with the DPCM/DCT
formats along with additional tools to help improve the coding efficiencies and
transmission efficiency. The basic tools and objects for coding of rectangular
frames are as shown in the following figure:
Key
Figure 2.6 Tools and objects for coding of rectangular frames [3]
41


Thus, from the figure, a simple profile would consist of tools and objects as
follows:
I-VOP (Intra-coded rectangular VOP; progressive video format)
P-VOP (Inter-coded rectangular VOP, progressive video format)
A short header mode to make compatible with H.263 codec
Compression efficiency tools consisting of four motion vectors per MB,
unrestricted motion vectors and intra prediction
Transmission efficiency tools consisting of Data Partitioning, video packets,
and reversible Variable Length Coding.
We would try to understand the basics of the Simple profile and then move ahead to
the Scalable Profiles which form the crux of the research.
The best way to understand the I-VOP and the P-VOP is to through the following
figures:

Co-tea
I-VOP
Figure 2.7 I-VOP encoding and decoding stages [3]
42


Frama
Figure 2.8 P-VOP encoding and decoding stages [3]
A rectangular I-VOP consists of coding the frame in an intra mode, or without any
prediction from any other frame. The DCT and the IDCT block follows the same
methodology as discussed in the chapter 1, with an 8 x 8 block following the
forward DCT equation during the encoding process and the IDCT equation in the
decoding process accordingly. For the Quantization process, the MPEG-4 visual
standard specifies the rescaling of quantized transform coefficients and the DC
component and the AC coefficients are rescaled both in separate ways. The process
of rescaling is controlled by a quantization parameter (QP) ranging from 1 to 31,
where the higher QP represents a larger quantization and thus larger distortion. The
DC coefficient in an intra-coded MB is rescaled as follows:
DC = DCn .dc scaler
(2.9)


The DCq is the quantized coefficient and DC is the rescaled coefficient and
dcscaler follows from the table given below:
Table 2.9 Values of dcscaler parameter depending on QP range [3]______
Block Type QP<4 5 Luma
Chroma
8
8
2 QP QP + 8 (2 QP) 16
(QP + 13)/ 2 (QP + 13)/ 2 QP 6
The other transform coefficients are rescaled as follows:
I F | = QP. (2. |FQ| + 1);
I F | = QP. (2. |FQ|+ 1)-1;
F = 0;
(if QP is odd and Fq ^ 0)
(if QP is even and Fq ^ 0)
(if Fq = 0)
(2.10)
The reorder block signifies the zigzag scan of the DCT coefficients prior to
encoding process. This method is adopted to reduce the number of zeros
encountered while entropy encoding the (run, level) pairs of DCT coefficients.
The RLE and RLD blocks in the figure 2.7 correspond to the array of reordered
coefficients and each coefficient is encoded as a triplet as (run, level, last) where the
run represents the number of zeros preceding value of the DCT coefficient and the
level corresponds to the value of the DCT coefficient and the value of last
corresponds to whether the transformed coefficient is the last nonzero coefficient in
the array of the zigzag scanned coefficients. The VLE and VLD correspond to the
variable length encoding which encode the side information like the motion
44


vectors, header information and the triplets (run, length, last). Apart from all this,
an I-VOP would consist of a header, coded MBs and each MB would however
contain header information about the signaling changes involved in quantization.
A P-VOP is better known as a prediction based encoding, where the encoded P-
VOP is coded using a prediction of the past encoded I or P-VOP. The figure 2.9
gives us an idea of the P-VOP coding. Motion estimation here is based on a block
of 16 x 16 pixels and the offset between the current and the compensation region in
the reference frame or the motion vectors may have a half pixel resolution. The
predicted samples at sub-pixels positions are calculated using bilinear interpolation
between samples at integer-pixel positions. The motion vector is selected and the
region selected is subtracted from the current region and the residual macroblock is
then transformed using a DCT transform, quantized, reordered and entropy
encoded. This quantized residual is rescaled and inverse transformed to provide a
local copy for further motion compensated prediction. The macroblocks within a P-
VOP can be encoded using an inter mode or an intra mode.
The short header, mentioned in the Figure 2.6, provides us compatibility between
MPEG-4 Visual and the H.263 video standard. A I or P-VOP encoded in the short
header mode corresponds to the I-picture or P-picture coding mode in the baseline
mode of H.263. Thus, a H.263 compliant codec would be able to decode the I-VOP
or the P-VOP.
45


2.2.1.1 Coding efficiency tools
From Figure 2.6, we see that there are certain features like four motion vectors per
macroblock, an unrestricted motion vector and Intra prediction. [3]
Four motion vectors per macroblock:
Motion Compensation tends to be more effective with smaller block sizes. The
default bock size for motion compensation is 16 x 16 samples (luma), 8x8 samples
(chroma), resulting in one motion vector per macroblock. This tool gives the
encoder the option to choose a smaller motion compensation block size, 8x8
samples (luma) and 4x4 samples (chroma), giving us four motion vectors per
macroblock. This is best illustrated in the figure below.
Figure 2.9 (One or four vectors per MB) [3]
This mode can be more effective at minimizing the energy in the motion-
compensated residual, particularly in areas of complex motion or near the
boundaries of moving objects. There is an increased overhead in sending four
motion vectors instead of one, and so the encoder may choose to send one of the
four motion vectors on a macroblock-by-macroblock basis.
46


Unrestricted Motion Vectors:
Often, it might so happen that the best match for a given macroblock lies outside
the boundaries of a reference VOP. We see a better match for a MB is obtained
from a reference MB which is outside the reference picture into the extrapolated
region. This feature allows motion vectors to point outside the boundary of the
reference VOP. The UMV mode can improve motion compensation efficiency,
especially in such a case where the objects keep coming in and out of the reference
VOP.
2.2.1.2 Transmission efficiency tools
From Figure 2.7, we see that the standard has considered the transmission of video
over error-prone channels. A transmission error such as a bit error or a packet error
when occurs may result in a distortion, which jeopardizes the decoding operation
and the decoder would start to decode incorrectly some or all of the information
after the occurrence of an error. The decoded VOP might get distorted as a result,
and this distortion or error, through spatial error propagation, would propagate and
the subsequent VOPs are predicted from the damaged VOP, where the distorted
area might be used as a prediction reference. Thus, we need a tool or a feature
which would indicate some kind of a marker indicating the start point of VOP
distortion and then resynchronize with the next resynchronizing point within the
decoding frame accordingly. This feature would be a uniquely decodable binary
47


code inserted in a bit stream without any additional usage in the bits. When the
decoder detects an error, a suitable recovery mechanism is to scan the bit stream
until a resynchronization marker is detected. In the short header mode, this could be
a part of the VOP header.
Video packet:
A video packet which could be one or more for a given VOP consists of a
resynchronization marker, a header byte, an extension to the packet header a
header extension coder (HEC) consisting of a quantization parameter and a flag.
The structure for a video packet is given as follows in Figure 2.13.
Sync HeaiAar HEC (Header) Xlacro block >M. Sync
Figure 2.10 A Video packet structure [3]
A video packet tool assists in error recovery and a decoder can start decoding and
resynchronize at the start of the next video packet so that the error does not
propagate beyond the boundary of the video packet. Also, the predictive coding
such as differential coding of the quantization parameter, prediction of motion
vectors and intra DC/AC prediction does not cross the boundary between the video
packets, thus preventing the error propagation of the motion vector data from
propagating to another video packet [3],
48


Data Partitioning
This feature tells us that the encoder partitions the coded data within a video packet
to reduce the impact of the transmission errors. The packet is basically divided into
two parts where the first part consists of the coding mode information for each
macroblock together with the DC coefficient for each block. The second part
consists of the AC coefficients and the DC coefficients, placed after a
resynchronization marker. Usually, the first part is considered to be very important
and if that is recovered, the second part is also easily recovered. On losing the
second part, the decoder can still predict the second part from the information of the
first part.
Reversible VLCs:
As a part of transmission efficient tools, this feature denotes that these codes could
be decoded both in the forward and the reverse directions, thus reducing the picture
area affected by the error transmission. A decoder starts to decode a particular video
packet from the forward direction and if an error is detected, (because of bit stream
syntax violation), the packet is decoded in the reverse direction from the next
resynchronization marker. Thus, with this feature the damage could be restricted to
that particular MB only.
All this was a discussion on Simple Profile under the MPEG-4 standard.
49


2.2.2. Scalable Video Coding:
A basic idea of a scalable video coding enables a decoder to decode only a part of
the coded bit stream where the coded bit stream is in the form of several layers
including the base layer and one or more enhancement layers. The following
figure illustrates the scalable coding concept.
B* tic-
quality
sequence
High-
quality
sequence
Figure 2.11 Scalable coding: a general concept [3]
From the above figure, we see that a decoder is able to decode a bit stream
consisting only of a base layer which has a basic quality of the video sequence
being encoded, whereas another decoder is able to decode all the layers being sent
to it. This arises two possibilities for the decoders, one where the network
conditions force the decoder to be able to decode only the base layer whereas all the
enhancement layers could be decoded at higher bit-rate availability or the decoder
which is complex is able to decode all the bit streams including the enhancement
layers and the base layer whereas the simple decoder is able to decode only the base
layer.
50


MPEG-4 Visual supports a variety of Scalable coding types listed as:
Spatial Scalability
Temporal Scalability
Fine Granular Scalability
2.2.1.3 Spatial Scalability:
This scalability type provides the scalability aspect at the spatial domain, where the
base layer corresponds to the reduced-resolution version of each coded frame.
Decoding the bit stream with both the base and the enhancement layers gives us an
output of a higher resolution output. To encode the video into two spatial layers we
follow these steps: [3]
Subsample each input video frame horizontally and vertically.
Encode the reduced-resolution frame to form the base layer.
Decode the base layer and up-sample to the original resolution to form a
prediction frame.
Subtract the full resolution frame from this prediction frame.
Encode the difference (residual) to form the enhancement layer,
and at the decoder end...
Decode the base layer and up-sample to the original resolution
Decode the enhancement layer.
Add the decoded residual from the enhancement layer to the decoded base
layer to form the output frame.
51


In the enhancement layer, an I-VOP is encoded without any prediction, as a
complete frame by itself. In a P-VOP for an enhancement layer, the decoded up-
sampled base layer VOP is used as a prediction and the difference between the
prediction and the current VOP is coded without the motion compensation, thus not
needing to send any motion vectors. For a B-VOP, in the enhancement layer, the
prediction is done in both directions- the backward direction corresponding to the
decoded, up-sampled base layer VOP without any motion compensation whereas
the forward direction prediction is performed using the previous VOP in the
enhancement layer with the motion compensated prediction.
2.2.2.2. Temporal Scalability
This feature corresponds to a base layer producing a bit-stream with a lower frame
rate and the enhancement layer when combined with the base layer produces a
higher temporal rate or higher frame rate. The following figure gives us an idea into
temporal scalability.
Figure 2.12a Temporal enhancement of P-VOP prediction options [3]
52


& E
E E
Figure 2.12 b Temporal enhancement of B VOP options [3]
2.2.1.3.1 Fine Granular Scalability(FGS)
The FGS scheme works on the principle of creating two bit streams with the base
layer being encoded in a non-scalable mode reaching a lower bound of the bit-rate
range whereas the enhancement layer codes the difference between the original and
the reconstructed picture using the bit-plane coding of the DCT coefficients. [5] The
bit stream for an enhancement layer can be truncated into any number of bits per
picture after encoding is completed. This leads to the fact that the decoder in any
case is able to decode the base layer and the enhancement layer upto the truncated
bitstream. The enhancement layer thus would represent a video quality proportional
to the number of bits decoded by the decoder. The encoder and the decoder are as
shown in the figures below.
53


Figure 2.13 FGS Encoder Structure [5]
Figure 2.14 FGS Decoder Structure [5]
54


In DCT based run-level coding, the run corresponds to the number of zeros
preceding a non-zero coefficient and the level indicates the absolute value of the
non-zero coefficient and if a 2-D VLC table is used, these coefficients are coded as
(run, level) and a separate EOB or End of Boundary is used. This is the basis of a
(run, level) coding. Now, a bit-plane coding works on the principle of considering
each of the DCT coefficients as a binary number instead of a decimal number. An 8
x 8 DCT block gives us 64 coefficients and in the bit plane coding, these 64
coefficients are considered bit-wise and encoded. For a better understanding, let us
consider an example. [5] Assume that the absolute values and the sign bits after
zigzag ordering are given as follows:
10, 0, 6, 0, 0, 3, 0, 2, 2, 0, 0, 2, 0, 0, 1, 0 ..., 0, 0 (absolute)
0, x, 1, x, x, 1, x, 0, 0, x, x, 1, x, x, 0, x ..., x, x (sign)
Now, we see that the maximum non zero coefficient is 10 which requires 4 bits to
represent it as (1010). Thus we would have 4 bit-planes to form the (RUN, EOP)
symbols.
Writing every, value in the binary format, the four bit planes are formed as follows:
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 . ..,0,0 (MSB) -23
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 . ..,0,0 (MSB-1) -22
1,0, 1,0, 0, 1,0, 1, 1,0, 0, 1,0, 0,0,0 . ..,0,0 (MSB-2) -21
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1 . ..,0,0 (MSB-3) -2
55


Converting these into four bit-planes into (RUN, EOP) symbols we have
(0, 1) (MSB)
(2, 1) (MSB-1)
(0, 0), (1, 0), (2, 0), (1, 0), (0, 0), (2, 1) (MSB-2)
(5, 0), (8, 1) (MSB-3)
Thus, this represents the 4- bit planes which are VLC coded and the sign
information for the coefficients is also sent along with the bit planes. Coming back
to the encoder and the decoder structure as shown in Figures 2.16 and 2.17, the top
portion of the block diagrams in these figures represent the FGS Enhancement
Encoding Scheme. This portion has a block as Find Maximum which determines
the maximum of the DCT coefficients and the no. of bit-planes required as a part of
the FGS encoding. This block is necessary to find out the different number of
maximum bit- planes required for each component of luminance and that of chroma
since they could be different. After this, the bit-planes are VLC encoded and forms
the enhancement stream.
The decoder decodes the base layer and the enhancement layer, which could be
truncated. If the enhancement layer is truncated, the accuracy of the coefficients is
reduced.
As a development to the FGS, features like frequency weighting and selective
enhancement could be added where frequency weighting indicates the use of
different weights for different frequency components making some bits of these
56


coefficients put up in the stream ahead of the other components. Selective
enhancement refers to the use of different weighting for different spatial locations
of a frame so that more bit-planes of some parts of a frame are put in the bit stream
ahead of other parts of the frame.
2.3 H.264
H.264, forms the first of the third generation of video coding schemes, the first
generation being H.120, H.261 and MPEG-1, the second being H.263, MPEG-2 and
MPEG-4. When compared to its predecessors, this standard also is a specification
of what the decoder is supposed to do, as opposed to how the encoding is done.
D. X
Figure 2.15 H.264 Encoder Structure [3]
H.264 works on a block of pixels at a time. It works on a 16 x 16 pixel block called
Macro Block (MB) of luminance and an 8 x 8 block for the chroma components.
57


Within each picture the MBs are arranged in the form of a slice, where each slice
arranges the MBs in a particular scan order. There is flexibility within the standard
in the arrangement of these MBs within a slice. Also, within H.264, you can predict
from two separate lists of pictures, list 0 and list 1.
An / slice consists of I-macroblocks which are predicted using the intra prediction
mode from the Encoder structure as shown in Figure 2.15. The intra prediction
could be performed for either a complete MB or a 4 x 4 MB of luma samples and
associated chroma samples.
P slice may contain P and I macroblock types. The predictions for P macroblocks is
done using inter prediction from reference pictures. An inter coded MB maybe
divided into Macroblock partitions of 16 x 8, 8 x 16, or 8 x 8 luma samples, and the
chroma samples accordingly corresponding to these samples.
B slices would have macroblocks of both B and I macroblocks, which could be
predicted from both the list of pictures, as in one from list 0 and another from list 1.
Also, for other purposes we have the SP an SI slices which we will discuss as a part
of the profiles. The main difference between this standard and that of all previous
standards is the presence of a Network Abstraction Layer and the Video Coding
Layer, this allows for more flexible packaging of the elementary streams. The
following figure illustrates this concept.
58


Tnmport feyer
MPEG 2
AVC
Figure 2.16 Comparisons of MPEG-2 and H.264/AVC Coding Layers [4]
Talking about the NAL, the coded video data are organized into NAL units, which
are packets that each contains an integer number of bytes. A NAL unit starts with a
one-byte header, which signals the type of the contained data. The remaining bytes
represent the payload data. The NAL units are classified into VCL NAL units
containing the coded slices or coded slices partitions, and non-VCL NAL units,
which contain associated additional information. A set of consecutive NAL units
with specific properties is referred to as an access unit. The decoding of an access
unit results in exactly one decoded picture. A set of consecutive access units with
certain properties is referred to as a coded video sequence. [6]
The Video Coding Layer corresponds to the coded representation of the picture and
that corresponds to the payload in a video packet. The payload is formed using one
of the profiles the standard has defined. H.264, in its standard, has defined three
profiles. The main three profiles are as follows:-
59


Baseline profile: This profile is generally used as a low-complexity, low
latency profile which provides the basic functionality. The typical
applications would involve interactive ones like mobile video telephony and
video conferencing. This profile has features like Intra and Inter Coding and
Context Adaptive Variable Length Coding (CAVLC). Also, there are
features like Slice Groupings, and Redundant Slices to provide error
resilience for transmission of data.
Main profile: This is a high complexity, high latency profile with studio
quality levels fit for applications like DVD, HDTV and broadcast
applications. This profile does not incorporate any error resilient features
though. This profile supports features like Interlacing, and Context Adaptive
Binary Arithmetic Coding apart from including aspects like B-slices and
Weighted Prediction.
Extended Main profile: This profile is especially developed to incorporate
error resilient features so as to use this profile in extreme conditions like
wireless channels where the channel conditions require the source to be
robust. This profile would have additional features on top of the Baseline
features, such as SP and SI slices and Data Partitioning.
All the profiles are compatible with the Baseline profile. The following figure tells
us a complete picture about the compatibility issues and the features in each profile.
60


Baseline. 2-.Ia. ini and Extended prof its
KBSP KAL RSSP KAl- | X5SF
HbaeSes-
Seonence ofNAL lanits
F^isre 2.17 Profiles uiH264 [37
2.3.1 Baseline Profile:
2.3.I.I. I-slice:
An I-slice is defined as having all the MBs being predicted from the previous coded
samples but from the same slice. The prediction block is formed by previously
encoded and reconstructed blocks and is subtracted from the current block prior to
encoding. For the luma samples, P is formed for each 4x4oral6xl6 pixel blocks
with nine different modes of operation for a 4 x 4 luma block and four modes of
operation for a 16 x 16 block. The choice of the mode depends on the minimization
of the difference between P and the block to be encoded.
61


From the figure below, for a luma block of 4 x 4 pixels, we see that the samples
from A-M have been previously encoded and reconstructed and are available for a
prediction
0 (VB.UGU.I
XI A E C d| e f G H
I K L 1 r 1 r 1 r 1 t
2 (DC)
a (Buaojaauo-wxLzrn
i iSKMESICNTAL BCWX) (VBIT1CA11EF7J
Figure 2.18 4 x 4 terns prediction modes.
62


Table 2.10 showing the nine possible modes of operation for a 4 x 4 MB of pixels
[3].
Mode 0 (Vertical)
Mode 1 (Horizontal)
Mode 2 (DC)
Mode 3 (Diagonal
Down-left)
Mode 4 (Diagonal
Down-right
Mode 5 (Vertical right)
Mode 6 (Horizontal down)
Mode 7 (Vertical left)
Mode 8 (Horizontal up)
The upper samples A,B,C,D are vertically extrapolated.
The left samples I,J,K,L are extrapolated horizonatally.
All samples in P are predicted by the mean of
samples A.. ,D and I...L.
The samples are interpolated at a 45 angle between
lower left and upper left
The samples are interpolated at a 45 angle down
and to the right
Extrapolated at an angle of approximately 26.6 to the
left of vertical (width / height = 'A )
Extrapolated at an angle of approximately 26.6 below
Horizontal
Extrapolated at an angle of approximately 26.6 to the
Right Of vertical
Interpolation at an angle of approximately 26.6 above
Horizontal
Similarly, for a 16 x 16 pixel MB we would see four modes of operation which give
us this table.
Table 2.11 showing the four different modes of operation for 16 x 16 pixels MB [3]
Mode 0 (vertical) Extrapolation from upper samples (H)
Mode 1 (Horizontal) Extrapolation from left samples (V)
Mode 2 (DC) Mean of upper and left hand samples (H + V)
Mode 4 (Plane) A linear plane function is fitted to the upper and
left hand samples H and V. This works well in
__________________________________areas of smoothly varying luminance__________
2.3.1.2 P-slice:
A P-slice is defined as having inter-coded MBs predicted from samples in
previously coded pictures, or intra-coded MBs or Skipped MBs. For the Skipped
63


Even, if we select the 8 x 8 mode of operation, we see the division of the MB
prediction into four more modes of 8 x 8, 4 x 8, 8 x 4 and 4x4. This was meant for
the luminance blocks, but for the chroma components the block sizes are taken
accordingly by dividing the block size by two as the chroma components would
have half the horizontal and vertical resolution. Thus, if we have a luma component
as 16x8, then the chroma component is assumed to be at 8 x 4. This way we see
that for the motion estimation and motion compensation, we have broken down a
macroblock into manageable sizes, giving us more flexibility in motion estimation
and motion compensation. This gives rise to a trade-off scenario, because for each
MB partition we would have to signal the choice of the block size through the
motion vector, giving an overhead in the amount of information to be sent. But on
the other hand, having such flexibility provides a better way to compensate for the
motion of the objects taking place within a MB. The next step is to define the
motion vectors and the method of obtaining motion vectors for each block and their
transmission.
Motion Vector prediction: Motion vectors are at a quarter sample resolution
where the present partition or sub-partition block is predicted from an area of the
same size in a reference picture. The difference or the residual image is then divided
into partitions or sub-partitions depending on the energy concentrations in the
macroblock. The current 4x4 block is looked for a matching area in the previous
65


pictures. Once the area is found, depending on a particular criterion like MSE
(Mean Square Error) or MAD (Mean Absolute Difference), a reference vector is
found. Later, as luma and chroma samples do not exist for sub-sample positions, we
need to interpolate from the nearby samples. The interpolation of the samples is
performed using a six-tap FIR filter with the following filter coefficients (1/ 32,
-5/32, 5/8, 5/8, -5/32, 1/32) for the half-pel precision. To understand the process of
motion vector prediction and the interpolation method, we observe this with the
following figures:


E F
cc eSd
K L-
n

C bb
G b H
h j m
M s N
SS
T fell U
X j
s
p Q
Figure 2.20 Interpolation of luma half-pel positions. [3]
The half pel sample at b is to be calculated from the previous six horizontal
samples, E, F, G, H, I, J, using the FIR filter, and similarly all other half pel
resolutions are calculated. Once the half resolutions are obtained, to obtain the
66


quarter pel resolutions we use a general averaging formula and round it off. The
following figure gives us an interpolation for quarter pel resolution.
G b H
** Vt a ,
% j m
* ti
M 3- N
Figure 2.21 Interpolation of luma samples for Quarter pel resolution [3]
Motion vector prediction:
A predicted vector MVP is calculated based on previous motion vectors calculations
and MVD the difference between the current and the predicted vector which is
encoded and transmitted. The method of obtaining the prediction MVP depends on
the motion compensation partition size and on the availability of nearby vectors.
The following conditions are followed when MVpis to be decided: [3]
For transmitted partitions excluding 16x8 and 8x16 partition sizes, MVP is
the median of the motion vectors for partitions A, B and C.
For 16x8 partitions, MVP for the upper 16 x 8 partition is predicted from B
and MVP for the lower 16x8 partition is predicted from A.
For 8x16 partitions, MVP for the left 8x16 partition is predicted from A
and MVP for the right 8x16 partition is predicted from C.
67


2.3.1.3 Context Adaptive Variable Length Coding (CAVLC):
To explain the concept of the CAVLC, we assume that the transformed coefficients
are converted to a 1-D array by a zigzag scanning of the coefficients. Every non-
zero coefficient is then associated with a variable run which counts the number of
zero coefficients to the previous non-zero coefficient. It is observed that there are a
lot of ls with either signs among the higher frequencies which are recorded in
number (up to three) and together with the total number of non-zero coefficients,
coded with one out of a set code of tables. The choice of the table is made
depending on the number of non-zero coefficients in the neighboring blocks. The
values of the remaining coefficients is then coded using adaptive Rice codes, where
the adaptivity is given by a varying suffix size to adapt to the coefficients
frequency range. Thus, several code tables are used and the choice among the
tables is made according to the value of the previously encoded coefficient. The
sum of runs is computed and encoded with one of the 15 tables depending upon the
number of non-zero coefficients in that block, the runs are then individually coded
with one out of the seven code tables. [7]
2.3.1.4 Arbitrary Slice Order and Redundant Slices:
This feature is basically a performance enhancement feature which is primarily
meant to improve the performance during transmission. ASO is defined to be in use
if the first MB in any slice in a decoded frame has a smaller MB address than the
first MB in a previously decoded slice in the same picture. [3]
68


A slice group may contain one or more slices and is a collection of macroblocks.
Within each slice group the MBs are arranged in a raster scan order and the slices
are arranged in a flexible manner and this feature is known as Flexible Macroblock
Ordering. The allocation of the MBs to the slice groups follow a mapping function
as defined in the table given below.
Table 2.12 Mapping of MB to a slice group [3]:
Macro block to slice group map types_
Type Name Description
0 Interleaved Run length MB s are assigned to each slice group in turn
1 Dispersed MB s in each slice group are dispersed throughout the picture
2 Foreground All but the last slice group are defined as rectangular regions within the picture. The last slice group contains all MB s not contained in any other slice group(the back ground)
3 Box-out A box is created starting from the centre of the frame and containing group 0, all other MBs are in group 1
4 Raster scan Group 0 contains MB s in raster scan order from the top left and all others are in group 1.
5 Wipe Group 0 contains MB s in vertical scan order from top left and all others are in group 1
6 Explicit A parameter, slice group id is sent for each MB to indicate its
slice group.
69


012301230X2
23012301230
01230123012
01230123012
23 0 12301 230
01230123012
01230123012
23012301 23 0
01230123012
Figure 2.22 Slice groups Interleaved and Dispersed map [3]
A redundant picture is the same representation of the coded picture, where the
decoder always uses the primary picture to decode and trashes the redundant picture
upon the correct arrival of the primary picture. When the primary picture is lost
during the transmission, the decoder in that case would use the redundant picture to
reconstruct the picture.
0

1
2
Figure 2.23 Slice Groups: Foreground and Background map [3]
70


2.3.2 Main Profile
Here in this profile, Main Profile uses all the features of the Baseline except for the
performance enhancement features like ASO, FMO and RS. The additional features
of Main profile are as follows:
B-slices
Weighted Prediction
CABAC
Support for Interlace pictures.
2.3.2.1 B-slices:
The MBs in a B-slice are predicted from pictures of reference lists, list 0 and list 1
and the prediction is done using previously coded or future coded pictures from
either lists (which are before or after current picture in display order). The B-slices
would perform a prediction based on the following equation unless a weighted
prediction is used.
pred (i,j) (predO (ij) + predl(ij) + 1) > > 1 ........................(2.9)
where the reference block is created from pictures from list 0 and list 1 of the same
size as the current partition or sub-macroblock partition. Two motion compensated
reference pictures are created from both the lists and each sample of the prediction
block is calculated as an average of the list samples. In the above equation,
predO(ij) and predl(i,j) are the samples from the list 0 and listl respectively. After
71


the bi-directive sample, pred(i,j) is obtained, we subtract that from the current
sample we get a motion compensated residual which then follows a DCT
transformation and then an entropy coding strategy.
2.3.2.2. Weighted Prediction:
The Main profile offers the use of weighted predictions i.e. a linear transformation
of one or two predictions in the form Xp=a.Xr + c with P pictures and Xp=a.Xr,i +
b.Xr,2 + c with B pictures, where Xp and Xr denote the predicted and reference
signals and where a, b and c represent the weight.[7] This feature is helpful in
changes in intensity like fades, where one scene fades into the other. The choice of
the weights depends on the temporal location of the reference picture. If the
reference picture is far from the current picture, a smaller weight is used. If the
reference picture is nearer to the current picture, a larger weight is used.
2.3.2.3 CABAC (Context-based Adaptive Binary Arithmetic
Coding)
The main disadvantage of using a CAVLC is that when we assign a codeword
consisting of an integral number of bits, we achieve a suboptimal compression
because the optimal number of bits for a symbol is generally a fraction. So the best
way to achieve that elusive theoretical limit given by [17] is come as closely as
possible by using the arithmetic coding. An arithmetic coder converts the sequence
of symbols into a fractional number and thus approaches the optimal number of bits
required to represent each symbol.
72


The fundamental idea of an arithmetic coder is to use a scale in which the coding
intervals of real numbers between 0 and 1 is considered. This is the cumulative
probability density function of all symbols adding up to 1. The interval needed to
represent the message becomes smaller as the message increases and the number of
bits needed to specify the interval is increased. To illustrate the use of a CABAC
coder we consider an example:
Consider the symbols {-2,-1,0,1,2,EOF} with the probabilities as listed in the table
below:
Table 2.13 showing the probability of the symbols and the sub-range allocated to it.
[3]
Motion vectors, sequence 1: Probabilities & sub-ranges
Vector Probability Iog2(l/P) Sub range
-2 0.1 3.32 0-0.1
-1 0.2 2.32 0.1 0.3
0 0.4 1.32 0.3 0.7
1 0.2 2.32 0.7 0.9
2 0.1 3.32 0.9 -1.0
where the total range is shown as:
0 ('2) 0.1 <-*> 0.3 (0) 0.7 (1) 0.9 (2) ,
Figure 2.24 Sub-ranges of the symbol probabilities [3]
If we have to encode {0,-1,0,2} then the following table gives us a procedure of
encoding:
73


Table 2.14 Encoding Procedure for vector sequence (0,-1, 0, 2): [3]
Encoding Procedure Range Symbol (L-H) (L-H) Sub-range Notes
1. Set the initial range 0-1.0
2. For the first data symbol, (0) 0.3-0.7
find the corresponding sub-range.
3. Set the new range (1) 0.3-0.7
to this sub range.
4. For the next data (-1) 0.1 0.3 This is the
symbol find the sub range sub range
L to H within interval 0-1.
5. Set the new range (2) 0.34- 0.42 0.34 is 10% of
to this sub range the range.
within the previous 0.42 is 30% of
range. the range
6. Find the next sub (0) 0.3-0.7
range
7. Set the new range (3) 0.364-0.396 0.364 is 30%
within the previous of the range
range. 0.396 is 70% of the range
8. Find the next sub (2) 0.9- 1.0
range
9. Set the new range (4) 0.3928-0.396 0. 3928 is 90% of
within the previous the range 0.396 is 100% of the range
74


The decoding process on the other hand takes place as follows from the table 2.15
given below:
Table 2.15 Decoding procedure for a given vector [3]
Decoding Procedure symbol Range Sub range Decoded
1. Set the initial range. 0-1
2. Find the sub range in which the received number falls. This indicates the first data symbol. 0.3 0.7 (0)
3. Set the new range ( 1 ) to this sub range. l o
4. Find the sub range of the new range in which the received number falls. This indicates the second data symbol 0.34-0.42 (-1)
5. Set the new range (2) to this sub range within the previous range. 0.34-0.42
6. Find the sub range in which the Received number falls and decode the third data symbol 0.364 0.396 (0)
7. Find the sub range in which the Received number falls and decode the fourth data symbol 0.3928 0.396
75


To better put in a figure, the following figure gives us an idea of the decoding
process using CABAC.
o O.l 03 o.~ 0.3 1
Figure 2.25 Decoding mechanism in CABAC [3]
2.3.2.4 Support for Interlace pictures:
The Main profile under the H.264 standard provides a support for interlace video,
first seen in MPEG-2, and again supported by this standard. The interlacing of a
video frame is the separation of the picture frame fields into two different frames on
the basis of even and odd lines of pixels. This feature is known as MB-AFF or
Macroblock- Adaptive Frame/Field. In this feature, the current slice is processed in
units of either a complete frame or two separate fields. The encoder can decide as to
76


code each MB pair as using a two frame macroblocks or two field macroblocks and
may select the optimum coding mode for each region of the picture. The usage of
this feature results in a better coding gain and provides a better position for error-
concealment. [7]
2.3.3 Extended Main
This feature when introduced was primarily meant for providing more error
robustness within the transmission mechanism for the standard. It incorporates all
of the Baseline profile tools along with the inclusion of B-slice coding, weighted
prediction and additional features like Data Partitioning and SP and SI slices, all
meant to support video over error prone channels.
2.3.3.1 Data Partitioning:
Under this feature, the coded data is represented in separate partitions where the
first partition would consist of slice header, and header data for each macroblock in
the slice, the second partition would consist of residual data for intra and SI slice
macroblocks, and the third partition consists of the coded residual data for inter
coded macroblocks.
The importance of these partitions is such that since the first partition would consist
of the header data for each macroblock, losing it to transmission errors would mean
that it would be difficult to reconstruct the coded macroblocks. However, for any
errors in the second and the third partitions, the remaining two partitions could still
77


be obtained successfully. Each partition is in turn, transmitted as one packet by the
NAL.
2.3.3.2 SP and SI slices:
These coded slices enable efficient switching between video streams providing a
random access to all the bit-streams coded at various bit rates. For a same video
material coded at various bit rates, the decoder would try to decode the bit-stream of
the highest bit-rate stream, but due to network conditions when the decoder has to
switch to lesser bit-rate streams, it has to switch between high bit-rate video to a
low bit-rate video. This is when we have these switching slices as SP and SI.
SP Slices are introduced to provide switching between similar coded sequences
without the increased bit-rate penalty of I-slices.
78


P slices.
AC<
A1
A3
A4
SeresA
Sw-ttdi point
BO
B1
B2
B3
B4
SereaiTi B
P slices I slice P slices
P slices SP slices Pslices
AO
A3
A 4
Stresir. A
BO
B1
B2
B3
B4
Stream. B
Fieuire 2.2i$ Switcfems Streams 'between I and P slices T31
From the figure, we notice that the A2 picture could be decoded using reference
picture Ai and the picture B2 can be decoded using reference picture Bi. The
introduction to the switching process can be implemented using a slice AB2 such
that, it can be decoded using motion-compensated reference picture Ai, to produce
decoded frame B2. Thus, this would mean that the decoded picture B2 is obtained by
either using Bi or B2 is obtained by using Ai. However, the only overhead that
would result is the introduction of the switching SP slice.
79


SI slices follow the same switching principle but they follow the same nine modes
of operation for the Intra prediction as discussed in the baseline profile with a block
of 4 x 4 MB. The prediction here is obtained by the 4 x 4 intra prediction mode
from previously decoded samples of the reconstructed frame. This slice mode can
be used to switch from one sequence to a completely different sequence, in which
case there would not be any prediction that would be used.
80


3. SNR Scalability by Hybrid Coefficient Refinement
SNR Scalability denotes the ability of a communication system to display imagery
with several quality levels. [8] This profile is the most useful when it comes to
video transmission over heterogeneous networks with varying bandwidth and the
decoder has to choose between the different available bit rates. This way the
decoder decodes the same video at that particular bit rate and the available quality
level at that bit rate. The layer with the least quality level is known as base layer
and subsequent layers with higher qualities are known as enhancement layers. The
history of SNR Scalability starts with the MPEG-2 standard where the profile of
SNR Scalability was first introduced. The description of the SNR Scalability in the
MPEG-2 is defined and explained in the in MPEG-2 description in Chapter-2, but
again summarized as follows. The Base layers quantization error is again
requantized with a higher precision and transmitted as an enhancement layer. This
is again fed back to the prediction loop to obtain a copy of the picture sent.
However, the disadvantage of this strategy is that there is a perceptible visual
degradation arising due to the drift. This drift is observed because of the high
quality data being fed back in the decoding loop because the reference for the
motion compensation at the encoder depends on this high quality data whereas at
the decoder it depends on the base layer. Thus, with a drift error resulting in such a
perceptible degradation, this strategy results in a sub-optimal coding efficiency. The
next instance on SNR Scalability or quality scalable was re-addressed in MPEG-4
81


Visual standard. This standard, as defined in Chapter-2, had introduced the concept
of Fine Granular Scalability where the transformed coefficients were coded on a bit
plane basis. The base layer utilizes a non-scalable coding technique and the
enhancement layer is formed from the residue formed from the original and the base
layer and without any motion compensation is bit-plane coded. The crux of the
situation is that the enhancement layer can be truncated at any position of time. This
enhancement layer thus utilizes the best available rate control mechanism. This
scalability inspite of a better rate control mechanism has a flaw. The lack of
exploitation of motion compensation and that if the enhancement layers are
truncated, it leads to a drift since the reference frames are not entirely reconstructed,
form the two main disadvantages of this scheme.[5]
The next standard that described SNR Scalability was H.263 with Annex 0.[9] The
enhancement layer is created by coding the residual of the original and the
reconstructed base layer signal in the spatial domain. Also the coding mode for each
MB is done on all the layers raising the bit-rate and the amount of side information
being generated.
As a part of the standardization efforts, a lot of SNR Scalability techniques were
being developed, however we would in the later sections talk about the algorithm,
the implementation of the algorithm using the reference software JM 13.0 and the
obtained results.
82


3.1 Algorithm
The algorithm that we have worked on was built on the H.264 standard, so
accordingly all the profiles and levels and the tools that we have discussed so far in
the previous chapter holds true. The video is first broken down in the form of MBs
and these MBs are transformed into the frequency domain using the DCT transform
and the spatial frequencies are converted into transform coefficients. These
coefficients represent a good energy compaction property and restrict most of the
spatial frequencies within these coefficients. This gives us a better means of
forming the coding error in the frequency domain rather than in the spatial domain
because of the excellent de-correlation of the data in the frequency domain. The
transform coefficients for the second layer are predicted from the base layer
coefficients itself and the difference is losslessly encoded. Thus if we have two sets
of transformed coefficients, the prediction error difference is in the transformed
domain and is given by,
A£/W =Q{T{Ehql}}-Q{T{Ebl}}.................................3.1
=X-XBL................................................3.2
where E with different subscripts denotes the prediction error in the respective
layers, T represents the transformation domain and Q represents the quantization
83


being applied. Equation 3.1 serves us a better energy compaction and provides us a
better means of obtaining a prediction error rather than in the spatial domain being
represented by Equation 3.2, because of the low correlation between
Q{T{EHQL}}and Q{T{Ebl}}- Our aim is to minimize the energy of the prediction
error, and this is supported by the characteristics of H.264 which uses the temporal
and the spatial redundancies in reducing the prediction error. One major blocking
factor is the large overhead obtained in the form of side information consisting of
MV, MV headers, data containing mode decisions. This side information increases
with the number of layers involved. As an example of the amount of the side
information for a QCIF-size video Foreman coded with a IPP GOP structure of
30fps, the average amount of side information per frame normalized by the total no.
of bits spent bits per frame for an I frame is 7% and for a P frame is 63%. [10] This
amount of information is phenomenal when we would be talking in consideration of
the number of layers which display imagery.
From the figure 3.1 and 3.2 given below for the SNR Scalable encoder and decoder,
in our research we look into a strategy where the motion estimation and the motion
compensation is done only once and the side information generated for that
particular layer is reused on all the other layers and this would help in tremendously
reducing the bit-rates. However this might result in the degradation of the
performance in the layers where the optimization has not been done, this is typically
in the order of less than 1 dB.[10] In some techniques like [11] we see that the Rate-
84


Distortion (RD) optimization is done on all the layers and then the MBs are coded
based on the coding mode decisions made by the minimization of the Lagrangian
cost between the target macroblock and the macroblock in the reference picture.
X
K J
1 L^LJ
/
<3R.
Q
BL cod
a: odes
3Q
1C


EI
EL




7*3

* Q -
XaL
T
BL_\TVsjef
idc
Charms!
Figure 3.1 SNR Scalable Encoder [10]
85


Figure 3.2 SNR Scalable Decoder [10]
To verify the RD optimized results we need to look into the ways of measuring the
distortions. A detailed discussion on Distortion Measures is given in Appendix-A
where it discusses how these metrics are often correlated with the human visual
perception and how effective these metrics are in the verification of subjective
results.
Now in our research framework, based on the application the encoder makes a
decision of which layer should be used for the motion compensation and which
layer should be used for an optimization. The encoder simplifies the mode selection
method, when compared to [12] but is along the same lines, for which the best trade
off occurs with the coding costs to the distortion introduced, where the criteria of
decision is SATD (Sum of Absolute Transformed Differences). This optimal
86


solution is obtained by combining the coding costs linearly jointly minimized using
Lagranges functional minimization and is given by:
J = SATD+AR........................................................3.3
with a rate constraint R and the Lagrangian multiplier X as an optimization
parameter. The SATD for a 4 x 4 pel block is given by,
WO-iv T ............................................3.4
i,j=o
where Th{ } is the 2-D Hadamard transform, and the definition of the prediction
error is
E(i,j) = X(i,j)-X(i,j)....................................,........3.5
A
with the original and predicted samples X and X respectively at position (i,j) in the
frame. This minimization is done for all intra and inter frame MB coding modes
where for all inter frame MB coding modes we use reference frames. For the same
reason to avoid the drift error as in MPEG-2, the enhancement and the base layers
have their own reference picture buffers, prediction, transform and quantization
which would be performed independently of the layer. It becomes mandatory for
both the layers to operate with the same coding mode decisions as both the set of
coefficients are subtracted from each other.
87


In all these discussions so far we have been talking about the Rate-Distortion (RD),
which could be explained as the trade off with respect to the number of bits that
could be traded off (rate) for the representation of the source for a given
reproduction quality(distortion). Thus, for the design of a lossy compression
system, the trade off involved is the R-D performance. To determine the RD
performance we obtain an operational rate-distortion curve and this curve is
obtained by plotting for each rate the distortion achieved by designing the best
encoder/decoder pair for the rate. These points are points that are directly
achievable with the specific chosen parameter implementations. Thus, the RD curve
is a convex hull of the set of operating points. These points are achieved by the
encoder, when selecting different parameters resulting in different levels of RD
performances leading to a situation where number of operating points is discrete
and the RD curve is determined by the convex hull of the set of all operating points.
To illustrate the concept of having multiple operating points, for someone who has
used the still JPEG compression he would realize that one can select different rate-
quality targets and still perform the required decoding. When all these rate quality
points are plotted and drawn through a curve, we find that the operating RD curve is
the convex hull of this obtained curve. Similarly, in video coding as well we have
each frame or scene requiring a different rate to achieve a given perceptive quality
and the encoder needs to control the coding parameter selection to enable proper
transmission. In all the standards, the encoding task of selecting the best operating
88


point from a discrete set of options agreed upon a priori by a fixed decoding rule is
often referred to as syntax constrained optimization and the selected choice is
communicated by the encoder to the decoder as a part of side information. Thus, the
problem that exists is to find the optimal quantizer, or operating point, x(i) for each
coding unit i such that, [13]
S K..........................................................................3.6
(=1
and some metric f(dlx(l),d2x(2),....,dNx(N)) is minimized. This formulation of the
problem is termed as Budget Constrained Allocation. To explain it in detail, it
would mean that if we are interested in a minimum average distortion, we should
have the formulation as follows:
N
f \x{\)J ^2x(2) 5*** dNx(N) ) = ..................................................3.7
i-l
So apart from budget constrained allocation, we have other problem formulations
like a buffer constrained allocation, or a delay constrained allocation where these
formulations are a way to find the optimal quantizer subject to different constraints
like buffer occupancy or the buffer size. These formulations are generally solved
using Lagrangian Optimization or Dynamic programming. We would take a look at
Lagrangian Optimization in detail since we would be using these techniques in our
solution.
89