Literature DB >> 36268535

OpenVVC Decoder Parameterized and Interfaced Synchronous Dataflow (PiSDF) Model: Tile Based Parallelism.

Naouel Haggui^1,2, Wassim Hamidouche¹, Fatma Belghith², Nouri Masmoudi², Jean-François Nezan¹.

Abstract

The emergence of the new video coding standard, Versatile Video Coding (VVC), has resulted in a 40-50% coding gain over its predecessor HEVC for the same visual quality. However, this is accompanied by a sharp increase in computational complexity. The emergence of the VVC standard and the increase in video resolution have exceeded the capacity of single-core architectures. This fact has led researchers to use multicore architectures for the implementation of video standards and to use the parallelism of these architectures for real-time applications. With the strong growth in both areas, video coding and multicore architecture, there is a great need for a design methodology that facilitates the exploration of heterogeneous multicore architectures, which automatically generates optimized code for these architectures in order to reduce time to market. In this context, this paper aims to use the methodology based on data flow modeling associated with the PREESM software. This paper shows how the software has been used to model a complete standard VVC video decoder using Parameterized and Interfaced Synchronous Dataflow (PiSDF) model. The proposed model takes advantage of the parallelism strategies of the OpenVVC decoder and in particular the tile-based parallelism. Experimental results show that the speed of the VVC decoder in PiSDF is slightly higher than the OpenVVC decoder handwritten in C/C++ languages, by up to 11% speedup on a 24-core processor. Thus, the proposed decoder outperforms the state-of-the-art dataflow decoders based on the RVC-CAL model.

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Entities: Chemical

Keywords: Dataflow modeling; OpenVVC; PiSDF; Tiles; VVC

Year: 2022 PMID： 36268535 PMCID： PMC9569024 DOI： 10.1007/s11265-022-01819-7

Source DB: PubMed Journal: J Signal Process Syst ISSN： 1939-8115

Introduction

According to Cisco statistics, video content now accounts for about 82% of the global Internet traffic [1]. This can be explained by the massive use of social media, YouTube, video streaming sites such as Netflix, and video conferencing that has been used extensively since the appearance of Covid-19 for remote work. Due to this fact, the storage requirement of the video traffic and its energy footprint are increasing. Thus, raising the need for a more efficient codec than High Efficiency Video Coding (HEVC) [2]. In 2020, a new video coding standard called Versatile Video Coding (VVC) [3] has been released. This latter offers between 40 and 50% of bitrate reduction compared to its predecessor HEVC thanks to several enhancement methodologies and new tools including Multiple Transform Selection (MTS), Adaptive Loop Filter (ALF), new intra/inter prediction tools, etc. Those tools improve the efficiency of the encoding process, but come with an increase in computational complexity. In order to overcome this problem, a lot of research has been carried out to reduce the computational complexity on both encoder and decoder sides. Some of the optimizations in the VVC standards were based on the use of machine learning, especially in the encoder parts (intra/inter prediction and partitioning) [4], others explored the parallelism between tasks. In fact, since the advent of Moore’s Law, the industry has been moving towards the creation of multicore hardware using parallelism strategies to cope with the increase in computational complexity of modern applications. In the VVC standard, there are three main levels of parallelism: the data level, the frame level, and the high-level parallelism including both tile and slice features. The last few years have seen the emergence of new software decoders compliant with the VVC standard, like the Fraunhofer Heinrich Hertz Institute’s decoder called VVdec [5] and the open software decoder called OpenVVC [6]. These decoders are developed to offer real time decoding capabilities over different platforms. The OpenVVC decoder use the three levels of parallelism resulting in a decoding performance that overcome both the VTM MPEG reference software and the VVdec decoder. Implementing complex signal or image processing applications, especially video coding applications, on embedded multicore architectures is challenging, creating the need for design frameworks and methodologies to accelerate the design process. In this context, Dataflow modeling has been used extensively in the generation of efficient multicore implementations for embedded systems. Dataflow is a powerful model to explore data dependencies and to reveal existing parallelism levels. A single dataflow model gives the possibility to target several heterogeneous architectures (multicore, many-core, and Programmable Logic architectures), using efficiently the available parallelism on the hardware and optimizing the memory allocations. In this context, this paper aims at using the Parallel and Real-time Embedded Executives Scheduling Method (PREESM) [7] tool to create a Parameterized and Interfaced Synchronous Dataflow (PiSDF) [8] model of the VVC decoder in order to automatically generate a multicore algorithm optimizing the execution of tiles on multicore architectures. This work investigates the performance of the parameterized dataflow model (PiSDF) versus the state-of-the-art dynamic dataflow models RVC-CAL. As a matter of fact, RVC-CAL models showed inferior performance compared to C/C++ decoders, which could be explained by either huge time spent on scheduling at runtime or by the data movement overhead. To the best of our knowledge, this is the first work that presents a dataflow implementation for a full decoder compliant with the VVC standard. The rest of the paper is organized as follows. Section 2 gives an overview of dataflow modeling, the VVC standard and introduces the OpenVVC decoder and its parallelism strategies, Sect. 3 is dedicated to the related work. Section 4 introduces the proposed dataflow model for the OpenVVC decoder. In Sect. 5, a comparison between the OpenVVC decoder and the proposed dataflow model while exploring tile-level parallelism is provided. Finally, Sect. 6 is devoted to the conclusion and future works.

Background

Dataflow Modeling

A dataflow model is defined as a set of actors exchanging data information through First In First Out (FIFO)s. The interactions between actors are regulated by a Model of Computation (MoC) that specifies which scheduling strategies can be utilized to fire actors. These interactions influence the system behavior of a dataflow model. In reality, actors are defined as a collection of firing rules that specify the circumstances in which an actor may fire. Dataflow modeling is a methodology that proved to be efficient in computing data dependencies, scheduling and exploring parallelism between the tasks of a process. Moreover, it gives an additional insight into the application and facilitates the detection of any missing items or inadequate details. This last has been used extensively for signal and image processing applications especially while using Digital Signal Processor (DSP) architecture. In addition, Dataflow is widely used for Multiprocessor System-on-Chips (MPSoC) design and programming [9]. There are several toolchains that can be used to create a dataflow model such as SynDEx [10], Open RVC-CAL Compiler (Orcc) [11], and PREESM. In this work, the used tool is PREESM. Compared to the other tools, PREESM allows both the automatic scheduling of tasks and the automatic generation of functional code for heterogeneous multicore embedded systems [7]. In essence, it is an open source rapid prototyping tool that explores the design space of a target system in a way that minimizes its cost and ensures compliance with various constraints, most often latency, throughput, memory and power consumption. In addition, it boosts pluggable features that fit different targets [7]. The PREESM scheduling principle is based on fast scheduling methodology of Kwok [12]. PREESM plugins are generally focused on latency-dominated systems, i.e., systems that ensure throughput constraints are met while fulfilling the latency constraint [13]. Between two application iterations, PREESM uses barriers in order to synchronize all cores of an architecture. Moreover, the code generation is designed to produce self-timed code [14], i.e. static code for each core with automated communication between the cores, cache management and synchronization. PREESM includes different optimizations, one of them is an advanced memory optimization based on a memory exclusion graph [15]. This optimization serves to avoid the preservation of FIFO memory spaces unnecessary for the correct system execution. The simulation of PREESM workflow provides the system designer with a gantt chart that shows the partitioning of the tasks according to the used parallel architecture.

Overview of VVC Decoder

The decoder side of VVC standard has seen a raise in computing complexity approximately two times higher than HEVC standard in All Intra (AI) configuration. Figure 1 presents the structure of the VVC decoder. A decoder takes a bitstream as input and provides a decoded video sequence as output. The VVC decoder is mainly composed by three main processes: entropy decoding, block reconstruction, and the in-loop filters. The entropy decoding is based on the use of Context Adaptive Binary Arithmetic Coding (CABAC) [16]. This latter serves at decoding the binary syntax into syntax elements that feed all other decoder parts. The block reconstruction that is composed from inter/intra prediction, inverse transform and quantization is responsible for constructing the different regions of a frame. In fact, on the encoder side, a frame is divided into different blocks based on the use of tools included in the partitioning part of the encoder. The final step in the decoding process is the in-loop filters. The filters used in the VVC decoder are the inverse Luma Mapping with Chroma Scaling (LMCS), the Deblocking Filter (DBF), the Sample Adaptive Offset (SAO) filter and the ALF filter. Once a picture is entirely decoded it will be stored in the Decoded Picture Buffer (DPB) to be used if needed in the decoding of the next picture (case of inter prediction). More details about VVC standard could be found in [17]. There are three possible parallelism levels in the decoding process, they consist in data level, frame level and high level parallelism: slices/tiles and Wavefront Parallel Processing (WPP). Data level is based on the use of Single Instruction Multiple Data (SIMD) instructions. SIMD optimizations are generally used in applications that perform the same operation on multiple data points simultaneously which means operations that use vectors and matrices. In video coding field, there are several tasks that could take benefits from the SIMD instructions such as ALF filter, the transforms, etc. Data level parallelism has shown its efficiency in minimizing the computational complexity of the video coding process. Authors in [18] have leveraged the use of SIMD instructions to minimize the decoding time of the HEVC decoder. Effectively, the experimental results showed that the optimized decoder performs more than 4 times faster than the HM 4.0 decoder. Frame-level parallelism relies on frames that can be independently encoded/decoded. In this case, each frame is associated with a thread with a synchronisation module to manage the inter dependencies. In video coding standards, a frame can be divided into different regions. Some of these regions are called tiles and slices. The partitioning of an image into tiles and slices was first introduced in HEVC [19]. Tiles are rectangular regions of a frame containing entire CTUs. The prediction dependencies between tiles boundaries are broken and the entropy decoding is reinitialized for each tile. This fact, enables different large regions of a frame to be processed independently.

Figure 1

VVC decoder block diagram.

OpenVVC Decoder

The OpenVVC decoder has been developed in IETR laboratory at INSA Rennes. The decoder was created based on the VVC standard and with the use of C programming language. It has been compiled as a cross-platform library and implemented as a dynamic library in FFmpeg player [20]. OpenVVC is capable of real time decoding and supports all the parallelism levels previously detailed in Sect. 2.2: data level parallelism, frame level parallelism, and tile parallelism. The decoding process of the OpenVVC decoder follows the same process of the VVC decoder. It starts with parsing the global parameters sets which are the Sequence Parameter Set (SPS), Picture Parameter Set (PPS), Picture header (PH), and Slice header (SH). This process helps gathering information from the different stages of a frame (slice, tile, sequence, picture) in order to decode it. After that, block reconstruction is applied at the CU level. When all CUs in the CTU are fully reconstructed, the DBF filter is applied at CTU level. Finally, the SAO, ALF and CC-ALF filters are applied at a CTU line level. Applying the filters at the CTU lines improves frame-level parallelism in the inter-configuration compared to processing the in-loop filters after the entire frame is reconstructed. Compared to the other available decoders developed based on the VVC standard such as VTM decoder and the Fraunhofer’s decoder VVdeC, OpenVVC offers higher decoding rate with lower memory consumption especially in AI configuration. More detailed information about OpenVVC and the most consuming part of the decoding process could be found in [6].

Related Work

Usually, the most used approaches in creating a parallel algorithm for multicore processors are OpenMP [21], OpenCL [22] and pthread. The development of decoders using these approaches requires months of algorithmic development and then it is necessary to study the parallelization on each of the target architectures, which can be fast on homogeneous architectures with shared memories with the use of OpenMP for example, but much longer on heterogeneous architectures like FPGA and MPSoC. Besides, these approaches are based on the use of C/C++ programming languages which have many shortcomings and are not suitable for hardware design. As a result, implementing complex signal or image processing applications, especially video coding applications, on embedded multicore architectures has become increasingly difficult. Due to this fact, many researchers have focused on creating a library based on dataflow oriented language CAL [23] for the video coding components. This library is called Reconfigurable Video Coding (RVC) [24]. This framework provides a new specification formalism for the design of video codecs in a way that promotes flexibility and reuse [25]. RVC-CAL [26] actor language has been used in the modeling of several video coding standards such as MPEG-4 [27] and HEVC [28]. Authors in [25] have demonstrated that using a dataflow model facilitates code generation for hardware devices. Studies in [28] and [29] used RVC-CAL dataflow models in the creation and implementation of a multicore algorithm for the HEVC decoder. Although the proposed dataflow model for the HEVC decoder does not exceed the performance of other HEVC compliant decoders such as OpenHEVC [30], the advantages of RVC-CAL remain undeniable. Indeed, RVC-CAL code is cross-platform, and compilers such as Orcc [11] and OpenDF [31] can automatically produce many languages from a single description, including C, C++, C-HLS, Verilog, etc. Since the creation of the hardware is ultimately complete and the validation of the application can be performed in a software context, this capability is crucial for hardware developers because it speeds up prototyping. Dataflow modeling has also been tested on some parts of the latest video coding standard VVC and proved to be very efficient. For instance, in [32] a dataflow implementation for MTS concept has been created. The result showed that with choosing coarse grain granularity for x86 architecture, the dataflow model offers a speed-up close to the theoretical result. In [33], a comparison between a dataflow implementation for the MTS concept and the OpenMP method showed that, with the use of the dataflow model, a better speedup was achieved. This evidence proves that, compared to the state-of-the-art approaches used to create a multicore algorithm for embedded systems, dataflow modeling could achieve better performance in a much shorter development time. In addition, dataflow circuits are necessary for efficient C-to-circuit translation of any software program, as they are capable of handling variable latencies and erratic memory dependencies. Many HLS compilers have included dataflow modeling to generate an FPGA design in which all tasks are pipelined and executed concurrently [34, 35].

The Proposed Model

The development of a multicore algorithm such as the OpenVVC decoder is time-consuming because it requires a great deal of knowledge of the software and the target architecture. Added to that, when modifying the target architecture, other constraints must be taken into account. To reduce time to market, dataflow modeling is commonly used. It facilitates the mapping of tasks according to the available cores and the fast generation of a multicore algorithm. Indeed, dataflow modeling is widely used to implement an application on a multicore architecture. This stems from the fact that it is efficient for calculating data dependencies and scheduling tasks without having a deep knowledge about the hardware. From this context, this paper aims at studying the efficiency of dataflow modeling in the exploration of tile parallelism. For the purpose of creating a dataflow model for the OpenVVC decoder that explores the parallelism between tiles with the use of PREESM, the following steps must be performed: After implementing the OpenVVC as an external library in PREESM and identifying the functions responsible of decoding the tiles, an algorithm graph (Fig. 2) is created. The Fig. 2 presents the global algorithm graph of the proposed model. This last is composed of 18 FIFOs and 12 actors in total. Some actors are responsible for the duplication or initialization of variables and others for the decoding process. The latter are the following five actors:As the tile parallelism is the focus of this paper, the decoding of the picture ( actor) itself has been decomposed into several actors. The Fig. 3 represents the algorithm graph associated with the actor . The latter is composed of 25 FIFOs and 13 actors in total. The main actors in this graph are:To ensure the tile parallelism, the tiles actor is duplicated based on the information provided by the parameter. For example, if the is equal to N, the tiles actor will be duplicated N times and those N tasks will be executed in parallel. In other words, the duplication process is done automatically according to the production and consumption rate of each actor. If an actor A has a production rate equal to N for example while actor B has a consumption rate equal to one, PREESM will duplicate actor B N times.

Figure 2

PiSDF model for the OpenVVC Decoder.

Figure 3

The inside of the actor.

Implement the OpenVVC project as an external library. Disable the use of all functions developed using pthread in the OpenVVC project to explore parallelism between tiles and between frames. Provide the necessary inputs for PREESM. Essentially, in order to create the dataflow model and create the multicore algorithm, three elements need to be created first. The algorithm graph: a graph presenting a set of actors exchanging data through FIFOs and aims to provide a clear and complete description of the process. The actor description could be hierarchical ( i.e. an actor could contain other actors). The algorithm graph is created using a type of dataflow model known as PiSDF. The architecture graph: the kind of architecture graph used in PREESM is called System Level Architecture Model (S-LAM). It is a set of cores linked to a shared memory in order to communicate. The scenario: an item of PREESM which includes all the information necessary for the execution of the workflow which is a set of tasks responsible for the generation of the code, the display of the Gantt chart and the use of optimizations. The scenario contains the path of the algorithm graph, the path of the architecture graph, and the execution time of each actor. : this actor serves at associating the bitstream to be decoded to the decoder and initializing the decoder and the dumuxer. Then it will send the information about the initialized decoder and dumuxer to the other actors through the variable . : this actor is responsible for extracting parts of the bitstream more specifically, the Network Abstraction Layer (NAL) units. The extracted information will be sent to the actor via pu variable which is a vector of NAL units. : this actor passes the sub-stream to the input of the decoder and starts decoding the picture. : this actor polls the decoder in order to see if the next image in the display order is available and can leave the decoder, and places the next image in the display order on the output. : this actor closes the file and resets the dumuxer to zero. : selects the slices to be decoded. : decodes the tiles of a frame. : decodes other types of NAL units (that don’t contain tiles information). PiSDF model for the OpenVVC Decoder. The inside of the actor.

Experimental Results

This section presents the experimental setup and the different results obtained using the proposed dataflow model in AI configuration.

Experimental Setup

The experiments were performed using PREESM version 3.20.0 on Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz with 24 cores under Ubuntu 18.04 and using various video sequences that are detailed in Tables 1, and 2. The video sequences used are FHD (class B, 19201080 pixels) and UHD (class A, 38402160 pixels) included in the Common Test Conditions (CTC) [36], as these are the sequences that have the highest resolution and take a long time in the decoding process. Actually, with the grown complexity of VVC standards there is a raise of critical issue basically for high resolution sequences. Moreover, with high resolution comes high granularity of actors, which is an important factor to achieve a good performance while using x86 architecture, as proven in previous works [32, 33]. To be precise, in [32] a study was made to investigate the effect of the granularity level on the performance of an application using the x86 architecture. The study showed that it is important to use high granularity when using x86 architecture to get a good result. Based on this study, high resolution sequences are used so that each actor performs a large number of computations and the synchronization time becomes negligible compared to the execution time of the actors. In addition, while decoding a tile, each thread requires its own filter buffers and local context structure. The main frame buffer and global context structure, on the other hand, are shared by all threads, which may result in a buffer overhead. But since the sizes of the filter buffers and the local context structure are both small in comparison to the size of the main image buffer, buffer overhead is assumed to be negligible for high-resolution sequences [37]. The used video sequences are encoded with three tile configurations (four horizontal by three vertical splits), , and . The same encoder configuration, without tiles, is used as a reference to calculate the BD-Rate objective measurement [38]. The BD-Rate is an objective measure applied in video compression to compare the bitrate distortion performance or compression efficiency of two different video codecs or different parameters of the same video codec over a range of bitrate or quality values. Tables 1 and 2 show that as the number of tiles increases, there is an increase in the BD-Rate. Hence, a bitrate overheat at same quality. In fact, the study of [39] showed that the quality of coding decreases with the increase of the number of independent regions in the image. This is due to the fact that intra prediction dependencies across tiles boundaries are broken and that the entropy coding is initialized at each tile, resulting in a loss of coding quality compared to the unpartitioned sequence coding.

Table 1

Benchmarks of UHD (38402160) sequences used in the experiment.

	BD-Rate
Sequence	12 Tiles \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(6\times 2)$$\end{document}(6×2)	24 Tiles \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(6\times 4)$$\end{document}(6×4)
DaylightRoad2	0.88%	1.26%
Campfire	0.84%	1.33%
CatRobot	0.91%	1.76%
ParkRunning	0.29%	0.48%
Tango2	1.79%	2.84%
FoodMarket	1.32%	2.13%

Table 2

Benchmarks of FHD (1920x1080) sequences used in the experiment.

	BD-Rate
Sequence	12 Tiles \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(4\times 3)$$\end{document}(4×3)	24 Tiles \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(6\times 4)$$\end{document}(6×4)
BQTerrace	0.77%	1.27%
Cactus	1.08%	1.79%
BasketballDrive	1.80%	3.08%
MarketPlace	1.10%	1.84%
RitualDance	1.49%	2.45%

Benchmarks of UHD (38402160) sequences used in the experiment. Benchmarks of FHD (1920x1080) sequences used in the experiment. In order to provide a fair comparison between the openVVC decoder and the proposed dataflow model, the frame level parallelism is disabled in the original OpenVVC decoder. The comparison will be made based only on tiles level parallelism. The bitstreams used in this work are generated with VTM12.0 encoder. The input of an encoder is a configuration file that contains various information about the sequence to be encoded such as the number of frames, the type of configuration (AI, Random Access, Low Delay), etc. In this paper, the configuration file has been modified to allow the decomposition of a frame into tiles and to set the number of tiles and the location of their boundaries. Actually, tiles offer flexible classification of CTUs. Besides, tiles provide an excellent coding efficiency and a favored correlation of pixels over slices as they do not have a header information [40]. The performance of both OpenVVC decoder and the proposed model are assessed at various bit-rates, obtained with QP values of {22, 27, 32, 37} following the CTC.

Results Analysis

Figures 4 and 5 showcase the speed-up of both the proposed model and the OpenVVC decoder while using UHD and FHD sequences encoded with 12 tiles respectively. Results showed that speed gains of 6 and 7 are achieved using 12 cores for a bitstream consisting of 12 tiles for FHD and UHD sequences, respectively. For FHD and UHD sequences with 24 tiles, speed gains of 9 and 11 on 24 cores are achieved respectively as illustrated in Figs. 6 and 7). For the results obtained using FHD sequences, the speeds of the proposed model and the OpenVVC decoder shown in Figs. 5 and 7 are not too close using four cores. However, they are closer using UHD sequences, which can be explained by the fact that as the resolution of the sequence increases, the actor has more computational process, so the granularity increases and the synchronization time becomes negligible compared to the actor execution time. Ideally, for a fully parallelizable process, a speedup equal to 12 is expected while using 12-core architecture. However, the decoding process is not 100% parallelizable. In reality, many parts of the decoding process are more complex than decoding a tile and are performed sequentially, making them more time consuming. For example, according to the study presented in [6], loop filters account for more than half of the decoding complexity in an AI configuration. This fact is one of the main reasons that explain the difference between the theoretical and experimental results, since the loop filters are executed sequentially after the decoding of a frame. Another factor that explains the discrepancy between the theoretical case and the obtained result is the fact that the tiles are not of equal sizes. For example, Fig. 8 shows the size inequality between 12 tiles for a class A sequence. An image is a set of CTUs and the details of an image change from one pixel to another. Consequently, the amount of detail in each CTU is different. It is thereby different for each tile. The decoding time is therefore not the same for all tiles as shown in Fig. 9. In fact, this figure details the decoding time in percentage for each tile presented in a frame of a class B sequence (BQTerrace). This decoding time imbalance between tiles has less effect on UHD sequence than on FHD sequence as the former’s frame size is larger and it grows with the number of tile contained in the frame.

Figure 4

Speed-up of the proposed model and of the OpenVVC decoder while using UHD sequences encoded with 12 tiles.

Figure 5

Speed-up of the proposed model and of the OpenVVC decoder while using FHD sequences encoded with 12 tiles.

Figure 6

Speed-up of the proposed model and of the OpenVVC decoder while using UHD sequences encoded with 24 tiles.

Figure 7

Speed-up of the proposed model and of the OpenVVC decoder while using FHD sequences encoded with 24 tiles.

Figure 8

Tiles partitioning for Class A ().

Figure 9

Per tile decoding time in percentage for a frame of BQTerrace sequence.

Although the experimental result of the proposed dataflow model is very close to that of the OpenVVC decoder and does not significantly outperform it, dataflow modeling remains a design methodology of great importance. The state-of-the-art approaches used for the creation of a multicore algorithm necessitate a lot of development time compared to dataflow modeling as they require deep knowledge of both hardware and software. For example, as real-time codecs typically offer fine-grain parallelism, architectures like Field Programmable Gate Arrays (FPGA) are perfect targets for implementation. However, the programming languages used to program FPGA, such as VHDL or Verilog, require digital design expertise and result in an increase in time to market [41]. In a nutshell, using dataflow modeling saves development time. Also, compared to other approaches, dataflow modeling presents a suitable framework that allows switching from one architecture to another in a flexible way, as a dataflow model can target different architectures. Indeed, a dataflow model could target both shared and distributed memory architectures without changing the developed model. However, in the case of the OpenVVC decoder that was developed using pthread and with the knowledge of software and hardware, there is a possibility of presenting memory storage problems when switching from one architecture to another. Indeed, due to the large number of memory allocations included in OpenVVC, portability on systems with heterogeneous architecture can be challenging. Speed-up of the proposed model and of the OpenVVC decoder while using UHD sequences encoded with 12 tiles. Speed-up of the proposed model and of the OpenVVC decoder while using FHD sequences encoded with 12 tiles. Speed-up of the proposed model and of the OpenVVC decoder while using UHD sequences encoded with 24 tiles. Speed-up of the proposed model and of the OpenVVC decoder while using FHD sequences encoded with 24 tiles. Tiles partitioning for Class A (). Per tile decoding time in percentage for a frame of BQTerrace sequence. Decoding time on single core of UHD sequences, QP = 22. One other major result up for debate is the time spent by the dataflow decoder in the decoding process on a single core. In fact, previous dataflow models for decoders have shown promising performance on multicore, but on a single core, the performance is degraded compared to the reference decoders. In [29] a DSP based decoder conforming to the HEVC standard has been implemented using the RVC-CAL model. The presented RVC-CAL decoder is 50% less efficient than HM9.0 and five times less efficient than OpenHEVC, while using a single core. Similarly, in [28], a dataflow model for a full decoder that explores parallelism between the YUV component showed a decrease in framerate (FPS) compared to the reference, and the OpenHEVC decoder outperformed both the reference and the presented model while using a single core architecture. Nevertheless, the PiSDF model presented for the OpenVVC decoder outperformed the reference software decoder (VTM12.0) and has a similar decoding time as OpenVVC on monocore as shown in Table 3. This result proves the efficiency of the PiSDF model compared to the decoder models that have been implemented using RVC-CAL. In fact, dynamic dataflow models such as RVC-CAL perform scheduling at runtime which slows down the decoding process. However, parameterized models such as PiSDF used in this work are predictable, allowing optimizations at compile time. Consequently, the scheduling time is negligible compared to the decoding process.

Table 3

Decoding time on single core of UHD sequences, QP = 22.

Sequence	Decoding time (s)
	VTM12.0	OpenVVC	Dataflow model
DaylightRoad2	42.62 ± 0.205	15.39 ± 0.375	15.10 ± 0.042
Campfire	34.32 ± 0.571	11.55 ± 0.405	11.65 ± 0.67
CatRobot	34.65 ± 0.265	11.15 ± 0.091	10.95 ± 0.04
ParkRunning	35.86 ± 0.078	12.97 ± 0.078	12.71 ± 0.046
Tango2	23.66 ± 0.054	8.32 ± 0.053	8.07 ± 0.082
FoodMarket	21.49 ± 0.06	7.32 ± 0.059	7.19 ± 0.037

Conclusion

This paper investigates the efficiency of using dataflow modeling in creating a model for a complete decoder based on the VVC standard. The tool used to create the dataflow model is called PREESM. This tool allows the automatic scheduling of tasks according to the number of used cores and the automatic generation of multicore algorithms. The proposed model is created to make explicit and explore the parallelism between tiles that are a rectangular region of a frame encompassing entire CTUs. Parallel tile decoding is allowed due to the fact that prediction dependencies between tile boundaries are broken and entropy decoding is initialized for each tile. The results showed that, unlike the state-of-the-art dataflow decoders based on the RVC-CAL model, the proposed model still performs well on a single-core architecture compared to decoders developed with C/C++ languages. In addition, it achieves a competitive speed-up to that of the OpenVVC decoder. Whereas the speed improvement is not very significant, the relevance dataflow comes from the fact that one model can target different architectures and dataflow modeling is an effective framework that facilitates the transition from software to hardware. Moreover, creating a multicore algorithm using dataflow modeling requires a shorter development time compared to state-of-the-art approaches such as OpenMP and OpenCL. In the near future, we aim to adopt task-based code generation and execution, a method proposed in [42], to automatically balance the differences of tile decoding times and improve our proposed model. It is also planned to improve the proposed model to support frame-level parallelism. Furthermore, it is planned to implement the proposed dataflow model on heterogeneous architectures such as ARM BIGlittle, FPGA, and Kalray.

2 in total

1. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems.

Authors: John E Stone; David Gohara; Guochun Shi
Journal: Comput Sci Eng Date: 2010-05 Impact factor: 2.080

2. DeepQTMT: A Deep Learning Approach for Fast QTMT-based CU Partition of Intra-mode VVC.

Authors: Tianyi Li; Mai Xu; Runzhi Tang; Ying Chen; Qunliang Xing
Journal: IEEE Trans Image Process Date: 2021-05-31 Impact factor: 10.856

2 in total