FPGA Implementation of Ultra-High Speed and Configurable Architecture of Direct/Inverse Discrete Wavelet Packet Transform Using Shared Parallel FIR Filters

,


Sate of art
Wavelet transform are considered the rightful heir to Fourier transform, since they are the best signal analysis tool in time and frequency domain.
From a historical point of view, the founder of the wavelets world was J. Morlet in 1980 who developed a new transform called "Wavelet Transform (WT)" to solve the limitations of the different versions of the Fourier transform on signal analysis passage from the time to the frequency domains [1,2]. The numerical application of the wavelet transform is born with the work of Morlet and Grossman [3], while the idea of wavelet seems to originate in the work of Gabor and Neumann in the late 1940s. Mallat [1,4], Meyer [5], Chui [6], Daubechies [7] and others, contributed the pioneering work reported in the early monographs.
The discrete version of the wavelet transform called "Discrete Wavelet Transform (DWT)" mixes both the notions of time and frequency scaling using a compact kernel (bit wave = Wavelet). Mallat [1] generalized the discrete wavelet transform that introduce more flexibility on the time scale of data analysis and lead to the concept of Discrete Wavelet Packet Transform (DWPT). restrains the usage of DWPT in large applications domains, which leads us to develop a new hardware architecture to improve the performance of hardware implementation of DWPT/IDWPT. The major challenges for modern systems can be summarized as: (i) high throughput rates (increasing the volumes of data processed, especially in multimedia application) and (ii) low cost hardware (limited amount of hardware resource) while still providing high performance.

Related works
Abundant researches can be found in the literature about the hardware implementations of wavelet packet transform. The first work related to the hardware implementation of DWT was recorded in 1994 [8] Denk and Parhi described an orthonormal DWT architecture, which used a lattice Quadrature Mirror Filters (QMF) structure and digit-serial processing techniques. Similarly, Hatem et al. [9] proposed a parallel/sequential QMF architecture in VLSI for the DWT, to reduce the overall numbers of multipliers. After that, the research evolution was started.
Since DWPT retains all the advantages of the discrete wavelet transform, it has becomes an essential standard tool for signal, image and video processing. For this reason, the DWPT topic has attracted a lot of attention in research area. Notably, the most implementations of DWPT are based on the concept of FIR (Finite Impulse Response filter) filter banks [10]. In the literature, many works are addressing the implementation of DWPT/IDWPT in hardware devices such as Processors, GPUs (Graphic Processing Units) or FPGAs, to take full advantage of the powerful and flexibility offered by the DWPT. In this context, the year 1999 viewed the first work to implement the DWPT on a processor [11] but in the last recent years, FPGA became a popular target technology for many works, which become doubtlessly the target implementation technology for DWPT and IDWPT. However, the implementation of DWPT/IDWPT (based of FIR filter banks) has many caveats, which make it hard to fulfill a high throughput rate and low area consumption simultaneously.
To fulfill these challenges, many works implement the DWPT or IDWPT transformations on FPGA like the work of Wu and Hu [12], the authors implemented DWPT/IDWPT using Embedded Instruction Codes (EIC) and strict number of multipliers and adders for the symmetric filters (two multipliers and four adders). In [13], IDWPT architecture was presented based on classical recursive pyramid algorithm (RPA) by using lifting transformation and polyphase decomposition. In [14], a fast and configurable DWPT was presented based on FIFO inputs, dual-port memory, multipliers and adders.
An inspirational architecture of DWPT was presented in [15] by implement the tree algorithm, their implementation on FPGA was discussed in [16]. In [17], the same authors based on two modes developed pipelined architectures of the direct and inverse DWPT transform: serial-word mode and parallel-word mode. In [18], Farahani and Eshgho presented word-serial pipeline architecture of DWPT/IDWPT transforms by using parallel filter structure. To speed up the transform, the same authors updated their architectures by using high-pass and low-pass filters. In addition, they implemented the wavelet packet transform on FPGA by using three types of multipliers [20].
To increase the performance, in [21] a VLSI architecture of DWPT was presented which based on frame-partition architecture. In [22], authors implemented a flexible architecture of DWPT and IDWPT based on register interface and a multiplexing structure.
An algebraic integer architecture of Daubechies DWPT was presented in [23], this architecture was developed to compute one and two dimensions of Daubechies wavelet order 6 (Db6) with adder-less.

Contributions
This rich literature around DWPT and IDWPT reveals the importance of wavelet transforms in the current and future applications, which motivates us to deeply excavate in DWPT/IDWPT research area. Furthermore, most of recent approaches present some advantages in terms of flexibility, scalability, reliability, or low costing but not all together within the same architecture.
However, our aim is to develop new architecture generic and configurable (in synthesis and post-synthesis) of DWPT and IDWPT, which can provide an ultra-high speed data processing with minimum hardware usage. We already proposed our first pipeline and configurable architecture of DWPT in 2015 [24] and of IDWPT in 2016 [25] based on Mallat binary tree scheme [4] without any presence of parallelization tools. We transformed successfully an exponential algorithm of DWPT/IDWPT, based on classic Mallat binary tree, to a linear algorithm with the conservation of sample speed processing but with decreasing the usage of hardware by a smart managing and interleaving of data along the depth of transformations.
To increase the throughput, we propose in this literature massive pipeline-parallel architectures of DWPT (P-DWPT architecture) and of IDWPT (P-IDWPT architecture). Moreover, a massive pipeline-parallel DWPT-IDWPT architecture with massive sharing of resources: the core of the architecture is based on sharing hardware resources between P-DWPT and P-IDWPT transforms, and also based on a smart sharing of computational resources (multipliers and adders) between the approximation/details functions related to filter banks.
Those architectures are generic and fully configurable relative to the tree depth, the filters order, the quantization, and selection way of processing method (P-DWPT or P-IDWPT).
To evaluate the performance of our architectures, we study the effect of four parameters: "degree of parallelization, transformation depth of DWPT or/and IDWPT, order of low-pass and high-pass filters, and order of quantization" on the "operating frequency or throughput rate" (computational performance) and "resource consumption" (used hardware or hardware cost). Furthermore, we compared our synthesis results with the existing in literature.

Paper organization
The rest of this literature is organized as follows: Section II, presents a brief overview of wavelet packet transform concepts and the main challenge of parallelization in our context. Section III is dedicated to present our proposed pipeline-parallel architectures and related design paradigms. Then, we summarize the results obtained for these architectures on FPGA implementation. At the end, we finalize with the comparison, conclusion and the perspective works.

Background
From mathematical point of view, wavelet packet transform decomposes the signal into a two sub-signals: detailed signal and approximated signal. To achieve the transforms there are several methods, the famous one is the multi-resolution or multi-cadence analysis algorithm proposed by Mallat [1], which called "Mallat binary tree" or "pyramid algorithm" of wavelet packet transform. This multi-resolution analysis is based on cascade of low-pass and high-pass digital FIR filters. By definition, using Wavelet Transform (WT) theory is that each signal ( ) can be presented by projecting it into a series of scaled and translated functions , ( ). The original single function is called "Mother Wavelet" and all scaled and translated functions are obtained by translation and expansion of the mother function, as presented in equation (1): where ∈ * , ∈ .
The value of affects strictly the bandwidth of Mother Wavelet. When s is less than 1 ( < 1), wavelet variance decreases and the basis function contracts (used to analyze the high frequency signal). While for ≥ 1, wavelet variance increases and the basis function stretched (used to analyze the low frequency signal).
The Continuous Wavelet Transform (CWT) of ( ) is defined as: where * denotes a complex conjugation.
From the wavelet theory, the CWT is too redundant that make it impractical. To solve this problem, in 1992 Daubechies [26] create discrete wavelets transform (DWT) from the CWT by discretizing of scalable and translatable variables as shown in equation (3): Where the number of wavelets coefficients is finite. To discretize the scalable and translatable variables, we use the dyadic sampling concept that mean we usually choose = 2 0 , 0 = and 0 = 1 (give us a dyadic sampling in time). The Discretize of the translation and dilation contraction parameters of the wavelet in (2) leads to DWT presented in (4): Another important function called "scaling function" is similar to the wavelet function ( ) . According to the multiresolution wavelet theory, the is decomposed into finer and finer detail (in multiresolution stairs). As seen in equation (5), the ( ) has also two integer subscripts or parameters and .
where specifies the magnitude 2 2 ⁄ as well as the scale 2 of the function, and specifies the position (integer location, translation or shift) of the function.
In fact, DWT in signal analysis theory can be implemented with a different ways that is presented in equation (4) by using the concept of non-uniform filter bank. The filtering operation of input signal ( ) is done iteratively and generates two separates subsignals with two different spectrum, the upper half of the spectrum contains the high frequency component of input signal which is analyzed by the wavelet function and the lower part contains low frequency component of the same input signal which is continued to the next stage. The corresponding obtained coefficients that present the first filtered signal are called detailed coefficients and the second filtered signal are called smooth coefficients. Classically, both low-pass digital filter ℎ and high-pass digital filter are obtained from the scaling function and their corresponding mother wavelets.
We suppose ℎ and like a FIR filters non-recursive with length, the transfer functions of ℎ and can be represented as follows: Where −1 denotes a delays of 1 × .
The orthogonal multiresolution decomposition of DWT can be carried out efficiently by using Mallat tree algorithm or pyramid algorithm [1]. This structure constitutes a bank of filters in QMF analysis, where the scaling functions and wavelet functions are realized using scale relations (6) and (7) and can be given as: forms the set of scaling functions and their corresponding wavelets. The suffix denotes the number of wavelets and is dubbed as multiplicity.
According to the wavelet theory based of filter bank concept, any arbitrary signal ( ) can be expanded into a sum of scaling and wavelet functions. The discrete wavelet transform of target signal ( ) ∈ 2 (ℝ) is given by:

DWPT under focus
The decomposition DWPT at each resolution level, based on Mallat binary tree, can be presented as tree shape. To demonstrate a general transform of DWPT based of Filter Bank concept, we use Figure 1.
The input signal ( 0 0 [ ] ) in coming data level will be decomposed into a high frequency signal (smooth or approximation coefficient 1 0 [ ] ) and a low frequency signal (detailed coefficient 1 1 [ ]) in level 1 by using low-pass / highpass filters respectively. Then data path will be down sampling by a factor of two (lead to half size of original signal) and so on for the other levels (number of level equal the depth in Mallat tree algorithm). As a global view, the corresponding wavelet coefficients in different level are derived as follows: where = 0, 1, . . ., (2 ( −1) − 1) and ℎ( ) and ( ) are low pass and high pass filters, respectively. Considering the FIR (Finite Impulse Response) non-recursive implementation scheme of length for the ℎ and filters, the corresponding transfer functions can be represented as equations (6) and (7). The number of filters coefficients depends on the mother wavelet. The choice of this later depends upon the required application and its properties. Examples are Daubechies family where the filter length is 2 (Order N is strict positive integer), and Coiflets family where the filter length is 6 (Order is strict positive integer), etc.

IDWPT under focus
From the wavelet theory, the reconstruction of original signal that decomposed by direct DWPT is achieved by using the inverse of wavelet packet transform (IDWPT). Like the direct way, the reconstruction operation is also performed by using an iterative method. This mean, for each pair coefficients at level + 1 of the tree we can calculate the wavelet packets coefficients at the previous level as shown in equation 13: In sample representation of equation (13), it can be represented in Figure 2.
For example to reconstruct the +1 0 coefficient at level + 1, we used the approximation 0 and detailed 1 coefficients at the previous level . The reconstruction operation is provided by adding zeros and convolving the results with the reconstruction filters. To perfectly reconstruct the original signal (before the DWPT transformation), it is sufficient to use the concept of QMF filters that satisfy the following relations: Where ℎ ̅ and ̅ are the low-pass and high-pass reconstruction filters for IDWPT.

Parallelization challenge of DWPT and IDWPT
In order to achieve our aim to develop a high performance hardware architecture of DWPT and IDWPT, we think for the first time by a classic way to parallelize the Mallat binary tree as shown in figures 3 and 4.  , where we show a three-level decomposition, reconstruction tree respectively, with theoretical P-parallel degree and a P-parallel filters bank consists of wavelet functions to be able to treat P sampling in each slot time. This classic P-parallel architecture is unrealizable because the number of reconstruction filters and/or decomposition filters increases exponentially as a function of depth order. In precision, the number of filters needed to implement this architecture is around * (2 ℎ+1 − 1). For example, with degree of parallelism = 16 and ℎ = 5, we need 1008 low/high pass filters for DWPT and another 1008 low/high pass filters for IDWPT. The implementation of 2016 filters is a potential problem, which make it unable to implement.

Our Parallel-Pipeline architectures with sharing resources
Considering the diagram in figures 1 and 2, a zoom on data flow on the classic Mallat binary tree leads us to some regularity function:  At a given stage of the Mallat binary tree, the data rate processed by any filter is twice as fast as that of any filter on the adjacent stage on the input side in Figure 1.  While, the data rate processed by any filter is twice as slow as that of any filter on the adjacent output-side stage in Figure  2 (factor 2 upper sampling from stage to stage).  In addition, the amount of data processed in a k-level is 2 ℎ − times the amount to process at level 1. Thus, the total amount of data to be processed in level is the same as in the first or last level (2 − × 2 = 1).
Based on this big regularity that provide high throughput rate with lower hardware resources, we build an evolution parallelpipeline architecture of DWPT and IDWPT based on Mallat binary tree as shown in section III. To achieve our objective that ensure high throughput with low hardware consumption in our parallel architecture of DWPT and IDWPT, we have to modify the high/low pass FIR filters.
Under the strategy to reduce area consumption, instead of using two filters we merge in the same architecture the functionality of low-pass and high-pass filters in a single block filter. Furthermore, we modify the single block to process P sampling in two-clock cycle. In figure 5, we present our P-Parallel modified transposed FIR filter.
This new architecture is linearized to be similar to coding theory where the Serial transposed FIR filter is like a Single-Input Single-Output (SISO) system and the Parallel transposed FIR filter is like a Multiple-Input Multiple-Output (MIMO) system. Therefore, this heterogeneous architecture provides the processing of P inputs signals and consequently P outputs signals, in each clock cycle. The main difference with the filters, proposed in [24,25] is related to the handling of the low-pass and high-pass filter coefficients. Furthermore, this filter requires to be feed by correctly scheduled data by a smart shift of data between different stages to serve P sampling in each clock cycle.
Respecting the − buffer order, it is the most sensitive and the core of the serial FIR filter to parallel operation. This role of manage and interleaving data is devoted to the key block in our model. We called this block "buffers block" situated between the filters in different level. The structure of a single buffer is dependent on the direct or inverse DWPT.
In our following proposed architecture, we respect these constraints:  Our developed architecture is based on the Mallat binary tree where we use the theory of filter bank to decompose and reconstruct the original signal.  The degree of parallelism and the characteristics of the target card (technology) must be taken into consideration in order not to deposit available resources.  The degree of parallelism must respect the dyadic rule, that is, = 2 , ∀ ∈ ℕ + . Where affects directly the data managing between different levels of the transformation tree and simplifies the up/down sampling operation.  To synthesis our proposed hardware architecture, we used Altera Quartus premium lite edition software that is targeted on Altera FPGA belonging to the Cyclone V family with a speed grade of -7.

First proposed architecture: Parallel-Pipeline architecture of DWPT with sharing resources
This part is dedicated to present a P-Parallel DWPT architecture, which provides high throughput, by using the modified transposed P-parallel digital FIR filters (presented in figure 6). As we mention below, this modified filter or / block, presented in Figure 5, ensures the processing of the same amount of data on any stage in the original Mallat binary tree.
Concerning our plan to reduce the hardware usage, the functionality of high-pass filter and low-pass filter is provided by the / blocks. Consequently instead of using two separate similar filters (high-pass filter and low-pass filter), we propose using a single filter in an alternatively process. Hence, this alternatively process (on consecutive clock cycle) works by taking a sample for the ( ) and then for ( ) and so on. The critical point in this model is to manage correctly the data between filters and different levels, this role is devoted to a specific buffers situated between the filters in different stages. The structure of blocks buffer is shown in figure 7.

A. Buffer Block structure of DWPT
The parameter in figure 7 is related to the number of the stage in which the buffer is implemented. Then in each level k in figure  6, the buffer blocks is built up on two types of shift register (2 −1 positions shift for each one): a "fast shift buffer" which takes data from the previous stage, and a "slow shift buffer" which feeds its data from its own stage filter. Overall, the buffer size depends on the parallel degree and the level on which it is implemented.
The "fast shift buffer" is faster than "slow shift buffer" where it achieves P-shift on each clock cycle while the "slow shift buffer" achieves P-shift on two-clock cycles. To handle this latency, we used an enable signal called " " which manages the slower shift rate in level . To control the transfer data from the fast buffer to slow buffer, we insert a counter signal 2 to activate the " " signal on every 2 cycles. This transfer operation is performed P times at each stage. It is used to combine the down sampling factor (2 ↓) and to synchronize the output data (sample selection/ordering) on stage ( − 1) with input data on stage ( ) (to still similar of concept proposed on Mallat tree).
In this new scheme, the fast buffer (and hence in the slow buffer too) are stored, by interleaving ordered disposition, filtered data of all related stage. On each 2 cycles, only half samples can be transferred from the fast buffer part to the slow buffer part. Furthermore, each sample flows out from the slow buffer to the next stage must be presented twice (granted by the low frequency rate of the slow buffer) in order to be processed by both ( ) and ( ) filters. To manage different control signal and the data interleaved in different stage, we developed a control unit called " _ ". The " _ " is dedicated to generate and manage the various controls signals " " and " " related to stage .

B. Synthesis results
As we mentioned, this architecture is fully configurable in synthesis by three parameters: the wavelet scale (the tree depth), the order of the filters and the filter coefficient quantization (generic parameters in the VHDL-RTL model). Moreover, it is partially configurable after synthesis of filters coefficient values corresponding to different wavelet family (with the same filter order). This means we can change the type of wavelet without resynthesis our architecture (the coefficients are loaded dynamically after synthesis) where it was not the case for all the previous work.
The resources consumed by our proposed architecture resources are given as 2-tuples ( , ): where stands for logic elements and for logic registers. The associated data processing rate of our proposed architecture depends on clock frequency (clock frequency obtained after synthesis is around 200 MHz with order quantization 5 and around 100 MHz with order quantization 16) and the number of input signals, which is given in Megasamples in clock cycle during the synthesis.
In tables 1, 2 and 3, the configuration parameters are presented as 3-tuples: Depth of DWPT tree, Order of filter, and Quantization with the synthesis results of implementation for 4, 8 and 16 parallel DWPT architecture. Although the highest rate of our architecture is not requiring any memory or DSP blocks.

Second proposed architecture: Parallel-Pipeline architecture of IDWPT with sharing resources
A new P-parallel architecture of IDWPT will proposed in the following part, which can provide ultra-high speed data processing and low cost resources consummation. The reconstruction process of the P-parallel IDWPT architecture is simply the reversed form of P-parallel DWPT. Considering the diagram in Figure 4, we can observe that any filter of level is able to process * 2 ( ℎ− ) times the amount of data to be processed on level 1. Thus, the total amount of data by all filters to be processed in level k is the same as in level 1 ( * 2 ( ℎ− ) * 2 −( ℎ− ) = ). It is the same remark that is observed at the DWPT in the previous part (III.1 a). Moreover, we can notice that at any filter, in a given level, half the data is processed than its neighbors on the output side, and twice its neighbor on the input side. This implies that the construction of structure tree have repeated functionally blocks and the complexity of the filters is the same for all stages.
To eliminate the exponential evolution of filters number as function of depth order, we linearize and serialize the filtered data in different stages. Consequently, instead of using * 2 2 ⁄ low pass filters and * 2 2 ⁄ low pass filters to filter data in stage , we just implement only one modified filter in each stage. As shown in figure 10, the number of used modified transposed FIR filter bank increases linearly a function of depth order. is the high-pass filter and low-pass filter, which are related to ℎ( ) and ( ), in equation (16).
The structure for the modified FIR filter is the same as that presented in figure 5. The only difference is the coefficients of FIR filters concerning ̅ ( ) and ̅ ( ). This filter provides the same functionality of that presented by Mallat in their binary tree, we also reduce their occupied area by merging low-pass and high-pass reconstruction filters function in the structure in a single block. The processing of filtering is similar to that used in decomposition part (sub section III-1) where we take a sample for ̅ ( ) filter and for ̅ ( ) filter by an alternatively process (on consecutive cycles) and so on.
Consequently, this modified block can process one sample in two clock cycles. To ensure the best interleaving and managing of data in different stages, we developed a key block in our entire model that is the "Buffer Block", which is situated between the filters in different levels.

A. Buffer Block structure of IDWPT
The most important thing in our P-parallel IDWPT ( Figure 10) is, as previously, the link between neighbor levels, which does not need any reorganize of data set. We use the same concept of buffer blocks as illustrated in Figure 6, so the size of the buffer shown in figure 11 depends on the stage in which it is implemented. The buffer block is then built up on two sub-blocks in each positions ( is an indicator of the stage where the block buffer is implemented).
Then in each level k in figure 10, the buffer block is built up on two types of shift register (2 −1 positions shift for each one): a "fast shift buffer" takes data from the previous stage, and a "slow shift buffer" which feeds its data from its own stage filter. Overall, the buffer size depends on the parallel degree and the stage in which it is implemented.
The "fast shift buffer" is faster than "slow shift buffer" where it achieves P-shift on each clock cycle while the "slow shift buffer" achieves P-shift on two-clock cycles. To handle this latency, we used an enable signal called " " which manages the slower shift rate in level . To control the transfer data from the fast buffer to slow buffer, we insert a counter signal 2 to activate the signal on every 2 cycles. This transfer operation is performed P times at each stage.
The synthesis results of implementation of our P-parallel IDWPT architecture based on modified P-parallel FIR filter structure will be the subject of this part.
This architecture of P-parallel IDWPT is fully configurable in synthesis by three parameters: the wavelet scale (the tree depth), the order of the filters and the filter coefficient quantization (generic parameters in the VHDL-RTL model). Moreover, it is partially configurable after synthesis of filters coefficients values corresponding to different wavelet families (which have the same filter order). The resources consumed by our proposed architecture resources are given as 2-tuples ( , ): where stands for logic elements and for logic registers. To generate the differences counter, the different control signals " " and " " related to stage , we proposed also like in figure 8 a block control unit. Figure 11. Structure of the IDWPT buffer block in stage k with degree of parallelization P=4 Table 4. Implementation results of 4-parallel IDWPT architecture the DWPT and IDWPT transforms on the same FPGA split. Certainly, we can implement our architecture (P-parallel DWPT and P-parallel IDWPT) on the same board to ensure the transformation functionality using the same constraints of work with ultra-high throughput and low resources consumption (consequently low power consumption). We notice a big regularity in the architecture of P-parallel DWPT ( Figure 6) and P-parallel IDWPT (Figure 10), particularly in the modified FIR filters on the different levels where the structure of modified FIR filter is the Table 5. Implementation results of 8-parallel IDWPT architecture

Design parameters (Depth, Order Quantization)
In this scheme, we represent a new pipeline-parallel architecture of DWPT/IDWPT with shared P-parallel FIR filter. Furthermore, instead of the implementation of 2*Depth P-parallel FIR filter in this architecture, we used the half of hardware architecture with the conservation of the ultra-throughput rate.
The buffers block are similarly presented in previous part in Pparallel DWPT and P-parallel IDWPT. Respectively, Mux_Data and DeMux_Data blocks are implemented to manage data between the buffers and the shared P-parallel FIR filter in P-parallel DWPT or P-parallel IDWPT. The Mux_Filter blocks are multiplexer, used to precise the transform direction and to manage the loading FIR filters coefficients after synthesis.

A. Synthesis results
To evaluate the performance of our P parallel DWPT/IDWPT architecture, we just implement the Blocks of P-parallel FIR filters, which gives us a percentage of the resources quantity consumption that have been dispensed by using a shared P-parallel FIR filters.
The compilation results of area consumption for different values of the configuration parameters, that mean tree depth, filter order, coefficient quantization (in number of bits), and parallel degree are presented in table 7 with different parallel degree. From table 7, we notice that the resources consumption are the same with the same order filter (which is expected). Moreover, with half usage of P-parallel FIR filter in P-parallel DWPT/IDWPT, we gain resources percentage from 3 to 5 % for a depth = 4 and filter order = 16, which is significant especially with the large depth and high filter order (like discrete Meyer wavelet). This is a very important result because classically it is too difficult to implement in the same board simultaneously both parallel architectures.

Comparison
To evaluate the performance of each architecture, we make a comparison chart (

Conclusion
In this work, we have proposed three powerful Pparallel/pipeline configurable architecture of Direct Discrete Wavelet Packet Transform (DWPT) and Inverse Discrete Wavelet Packet Transform (IDWPT) based on Mallat binary tree using bank filter concept, which provide high throughput with minimal hardware resources. The considered problem is to accelerate data processing in clock cycle and decrease the total hardware resources used.
To solve these problems, we develop a P-Parallel modified FIR filter based on transposed FIR filters that can share hardware resources between low-pass and high-pass filters (by merging their functionality). The effective data path maintains a short critical path allowing high operation frequency to be reached, where synthesis results indicate that our architecture provides a very high speed data processing with minimum resources. For example: for a parallel degree P=16, depth order = 2, filter order = 2 and order quantization = 5 we reach a very high bit rate equal to 3159.52 Mega samples. We also developed a new effective architecture for both DWPT and IDWPT implemented in the same programmable board, conserving the high throughput performance with more decreased hardware resources consumption.
Those proposed architectures (IDWPT and DWPT) are fully configurable at synthesis as function of P-parallel degree, depth (number of tree stages), filter order and filter coefficient quantization (generic parameters in the VHDL-RTL model). Furthermore, they are re-configurable after synthesis by loading the filter coefficients, which depend on different wavelet family during operations (after synthesis) that providing high flexibility of DWPT/IDWPT transforms.

Perspectives
In this work, we present a different parallel version of hardware implementation in FPGA of DWPT/IDWPT. This work is still in progress where we are constructing another generation of DWPT/IDWPT implementation, which can provide more performance and can be used in new application domains. Figure 12. Proposed data-path diagram of both P-parallel DWPT and IDWPT architecture. Tze-Yun et al. [28] Marino et al. [29] Mohanty et al. [30] Madishetty et al. [23] Wang et al. [27] Wu et al. [12] Meihua et al. [