Empirical and Statistical Evaluation of the Effectiveness of Four Lossless Data Compression Algorithms

Data compression is the process of reducing the size of a file to effectively reduce storage space and communication cost. The evolvement in technology and digital age has led to an unparalleled usage of digital files in this current decade. The usage of data has resulted to an increase in the amount of data being transmitted via various channels of data communication which has prompted the need to look into the current lossless data compression algorithms to check for their level of effectiveness so as to maximally reduce the bandwidth requirement in communication and transfer of data. Four lossless data compression algorithm: Lempel-Ziv Welch algorithm, Shannon-Fano algorithm, Adaptive Huffman algorithm and Run-Length encoding have been selected for implementation. The choice of these algorithms was based on their similarities, particularly in application areas. Their level of efficiency and effectiveness were evaluated using some set of predefined performance evaluation metrics namely compression ratio, compression factor, compression time, saving percentage, entropy and code efficiency. The algorithms implementation was done in the NetBeans Integrated Development Environment using Java as the programming language. Through the statistical analysis performed using Boxplot and ANOVA and comparison made on the four algorithms, Lempel Ziv Welch algorithm was the most efficient and effective based on the metrics used for evaluation. Keywords: Data compression, lossless, evaluation, entropy, algorithm


I. INTRODUCTION
The need for data and information sent through various means of communication needs to be depressed to a reduced and yet compact form in very important.Compression of data is the process of reducing the size of a data into a smaller but yet a compact form.It is also the process of sinking large storage of data in a way of reducing its communication cost.Data compression which is also known as source coding revolves around the reduction of bits in the original file size as compared to the original state.
There are two forms of data compression; Lossless data compression which exploits redundancy in a text data to represent the data in a compact form without data loss e.g.text data.Lossy data compression allows for the loss of data during the process of compression.
In 1970s, software compression came to live in the advent of Internet and subsequently online storage with the Huffman encoding (invented by David Huffman who was studying information theory at MIT) which is similar to Shannon-Fano coding but different as its probability tree is built in a top-down form (Mohammed and Ibrahiem , 2007).Abraham Lempel and Jacob Ziv in 1977 came up with Lempel-Ziv algorithm which was the first algorithm to use dictionary in compressing data (Arup, et al., 2013).Since then, many variants of Lempel-Ziv algorithm have grown from LZ77, LZ78, LZMA and LZX for which most have faded after its invention.
The advent of this various compression techniques begs for the need to evaluate Lempel-Ziv Welch algorithm, Shannon-Fano algorithm, Adaptive Huffman algorithm and Run-Length encoding for a proper test on their efficiency and effectiveness.
Against this backdrop, this work aims at providing comprehensive details on the effectiveness and efficiency of the algorithms base on the selected metrics for their evaluation.

A. Entropy Based Encoding
This type of lossless data compression algorithm tallies the number of occurrence of each character/symbol in the original document.These unique characters are represented with a new set of symbol generated by the algorithm.The length of the newly generated symbols depends on the level of occurrence of each symbol in the original document (Kodituwakkuand Amarasinghe, 2015).Entropy based encoding algorithm is also based on the statistical information of the source filelooking at the rate of occurrence of a particular character (Manas, et al., 2012).An example of this algorithm is Shannon Fano encoding.
Entropy is the randomness of occurrence for a set of string at a particular time.
Entropy can be defined as: ) (1)  (Wang, 2011) where "S" is the set of probable states, and P(S) is the likelihood of state P(S) =

B. Adaptive Huffman Encoding
Huffman encoding algorithm was invented by David Huffman in the year 1951.This algorithm is an entropy based algorithm mainly for lossless data compression.Character of fixed length codes are substituted with variable length codes.Huffman Encoding Algorithm is the process of using the probability of occurrence of a symbol in the original source document to create a code word for each character (Tamanna and Sonia , 2014).Adaptive Huffman algorithm which is a branch of Huffman Encoding algorithm creates a tree in a bottom up form during the process of calculating characters occurrence (Pooja, et al., 2015).

C. Shannon Fano Coding
Shannon Fano data compression algorithm was named after Claude Shannon and Robert Fano after their efforts to create an encoding procedure that will generate a binary code treein a top-down form (Kannanand Murugan, 2012).The algorithm which is entropy based and similar to Huffman encoding algorithm evaluates a characters reoccurrence and allocates a code word with corresponding code length.

D. Dictionary Based Encoding
This algorithm is also known as substituting encoding.It holds a data structure called "dictionary" which contains strings.The encoder of the algorithm in the process of compression matches a substring in the original file to the string in the dictionary (Manas, et al., 2012).If a match is found, the encoder replaces the substring with a reference to the dictionary.

E. Lempel Ziv Welch
Lampel Zev Welch was named after Abraham Lampel and Jacob Zev worked on an LZ78 algorithm in 1977; Terry Welch modified it in 1984 for implementation in an extraordinary performance disk (Pooja, et al., 2015).It is a substitution compression algorithm which creates an active dictionary with a set of strings and thereby substitutes each corresponding substring in the original files with the string in the dictionary.The string in the dictionary acts as a reference to the substring in the original document.

F. Run Length Encoding
Run Length encoding can be regarded as the simplest lossless data compression algorithm.It processes a document on number of "Runs" and "Non-Runs" (Shrusti, et al., 2013).It simply counts the number of times a character occurs repeatedly in the source file, for example, BOOKKEPPER will be encoded as 1B2O2K1E2P1E1R.(Sebastian, 2003).Arup, et al. (2013) presented a paper which was set with the objective of examining the performance of various lossless data compression based on different test files.Various metrics were used to determine the level of performance of each algorithm.Three lossless data compression algorithm, namely Huffman encoding, Shannon Fano and LampelZiv Welch (LZW) were implemented and examined.From the various performance evaluation metrics carried out (compression ratio, compression factor, entropy and code efficiency), LZW was said to be slower, Shannon Fano has a higher average decompression time.It was concluded that depending on the various performance metrics, their performance varies.It was recommended that more Lossy and lossless data compression algorithm be examined in future while they should also be tested on larger test files.Barath, et al. (2013) designed software, Domain "Sun Zip" developed with Java programming language with the aim of reducing the number of bit and byte representation of a character.The software works by reducing the bit representation of source file, lessens the disk storage space of such data and thereby allows easy transmission over a network.It was noted that other third party software such as WinRAR, WinZip etc. poses some disadvantages and difficulties.The software was developed using a lossless data compression algorithm named Huffman encoding Algorithm.Some major drawbacks were identified in the previous existing third party software which are; Data insecurity, higher compression time and monopoly in file extension.

III. RELATED WORKS
It was observed by SubhamastanRao, et al. (2011) that speed (processing time) is the main challenge during the separate process of data compression and encryption.The paper focused on the need to combine these two processes together thereby lessening the challenges.The idea behind this combination was to add to data compression a pseudo random shuffle.Shuffling of nodes in the tree of Huffman algorithm is done to produce a single mapping of the Huffman table.Decompression cannot be done once the Huffman table is encrypted thus simultaneous encryption and compression is achieved.
Challenges facing the separate process of compression and encryption ranges from low sped, acquiring more cost and the computer having more processing time.These challenges were the main reason behind combining compression and encryption algorithms.Execution time of both process reduced drastically and the new algorithm was deemed as good as other common algorithm such as DES, RC5, etc.
The approach improved the speed and also provided more security.Enhancement is encouraged on this approach to achieve more efficiency and the algorithm was said to be prone to security attack.Hanaa, et al. (2015) observed that images contain multiple redundancies from high correlation between pixels which occupies a lot of space.Many algorithms have been designed and developed to compress images.This research was based on analyzing all the image compression algorithms and identifying the advantages and shortfalls.The main objective of this research was to find a way of reducing the amount of power consumed by redundant images.
In the source data, three major types of data redundancy were observed; Spatial redundancy, temporal redundancy and spectral redundancy.Various processes involved in its image compression included mapper, quantizer and entropy encoding.The performance metrics used to measure the level of efficiency of image compression were quality of image, compression ratio, power consumption and speed of compression which can be divided into two; computational complexity and memory resources.During the course of evaluation, it was reached that SPIHT is the best technique due to its compactness and generation of low bit rate.Adaptation of SPIHT for Wireless Media Sensor Network (WMSN) was encouraged as an area to be researched upon.Suarjaya (2012) proposed a new data compression algorithm "J BIT ENCODING" (JBE) which manipulates every bit in a source file to minimize the data size without losing any information.The algorithm was considered to be a lossless data compression algorithm.The developed algorithm was also compared with other algorithms to measure the level of effectiveness and efficiency.
Other algorithms used for the comparison are Run Length encoding, Burrows wheeler transform, Move to Front (MTF) and Arithmetic coding.The proposed algorithm with other four algorithms were tested with five different data files.The results were inconclusive due to the hybrid nature of test files used e.g.document content included audio, text, and video.The author recognizes the need for more review and research into J Bit encoding algorithm.
Lempel ZivWelch which was "incorporated as the Standard of the consultative committee on International telegraphy and telephony" was implemented with a little modification.Simrandeep and Sulochana in 2012 designed the dictionary of the algorithm based on "content addressable memory array".Xilinx ISE simulation tool was used to derive accurate performance measures.The algorithm which was evaluated by a finite state machine technique achieved a compression rate of 30.3% with 60.25% reduction in disk storage.The result of the developed LempelZiv Welch data compression algorithm assigned 5 bit to each character instead of 7 bits.Various test data were used for the analysis.Pooja, et al. (2015) proposed a two stage data compression algorithm OLZWH which used both Lempel Ziv Welch and Adaptive Huffman encoding algorithm at the optimal level.In the algorithm, dictionaries are formed for input character symbols in two modes; set of indices and set of ASCII.OLZW was applied to set of indices while Adaptive Huffman was applied to ASCII code.The analysis were however unclear as there is no detailed explanation and statistical interpretation of the results obtained.

IV. DATA COMPESSION EVALUATION TECHNIQUES/METRICS.
Various performance evaluations metric were used to evaluate the four lossless data compression algorithms.The implication of these values with respect to -114 dBm defined by FCC as the criteria of the empty spaces for TV white space ( Nasir et. al., 2013) is that FCC has chosen additional sensing margins of 27.3 dB and 3.3 dB in both cases of channel 31, but the margin is 2.7 dB in the case of channel 10.

A. Compression Ratio
This was calculated by finding the ratio between the compressed and original file.

D. Compression and Decompression Time
This calculates the time taken for each algorithm to compress file of a particular size and also to decompress same file back to its original form.The time will be calculated in Nanoseconds (Ns).

E. Entropy
Generally, entropy refers to disorder or uncertainty.Entropy is used if the data compression algorithm is based on statistical information of the source file.Two events happen in a source document; an event that occurs rarely and the other which occurs repeatedly.Entropy can be calculated (Kodituwakku and Amarasinghe, 2015) as: where S is the set of probable states, and P(S) is the likelihood of state.

F. Code Efficiency
Code efficiency can be defined as the percentage in ratio between the source file entropy and the average code length of the source file.It can be calculated as: where E is the code efficiency, H(S) is the entropy and L is the average code length.Source: Kodituwakku and Amarasinghe, 2015.

G. Average Code Length
This can be defined as the average number of bits expected to represent a single code word.For the length of the code word in the source file is known, the average code length can be calculated as (Kodituwakkuand Amarasinghe, 2015): where p is the likelihood of occurrence of a particular symbol; l is the length of a code word for a particular symbol.

V. IMPLEMENTATION, FINDINGS AND RESULTS
provides analysis of the four lossless compression algorithms using various metrics for performance evaluation.
Going by the result in Table 1, Run Length compression algorithm did not work well with the test data.Run Length works well on repeated character and since all the data have little or few repeated values, the compressed data increased from that of the original data which isn't the desired result expected.The compressions ratio and factor are over the mark while the saving percentage is negative all through.In File 1, the compressed file size almost doubled the original file size.LampelZiv Welch data compression algorithm makes use of a dynamic dictionary.The result in Table 2 shows a very good compression ratio.File 10 of Table 2 gives a saving percentage of 78.19%.All the files compressed have a reduction in size as compared to Run Length which increased in size.The lowest saving percentage is 26.59%.The compression ratio and factor of all files are quite good.The saving percentage is still positive in the compression of picture and graphics.The compression time is also within satisfaction.With this algorithm, communication cost and storage space will be reduced.
Implementation of Adaptive Huffman algorithm as shown in Table 3 shows a dynamic tree for the traversal of nodes with a relatively average saving percentage.The saving percentage for the text document was as high as 63.53%.The algorithm doesn't work well with tabs as the compression of .docxfile has shown a low saving percentage.For example, File 3 has 0.21% while File 8 has-0.13%.Adaptive Huffman compression ratio of picture and audio file is very high as shown in File 11 to File 15.The saving percentage for audio is a bit higher than that of picture.Adaptive Huffman helps in reducing file size of compressed data which helps to reduce communication cost and storage space.
Shannon Fano which is a variant of Huffman Algorithm has quite been known not to have a better code efficiency to Adaptive Huffman.Results obtained as shown is Table 4 gives all the files compression ratio to be above 100% which isn't efficient.The saving percentage is also in the negative state.The compression factor is far low while the entropy is in the range 7.0 to 8.0 bit per character.The algorithm doesn't works well with the test data.The four lossless data compression algorithms which results have been shown in Table 2 were compared based on their saving percentage, compression ratio, compression time, entropy and code efficiency.With the comparison shown in Table 5 and graphical comparison result in Figure 1, it is shown that Lempel Ziv Welch clearly has a better saving percentage than the other algorithm compared though Adaptive Huffman has a better saving percentage in Text2.pdfonly.
The closer the compression ratio is to "1%", the more efficient the algorithm is.In the result shown in Table 6 and its graphical representation in Figure 2, Lempel Ziv Welch algorithm has a better compression ratio in all test data except in Text2.pdfwhere Adaptive Huffman algorithm has a better compression ratio.It can be deduced that Lempel Ziv Welch has a better compression ratio to other algorithm.
Table 7 shows the Analysis of Variance which was used to deduce that there are significant difference in the mean value of each of the algorithms.The above boxplot graph shows Lempel Ziv Welch algorithm with a better saving percentage.

A. Comparison Based on Compression Ratio
The closer the compression ratio is to "1%", the more efficient the algorithm is.In the result shown in Table 6 and its graphical representation in Figure 2, Lempel Ziv Welch algorithm has a better compression ratio in all test data except in Text2.pdfwhere Adaptive Huffman algorithm has a better compression ratio.It can be deduced that Lempel Ziv Welch has a better compression ratio to other algorithm.

B. Comparison Based On Compression Time
In the result shown in Table 7 and its graphical representation in Figure 3, Adaptive Huffman a better compression time.The average compression rate of 524,325,363.1 Nanoseconds is regarded as the best.Lempel Ziv Welch algorithm which has a better compression ratio and saving percentage has the least good average compression time of 8,374,475,588 Nanoseconds.

C. Comparison between Original and Compressed File Sizes
In the result comparison showed at Table 8 and its graphical representation in Figure 4, the original file sizes are compared with the their corresponding compressed file sizes.Lempel Ziv Welch algorithm has the lower rate of compressed file size as compared to other in all test files.

VI. CONCLUSION
A study and evaluation of four lossless data compression algorithm was done.The algorithms were implemented and tested with different test data of different sizes.A comparison of all four algorithms was done to know their level of efficiency and effectiveness.By working on their result analysis and graphical representation while considering the compression factor, compression ratio, saving percentage and ability to compress audio and graphics file effectively, the Lempel Ziv Welch algorithm which is based on using dictionary is considered to be the most effective and efficient of the four data compression algorithm evaluated.The result and values are very good and acceptable.Since an efficient and effective compression algorithm has been identified this in turn allows optimal usage of storage space and also reduction in communication cost.Great knowledge has been contributed to the world of computer science as an efficient data compression algorithm has been identified.
A system should be put in place that will recognize a file type and subsequently assign it to a suitable data compression algorithm.Research should be focused towards Context Mixing Algorithm such as PAQ which is efficient in its compression ration but slow due to usage of multiple statistical prototypes.The speed should be improved upon.Use of compression via substring enumeration (CSE), a compression technique should be research more into to improve its level of efficiency.

Figure 4 :
Figure 4: Graphical comparison of original file size against compressed file size.