Satoshi Yoshida and Takuya Kida Graduate School of Information Science and Technology, Hokkaido University
ompressed Pattern Matching ompressed Data Search Directly 0000 000000 Program Searching on ompressed Data Variable-to-Fixed Length (VF) ode has been attracted attention from the viewpoint of compressed pattern matching for a few years. Input Text ompressed Text Fixed Variable Fixed FF ode VF ode Variable FV ode VV ode WT 200 October 4th, 200 2
more memory and time short codeword length long small size of parse tree large low compression ratio high low cost of construct/hold high low preparation cost of pattern matching high WT 200 October 4th, 200 3
ompression method for high compression ratio and fast pattern matching pply other compression method after VF oding STVF oding [Kida2009] Range oder [Martin979] Input Text VF oding Intermediate Output Other oding Output WT 200 October 4th, 200 4
fter decoding compressed text with range coder, we get STVF coded text. We run pattern matching algorithm on it. STVF (short codeword ) + Range oder Range oder (decoding) Intermediate Output (STVF oded) PM on STVF Which is fast? STVF (long codeword) WT 200 October 4th, 200 5
ompression ratio: STVF(2) + range coder slightly improved STVF(6) Pattern matching time: slower by decompression of Range oder. ompression time: almost the same (slightly faster!) Decompression time: almost the same WT 200 October 4th, 200 6
y G. N. N. Martin in 979. G. N. N. Martin, Range encoding: n algorithm for removing redundancy from a digitised message, 979. variation of rithmetic oding [Rissanen, Langdon 979]. Encode using integers instead of real numbers. Encoding is faster than rithmetic oding. ompression ratio is better than Huffman odes. WT 200 October 4th, 200 7
y T. Kida in 2009 T. Kida, Suffix tree based VF-coding for compressed pattern matching, D2009, 2009. VF coding using a pruned suffix tree as a parse tree. chieves higher compression ratio than the basic VF code. WT 200 October 4th, 200 8
P. Weiner, Linear pattern matching algorithms, SWT973, 973. tree structure representing all suffixes in the string. Each branch is labeled by a nonempty string. Each inner node has at least two children. O O O O Suffixes of a string OO: The label of each branch outgoes from an inner node begins different character. OO OO O O WT 200 October 4th, 200 9
Make a compact parse tree. 9 4 3 2 October 4th, 200 0 WT 200 The suffix tree of the string S =
Make a compact parse tree. 2 000 9 4 3 0 0 00 00 0 00 WT 200 October 4th, 200
Input: 2 000 9 4 3 00 00 0 00 0 output: 0 000 00 0 0 0 WT 200 October 4th, 200 2
STVF coded text is represented by regular collage system: a formal system to represent a string [Kida 2003], which is a general framework to capture the essence of compressed pattern matching. We can introduce a (ho-orasick type) pattern matching on STVF code systematically with collage system. The pattern matching algorithm runs in O(n + m 2 ) time and O(D + m 2 ) space. ollage system: a unifying framework for compressed pattern matching WT 200 October 4th, 200 3
ompression methods STVF oding STVF oding + Range oder Data English Text (brown corpus, 6.8M, Σ =96) Environments PU: Intel Xeon processor 3.00GHz dual core Memory: 2G OS: Red Hat Enterprise Linux ES Release 4 odeword Length l = 8-6 bits We compared compression ratios, compression times, decompression times and pattern matching times between the two methods. WT 200 October 4th, 200 4
90% 80% ompress sion ratio 70% 60% 50% 40% 30% 20% 50.6% 5.3% STVF STVF + Range oder 0% 0% 8 9 0 2 3 4 5 6 odeword length WT 200 October 4th, 200 5
ompression time (sec) 35 30 25 20 5 0 5 STVF STVF + Range oder 0 8 9 0 2 3 4 5 6 odeword length WT 200 October 4th, 200 6
Decompressio on time (sec).8.6.4.2 0.8 0.6 0.4 0.2 0 8 9 0 2 3 4 5 6 odeword length STVF STVF + Range oder WT 200 October 4th, 200 7
2.5 Pattern matchi ing time (sec) 2.5 0.5 STVF(6) STVF(2) STVF(2) + Range oder 0 5 0 5 20 25 30 35 40 45 50 Pattern length WT 200 October 4th, 200 8
fter decoding compressed text with range coder, we get STVF coded text. We run pattern matching algorithm on it. STVF (short codeword ) + Range oder Range oder (decoding) Intermediate Output (STVF oded) PM on STVF Which is fast? STVF (long codeword) WT 200 October 4th, 200 9
ompression ratio: STVF(2) + range coder is slightly better than STVF(6) (5.3% 50.6%) Pattern matching time: slow by decompression of Range oder. ompression time: almost the same (slightly faster!) Decompression time: almost the same WT 200 October 4th, 200 20
We investigated the performance of the combination of STVF coding and range coder. We have almost no sacrifices in compression and decompression times. Since the decode of range coder is slower than we expected, we could not improve pattern matching speed. Future work ombine with other methods whose decompression speeds are fast such as gzip. Implement Set-Horsepool algorithm* (oyer Moore type) to improve pattern matching speed when pattern length is long. * G. Navarro and M. Raffinot, Flexible pattern matching in strings, 2007 WT 200 October 4th, 200 2