9.1. Snappy Module – Unpacking iWork 13 files.

This is not a full implementation of the Snappy compression protocol. It’s a minimal implementation, enough to unpack iWork ‘13 files.

9.1.1. The iWork ‘13 use of Snappy

https://github.com/obriensp/iWorkFileFormat

https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md

“Components are serialized into .iwa (iWork Archive) files, a custom format consisting of a Protobuf stream wrapped in a Snappy stream.

“Snappy Compression

“Snappy is a compression format created by Google aimed at providing decent compression ratios at high speeds. IWA files are stored in Snappy’s framing format, though they do not adhere rigorously to the spec. In particular, they do not include the required Stream Identifier chunk, and compressed chunks do not include a CRC-32C checksum.

“The stream is composed of contiguous chunks prefixed by a 4 byte header. The first byte indicates the chunk type, which in practice is always 0 for iWork, indicating a Snappy compressed chunk. The next three bytes are interpreted as a 24-bit little-endian integer indicating the length of the chunk. The 4 byte header is not included in the chunk length.

9.1.3. Implementation

Module docstring.

"""Read snappy-compressed IWA files used for Numbers '13 workbooks.

This is a variation on the "official" snappy protocol. The CRC checksums
are not used by iWork '13. This is not a full implementation, just
a decoder for iWork snappy-compressed IWA files.

See https://code.google.com/p/snappy/
"""

Some overheads

import logging
import sys
snappy.bytes_int(seq)

Decode a sequence of bytes into an integer.

Parameters:seq – sequence of bytes; the entire sequence is consumed.
Returns:integer
def bytes_int( seq ):
    """8-bit encoded integer as sequence of 1 to 4 bytes, little-endian.
    """
    shift= 0
    v= 0
    for b in seq:
        v += b<<shift
        shift += 8
    return v
snappy.varint(stream)

Decode varint-encoded sequence of bytes.

Parameters:seq – sequence of bytes; consume bytes to decode the int.
Returns:integer
def varint( stream ):
    """7-bit encoded integer as a sequence of bytes, little-endian.
    MSB is used to indicate if more bytes are part of this value.

    >>> varint( iter([0xfe, 0xff, 0x7f]) )
    2097150
    """
    b= next(stream)
    shift= 0
    v = (b & 0x7F) # <<shift to be pedantic
    while b & 0x80 != 0:
        b= next(stream)
        shift += 7
        v += (b & 0x7F)<<shift
    return v

The snappy protocol has two levels.

  • The LZ77 decoder which expands the tags to create the data. There are four kinds of tags.

    • 0b00: literal

      Literals are uncompressed data stored directly in the byte stream. The literal length is stored differently depending on the length of the literal:

      • For literals up to and including 60 bytes in length, the upper six bits of the tag byte contain (len-1). The literal follows immediately thereafter in the bytestream.
      • For longer literals, the (len-1) value is stored after the tag byte, little-endian. The upper six bits of the tag byte describe how many bytes are used for the length; 60, 61, 62 or 63 for 1-4 bytes, respectively. The literal itself follows after the length.
    • 0b01: Copy with 1-byte offset

      These elements can encode lengths between [4..11] bytes and offsets between [0..2047] bytes. (len-4) occupies three bits and is stored in bits [2..4] of the tag byte. The offset occupies 11 bits, of which the upper three are stored in the upper three bits ([5..7]) of the tag byte, and the lower eight are stored in a byte following the tag byte.

    • 0b10: Copy with a 2-byte offset

      These elements can encode lengths between [1..64] and offsets from [0..65535]. (len-1) occupies six bits and is stored in the upper six bits ([2..7]) of the tag byte. The offset is stored as a little-endian 16-bit integer in the two bytes following the tag byte.

    • 0b11: Copy with a 4-byte offset

      These are like the copies with 2-byte offsets (see previous subsection), except that the offset is stored as a 32-bit integer instead of a 16-bit integer (and thus will occupy four bytes).

  • The higher-level framing protocol.

    • type “0” (Compressed Data) frame with a three-byte length.
    • Other types are possible in principle. Numbers ‘13 doesn’t use them.
class snappy.Snappy

Implement the two-level snappy protocol used by Numbers ‘13.

  • The LZ77 decoder which expands the tags to create the data.
  • The higher-level framing protocol with just one kind of frame, type “0” (Compressed Data) with a three-byte length.
class Snappy:
    def __init__( self ):
        self.log= logging.getLogger( self.__class__.__qualname__ )
Snappy.lz77(frame)

The LZ77 decoder. This locates the varint size header. That’s followed by a sequence of tags. The literal tag has data. The other three tags repeat previously output bytes.

We’re building a bytearray buffer from the input. This means copying literals into the buffer. It also means copying part of the buffer into the buffer to add one of the three kinds of copies.

Because of the framing protocol, we’re limited to a buffer of only 64K bytes.

def lz77( self, frame ):
    """Decode one frame of a Snappy LZ77-encoded stream.

    Get the tags, data and emit the resulting uncompressed bytes for this frame.

    There are four types of tags:

    0b00 - Literal - the balance of the tag specifies the length.
    0b01 - Copy 1-byte offset - repeat previous bytes from the output buffer.
    0b10 - Copy 2-byte offset - repeat previous bytes
    0b11 - Copy 4-byte offset - repeat previous bytes

    :param frame: One frame from a Snappy file.
    :returns: buffer of bytes for this frame.
    """
    buffer= bytearray()
    stream= iter( frame )
    # The size of the uncompressed data in this frame.
    size= varint( stream )
    self.log.debug( "  LZ77 size {0}".format(size) )
    # Build the uncompressed buffer.
    while len(buffer) < size:
        hdr= int(next(stream))
        tag_upper, element_type = hdr >> 2, hdr & 0b11

        if element_type == 0b00: # Literal
            if tag_upper < 60:
                size_elt= tag_upper
            else:
                size_elt= bytes_int( next(stream) for i in range(tag_upper - 59) )
            bytes= [next(stream) for b in range(size_elt+1)]
            self.log.debug(
                "{0:08b} {1} {2} = {3!r}".format(
                hdr, element_type, size_elt, bytes) )
            buffer.extend( bytes )
        else: # Some kind of copy
            # Copy -- gather bytes based on offset, stow into buffer based on length
            if element_type == 0b01: # Copy with 1-byte offset
                length, offset_hi = tag_upper & 0b111, (tag_upper & 0b111000)>>3
                offset_lo= next(stream)
                offset= (offset_hi<<8)+offset_lo
                self.log.debug(
                    "{0:08b} {1:8b} {2} {3} = {4!r} {5!r}".format(
                    hdr, offset_lo, element_type, length, (offset_hi, offset_lo), offset) )
                length += 4
            elif element_type == 0b10: # Copy with 2-byte offset
                offset= bytes_int( next(stream) for i in range(2) )
                length= tag_upper
                self.log.debug( "{0:08b} {1} {2} {3}".format(hdr, element_type, length, offset) )
                length += 1
            elif element_type == 0b11: # Copy with 4-byte offset
                offset= bytes_int( next(stream) for i in range(4) )
                length= tag_upper
                self.log.debug(
                    "{0:08b} {1} {2} {3}".format(
                    hdr, element_type, length, offset) )
                length += 1
            else:
                raise Exception( "Logic Problem" )
            # Extend buffer with the copied bytes.
            # Handle RLE feature, if necessary.
            copy= buffer[-offset:]
            if offset < length:
                repeat= copy[:]
                while len(copy) < length:
                    copy += repeat
            buffer.extend( copy[:length] )
    assert len(buffer) == size, "len(buffer) {0} != size {1}".format(len(buffer),size)
    return buffer
Snappy.decompress(file_object)

The Framing protocol required to decode. The frames contain up to 64K of compressed data. This defines a sequence of windows over the stream of data.

def decompress( self, file_object ):
    """Decompress a snappy file object. Locate each frame in the snappy
    framing protocol that's used by iWork (not precisely as specified
    by Google.) For each frame, do the LZ77 expansion on the frame's bytes to
    build the uncompressed data.

    Frames have a 4-byte header. Byte 0 is frame type, only type 0 (Compressed Data)
    is supported. Bytes 1-3 are a 24-bit size for the frame.
    Practically, it's limited to 65536 bytes.

    The CRC32 is omitted for iWork files

    ..  todo:: yield iterable byte stream for use in higher-levels of the protocol.

        It's not *required* to materialize the entire data buffer as a single object.
        The intent of the framing is to limit the size of the buffer required.

    Note that we could provide ``file_object`` file directly to ``lz77()`` function because
    lz77 protocol starts with the target uncompressed size at the front of the frame.
    We don't **actually** need to read the frame here.
    """
    data= bytearray()
    header= file_object.read(4)
    while header:
        # The Snappy framing format: type 0 (Compressed Data) with a 24-bit size.
        # The CRC32 is omitted for iWork files
        type_frame, size_frame = header[0], bytes_int(header[1:4])
        assert type_frame == 0, "Unsupported Snappy Frame {0}".format(type_frame)
        self.log.debug( "Frame type {0} size {1}".format( type_frame, size_frame ) )
        frame= file_object.read( size_frame )
        data.extend( self.lz77( frame ) )
        header= file_object.read(4)
    return data