9.2. Protobuf Module – Unpacking iWork 13 files.

This is not a full implementation of Protobuf object representation. This is a minimal implementation of protobuf parsing, enough to unpack iWork ‘13 files.

9.2.1. The iWork ‘13 use of protobuf

https://github.com/obriensp/iWorkFileFormat

https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md

“Components are serialized into .iwa (iWork Archive) files, a custom format consisting of a Protobuf stream wrapped in a Snappy stream.

“Protobuf

“The uncompresed IWA contains the Component’s objects, serialized consecutively in a Protobuf stream. Each object begins with a varint representing the length of the ArchiveInfo message, followed by the ArchiveInfo message itself. The ArchiveInfo includes a variable number of MessageInfo messages describing the encoded Payloads that follow, though in practice iWork files seem to only have one payload message per ArchiveInfo.

“Payload

“The format of the payload is determined by the type field of the associated MessageInfo message. The iWork applications manually map these integer values to their respective Protobuf message types, and the mappings vary slightly between Keynote, Pages and Numbers. This information can be recovered by inspecting the TSPRegistry class at runtime.

“TSPRegistry”

“The mapping between an object’s MessageInfo.type and its respective Protobuf message type must by extracted from the iWork applications at runtime. Attaching to Keynote via a debugger and inspecting [TSPRegistry sharedRegistry] shows:

“A full list of the type mappings can be found here.”

https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes

Message .proto files.

We require two of the files from this project to map the internal code numbers

These files are incorporated into this module as separate *.json files. See Installation via setup.py for more information on these files.

9.2.3. IWA Structure

Each IWA has an ArchiveInfo message.

message ArchiveInfo {
    optional uint64 identifier = 1;
    repeated MessageInfo message_infos = 2;
}

Within the ArchiveInfo is a MessageInfo message.

message MessageInfo {
    required uint32 type = 1;
    repeated uint32 version = 2 [packed = true];
    required uint32 length = 3;
    repeated FieldInfo field_infos = 4;
    repeated uint64 object_references = 5 [packed = true];
    repeated uint64 data_references = 6 [packed = true];
}

The MessageInfo is followed by the payload. That must be decoded to get the actual data of interest.

9.2.4. Implementation

Module docstring.

"""Read protobuf-serialized messages from IWA files used for Numbers '13 workbooks.

https://developers.google.com/protocol-buffers/

https://developers.google.com/protocol-buffers/docs/encoding

Requires :file:`Numbers.json` and :file:`Common.json` from the installation
directory.
"""

Some Overheads

import logging
import sys
import os
import json
from collections import defaultdict, ChainMap
from stingray.snappy import varint
class protobuf.Message

A definition of a generic protobuf message. This is both an instance and it also has staticmethods that build instances from a buffer of bytes.

We don’t use subclasses of Message. The proper way to use Protobuf is to compile .proto files into Message class definitions.

class Message:
    """Generic protobuf message built from sequence of bytes.

    :ivar name_: the protobuf message name.
    :ivar fields: a dict that maps field numbers to field values.
        The contained message objects are **not** parsed, but left as
        raw bytes.
    """
    def __init__( self, name, bytes ):
        self.name_= name
        self.fields= Message.parse_protobuf(bytes)
    def __repr__( self ):
        return "{0}({1})".format( self.name_, self.fields )
    def __getitem__( self, index ):
        return self.fields.get(index,[])
Message.parse_protobuf_iter(message_bytes)

An iterative parser for the top-level (name, value) pairs in the protobuf stream. This yields all of the pairs that are parsed. This a static method which builds message instances.

@staticmethod
def parse_protobuf_iter( message_bytes ):
    """Parse a protobuf stream, iterating over the name-value pairs that are present.
    This does NOT recursively descend through contained sub-messages.
    It does only the top-level message.
    """
    bytes_iter= iter(message_bytes)
    while True:
        try:
            item= varint( bytes_iter )
        except StopIteration:
            item= None
            break
        field_number, wire_type = item >> 3, item & 0b111
        if wire_type == 0b000:   # varint representation
            item_size= None # varint sizes vary, need something for debug message
            field_value = varint( bytes_iter )
        elif wire_type == 0b001: # 64-bit == 8-byte
            item_size= 8
            field_value = tuple( next(bytes_iter) for i in range(8) )
        elif wire_type == 0b010: # varint length and then content
            item_size= varint( bytes_iter )
            field_value = tuple( next(bytes_iter) for i in range(item_size) )
        elif wire_type == 0b101: # 32-byte == 4-byte
            item_size= 4
            field_value = tuple( next(bytes_iter) for i in range(4) )
        else:
            raise Exception( "Unsupported {0}: {1}, {2}", bin(item), field_number, wire_type )
        Message.log.debug(
            '{0:b}, field {1}, type {2}, size {3}, = {4}'.format(
            item, field_number, wire_type, item_size, field_value) )
        yield field_number, field_value
Message.parse_protobuf(message_bytes)

Create a bag in the form of a mapping {name: [value,value,value], ... }. This will contain the top-level identifiers and the bytes that could be used to parse lower-level messages.

@staticmethod
def parse_protobuf( message_bytes ):
    """Creates a bag of name-value pairs. Names can repeat, so values are
    an ordered list.
    """
    bag= defaultdict( list )
    for name, value in Message.parse_protobuf_iter( message_bytes ):
        bag[name].append( value )
    return dict(bag)

A class-level logger. We don’t want a logger for each instance, since we’ll create many Message instances.

Message.log= logging.getLogger( Message.__class__.__qualname__ )
class protobuf.Archive_Reader

A Reader for IWA archives. This requires that the archive has been processed by the snappy.Snappy decompressor.

class Archive_Reader:
    """Read and yield Archive entries from MessageInfo.type and payload.
    Resolves the ID into a protobuf message name.

    Mapping from types to messages

    -   https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Numbers.json

    -   https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Common.json
    """
    def __init__( self ):
        self.tsp_names= self._tsp_name_map()
        self.log= logging.getLogger( self.__class__.__qualname__ )

Mapping from internal code numbers of protobuf message class names. This requires Numbers.json and Common.json from the installation directory.

@staticmethod
def _tsp_name_map():
    """Build the TSPRegistry map from messageInfo.type to message proto
    """
    def load_map( filename ):
        """JSON documents have string keys: these must be converted to int."""
        installed= os.path.dirname(__file__)
        with open( os.path.join(installed, filename) ) as source:
            raw= json.load( source )
        return dict( (int(key), value) for key, value in raw.items() )

    tsp_names= ChainMap(
        load_map("Numbers.json"),
        load_map("Common.json"),
    )
    return tsp_names
Archive_Reader.make_message(messageInfo, payload)

Create the payload message from a MessageInfo instance and the payload bytes.

def make_message( self, messageInfo, payload ):
    name= self.tsp_names[messageInfo[1][0]]
    return Message( name, payload )
Archive_Reader.archive_iter(data)

Iterate through all messages in this IWA archive. Locate the ArchiveInfo, MessageInfo and Payload. Parse the payload to create the final message that’s associated with the ID in the ArchiveInfo.

def archive_iter( self, data ):
    """Iterate through the iWork protobuf-serialized archive:
    Locate the ArchiveInfo object.
    Each ArchiveInfo contains a MessageInfo(s).
    Each MessageInfo describes a payload.
    It appears that there's only one MessageInfo per ArchiveInfo
    even though the ``.proto`` file indicates multiple as possible.

    Yield a sequence of iWork archived messages as pairs:
        identifier, Message object built from the payload.
    """
    protobuf= iter(data)
    while True:
        try:
            size= varint( protobuf )
        except StopIteration:
            size= None
            break
        self.log.debug( "{0} bytes".format(size) )
        message_bytes= [ next(protobuf) for i in range(size) ]
        archiveInfo= Message( "ArchiveInfo", message_bytes)
        archiveInfo.identifier = archiveInfo[1][0]
        archiveInfo.message_infos = [
            Message("MessageInfo", mi) for mi in archiveInfo[2] ]
        self.log.debug( " ArchiveInfo identifier={0} message_infos={1}".format(
            archiveInfo.identifier, archiveInfo.message_infos) )
        messageInfo_0= archiveInfo.message_infos[0]
        messageInfo_0.length= messageInfo_0[3][0]
        self.log.debug( "   MessageInfo length={0}".format(messageInfo_0.length) )
        payload_raw= [ next(protobuf) for i in range(messageInfo_0.length) ]
        self.log.debug( "     Payload {0!r}".format(payload_raw) )
        message= self.make_message( messageInfo_0, payload_raw )
        yield archiveInfo.identifier, message