.. #!/usr/bin/env python3 .. _`protobuf`: ############################################################### Protobuf Module -- Unpacking iWork 13 files. ############################################################### This is not a full implementation of Protobuf object representation. This is a minimal implementation of protobuf parsing, enough to unpack iWork '13 files. .. py:module:: protobuf The iWork '13 use of protobuf =============================================== https://github.com/obriensp/iWorkFileFormat https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md "Components are serialized into .iwa (iWork Archive) files, a custom format consisting of a Protobuf stream wrapped in a Snappy stream. "Protobuf "The uncompresed IWA contains the Component's objects, serialized consecutively in a Protobuf stream. Each object begins with a varint representing the length of the ArchiveInfo message, followed by the ArchiveInfo message itself. The ArchiveInfo includes a variable number of MessageInfo messages describing the encoded Payloads that follow, though in practice iWork files seem to only have one payload message per ArchiveInfo. "Payload "The format of the payload is determined by the type field of the associated MessageInfo message. The iWork applications manually map these integer values to their respective Protobuf message types, and the mappings vary slightly between Keynote, Pages and Numbers. This information can be recovered by inspecting the TSPRegistry class at runtime. "TSPRegistry" "The mapping between an object's MessageInfo.type and its respective Protobuf message type must by extracted from the iWork applications at runtime. Attaching to Keynote via a debugger and inspecting [TSPRegistry sharedRegistry] shows: "A full list of the type mappings can be found here." https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes Message ``.proto`` files. - Table details https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Messages/Proto/TSTArchives.proto - Numbers details https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Messages/Proto/TNArchives.proto - Calculating Engine details https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Messages/Proto/TSCEArchives.proto - Structure (i.e., TreeNode, perhaps more relevant for Keynote) https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Messages/Proto/TSKArchives.proto We require two of the files from this project to map the internal code numbers - https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Numbers.json - https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Common.json These files are incorporated into this module as separate :file:`*.json` files. See :ref:`installation` for more information on these files. protobuf =============================================== For more information on protobuf, see the following: https://developers.google.com/protocol-buffers/ https://developers.google.com/protocol-buffers/docs/encoding http://en.wikipedia.org/wiki/Protocol_Buffers IWA Structure ================ Each IWA has an ArchiveInfo message. .. parsed-literal:: message ArchiveInfo { optional uint64 identifier = 1; repeated MessageInfo message_infos = 2; } Within the ArchiveInfo is a MessageInfo message. .. parsed-literal:: message MessageInfo { required uint32 type = 1; repeated uint32 version = 2 [packed = true]; required uint32 length = 3; repeated FieldInfo field_infos = 4; repeated uint64 object_references = 5 [packed = true]; repeated uint64 data_references = 6 [packed = true]; } The MessageInfo is followed by the payload. That must be decoded to get the actual data of interest. Implementation =============== Module docstring. :: """Read protobuf-serialized messages from IWA files used for Numbers '13 workbooks. https://developers.google.com/protocol-buffers/ https://developers.google.com/protocol-buffers/docs/encoding Requires :file:`Numbers.json` and :file:`Common.json` from the installation directory. """ Some Overheads :: import logging import sys import os import json from collections import defaultdict, ChainMap from stingray.snappy import varint .. py:class:: Message A definition of a generic protobuf message. This is both an instance and it also has staticmethods that build instances from a buffer of bytes. We don't use subclasses of ``Message``. The proper way to use Protobuf is to compile ``.proto`` files into Message class definitions. :: class Message: """Generic protobuf message built from sequence of bytes. :ivar name_: the protobuf message name. :ivar fields: a dict that maps field numbers to field values. The contained message objects are **not** parsed, but left as raw bytes. """ def __init__( self, name, bytes ): self.name_= name self.fields= Message.parse_protobuf(bytes) def __repr__( self ): return "{0}({1})".format( self.name_, self.fields ) def __getitem__( self, index ): return self.fields.get(index,[]) .. py:method:: Message.parse_protobuf_iter( message_bytes ) An iterative parser for the top-level (name, value) pairs in the protobuf stream. This yields all of the pairs that are parsed. This a static method which builds message instances. :: @staticmethod def parse_protobuf_iter( message_bytes ): """Parse a protobuf stream, iterating over the name-value pairs that are present. This does NOT recursively descend through contained sub-messages. It does only the top-level message. """ bytes_iter= iter(message_bytes) while True: try: item= varint( bytes_iter ) except StopIteration: item= None break field_number, wire_type = item >> 3, item & 0b111 if wire_type == 0b000: # varint representation item_size= None # varint sizes vary, need something for debug message field_value = varint( bytes_iter ) elif wire_type == 0b001: # 64-bit == 8-byte item_size= 8 field_value = tuple( next(bytes_iter) for i in range(8) ) elif wire_type == 0b010: # varint length and then content item_size= varint( bytes_iter ) field_value = tuple( next(bytes_iter) for i in range(item_size) ) elif wire_type == 0b101: # 32-byte == 4-byte item_size= 4 field_value = tuple( next(bytes_iter) for i in range(4) ) else: raise Exception( "Unsupported {0}: {1}, {2}", bin(item), field_number, wire_type ) Message.log.debug( '{0:b}, field {1}, type {2}, size {3}, = {4}'.format( item, field_number, wire_type, item_size, field_value) ) yield field_number, field_value .. py:method:: Message.parse_protobuf( message_bytes ) Create a bag in the form of a mapping ``{name: [value,value,value], ... }``. This will contain the top-level identifiers and the bytes that could be used to parse lower-level messages. :: @staticmethod def parse_protobuf( message_bytes ): """Creates a bag of name-value pairs. Names can repeat, so values are an ordered list. """ bag= defaultdict( list ) for name, value in Message.parse_protobuf_iter( message_bytes ): bag[name].append( value ) return dict(bag) A class-level logger. We don't want a logger for each instance, since we'll create many ``Message`` instances. :: Message.log= logging.getLogger( Message.__class__.__qualname__ ) .. py:class:: Archive_Reader A Reader for IWA archives. This requires that the archive has been processed by the :py:class:`snappy.Snappy` decompressor. :: class Archive_Reader: """Read and yield Archive entries from MessageInfo.type and payload. Resolves the ID into a protobuf message name. Mapping from types to messages - https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Numbers.json - https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Common.json """ def __init__( self ): self.tsp_names= self._tsp_name_map() self.log= logging.getLogger( self.__class__.__qualname__ ) Mapping from internal code numbers of protobuf message class names. This requires :file:`Numbers.json` and :file:`Common.json` from the installation directory. :: @staticmethod def _tsp_name_map(): """Build the TSPRegistry map from messageInfo.type to message proto """ def load_map( filename ): """JSON documents have string keys: these must be converted to int.""" installed= os.path.dirname(__file__) with open( os.path.join(installed, filename) ) as source: raw= json.load( source ) return dict( (int(key), value) for key, value in raw.items() ) tsp_names= ChainMap( load_map("Numbers.json"), load_map("Common.json"), ) return tsp_names .. py:method:: Archive_Reader.make_message( messageInfo, payload ) Create the payload message from a MessageInfo instance and the payload bytes. :: def make_message( self, messageInfo, payload ): name= self.tsp_names[messageInfo[1][0]] return Message( name, payload ) .. py:method:: Archive_Reader.archive_iter( data ) Iterate through all messages in this IWA archive. Locate the ArchiveInfo, MessageInfo and Payload. Parse the payload to create the final message that's associated with the ID in the ArchiveInfo. :: def archive_iter( self, data ): """Iterate through the iWork protobuf-serialized archive: Locate the ArchiveInfo object. Each ArchiveInfo contains a MessageInfo(s). Each MessageInfo describes a payload. It appears that there's only one MessageInfo per ArchiveInfo even though the ``.proto`` file indicates multiple as possible. Yield a sequence of iWork archived messages as pairs: identifier, Message object built from the payload. """ protobuf= iter(data) while True: try: size= varint( protobuf ) except StopIteration: size= None break self.log.debug( "{0} bytes".format(size) ) message_bytes= [ next(protobuf) for i in range(size) ] archiveInfo= Message( "ArchiveInfo", message_bytes) archiveInfo.identifier = archiveInfo[1][0] archiveInfo.message_infos = [ Message("MessageInfo", mi) for mi in archiveInfo[2] ] self.log.debug( " ArchiveInfo identifier={0} message_infos={1}".format( archiveInfo.identifier, archiveInfo.message_infos) ) messageInfo_0= archiveInfo.message_infos[0] messageInfo_0.length= messageInfo_0[3][0] self.log.debug( " MessageInfo length={0}".format(messageInfo_0.length) ) payload_raw= [ next(protobuf) for i in range(messageInfo_0.length) ] self.log.debug( " Payload {0!r}".format(payload_raw) ) message= self.make_message( messageInfo_0, payload_raw ) yield archiveInfo.identifier, message