9.2. Protobuf Module – Unpacking iWork 13 files.¶
This is not a full implementation of Protobuf object representation. This is a minimal implementation of protobuf parsing, enough to unpack iWork ‘13 files.
9.2.1. The iWork ‘13 use of protobuf¶
https://github.com/obriensp/iWorkFileFormat
https://github.com/obriensp/iWorkFileFormat/blob/master/Docs/index.md
“Components are serialized into .iwa (iWork Archive) files, a custom format consisting of a Protobuf stream wrapped in a Snappy stream.
“Protobuf
“The uncompresed IWA contains the Component’s objects, serialized consecutively in a Protobuf stream. Each object begins with a varint representing the length of the ArchiveInfo message, followed by the ArchiveInfo message itself. The ArchiveInfo includes a variable number of MessageInfo messages describing the encoded Payloads that follow, though in practice iWork files seem to only have one payload message per ArchiveInfo.
“Payload
“The format of the payload is determined by the type field of the associated MessageInfo message. The iWork applications manually map these integer values to their respective Protobuf message types, and the mappings vary slightly between Keynote, Pages and Numbers. This information can be recovered by inspecting the TSPRegistry class at runtime.
“TSPRegistry”
“The mapping between an object’s MessageInfo.type and its respective Protobuf message type must by extracted from the iWork applications at runtime. Attaching to Keynote via a debugger and inspecting [TSPRegistry sharedRegistry] shows:
“A full list of the type mappings can be found here.”
Message .proto
files.
Table details
Numbers details
Calculating Engine details
Structure (i.e., TreeNode, perhaps more relevant for Keynote)
We require two of the files from this project to map the internal code numbers
- https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Numbers.json
- https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Common.json
These files are incorporated into this module as separate *.json
files.
See Installation via setup.py for more information on these files.
9.2.2. protobuf¶
For more information on protobuf, see the following:
https://developers.google.com/protocol-buffers/
https://developers.google.com/protocol-buffers/docs/encoding
9.2.3. IWA Structure¶
Each IWA has an ArchiveInfo message.
message ArchiveInfo {
optional uint64 identifier = 1;
repeated MessageInfo message_infos = 2;
}
Within the ArchiveInfo is a MessageInfo message.
message MessageInfo {
required uint32 type = 1;
repeated uint32 version = 2 [packed = true];
required uint32 length = 3;
repeated FieldInfo field_infos = 4;
repeated uint64 object_references = 5 [packed = true];
repeated uint64 data_references = 6 [packed = true];
}
The MessageInfo is followed by the payload. That must be decoded to get the actual data of interest.
9.2.4. Implementation¶
Module docstring.
"""Read protobuf-serialized messages from IWA files used for Numbers '13 workbooks.
https://developers.google.com/protocol-buffers/
https://developers.google.com/protocol-buffers/docs/encoding
Requires :file:`Numbers.json` and :file:`Common.json` from the installation
directory.
"""
Some Overheads
import logging
import sys
import os
import json
from collections import defaultdict, ChainMap
from stingray.snappy import varint
-
class
protobuf.
Message
¶ A definition of a generic protobuf message. This is both an instance and it also has staticmethods that build instances from a buffer of bytes.
We don’t use subclasses of
Message
. The proper way to use Protobuf is to compile.proto
files into Message class definitions.
class Message:
"""Generic protobuf message built from sequence of bytes.
:ivar name_: the protobuf message name.
:ivar fields: a dict that maps field numbers to field values.
The contained message objects are **not** parsed, but left as
raw bytes.
"""
def __init__( self, name, bytes ):
self.name_= name
self.fields= Message.parse_protobuf(bytes)
def __repr__( self ):
return "{0}({1})".format( self.name_, self.fields )
def __getitem__( self, index ):
return self.fields.get(index,[])
-
Message.
parse_protobuf_iter
(message_bytes)¶ An iterative parser for the top-level (name, value) pairs in the protobuf stream. This yields all of the pairs that are parsed. This a static method which builds message instances.
@staticmethod
def parse_protobuf_iter( message_bytes ):
"""Parse a protobuf stream, iterating over the name-value pairs that are present.
This does NOT recursively descend through contained sub-messages.
It does only the top-level message.
"""
bytes_iter= iter(message_bytes)
while True:
try:
item= varint( bytes_iter )
except StopIteration:
item= None
break
field_number, wire_type = item >> 3, item & 0b111
if wire_type == 0b000: # varint representation
item_size= None # varint sizes vary, need something for debug message
field_value = varint( bytes_iter )
elif wire_type == 0b001: # 64-bit == 8-byte
item_size= 8
field_value = tuple( next(bytes_iter) for i in range(8) )
elif wire_type == 0b010: # varint length and then content
item_size= varint( bytes_iter )
field_value = tuple( next(bytes_iter) for i in range(item_size) )
elif wire_type == 0b101: # 32-byte == 4-byte
item_size= 4
field_value = tuple( next(bytes_iter) for i in range(4) )
else:
raise Exception( "Unsupported {0}: {1}, {2}", bin(item), field_number, wire_type )
Message.log.debug(
'{0:b}, field {1}, type {2}, size {3}, = {4}'.format(
item, field_number, wire_type, item_size, field_value) )
yield field_number, field_value
-
Message.
parse_protobuf
(message_bytes)¶ Create a bag in the form of a mapping
{name: [value,value,value], ... }
. This will contain the top-level identifiers and the bytes that could be used to parse lower-level messages.
@staticmethod
def parse_protobuf( message_bytes ):
"""Creates a bag of name-value pairs. Names can repeat, so values are
an ordered list.
"""
bag= defaultdict( list )
for name, value in Message.parse_protobuf_iter( message_bytes ):
bag[name].append( value )
return dict(bag)
A class-level logger. We don’t want a logger for each instance, since we’ll
create many Message
instances.
Message.log= logging.getLogger( Message.__class__.__qualname__ )
-
class
protobuf.
Archive_Reader
¶ A Reader for IWA archives. This requires that the archive has been processed by the
snappy.Snappy
decompressor.
class Archive_Reader:
"""Read and yield Archive entries from MessageInfo.type and payload.
Resolves the ID into a protobuf message name.
Mapping from types to messages
- https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Numbers.json
- https://github.com/obriensp/iWorkFileFormat/blob/master/iWorkFileInspector/iWorkFileInspector/Persistence/MessageTypes/Common.json
"""
def __init__( self ):
self.tsp_names= self._tsp_name_map()
self.log= logging.getLogger( self.__class__.__qualname__ )
Mapping from internal code numbers of protobuf message class names.
This requires Numbers.json
and Common.json
from the installation
directory.
@staticmethod
def _tsp_name_map():
"""Build the TSPRegistry map from messageInfo.type to message proto
"""
def load_map( filename ):
"""JSON documents have string keys: these must be converted to int."""
installed= os.path.dirname(__file__)
with open( os.path.join(installed, filename) ) as source:
raw= json.load( source )
return dict( (int(key), value) for key, value in raw.items() )
tsp_names= ChainMap(
load_map("Numbers.json"),
load_map("Common.json"),
)
return tsp_names
-
Archive_Reader.
make_message
(messageInfo, payload)¶
Create the payload message from a MessageInfo instance and the payload bytes.
def make_message( self, messageInfo, payload ):
name= self.tsp_names[messageInfo[1][0]]
return Message( name, payload )
-
Archive_Reader.
archive_iter
(data)¶
Iterate through all messages in this IWA archive. Locate the ArchiveInfo, MessageInfo and Payload. Parse the payload to create the final message that’s associated with the ID in the ArchiveInfo.
def archive_iter( self, data ):
"""Iterate through the iWork protobuf-serialized archive:
Locate the ArchiveInfo object.
Each ArchiveInfo contains a MessageInfo(s).
Each MessageInfo describes a payload.
It appears that there's only one MessageInfo per ArchiveInfo
even though the ``.proto`` file indicates multiple as possible.
Yield a sequence of iWork archived messages as pairs:
identifier, Message object built from the payload.
"""
protobuf= iter(data)
while True:
try:
size= varint( protobuf )
except StopIteration:
size= None
break
self.log.debug( "{0} bytes".format(size) )
message_bytes= [ next(protobuf) for i in range(size) ]
archiveInfo= Message( "ArchiveInfo", message_bytes)
archiveInfo.identifier = archiveInfo[1][0]
archiveInfo.message_infos = [
Message("MessageInfo", mi) for mi in archiveInfo[2] ]
self.log.debug( " ArchiveInfo identifier={0} message_infos={1}".format(
archiveInfo.identifier, archiveInfo.message_infos) )
messageInfo_0= archiveInfo.message_infos[0]
messageInfo_0.length= messageInfo_0[3][0]
self.log.debug( " MessageInfo length={0}".format(messageInfo_0.length) )
payload_raw= [ next(protobuf) for i in range(messageInfo_0.length) ]
self.log.debug( " Payload {0!r}".format(payload_raw) )
message= self.make_message( messageInfo_0, payload_raw )
yield archiveInfo.identifier, message