.. #!/usr/bin/env python3 .. _`sheets`: ###################################### Sheet Module -- Sheet and Row Access ###################################### .. py:module:: sheet A *Sheet* is a generator of *Row* objects. A *Row* is a sequence of :py:class:`cell.Cell` instances, identified by position within the row. We have three variations on :py:class:`sheet.Sheet`. - A simple :py:class:`sheet.Sheet` lacks a schema. (This corresponds with :py:func:`csv.reader`.) For workbooks with a well-known physical format, the schema can be optional. Each :py:class:`sheet.Row` object can be built eagerly and accessed by position. - A sheet with a schema. There are two variations. - :py:class:`sheet.EmbeddedSchemaSheet` contains a schema. This could be a simple as column titles in the first row. (This corresponds to :py:class:`csv.DictReader`.) Or it could be considerably more complex. - :py:class:`sheet.ExternalSchemaSheet` requires an external schema. This schema may be simply a list of column titles supplied externally. More often, the schema is a complete physical format description for Fixed or COBOL format files. A known physical format (like a workbook) can build :py:class:`sheet.Row` objects eagerly with or without a schema. In the case of COBOL and fixed-format files, however, a :py:class:`sheet.Row` cannot be built eagerly. It must be a lazy object which only builds :py:class:`cell.Cell` as needed. See :ref:`cobol` for details. Get Embedded Schema Use Case =============================== For an :py:class:`sheet.EmbeddedSchemaSheet`, the application (or Workbook) must do a three-step dance to get the data using schema that is embedded in the sheet. 1. Build a :py:class:`sheet.EmbeddedSchemaSheet` with an "embedded schema loader" class. (For example, :py:class:`schema.loader.HeadingRowSchemaLoader`.) The loader partitions rows into two sets: header and data. 2. Load the schema from the sheet. The :py:class:`sheet.Sheet` will build an object of the loader class and use it to gather the schema information. The schema loading may involve skipping irrelevant rows or combining multi-line headings or anything else required to parse the schema. 3. Get the rows from the sheet. This will, also, invoke the attached loader to filter rows so that the header is not seen as data. The code might look like this: .. parsed-literal:: with *open* as wb: sheet = EmbeddedSchemaSheet( workbook, 'Sheet1', HeadingRowSchemaLoader ) counts= process_sheet( sheet ) pprint.pprint( counts ) The idea is to simply access a sheet with column titles, no matter how complex the column titles turn out to be. Get External Schema Use Case =============================== For an :py:class:`sheet.ExternalSchemaSheet`, the application (or Workbook) must do a four-step dance to get data using schema. 1. Build a :py:class:`schema.loader.ExternalSchemaLoader` as a schema loader. This loader will require a source workbook, sheet name and a reader object. 2. Get the Schema object from the loader. 3. Build a :py:class:`sheet.ExternalSchemaSheet` with the Schema object. 4. Get the rows from the sheet. And yes, the external source, is another spreadsheet! Worse, the external source could be a fixed file or workbook for which a meta-schema is required to read the schema. The code might look like this: .. parsed-literal:: with *open schema* as swb: esl = ExternalSchemaLoader( swb, sheet_name='Schema' ) schema = esl.load() with *open data* as wb: sheet = ExternalSchemaSheet( wb, 'Sheet1', schema ) counts= process_sheet( sheet ) pprint.pprint( counts ) The idea is to get a schema and then use the schema to access data. Get Rows Use Case ====================== The essential job of a :py:class:`sheet.Sheet` is to produce :py:class:`sheet.Row` instances. A row is a sequence of :py:class:`cell.Cell` instances. Note that :py:mod:`csv` is eager about building a row from the source data. This isn't universally appropriate. COBOL files require lazy construction of the row's cells. A :py:class:`schema.Schema` can transform a sequence row into a dictionary row or a named tuple row. The :py:attr:`schema.Attribute.name` becomes the key for this row-as-dictionary. We specifically delegate the row-as-dictionary interpretation to the :py:class:`schema.Schema`, and avoid doing it in the :py:class:`sheet.Sheet`. This is because most workbook schemata are flat. However, a COBOL schema can have a very complex structure, making the row-as-dictionary too simplistic to be useful. As noted above, there are two candidate implementations of a Row. - **Eager**. Appropriate for most (but not all) Physical Formats. The idea is to apply the schema immediately to create the row as a tuple of cells. :py:mod:`csv` does this, and it can be applied to other workbook formats. It can be applied to simple, flat Fixed format files. - **Lazy**. This is more appropriate for Fixed format files and COBOL format files. Specifically, the data conversion, redefines and repeating group issues force us to wait for cell access rather than immediately create all possible cells. Indeed, for COBOL files with REDEFINES definitions, some of the cells cannot be built eagerly; application logic must determine which attributes are valid or invalid. Note that the API is the same. The implementation differs. Here's our prototypical code. .. parsed-literal:: def process_sheet( sheet ): counts= defaultdict( int ) for row in sheet.rows(): #\ *row is a sequence of* Cell *instances* print( repr(c) for c in row ) counts['read'] += 1 return counts Ultimately, the sequence nature of a row is unsatisfying. We'll have to wait until :ref:`schema` to extend this into something useful. Sheet Identification ===================== For CSV and TAB files, as well as COBOL and Flat files, there is one anonymous "sheet" that is the entire workbook. For XLS, XLSX, and ODS formats, however, there are sheets within the workbook. For Numbers, there are "pages" or "workspaces" that have multiple tables. Each Numbers **table** is -- effectively -- a :py:class:`sheet.Sheet`. The intermediate organization level, "workspace", is an additional detail. We handle this in the following way. - One anonymous sheet has a name either of ``None`` or the basename of the file. - Simple sheets have names which are simple strings. - Numbers workspaces with sheets have names which are two-tuples of workspace ("sheet") and table name. Model ======= .. code-block:: none http://yuml.me/diagram/scruffy;dir:td/class/ #sheet, [Workbook]<>-n[Sheet], [Sheet]<>-n[Row], [Row]^[LazyRow], [LazyRow]-gets->[Workbook], [Sheet]^[EmbeddedSchemaSheet], [Sheet]^[ExternalSchemaSheet], [EmbeddedSchemaSheet]->[SchemaLoader]. .. image:: sheet.png Overheads ========== Sheet and Row are essentially lazy sequences. :: """stingray.sheet -- Defines Row as a collection of Cells and Sheet as a collection of Rows. """ from collections import Sequence There are two "implicit" dependencies, also. A row depends on details of an :py:class:`schema.Attribute` and a :py:class:`workbook.base.Workbook`. However, there's no real need to present a formal import for this. The Attribute and Workbook are simply opaque objects passed around as arguments. Sheet Class ============= .. py:class:: Sheet An iterator over the rows of data in a workbook. Subclasses implement different bindings for the sheet's schema information. This is largely abstract, since there's no schema binding available. There are subclasses which have a schema binding. See :py:class:`sheet.ExternalSchemaSheet` and :py:class:`sheet.EmbeddedSchemaSheet`. .. py:attribute:: workbook The :py:class:`Workbook` which contains this Sheet. .. py:attribute:: name The name of this sheet. :: class Sheet: """An iterator over rows. A binding to a workbook. A subclass of Sheet will be bound to a schema. """ def __init__( self, workbook, sheet_name ): self.workbook, self.name= workbook, sheet_name def __repr__( self ): return "{0}({1!r},{2!r})".format( self.__class__.__qualname__, self.workbook, self.name ) def rows( self ): """Iterate through the rows of this sheet. This is a convenient interface for ``self.workbook.rows_of(self)`` """ return self.workbook.rows_of( self ) Row Class ============= .. py:class:: Row A single row in Sheet; a sequence of :py:class:`cell.Cell` instances. A Sheet produces this simple row-as-list. A Schema can transform this into row-as-dict or some even more elaborate structure. A row depends on details of an :py:class:`schema.Attribute` and a :py:class:`workbook.base.Workbook`. This feels circular. But this Sheet/Row schema definition is really just a convenient wrapper around the Workbook details. The :py:class:`cell.Cell` conversions are handled by the :py:class:`workbook.base.Workbook`. Some Workbooks have cell content identified by position. Some Workbooks have cell content identified by size, offset and encoding. Therefore, we must provide the Attribute details to the Workbook to get the Cell's value. .. py:attribute:: sheet The :py:class:`Sheet` which contains this row. .. py:attribute:: data The sequence of :py:class:`Cell` values for this row. :: class Row( Sequence ): """Eager Row: a tuple of Cell values.""" def __init__( self, sheet, *data ): """Build another Row. :param sheet: the containing sheet. :param *data: the various Cell values in this row """ self.sheet= sheet self.data= data def cell( self, attribute ): """Get a specific cell, based on a schema Attribute. :param attribute: The attribute's value to return. """ return self.sheet.workbook.row_get( self, attribute ) Basic Sequence features :: def __len__( self ): return len(self.data) def __iter__( self ): return iter(self.data) def __contains__( self, cell ): return any( cell.value == d.value for d in self.data ) def __getitem__( self, index ): return self.data[index] To approach the :py:class:`csv.DictReader` API (without the eager processing), we need make the ``Row`` API slightly more fluent with a ``by_name()`` method. .. parsed-literal:: def by_name( self, name ): attr= self.sheet.schema.get_name(name) return self.cell( attr ) Note that the presumption in this interface is that the Attribute is sufficiently detailed to specify a single :py:class:`cell.Cell`. For non-COBOL workbooks, this is perfectly true. For COBOL, however, there are groups and occurs clauses, meaning that a single Attribute can represent multiple :py:class:`cell.Cell` instances. Which one do we mean? And how do we specify this selection? - The :py:meth:`sheet.Row.cell` method can return a structure with all the values. Ordinary Python can then pick apart the instances. This requires working up the DDE hierarchy to locate all of the applicable "occurs" by to construct the proper dimensionality of an attribute. It also means getting all of the values to create a tuple or nested tuple-of-tuple structure for the various dimensions. Eager processing isn't going to work out well. - The :py:class:`schema.Attribute.index` method selects data from the row in the workbook. This applies the indices to the Attribute to compute the required offset into the source data. We're constrained by the laziness requirement of COBOL to lean toward the this implementation. .. py:class:: LazyRow When we can't eagerly build all :py:class:`cell.Cell` instances for a given row, this class provides the proper API. A COBOL REDEFINES clause may make the bytes invalid in all but one of the aliases for an attribute. Also, there's no formal ``NULL`` value in COBOL, so optional fields can have invalid data. Further, we may have Occurs Depending On. This means we can't set size and offset until we can access actual data. For these reasons, we have a :py:class:`sheet.LazyRow`, which conforms to the interface for a :py:class:`Row`, but isn't an actual sequence. No data is processed until the :py:meth:`LazyRow.__getitem__` method is used. .. py:attribute:: sheet The :py:class:`Sheet` to which this row belongs. .. py:attribute:: _state The worksheet's internal state information, required to perform lazy extraction of the cell values. The LazyRow superclass doesn't use this. A subclass may need it. :: class LazyRow( Sequence ): """Lazy Row: a tuple-like sequence of Cell values.""" def __init__( self, sheet, **state ): """Build another Row. :param sheet: the containing sheet. :param **state: worksheet-specific state value to save. """ self.sheet= sheet self._state= state super().__init__() def __repr__( self ): return "LazyRow(sheet={0!r}, state={1!r})".format( self.sheet, self._state ) def cell( self, attribute ): """Get a specific cell, based on a schema Attribute. :param attribute: The attribute's value to return. """ return self.sheet.workbook.row_get( self, attribute ) Basic Sequence features :: def __len__( self ): return len(self.sheet.schema) def __iter__( self ): for attribute in self.sheet.schema: try: yield self.sheet.workbook.row_get( self, attribute ) except Exception as e: yield None def __contains__( self, cell ): for attribute in self.sheet.schema: try: col= self.sheet.workbook.row_get( self, attribute ) except Exception as e: pass if col.value == cell.value: return True def __getitem__( self, index ): attribute= self.sheet.schema[index] return self.sheet.workbook.row_get( self, attribute ) To approach the :py:class:`csv.DictReader` API (without the eager processing), we could make the ``Row`` API slightly more fluent with a ``by_name()`` method. .. parsed-literal:: def by_name( self, name ): attr= self.sheet.schema.get_name(name) return self.cell(attr) This isn't implemented, because it doesn't seem very helpful. ExternalSchemaSheet Class ========================== .. py:class:: ExternalSchemaSheet A Sheet bound to a schema can be used to fetch data. This is a concrete subclass of :py:class:`Sheet`. A Sheet with an external schema can have one of two sources for the bound schema. - An external sheet that doesn't have row headers to embed the schema information. In this case, an eager Workbook Row can eagerly create a Sequence of :py:class:`cell.Cell` instances. The Schema information can be associated by position. - A Sheet that is really a COBOL or Fixed format file. In this case, the Workbook cannot create a sequence of :py:class:`cell.Cell` instances. Instead, the Sheet (which has schema information) must provide a LazyRow with deferred Cell conversions. :: class ExternalSchemaSheet( Sheet ): """A Sheet with an external Schema.""" def __init__( self, workbook, sheet_name, schema ): """Initialize a sheet for processing. :param workbook: the containing workbook :param sheet_name: the specific sheet to locate within the Workbook :param schema: the :py:class:`schema.Schema` schema definition. """ super().__init__( workbook, sheet_name ) self.schema= schema def rows( self ): """Iterate through the rows of this sheet.""" return self.workbook.rows_of( self ) EmbeddedSchemaSheet Class ========================== .. py:class:: EmbeddedSchemaSheet A sheet bound to a schema can be used to fetch data. This is a concrete subclass of :py:class:`Sheet`. A sheet with an embedded schema must also have a :py:class:`schema.loader.SchemaLoader` class provided. The loader is invoked to build the :py:class:`schema.Schema` object that's bound to the sheet. The :py:class:`schema.loader.SchemaLoader` is also used to return the rest of the rows; those that weren't used to build the schema. .. py:attribute:: loader The :py:class:`Loader` used to build schema from rows in this sheet. :: class EmbeddedSchemaSheet( ExternalSchemaSheet ): """A Sheet with a Schema embedded in it.""" def __init__( self, workbook, sheet_name, loader_class ): """Initialize a sheet for processing. :param workbook: the containing workbook :param sheet_name: the specific sheet to locate within the Workbook :param loader_class: the :py:class:`schema.loader.SchemaLoader` schema loader to load the schema from the sheet. Apply the loader to the given sheet of the workbook to get schema and rows. """ s = Sheet( workbook, sheet_name ) self.loader = loader_class( s ) schema= self.loader.schema() super().__init__( workbook, sheet_name, schema=schema ) def rows( self ): """The parser will skip over the headers.""" return self.loader.rows() Since the rows are already properly encoded as :py:class:`cell.Cell` instances, no further processing is required by the Sheet or the Loader. Rows of a Sheet ================== Note that the :py:mod:`csv` design pattern for each row involves two subclasses with the same method names but different results. One returns a ``dict`` of cells, keyed by field names, the other returns a ``list`` of cells, indexed by position. The dict-based processing has the advantage of clarity: cells are named row['cell']. It has the disadvantage of not coping well with duplicate column names or data which breaks first normal form. We can't follow the :mod:`csv` design pattern. Instead we do the following. - A :py:class:`sheet.Row` is a sequence of :py:class:`cell.Cell` instances. It may be lazy or it may be eager. - To use names, a :py:class:`schema.Schema` must be used to fetch :py:class:`cell.Cell` instances from the :py:class:`sheet.Row` object. The schema translates names to positions. - To create dict-like access to :py:class:`cell.Cell` instances, the :py:class:`schema.Schema` can be turned into a dictionary. The row itself is not a dictionary, just the schema. The row is still a Sequence. This "schema-as-dict" can still be used with a properly lazy :py:class:`sheet.Row` to create :py:class:`cell.Cell` instances. We need the lazy evaluation of a row that fetches data based on :py:class:`schema.Attribute` details in order to cope with COBOL ``REDEFINES``. It also allows us to cope with the unfortunately common problem of duplicate column names in conventional spreadsheets. We can have application programming which looks like this to process a Row as sequence: .. parsed-literal:: for row in sheet.rows(): row[i] # instance of Cell sheet.schema[i].name # name attribute of Schema Attribute Row as dict is a common alternative. If we have unique column names in the schema, We can than use application programming that looks like this. .. parsed-literal:: schema_dict = dict((a.name, a) for a in sheet.schema) for row in sheet.rows(): row.cell(schema_dict['name']) # instance of Cell row_as_dict= dict( (a.name, row.cell(a)) for a in sheet.schema) row_as_dict['name'] # instance of Cell This handles the COBOL case, where rows must be lazy. This includes the ``REDEFINES`` and occurs clauses. This assures proper packed decimal conversion of redefined fields.