.. #!/usr/bin/env python3 .. _`schema_loader`: ##################################################################### Schema Loader Module -- Load Embedded or External Schema ##################################################################### .. py:module:: schema.loader A *Schema Loader* loads the attributes of a schema from a source document. There are a variety of sources. - The first row of a sheet within a workbook. This version has to be injected into workbook processing so that the first row is separated from the data rows. - A separate sheet of a workbook. This version requires a sheet name. - A separate workbook. This, too, requires a named sheet. - COBOL Code. We'll set this aside as a subclass so complex it requires it's own module. A schema loader is paired with a specific kind of :py:class:`sheet.Sheet`. A workbook requires a schema, which requires a schema loader. A schema loader depends on a meta-workbook. Ideally that meta-workbook has an emedded schema, but it may have an external schema, meaning we could have a meta-schema required load the schema for the application data. Sheesh. First, let's hope that doesn't happen. Second, the circularity is resolved by making it the responsibility of the the application to handle schema loading. Embedded Schema Use Case =============================== A :py:class:`sheet.EmbeddedSchemaSheet` requires a loader class. The loader will 1. Be built with the sheet as an argument. 2. Be interrogated for the schema. 3. Be interrogated for the rows. The most typical case is the single-header-row case. In some cases, the loader is actually a a rather sophisticated parser that paritions the data into the embedded schema and the data rows. .. parsed-literal:: with Workbook( name ) as wb: sheet = self.wb.sheet( 'Sheet2', stingray.sheet.EmbeddedSchemaSheet, loader_class= stingray.schema.loader.HeadingRowSchemaLoader ) for row in sheet.rows(): *process the row* External Schema Use Case =============================== A :py:class:`sheet.ExternalSchemaSheet` requires a schema. In the typical case, the external schema file has an emedded meta-schema. The first row has appropriate column names. This requires a subclass of :py:class:`schema.loader.ExternalSchemaLoader` to properly map the names that were found onto the attributes of the :py:class:`schema.Attribute` class. When the embedded meta-schema has unusual names, then a builder must be defined to map the names that are found in the schema and build an :py:class:`schema.Attribute` instance. .. parsed-literal:: with open_workbook( schema_name ) as schema_wb: esl= stingray.schema.loader.ExternalSchemaLoader( schema_wb, "Schema" ) schema= esl.schema() with Workbook( name, schema=schema ) as wb: sheet = self.wb.sheet( 'Sheet2', stingray.sheet.ExternalSchemaSheet, schema= schema ) counts= process_sheet( sheet ) pprint.pprint( counts ) Manual Schema Use Case =============================== Also, a manually-defined :py:class:`schema.Schema` can be built rather than being loaded. .. parsed-literal:: schema= stingray.schema.Schema( stingray.schema.Attribute( name='Column #1' ), stingray.schema.Attribute( name='Key' ), stingray.schema.Attribute( name='Value' ), stingray.schema.Attribute( name='Etc.' ), ) Model ====== .. code-block:: none http://yuml.me/diagram/scruffy;/class/ #schema-loader, [Schema]<>-[Attribute], [SchemaLoader]-builds->[Schema], [SchemaLoader]^[HeadingRowSchemaLoader], [SchemaLoader]^[ExternalSchemaLoader], [ExternalSchemaLoader]-reads->[Workbook], [HeadingRowSchemaLoader]-reads->[Sheet]. .. image:: schema_loader.png :width: 6in Overheads =============== We depend on :py:mod:`schema`, :py:mod:`cell` and :py:mod:`sheet`. :: """stingray.schema.loader -- Loads a Schema from a row of a Sheet or from a separate Sheet. This is extended to load COBOL schema from DDE files. """ from stingray.schema import Schema, Attribute import stingray.cell import stingray.sheet import warnings No Schema Exception ==================== In some circumstances, we can't load a schema. The most common situation is a :py:class:`HeadingRowSchemaLoader` which is applied to an empty workbook sheet. No rows means no schema. :: class NoSchemaFound( Exception ): pass The default behavior is to simply write a warning for an empty sheet. The lack of a schema means there's no data, also, and 99% of the time, silently ignoring an empty sheet is desirable. Schema Loader ================= .. py:class:: SchemaLoader A Schema Loader has one mandatory contract: It must load the schema. A subclass may add a second contract, For example, an embedded schema loader will also return the non-schema rows. .. py:attribute:: sheet The :py:class:`Sheet` associated with this schema. .. py:attribute:: row_iter An iterator over the rows of this sheet; used to pick rows that belong to the header, separate from the rows that belong to data. :: class SchemaLoader: """Locate schema information. Subclasses handle all of the variations on schema representation. """ def __init__( self, sheet ): """A simple :py:class:`Sheet` instance.""" self.sheet= sheet self.row_iter= iter( self.sheet.rows() ) def schema( self ): """Scan the sheet to get the schema. :return: a :py:class:`Schema` object.""" return NotImplemented def rows( self ): """Iterate all (or remaining) rows.""" return self.row_iter Embedded Schema Loader =========================== .. py:class:: HeadingRowSchemaLoader In many cases, the schema is first-row column titles or something similar. As we noted above, :py:class:`csv.DictReader` supports this simple case. All other cases have to be handled with something a bit more sophisticated. The :py:class:`schema.loader.SchemaLoader` can be further subclassed to provide for more complex schema definitions buried in the rows of a sheet. This means that we must make the schema parsing an application-provided plug-in that the Workbook uses when instantiating each Sheet. :: class HeadingRowSchemaLoader( SchemaLoader ): """Read just the first row of a sheet to get embedded schema information.""" def schema( self ): """Try to get the schema from row one. Remaining rows are data. If the sheet is empty, emit a warning and return ``None``. """ try: row_1= next( self.row_iter ) attributes = ( dict(name=c.to_str()) for c in row_1 ) schema = Schema( *(Attribute(**col) for col in attributes) ) return schema except StopIteration: warnings.warn( "Empty sheet: no schema present" ) We'll open a :py:class:`sheet.Sheet` with a specific loader. .. parsed-literal:: sheet= stingray.sheet.EmbeddedSchemaSheet( self.wb, 'The_Name', loader_class=HeadingRowSchemaLoader ) .. py:class:: NonBlankHeadingRowSchemaLoader In many cases, we'd like to suppress the empty rows that are an inevitable feature of workbook sheets. Note that this doesn't work well for COBOL or Fixed format files, since an "empty" row may be difficult to discern. :: class NonBlankHeadingRowSchemaLoader( HeadingRowSchemaLoader ): def __init__( self, sheet ): """A simple :py:class:`Sheet` instance.""" self.sheet= sheet self.row_iter= self.non_blank( self.sheet.rows() ) def non_blank( self, rows ): for r in rows: if all( c.is_empty() for c in r ): continue yield r External Schema Loader ========================== .. py:class:: ExternalSchemaLoader In some cases, the data workbook is described by a separate schema workbook, or a separate sheet within the data workbook. In these cases, the other sheet (or file) must be parsed to locate schema information. In the case of a fixed format file, we must examine a separate file to load schema information. This additional schems file may be in COBOL notation, leading to a more complex parser. See :ref:`cobol_loader`. The layout of the schema, of course, will be highly variable, so the "meta-schema" must be adjusted to the actual file. Note, also, that the schema loader is -- itself -- a typical of schema-based reader. It has a number of common features. 1. A dictionary-based "builder", :py:meth:`schema.loader.ExternalSchemaLoader.build_attr`, to handle Logical Layout. This transforms the input "raw" dictionary of :py:class:`cell.Cell` instances to an application dictionary of proper Python objects. See :ref:`developer`. 2. An iterator, :py:meth:`schema.loader.ExternalSchemaLoader.attr_dict_iter`, that provides "raw" dictionaries from each row (based on the schema) to the builder to create application dictionaries. 3. The overall function, :py:meth:`schema.loader.ExternalSchemaLoader.schema`, that iterates over application objects built from application dictionaries. .. py:attribute:: workbook The overall Workbook that we're parsing to locate schema information. .. py:attribute:: Sheet A specific sheet within that workbook. :: class ExternalSchemaLoader( SchemaLoader ): """Open a workbook file in a well-known format. Build a schema with attribute name, offset, size and type information. The type is a string that names the type of cell to create. The meta-schema must be embedded as the first line of the schema sheet. The assumed meta-schema is the following:: Schema( Attribute("name",create="TextCell"), Attribute("offset",create="NumberCell"), Attribute("size",create="NumberCell"), Attribute("type",create="TextCell"), ) If the meta-schema has different names, then a subclass with a different :py:meth:`build_attr` is required to map the actual source columns to the attributes of a :py:class:`Attribute`. Offsets are typically 1-based. """ def __init__( self, workbook, sheet_name='Sheet1' ): self.workbook, self.sheet_name = workbook, sheet_name self.sheet= self.workbook.sheet( self.sheet_name, stingray.sheet.EmbeddedSchemaSheet, loader_class= HeadingRowSchemaLoader ) .. py:method:: ExternalSchemaLoader.build_attr( row ) There's potential for a great deal of variability in schema definition. Consequently, this ``build_attr`` method is merely a sample that covers one common case. :: base= 1 type_to_cell = { 'text': "TextCell", 'number': "NumberCell", 'date': "DateCell", 'boolean': "BooleanCell", } @staticmethod def build_attr( row ): """Build application dictionary from raw dictionary. """ try: offset= row['offset'].to_int()-ExternalSchemaLoader.base except KeyError: offset= None try: size= row['size'].to_int() except KeyError: size= None try: type_name= row['type'].to_str() create= ExternalSchemaLoader.type_to_cell[type_name] except KeyError: create= stingray.cell.TextCell return dict( name= row['name'].to_str(), offset= offset, size= size, create= create, ) Schema loading involves a process of 1. Iterating through the source rows as dictionaries. - Build each raw row as a source dictionary. - Build an standardized attr dictionary from the source dictionary. This mapping, implemented by :py:meth:`schema.loader.ExternalSchemaLoader.build_attr` is subject to a great deal of change without notice. 2. Building each :py:class:`schema.Attribute` from the dictionary. .. py:method:: ExternalSchemaLoader.attr_dict_iter( sheet ) Iterate over application dicts based on raw dicts built by the schema of the sheet. :: def attr_dict_iter( self, sheet ): """Iterate over application dicts based on raw dicts built by the schema of the sheet.""" return ( ExternalSchemaLoader.build_attr(r) for r in sheet.schema.rows_as_dict_iter(sheet) ) .. py:method:: ExternalSchemaLoader.schema( ) Scan a file to get the schema. :return: a :py:class:`Schema` object :: def schema( self ): """Scan a file to get the schema. :return: a :py:class:`Schema` object.""" self.row_iter= iter( [] ) source_dict = self.attr_dict_iter( self.sheet ) schema= Schema( *(Attribute(**row) for row in source_dict) ) return schema Worst-Case Loader ==================== .. py:class:: BareExternalSchemaLoader This is a degenerate case loader where the schema sheet (or file) doesn't have an embedded schema on line one of the sheet. :: class BareExternalSchemaLoader( SchemaLoader ): """Open a workbook file in a well-known format. Apply a schema parser to the given sheet (or file) to build a schema. The meta-schema is hard-coded in this class because the given sheet has no headers. """ schema= Schema( Attribute("name",create="TextCell"), Attribute("offset",create="NumberCell"), Attribute("size",create="NumberCell"), Attribute("type",create="TextCell"), ) def __init__( self, workbook, sheet_name='Sheet1' ): self.workbook, self.sheet_name = workbook, sheet_name self.sheet= self.workbook.sheet( self.sheet_name, stingray.sheet.ExternalSchemaSheet, schema= self.schema ) Parsing and Loading a COBOL Schema ===================================== One logical extension to this is to parse COBOL DDE's to create a schema that allows us to process a COBOL file (in EBCDIC) directly as if it were a simple workbook. We'll delegate that to :ref:`cobol_loader`, since it's considerably more complex than simply loading rows from a sheet of a workbook.