7. Schema Loader Module – Load Embedded or External Schema

A Schema Loader loads the attributes of a schema from a source document. There are a variety of sources.

  • The first row of a sheet within a workbook. This version has to be injected into workbook processing so that the first row is separated from the data rows.
  • A separate sheet of a workbook. This version requires a sheet name.
  • A separate workbook. This, too, requires a named sheet.
  • COBOL Code. We’ll set this aside as a subclass so complex it requires it’s own module.

A schema loader is paired with a specific kind of sheet.Sheet.

A workbook requires a schema, which requires a schema loader. A schema loader depends on a meta-workbook. Ideally that meta-workbook has an emedded schema, but it may have an external schema, meaning we could have a meta-schema required load the schema for the application data. Sheesh.

First, let’s hope that doesn’t happen. Second, the circularity is resolved by making it the responsibility of the the application to handle schema loading.

7.1. Embedded Schema Use Case

A sheet.EmbeddedSchemaSheet requires a loader class. The loader will

  1. Be built with the sheet as an argument.
  2. Be interrogated for the schema.
  3. Be interrogated for the rows.

The most typical case is the single-header-row case.

In some cases, the loader is actually a a rather sophisticated parser that paritions the data into the embedded schema and the data rows.

with Workbook( name ) as wb:
    sheet = self.wb.sheet( 'Sheet2',
        stingray.sheet.EmbeddedSchemaSheet,
        loader_class= stingray.schema.loader.HeadingRowSchemaLoader )

    for row in sheet.rows():
        process the row

7.2. External Schema Use Case

A sheet.ExternalSchemaSheet requires a schema.

In the typical case, the external schema file has an emedded meta-schema. The first row has appropriate column names. This requires a subclass of schema.loader.ExternalSchemaLoader to properly map the names that were found onto the attributes of the schema.Attribute class.

When the embedded meta-schema has unusual names, then a builder must be defined to map the names that are found in the schema and build an schema.Attribute instance.

with open_workbook( schema_name ) as schema_wb:
    esl= stingray.schema.loader.ExternalSchemaLoader( schema_wb, "Schema" )
    schema= esl.schema()
with Workbook( name, schema=schema ) as wb:
    sheet = self.wb.sheet( 'Sheet2',
        stingray.sheet.ExternalSchemaSheet,
        schema= schema )
    counts= process_sheet( sheet )
    pprint.pprint( counts )

7.3. Manual Schema Use Case

Also, a manually-defined schema.Schema can be built rather than being loaded.

schema= stingray.schema.Schema(
    stingray.schema.Attribute( name='Column #1' ),
    stingray.schema.Attribute( name='Key' ),
    stingray.schema.Attribute( name='Value' ),
    stingray.schema.Attribute( name='Etc.' ),
)

7.4. Model

http://yuml.me/diagram/scruffy;/class/
#schema-loader,
[Schema]<>-[Attribute],
[SchemaLoader]-builds->[Schema],
[SchemaLoader]^[HeadingRowSchemaLoader],
[SchemaLoader]^[ExternalSchemaLoader],
[ExternalSchemaLoader]-reads->[Workbook],
[HeadingRowSchemaLoader]-reads->[Sheet].
_images/schema_loader.png

7.5. Overheads

We depend on schema, cell and sheet.

"""stingray.schema.loader -- Loads a Schema from a row of a Sheet or
from a separate Sheet.  This is extended to load COBOL schema
from DDE files.
"""

from stingray.schema import Schema, Attribute
import stingray.cell
import stingray.sheet
import warnings

7.6. No Schema Exception

In some circumstances, we can’t load a schema. The most common situation is a HeadingRowSchemaLoader which is applied to an empty workbook sheet. No rows means no schema.

class NoSchemaFound( Exception ):
    pass

The default behavior is to simply write a warning for an empty sheet. The lack of a schema means there’s no data, also, and 99% of the time, silently ignoring an empty sheet is desirable.

7.7. Schema Loader

class schema.loader.SchemaLoader

A Schema Loader has one mandatory contract: It must load the schema.

A subclass may add a second contract, For example, an embedded schema loader will also return the non-schema rows.

sheet

The Sheet associated with this schema.

row_iter

An iterator over the rows of this sheet; used to pick rows that belong to the header, separate from the rows that belong to data.

class SchemaLoader:
    """Locate schema information.  Subclasses handle
    all of the variations on schema representation.
    """
    def __init__( self, sheet ):
        """A simple :py:class:`Sheet` instance."""
        self.sheet= sheet
        self.row_iter= iter( self.sheet.rows() )
    def schema( self ):
        """Scan the sheet to get the schema.
        :return: a :py:class:`Schema` object."""
        return NotImplemented
    def rows( self ):
        """Iterate all (or remaining) rows."""
        return self.row_iter

7.8. Embedded Schema Loader

class schema.loader.HeadingRowSchemaLoader

In many cases, the schema is first-row column titles or something similar. As we noted above, csv.DictReader supports this simple case.

All other cases have to be handled with something a bit more sophisticated. The schema.loader.SchemaLoader can be further subclassed to provide for more complex schema definitions buried in the rows of a sheet.

This means that we must make the schema parsing an application-provided plug-in that the Workbook uses when instantiating each Sheet.

class HeadingRowSchemaLoader( SchemaLoader ):
    """Read just the first row of a sheet to get embedded
    schema information."""
    def schema( self ):
        """Try to get the schema from row one.  Remaining rows are data.
        If the sheet is empty, emit a warning and return ``None``.
        """
        try:
            row_1= next( self.row_iter )
            attributes = (
                dict(name=c.to_str()) for c in row_1
            )
            schema = Schema(
                *(Attribute(**col) for col in attributes)
            )
            return schema
        except StopIteration:
            warnings.warn( "Empty sheet: no schema present" )

We’ll open a sheet.Sheet with a specific loader.

sheet= stingray.sheet.EmbeddedSchemaSheet(
    self.wb, 'The_Name',
    loader_class=HeadingRowSchemaLoader )
class schema.loader.NonBlankHeadingRowSchemaLoader

In many cases, we’d like to suppress the empty rows that are an inevitable feature of workbook sheets.

Note that this doesn’t work well for COBOL or Fixed format files, since an “empty” row may be difficult to discern.

class NonBlankHeadingRowSchemaLoader( HeadingRowSchemaLoader ):
    def __init__( self, sheet ):
        """A simple :py:class:`Sheet` instance."""
        self.sheet= sheet
        self.row_iter= self.non_blank( self.sheet.rows() )
    def non_blank( self, rows ):
        for r in rows:
            if all( c.is_empty() for c in r ):
                continue
            yield r

7.9. External Schema Loader

class schema.loader.ExternalSchemaLoader

In some cases, the data workbook is described by a separate schema workbook, or a separate sheet within the data workbook. In these cases, the other sheet (or file) must be parsed to locate schema information.

In the case of a fixed format file, we must examine a separate file to load schema information. This additional schems file may be in COBOL notation, leading to a more complex parser. See COBOL Loader Module – Parse COBOL Source to Load a Schema.

The layout of the schema, of course, will be highly variable, so the “meta-schema” must be adjusted to the actual file.

Note, also, that the schema loader is – itself – a typical of schema-based reader. It has a number of common features.

  1. A dictionary-based “builder”, schema.loader.ExternalSchemaLoader.build_attr(), to handle Logical Layout. This transforms the input “raw” dictionary of cell.Cell instances to an application dictionary of proper Python objects. See The Stingray Developer’s Guide.
  2. An iterator, schema.loader.ExternalSchemaLoader.attr_dict_iter(), that provides “raw” dictionaries from each row (based on the schema) to the builder to create application dictionaries.
  3. The overall function, schema.loader.ExternalSchemaLoader.schema(), that iterates over application objects built from application dictionaries.
workbook

The overall Workbook that we’re parsing to locate schema information.

Sheet

A specific sheet within that workbook.

class ExternalSchemaLoader( SchemaLoader ):
    """Open a workbook file in a well-known format.
    Build a schema with attribute name, offset, size  and type
    information.  The type is a string that names the
    type of cell to create.

    The meta-schema must be embedded as the first line of the schema sheet.

    The assumed meta-schema is the following::

        Schema(
            Attribute("name",create="TextCell"),
            Attribute("offset",create="NumberCell"),
            Attribute("size",create="NumberCell"),
            Attribute("type",create="TextCell"),
        )

    If the meta-schema has different names, then a subclass with
    a different :py:meth:`build_attr` is required to map the actual
    source columns to the attributes of a :py:class:`Attribute`.

    Offsets are typically 1-based.
    """
    def __init__( self, workbook, sheet_name='Sheet1' ):
        self.workbook, self.sheet_name = workbook, sheet_name
        self.sheet= self.workbook.sheet( self.sheet_name, stingray.sheet.EmbeddedSchemaSheet,
        loader_class= HeadingRowSchemaLoader )
ExternalSchemaLoader.build_attr(row)

There’s potential for a great deal of variability in schema definition. Consequently, this build_attr method is merely a sample that covers one common case.

base= 1
type_to_cell = {
    'text': "TextCell",
    'number': "NumberCell",
    'date': "DateCell",
    'boolean': "BooleanCell",
    }
@staticmethod
def build_attr( row ):
    """Build application dictionary from raw dictionary.
    """
    try:
        offset= row['offset'].to_int()-ExternalSchemaLoader.base
    except KeyError:
        offset= None
    try:
        size= row['size'].to_int()
    except KeyError:
        size= None
    try:
        type_name= row['type'].to_str()
        create= ExternalSchemaLoader.type_to_cell[type_name]
    except KeyError:
        create= stingray.cell.TextCell
    return dict(
        name= row['name'].to_str(),
        offset= offset,
        size= size,
        create= create,
    )

Schema loading involves a process of

  1. Iterating through the source rows as dictionaries.
    • Build each raw row as a source dictionary.
    • Build an standardized attr dictionary from the source dictionary. This mapping, implemented by schema.loader.ExternalSchemaLoader.build_attr() is subject to a great deal of change without notice.
  2. Building each schema.Attribute from the dictionary.
ExternalSchemaLoader.attr_dict_iter(sheet)

Iterate over application dicts based on raw dicts built by the schema of the sheet.

def attr_dict_iter( self, sheet ):
    """Iterate over application dicts based on raw dicts
    built by the schema of the sheet."""
    return (
        ExternalSchemaLoader.build_attr(r)
        for r in sheet.schema.rows_as_dict_iter(sheet)
    )
ExternalSchemaLoader.schema()

Scan a file to get the schema.

Returns:a Schema object
def schema( self ):
    """Scan a file to get the schema.
    :return: a :py:class:`Schema` object."""
    self.row_iter= iter( [] )
    source_dict = self.attr_dict_iter( self.sheet )
    schema= Schema(
        *(Attribute(**row) for row in source_dict)
    )
    return schema

7.10. Worst-Case Loader

class schema.loader.BareExternalSchemaLoader

This is a degenerate case loader where the schema sheet (or file) doesn’t have an embedded schema on line one of the sheet.

class BareExternalSchemaLoader( SchemaLoader ):
    """Open a workbook file in a well-known format.  Apply a schema parser
    to the given sheet (or file) to build a schema.

    The meta-schema is hard-coded in this class because the given
    sheet has no headers.
    """
    schema= Schema(
            Attribute("name",create="TextCell"),
            Attribute("offset",create="NumberCell"),
            Attribute("size",create="NumberCell"),
            Attribute("type",create="TextCell"),
        )
    def __init__( self, workbook, sheet_name='Sheet1' ):
        self.workbook, self.sheet_name = workbook, sheet_name
        self.sheet= self.workbook.sheet( self.sheet_name, stingray.sheet.ExternalSchemaSheet,
        schema= self.schema )

7.11. Parsing and Loading a COBOL Schema

One logical extension to this is to parse COBOL DDE’s to create a schema that allows us to process a COBOL file (in EBCDIC) directly as if it were a simple workbook.

We’ll delegate that to COBOL Loader Module – Parse COBOL Source to Load a Schema, since it’s considerably more complex than simply loading rows from a sheet of a workbook.