11. The Stingray Developer’s Guide

We use Stingray to work with data files where the schema is either external or complex (or both). We can tackle this question:

How do we process a file simply and consistently?

Or, more concretely,

How do we make a program independent of Physical Format and Logical Layout?

We can also use Stingray to answer questions about files, the schema those files purport to use, and the data on those files. Specifically, we can tackle this question:

How do we assure that a file and an application use the same schema?

There are two sides to schema use:

  • Application Use. Given a data file, we need to bind schema information to that file. For a workbook, the schema may be in the column headings of the sheet. Or it may be in a separate sheet, or a separate workbook. For a COBOL file, the schema is always in a separate COBOL-language data definition.
  • Schema Conformance. Given a data file, how do we confirm that some schema matches the file? This is the data quality assurance question.

We do need to note the following.

If it was simple, we wouldn’t need this package, would we?

11.1. Concepts

As noted in Introduction, there are three levels of schema that need to be bound to a file.

  • Physical Format. We can make this transparent to our applications. See Workbook Package – Uniform Wrappers for Workbooks and The COBOL Package for details. Everything is a workbook with a fairly limited API.

  • Logical Layout. This is how an application program will make use of the data found in a file. Sometimes the Logical layout information can be embedded in a file: it might be the top row of a sheet in a workbook, for example. Sometimes the logical layout can be separate: a COBOL data definition, or perhaps a metadata sheet in a workbook.

    We can’t make the logical layout transparent. Our applications and files both need to agree on a logical layout. The column names in the metadata, for example, must be agreed to. The order or position of the columns, however, need not be fixed.

  • Conceptual Content. A single conceptual schema may be implemented by a number of physical formats and logical layouts. An application should be able to tolerate variability in the logical layout as long is it matches the expected conceptual content.

    Since we’re working in Python, the conceptual schema is often a class definition. It might be a namedtuple or a SimpleNamespace, also.

We’ll tackle the schema binding in several pieces.

We’ll look at some demonstration programs in Stingray Demo Applications.

11.2. Binding a Schema to a File

We’re only going to bind two levels of schema to a file. The conceptual schema would require some kind of formal ontology, something that’s rarely available.

Logical Layout. We rely on a schema, schema.Schema to manage the logical layout of a file.

A workbook has two ways to bind logical layout to data values. Our sheet.Sheet subclass hierarchy requires a schema object. We’ll load the schema using a schema.loader components.

  • Embedded. This is commonly seen as column titles within the sheet. Or any variation on that theme. The common case of column titles is handled by a built-in schema loader, schema.loader.HeadingRowSchemaLoader. Other variations are managed by building different schema loaders.
  • External. This is a separate sheet or separate file. In this case, we can start with schema.loader.BareExternalSchemaLoader to read an external schema. In the case of COBOL files, there’s a separate cobol.loader.COBOLSchemaLoader that parses COBOL source to create a usable schema.

Physical Format. Generally, a file name provides a hint as to the physical file format. .csv, .xls, .xlsx, .xlsm, .ods, .numbers describe the physical format.

Our cell.Cell, sheet.Sheet, and workbook.base.Workbook handles many physical format details nicely.

11.3. Data Attribute Mapping – Using a Schema

Using a schema is the heart of the semantic problem.

We want to have just one application that is adaptable to a number of closely-related data file schemata. Ideally, there’s one, but as a practical matter, there are often several similar schema.

We need to provide three pieces of information, minimally.

  • Target attribute within our application.
  • Target data type conversion.
  • Source attribute based on attribute name or position.

We could use a number of different encodings for this relationship. We could write it as properties, or XML, or some other notation.

However, that triple is essentially a Python assignment statement with target, to_type and source. The simplest description is the following:

target = row.cell( schema['source'] ).to_type()

There is a tiny bit of boilerplate in this assignment statement. When using repeating groups, items with duplicated column names, or REDEFINES clauses, the “boilerplate” expands to some additional code required to locate the source data.

For multiple attributes, this is our use case: a Builder Function:

def build_record( aRow ):
    return dict(
        name = row.cell( schema['some column'] ).to_str(),
        address = row.cell( schema['another column'] ).to_str(),
        zip = row.cell( schema['zip'] ).to_digit_str(5),
        phone = row.cell( schema['phone'] ).to_digit_str(),
    )

The idea is to build a single function that defines the application-specific mapping from a row in a file, given the logical layout information buried in the schema definition.

Of course, the schema can lie, and the application can misuse the data. Those are inevitable (and therefore insoluble) problems. This is why we must write customized software to handle these data sources.

In the case of variant schemata, we would like something like this.

def build_record_1( aRow ):
    return dict(
        name = row.cell( schema['some column'] ).to_str(),
        address = row.cell( schema['another column'] ).to_str(),
        zip = row.cell( schema['zip'] ).to_digit_str(5),
        phone = row.cell( schema['phone'] ).to_digit_str(),
    )

def build_record_2( aRow ):
    return dict(
        name = row.cell( schema['variant column'] ).to_str(),
        address = row.cell( schema['something different'] ).to_str(),
        zip = row.cell( schema['zip'] ).to_digit_str(5),
        phone = row.cell( schema['phone'] ).to_digit_str(),
    )

We can then define a handy factory which picks a builder based on the schema version.

make_builder(args)

Create a builder object from the args.

Parameters:args – schema version
Returns:appropriate builder function for the schema
def make_builder( args ):
    return eval( 'build_record_{0}'.format(args.layout) )

The make_builder() function selects one of the available builders based on a run-time option.

11.3.1. Adding Fluency

In the case where there are no repeating groups, nor REDEFINES clauses, we can make our API slightly more fluent by building a row dictionary from row and schema. This kind of eager cell processing is risky for COBOL files. It often works, however, for well-known spreadsheet files.

row_dict = dict(
    (a.name, row.cell(a)) for a in schema )

This allows the following target, to_type and source triple.

target = row['source'].to_type()

This parallels the csv module’s DictReader class.

11.4. Data Transformation

In the Cell Module – Data Element Containers and Conversions chapter, we noted that there are two parts to data handling: Capture and Conversion. Capture is part of parsing the physical format. Conversion is part of the final application, and has nothing to do with the schema that describes the data source.

A target data type transformation (or conversion) could be considerably more complex than the trivial case of decoding a floating-point number to a digit string. It could involve any combination of filtering, cleansing, conforming to an existing database, or rewriting.

Here’s a much more complex Builder Function.

def build_record_3( aRow ):
    if aRow['flag'].is_empty():
        return None
    zip_str = aRow['zip'].to_str()
    if '-' not in zip:
        if len(zip) <= 5:
            zip= aRow['zip'].to_digit_str(5)
        else:
            zip= aRow['zip'].to_digit_str(9)
    else:
        zip= zip_str.replace('-'.'')
    return dict(
        name = aRow['variant column'].to_str(),
        address = arow['different column'].to_str(),
        zip = zip,
        phone = aRow['phone'].to_digit_str(),
    )

This shows filtering and cleasing operations. Yes, it’s complex. Indeed, it’s complex enough that alternative domain-specific languages (i.e., properties, XLST, etc.) are much less clear than the Python.

11.5. Stingray Application Design

Generally, there are two kinds of testing. Conventional unit testing applies to our application to be sure the schemata are used properly. Beyond that, we have data quality testing.

For application unit testing, our programs should be decomposed into three tiers of processing.

  • Row-Level. Inidividual Python objects built from one row of the input. This involves our builder functions.
  • Sheet-Level. Collections of Python objects built from all rows of a sheet. This involves sheet processing functions.
  • Workbook-Level. Collections of sheets.

Each of these tiers should be tested independently.

In Unit Level Validation for Application and Data, we’ll look at how we validate that the the input files have the expected schema. This is a kind of Data Quality testing. It can use the unit testing framework, but it applies to data, not applications.

11.5.1. Row-Level Processing

Row-level processing is centered on a suite of builder functions. These handle the detailed mapping from variant logical layouts to a single, standardized conceptual schema.

A builder function should create a simple dictionary or types.SimpleNamespace. Each dictionary key is the standardized attribute names used by internal Python class definitions.

Q. Why not build the final Python objects from the source row?

A. The source row needs to be validated to see if a valid object can be built. It seems simpler to map the logical layout in the source document to a standardized structure that matches the conceptual schema. This standardized structure can be validated. Then the Python object built from that structure.

This follows the design patterns from the Django project where a ModelForm is used to validate data before attempting to build a Model instance.

Here’s the three-part operation: Build, Validate and Construct.

def builder_1( row ):
    return dict(
        key = row.cell( row.sheet.schema['field'] ).to_type(),
    )

def is_valid( row_dict ):
    All present or accounted for?
    return state

def construct_object( row_dict ):
    return App_Object( **row_dict )

The validation rules rarely change. The object construction doesn’t really need to be a separate function, it can often be a simple class name.

Our sheet processing can include a function like this. We’ll look at this below.

builder= make_builder( args )
for row in sheet:
    intermediate= builder( row )
    if is_valid(intermediate):
        yield construct_object(intermediate)
    else:
        log.error( row )

The builder, however, varies with the file’s actual schema. We need to pick the builder based on a “logical layout” command-line option. Something like the following is used to make an application flexible with respect to layout.

def make_builder( args ):
    if args.layout in ("1", "D", "d"):
        return builder_1
    elif args.layout == "2":
        return builder_2
    else
        raise Exception( "Unknown layout value: {0}".format(args.layout) )

The builders are tested individually. They are subject to considerable change. New builders are created frequently.

The validation should be common to all logical layouts. It’s not subject to much variation. The validation and object construction doesn’t have the change velocity that builders have.

11.5.2. Configuration Options

We might want to package all builders in a separate module. This provides a focused location for change that leaves the application untouched when handling Yet Another Logical Layout.

def make_builder( args ):
    builder_name = 'builder_{0}'.format( args.layout )
    globals = {}
    execfile( 'builders.py', globals )
    return globals[builder_name]

Or

def make_builder( args ):
    import builders
    return eval('builders.builder_{0}'.format( args.layout ))

This allows a single application to be separated into invariant portions and the builders which need to be tweaked when the source file layouts change.

11.5.3. Sheet-Level Processing

The next higher layer is sheet-level processing. For the most part, sheets are generally rows of a single logcal type. In exceptional cases, a sheet may have multiple types coincedentally bound into a single sheet. We’ll return to the multiple-types-per-sheet issue below.

For the single-type-per-sheet, we have a processing function like the following.

process_sheet(sheet, builder)

Process the given sheet using the given builder.

def process_sheet( sheet, builder=builder_1 ):
    counts= defaultdict( int )
    object_iter = (
        builder(row))
        for row in sheet.schema.rows_as_dict_iter(sheet) )
    for source in object_iter:
        counts['read'] += 1
        if is_valid(source):
            counts['processed'] += 1
            # The real processing
            obj= make_app_object( source )
            obj.save()
        else:
            counts['rejected'] += 1
    return counts

This kind of sheet is tested two ways. First, with a test fixture that provides specific rows based on requirements, edge-cases and other “white-box” considerations.

It is also tested with “live-file”. The counts can be checked. This actually tests the file as much as it tests the sheet processing function.

11.5.4. Workbook Processing

The overall processing of a given workbook input looks like this.

process_workbook(source, builder)

Process all sheets of the workbook using the given builder.

def process_workbook( source, builder ):
    for name in source.sheets():
        # Sheet filter?  Or multi-way elif switch?
        sheet= source.sheet( name,
            sheet.EmbeddedSchemaSheet,
            loader_class=schema.loader.HeadingRowSchemaLoader )
        counts= process_sheet( sheet, builder )
        pprint.pprint( counts )

This makes two claims about the workbook.

  • All sheets in the workbook have the same schema rules. In this example, it’s an embedded schema in each sheet and the schema is the heading row. We could easily use an sheet.ExternalSchemaSheet and an external schema.
  • A single process_sheet() function is appropriate for all sheets.

If a workbook doesn’t meet these criteria, then a (potentially) more complex workbook processing function is needed. A sheet filter is usually necessary.

Sheet name filtering is also subject to the kind of change that builders are subject to. Each variant logical layout may also have a variation in sheet names. It helps to separate the sheet filter functions in the same way builders are separated. New functions are added with remarkable regularity

def sheet_filter_1( name ):
    return re.match( r'pattern', name )

Or, perhaps something like this that uses a shell file-name pattern instead of a more sophisticated regular expression.

def sheet_filter_2( name ):
    return fnmatch.fnmatch( name, 'pattern' )

11.5.5. Command-Line Interface

We have an optional argument for verbosity and a positional argument that provides all the files to profile.

def parse_args():
    parser= argparse.ArgumentParser()
    parser.add_argument( 'file', nargs='+' )
    parser.add_argument( '-l', '--layout' )
    parser.add_argument( '-v', '--verbose', dest='verbosity',
        default=logging.INFO, action='store_const', const=logging.DEBUG )
    return parser.parse_args()

The overall main program looks something like this.

if __name__ == "__main__":
    logging.basicConfig( stream=sys.stderr )
    args= parse_args()
    logging.getLogger().setLevel( args.verbosity )
    builder= make_builder( args )
    try:
        for file in args:
            with workbook.open_workbook( input ) as source:
                process_workbook( source, builder )
        status= 0
    except Exception as e:
        logging.exception( e )
        status= 3
    logging.shutdown()
    sys.exit( status )

This main program switch allows us to test the various functions (process_workbook(), process_sheet(), the builders, etc.) in isolation.

It also allows us to reuse these functions to build larger (and more complete) applications from smaller components.

In Stingray Demo Applications we’ll look at two demonstration applications, as well as a unit test.

11.6. Variant Records and COBOL REDEFINES

Ideally, a data source is in “first normal form”: all the rows are a single type of data. We can apply a Build, Validate, Construct sequence simply.

In too many cases, a data source has multiple types of data. In COBOL files, it’s common to have header records or trailer records which are summaries of the details sandwiched in the middle.

Similarly, a spreadsheet may be populated with summary rows that must be discarded or handled separately. We might, for example, write the summary to a different destination and use it to confirm that all rows were properly processed.

Because of the COBOL REDEFINES clause, there may be invalid data. We can’t assume that all attributes have valid data. This makes our processing slightly different because we can’t necessarily do eager evaluation of each row of data.

We’ll look at a number of techniques for handling variant records.

11.6.1. Trivial Filtering

When using an Embedded Schema Loader based on schema.loader.HeadingRowSchemaLoader we can extend this loader to reject rows.

The schema.loader.HeadingRowSchemaLoader.rows() method can do simple filtering. This is most appropriate for excluding blank rows or summary rows from a spreadsheet.

11.6.2. Multiple Passes and Filters

When we have multiple data types within a single sheet, we can process this data using Multiple Passes and Filters. Each pass uses different filters to separate the various types of data.

This relies on an eager production of an intermediate object from the raw data. This may not work well for COBOL files.

The multiple-pass option looks like this. Each pass applies a filter and then does the appropriate processing.

def process_sheet_filter_1( sheet ):
    counts= defaultdict( int )
    object_iter = (
        builder(row))
        for row in sheet.schema.rows_as_dict_iter(sheet) )
    for source in object_iter:
        counts['read'] += 1
        if filter_1(source):
            counts['filter_1'] += 1
            processing_1(source)
        else:
            counts['rejected'] += 1
    return counts

Each filter is a simple boolean function like this.

def filter_1( source ):
    return some condition

The conditions may be trivial: source['column'] == value. The conditions may be more complex. It’s good to encapsulate them as distinct, named functions.

We could make the filter function and processing function an argument to a generic process_sheet() function. Sometimes this is a good simplification, sometimes it leads to extra complexity of little value.

11.6.3. One Pass and Switch

When we have multiple data types within a single sheet, We can make single pass over the data, using an if-elif “switch” to distinguish the different types of rows. Each type of row is handled separately.

This relies on an eager production of an intermediate object from the raw data. This may not work well for COBOL files.

The one-pass option looks like this. A “switch” function is used to discriminate each kind of row that is found in the sheet.

def process_sheet_switch( sheet ):
    counts= defaultdict( int )
    object_iter = (
        builder(row))
        for row in sheet.schema.rows_as_dict_iter(sheet) )
    for source in object_iter:
        counts['read'] += 1
        if switch_1(source):
            processing_1(source)
            counts['switch_1'] += 1
        elif switch_2(source):
            processing_2(source)
            counts['switch_2'] += 1
        elif etc.
        else:
            counts['rejected'] += 1
    return counts

Each switch function is a simple boolean function like this.

def switch_1( source ):
    return some condition

The conditions may be trivial: source['column'] == value. The conditions may be more complex. It’s good to encapsulate them as distinct, named functions.

We may be able to build a simple mapping from switch function to process function.

switch_process = (
    (switch_1, processing_1),
    (switch_2, processing_2),
)

This allows us to write a generic sheet processing function.

def process_sheet_switch( sheet, mapping ):
    counts= defaultdict( int )
    object_iter = (
        builder(row))
        for row in sheet.schema.rows_as_dict_iter(sheet) )
    for source in object_iter:
        counts['read'] += 1
        processed= None
        for switch, process in mapping:
            if switch(source):
                processed= switch.__name__
                process( source )
                counts[processed] += 1
        if not processed:
            counts['rejected'] += 1
    return counts

This can more easily be extended by adding to the switch_process mapping.

11.6.4. Lazy Switch Processing

The above two examples rely on building an iterator over intermediate objects. The object_iter builds objects eagerly. This may not always work for COBOL files. Here’s a variation that might be helpful.

We’ll decompose the builders so that the switch is applied first. Then the builder and processing can depend on the switch.

switch_build_process = (
    (switch_1, builder_1, processing_1),
    (switch_2, builder_2, processing_2),
)

This structure can be used with the following generic sheet processing.

def process_sheet_switch( sheet, mapping ):
    counts= defaultdict( int )
    for row in sheet.schema.rows(sheet):
        counts['read'] += 1
        processed= None
        for switch, builder, process in mapping:
            if switch(row):
                processed= switch.__name__
                source= builder( row )
                process( source )
                counts[processed] += 1
        if not processed:
            counts['rejected'] += 1
    return counts

This is slightly more complex. It the advantage of not attempting to process a row unless some initial sanity check has been done. Once the switch function determines the type of the row, then an appropriate builder can be invoked and the row processed.

In many cases, the processing starts with more complex data quality validation. If so, that can be added to the mapping. It would become a switch-builder-validator-process mapping that decomposes each step of the processing pipeline.

11.7. Big Data Performance

We’ve broken our processing down into separate steps which work with generic Python data structures. The idea is that we can use multiprocessing to spread the pipeline into separate processors or cores.

Each stage of the Build, Validate, Construct sequence can be decomposed. We can read raw data from the source file, apply a switch and put the raw “Row” objects into a processing queue.

A builder process can dequeue row objects from the processing queue, apply a builder, and put objects into a validation queue.

A validator process can dequeue built objects (dictionaries, for example) and validate them. Invalid objects can be written to a queue for logging. Valid objects can be written to another queue for processing.

The final processing queue will get a sequence of valid objects all of a single type. The final processing might involve (slow) database transactions, and there may need to be multiple worker processes fetching from this queue.

11.8. File Naming and External Schema

Some data management discipline is also essential be sure that the schema and file match up properly. Naming conventions and standardized directory structures are essential for working with external schema.

11.8.1. Well Known Formats

For well-known physical formats (.csv, .xls, .xlsx, .xlsm, .ods, .numbers) the filename extension describes the physical format. Additional information is required to determine the Logical Layout.

The schema may be loaded from column headers, in which case the binding is handled via an embedded schema loader. If the schema.loader.HeadingRowSchemaLoader is used, no more information is required. If a customized schema loader is used (because the headings are not trivially the first row of a sheet), then we must – somehow – bind application to external schema. The filename extension doesn’t really help with this. The best choice is to use a configuration file of some kind.

If the schema is external, and we’re working with a well-known physical format, then we have an even more complex binding issue. The schema will often require a customized schema loader. Each file must be associated with a schema file and a schema loader class name. File naming conventions won’t help. This, too, should rely on a configuration file.

11.8.2. Fixed Formats and COBOL

For fixed-format files, the filename extension does not describe the physical layout. Further, a fixed format schema must combine logical layout and physical format into a single description.

For fixed format files, the following conventions help bind a file to its schema.

  • The data file extension is the base name of a schema file. mydata.someschema. Do not use .dat or something else uninformative.
  • Schema files must be be either a DDE file or a workbook in a well-known format. someschema.cob or someschema.xlsx.

Examples. We might see the following file names.

september_2001.exchange_1
november_2011.some_dde_name
october_2011.some_dde_name
exchange_1.xls
some_dde_name.cob

The september_2001.exchange_1 file is a fixed format file which requires the exchange_1.xls metadata workbook.

The november_2011.some_dde_name and october_2011.some_dde_name files are fixed format files which require the some_dde_name.cob metadata.

11.8.3. External Schema Workbooks

A workbook with an external schema sheet must adhere to a few conventions to be usable. These rules form the basis for the schema.loader.BareExternalSchemaLoader class. To change the rules, extend that class.

  • The standard sheet name "Schema" defines the appropriate sheet.
  • There must be an internal meta-schema on line one of the sheet that provides the expected column names.
  • The column names “name”, “offset”, “size”, “type” are used.
  • Only “name” is required in general.
  • For fixed format files, “offset”, “size” and “type” will also be required.
  • Additional columns are allowed, but will be ignored.
  • Type definitions are “text”, “number”, “date” and “boolean”.

11.9. Binding a Schema to an Application

We would like to be sure that our application’s expectations for a schema are aligned with the schema actually present. An application has several ways to bind its schema information.

  • Implicitly. The code simply mentions specific columns (either by name or position).
  • Explicitly. The code has a formal “requires” check to be sure that the schema provided by the input file actually matches the schema required by the application.

Explicit schema binding parallels the configuration management issue of module dependency. A file can be said to provide a given schema and an application requires a given schema.

Sadly, we don’t always have a pithy summary of a schema. We can’t trivially examine a file to be sure it conforms to a schema. In the case of well-known file formats with an embedded schema, we can do a test like this to determine if the schema is what we expect.

all( req in schema for req in ('some', 'list', 'of', 'columns') )

This covers 80% of the use cases. For all other cases, we don’t have a reliable way to confirm the file’s schema.

If we don’t implement this, we’re left with implicit schema binding in our applications. In short, we have to include data validity checks, a debugging log, and some kind of warning technique.

11.10. Schema Version Numbering

XSD’s can have version numbers. This is a very cool.

See http://www.xfront.com/Versioning.pdf for detailed discussion of how to represent schema version information.

Databases, however, lack version numbering in the schema. This leads to potential compatibilty issues between application programs that expect version 3 of the schema and an older database that implements version 2 of the schema.

Our file schema, similarly, don’t have a tidy, unambiguous numbering.

For external schema, we can embed the version in the file names. We might want to use something like this econometrics_vendor_1.2. This identifies the generic type of data, the source for that file, and the schema version number.

Within a SQL database, we can easily use the schema name to carry version information. We could have a name_version kind of convention for all schema, allowing an application to confirm schema compatibility with a trivial SQL query.

For embedded schema, however, we have no easy to handle schema identification and version numbering. We’re forced to build an algorithm to examine the actual names in the embedded schema to deduce the version.

This problem with embedded schema leads to using data profiling to reason out what the file is. This may devolve to a manual examination of the data profiling results to allow a human to determine the schema. Then, once the schema has been identified, command-line options can be used to bind the schema to file for correct processing.

11.11. Frequently Asked Questions

11.11.1. Junk Data

For inexplicable reasons, we can wind up with files that are damaged in some way.

“there is a 65-byte “header” at the start of the file, what would be the best (least hacky) way to skip over the first 65 bytes?”

This is one of the reasons why use both a file name and an open file object as arguments for opening a workbook.

with open("file_with_junk.some_schema","rb") as cobol:
    cobol.seek( 66 )
    wb = stingray.cobol.EBCDIC_File( cobol.filename, file_object=cobol )

This should permit skipping past the junk.