11. The Stingray Developer’s Guide¶
We use Stingray to work with data files where the schema is either external or complex (or both). We can tackle this question:
How do we process a file simply and consistently?
Or, more concretely,
How do we make a program independent of Physical Format and Logical Layout?
We can also use Stingray to answer questions about files, the schema those files purport to use, and the data on those files. Specifically, we can tackle this question:
How do we assure that a file and an application use the same schema?
There are two sides to schema use:
- Application Use. Given a data file, we need to bind schema information to that file. For a workbook, the schema may be in the column headings of the sheet. Or it may be in a separate sheet, or a separate workbook. For a COBOL file, the schema is always in a separate COBOL-language data definition.
- Schema Conformance. Given a data file, how do we confirm that some schema matches the file? This is the data quality assurance question.
We do need to note the following.
If it was simple, we wouldn’t need this package, would we?
11.1. Concepts¶
As noted in Introduction, there are three levels of schema that need to be bound to a file.
Physical Format. We can make this transparent to our applications. See Workbook Package – Uniform Wrappers for Workbooks and The COBOL Package for details. Everything is a workbook with a fairly limited API.
Logical Layout. This is how an application program will make use of the data found in a file. Sometimes the Logical layout information can be embedded in a file: it might be the top row of a sheet in a workbook, for example. Sometimes the logical layout can be separate: a COBOL data definition, or perhaps a metadata sheet in a workbook.
We can’t make the logical layout transparent. Our applications and files both need to agree on a logical layout. The column names in the metadata, for example, must be agreed to. The order or position of the columns, however, need not be fixed.
Conceptual Content. A single conceptual schema may be implemented by a number of physical formats and logical layouts. An application should be able to tolerate variability in the logical layout as long is it matches the expected conceptual content.
Since we’re working in Python, the conceptual schema is often a class definition. It might be a namedtuple or a SimpleNamespace, also.
We’ll tackle the schema binding in several pieces.
- File Schema. Binding a Schema to a File describes some preliminary operational steps that make Stingray work more simply and reliably.
- Processing. These are the basic concepts. Data Attribute Mapping – Using a Schema, and Data Transformation.
- Application Design Patterns. These are more complex issues. We can then dig into Stingray application programming in stingray Application Design, Variant Records and COBOL REDEFINES, and Big Data Performance.
- Data Management. File Naming and External Schema, Binding a Schema to an Application, and Schema Version Numbering.
- FAQ. Some other design questions. Frequently Asked Questions.
We’ll look at some demonstration programs in Stingray Demo Applications.
11.2. Binding a Schema to a File¶
We’re only going to bind two levels of schema to a file. The conceptual schema would require some kind of formal ontology, something that’s rarely available.
Logical Layout. We rely on a schema, schema.Schema
to manage
the logical layout of a file.
A workbook has two ways to bind logical layout to data values.
Our sheet.Sheet
subclass hierarchy requires a schema object.
We’ll load the schema using a schema.loader
components.
- Embedded. This is commonly seen as column titles within the sheet. Or
any variation on that theme. The common case of column titles is handled
by a built-in schema loader,
schema.loader.HeadingRowSchemaLoader
. Other variations are managed by building different schema loaders. - External. This is a separate sheet or separate file. In this case, we
can start with
schema.loader.BareExternalSchemaLoader
to read an external schema. In the case of COBOL files, there’s a separatecobol.loader.COBOLSchemaLoader
that parses COBOL source to create a usable schema.
Physical Format. Generally, a file name provides a hint as to the physical file format.
.csv
, .xls
, .xlsx
, .xlsm
, .ods
,
.numbers
describe the physical format.
Our cell.Cell
, sheet.Sheet
, and
workbook.base.Workbook
handles many physical format details nicely.
11.3. Data Attribute Mapping – Using a Schema¶
Using a schema is the heart of the semantic problem.
We want to have just one application that is adaptable to a number of closely-related data file schemata. Ideally, there’s one, but as a practical matter, there are often several similar schema.
We need to provide three pieces of information, minimally.
- Target attribute within our application.
- Target data type conversion.
- Source attribute based on attribute name or position.
We could use a number of different encodings for this relationship. We could write it as properties, or XML, or some other notation.
However, that triple is essentially a Python assignment statement with target, to_type and source. The simplest description is the following:
target = row.cell( schema['source'] ).to_type()
There is a tiny bit of boilerplate in this assignment statement. When using repeating groups, items with duplicated column names, or REDEFINES clauses, the “boilerplate” expands to some additional code required to locate the source data.
For multiple attributes, this is our use case: a Builder Function:
def build_record( aRow ):
return dict(
name = row.cell( schema['some column'] ).to_str(),
address = row.cell( schema['another column'] ).to_str(),
zip = row.cell( schema['zip'] ).to_digit_str(5),
phone = row.cell( schema['phone'] ).to_digit_str(),
)
The idea is to build a single function that defines the application-specific mapping from a row in a file, given the logical layout information buried in the schema definition.
Of course, the schema can lie, and the application can misuse the data. Those are inevitable (and therefore insoluble) problems. This is why we must write customized software to handle these data sources.
In the case of variant schemata, we would like something like this.
def build_record_1( aRow ):
return dict(
name = row.cell( schema['some column'] ).to_str(),
address = row.cell( schema['another column'] ).to_str(),
zip = row.cell( schema['zip'] ).to_digit_str(5),
phone = row.cell( schema['phone'] ).to_digit_str(),
)
def build_record_2( aRow ):
return dict(
name = row.cell( schema['variant column'] ).to_str(),
address = row.cell( schema['something different'] ).to_str(),
zip = row.cell( schema['zip'] ).to_digit_str(5),
phone = row.cell( schema['phone'] ).to_digit_str(),
)
We can then define a handy factory which picks a builder based on the schema version.
-
make_builder
(args)¶ Create a builder object from the args.
Parameters: args – schema version Returns: appropriate builder function for the schema
def make_builder( args ):
return eval( 'build_record_{0}'.format(args.layout) )
The make_builder()
function selects one of the available
builders based on a run-time option.
11.3.1. Adding Fluency¶
In the case where there are no repeating groups, nor REDEFINES clauses, we can make our API slightly more fluent by building a row dictionary from row and schema. This kind of eager cell processing is risky for COBOL files. It often works, however, for well-known spreadsheet files.
row_dict = dict(
(a.name, row.cell(a)) for a in schema )
This allows the following target, to_type and source triple.
target = row['source'].to_type()
This parallels the csv
module’s DictReader
class.
11.4. Data Transformation¶
In the Cell Module – Data Element Containers and Conversions chapter, we noted that there are two parts to data handling: Capture and Conversion. Capture is part of parsing the physical format. Conversion is part of the final application, and has nothing to do with the schema that describes the data source.
A target data type transformation (or conversion) could be considerably more complex than the trivial case of decoding a floating-point number to a digit string. It could involve any combination of filtering, cleansing, conforming to an existing database, or rewriting.
Here’s a much more complex Builder Function.
def build_record_3( aRow ):
if aRow['flag'].is_empty():
return None
zip_str = aRow['zip'].to_str()
if '-' not in zip:
if len(zip) <= 5:
zip= aRow['zip'].to_digit_str(5)
else:
zip= aRow['zip'].to_digit_str(9)
else:
zip= zip_str.replace('-'.'')
return dict(
name = aRow['variant column'].to_str(),
address = arow['different column'].to_str(),
zip = zip,
phone = aRow['phone'].to_digit_str(),
)
This shows filtering and cleasing operations. Yes, it’s complex. Indeed, it’s complex enough that alternative domain-specific languages (i.e., properties, XLST, etc.) are much less clear than the Python.
11.5. Stingray Application Design¶
Generally, there are two kinds of testing. Conventional unit testing applies to our application to be sure the schemata are used properly. Beyond that, we have data quality testing.
For application unit testing, our programs should be decomposed into three tiers of processing.
- Row-Level. Inidividual Python objects built from one row of the input. This involves our builder functions.
- Sheet-Level. Collections of Python objects built from all rows of a sheet. This involves sheet processing functions.
- Workbook-Level. Collections of sheets.
Each of these tiers should be tested independently.
In Unit Level Validation for Application and Data, we’ll look at how we validate that the the input files have the expected schema. This is a kind of Data Quality testing. It can use the unit testing framework, but it applies to data, not applications.
11.5.1. Row-Level Processing¶
Row-level processing is centered on a suite of builder functions. These handle the detailed mapping from variant logical layouts to a single, standardized conceptual schema.
A builder function should create a simple dictionary or types.SimpleNamespace
.
Each dictionary key is the standardized
attribute names used by internal Python class definitions.
Q. Why not build the final Python objects from the source row?
A. The source row needs to be validated to see if a valid object can be built. It seems simpler to map the logical layout in the source document to a standardized structure that matches the conceptual schema. This standardized structure can be validated. Then the Python object built from that structure.
This follows the design patterns from the Django project where a ModelForm
is used to validate data before attempting to build a Model
instance.
Here’s the three-part operation: Build, Validate and Construct.
def builder_1( row ): return dict( key = row.cell( row.sheet.schema['field'] ).to_type(), ) def is_valid( row_dict ): All present or accounted for? return state def construct_object( row_dict ): return App_Object( **row_dict )
The validation rules rarely change. The object construction doesn’t really need to be a separate function, it can often be a simple class name.
Our sheet processing can include a function like this. We’ll look at this below.
builder= make_builder( args )
for row in sheet:
intermediate= builder( row )
if is_valid(intermediate):
yield construct_object(intermediate)
else:
log.error( row )
The builder, however, varies with the file’s actual schema. We need to pick the builder based on a “logical layout” command-line option. Something like the following is used to make an application flexible with respect to layout.
def make_builder( args ):
if args.layout in ("1", "D", "d"):
return builder_1
elif args.layout == "2":
return builder_2
else
raise Exception( "Unknown layout value: {0}".format(args.layout) )
The builders are tested individually. They are subject to considerable change. New builders are created frequently.
The validation should be common to all logical layouts. It’s not subject to much variation. The validation and object construction doesn’t have the change velocity that builders have.
11.5.2. Configuration Options¶
We might want to package all builders in a separate module. This provides a focused location for change that leaves the application untouched when handling Yet Another Logical Layout.
def make_builder( args ):
builder_name = 'builder_{0}'.format( args.layout )
globals = {}
execfile( 'builders.py', globals )
return globals[builder_name]
Or
def make_builder( args ):
import builders
return eval('builders.builder_{0}'.format( args.layout ))
This allows a single application to be separated into invariant portions and the builders which need to be tweaked when the source file layouts change.
11.5.3. Sheet-Level Processing¶
The next higher layer is sheet-level processing. For the most part, sheets are generally rows of a single logcal type. In exceptional cases, a sheet may have multiple types coincedentally bound into a single sheet. We’ll return to the multiple-types-per-sheet issue below.
For the single-type-per-sheet, we have a processing function like the following.
-
process_sheet
(sheet, builder)¶ Process the given sheet using the given builder.
def process_sheet( sheet, builder=builder_1 ): counts= defaultdict( int ) object_iter = ( builder(row)) for row in sheet.schema.rows_as_dict_iter(sheet) ) for source in object_iter: counts['read'] += 1 if is_valid(source): counts['processed'] += 1 # The real processing obj= make_app_object( source ) obj.save() else: counts['rejected'] += 1 return counts
This kind of sheet is tested two ways. First, with a test fixture that provides specific rows based on requirements, edge-cases and other “white-box” considerations.
It is also tested with “live-file”. The counts can be checked. This actually tests the file as much as it tests the sheet processing function.
11.5.4. Workbook Processing¶
The overall processing of a given workbook input looks like this.
-
process_workbook
(source, builder)¶ Process all sheets of the workbook using the given builder.
def process_workbook( source, builder ): for name in source.sheets(): # Sheet filter? Or multi-way elif switch? sheet= source.sheet( name, sheet.EmbeddedSchemaSheet, loader_class=schema.loader.HeadingRowSchemaLoader ) counts= process_sheet( sheet, builder ) pprint.pprint( counts )
This makes two claims about the workbook.
- All sheets in the workbook have the same schema rules.
In this example, it’s an embedded schema in each sheet and the schema is the heading row.
We could easily use an
sheet.ExternalSchemaSheet
and an external schema. - A single
process_sheet()
function is appropriate for all sheets.
If a workbook doesn’t meet these criteria, then a (potentially) more complex workbook processing function is needed. A sheet filter is usually necessary.
Sheet name filtering is also subject to the kind of change that builders are subject to. Each variant logical layout may also have a variation in sheet names. It helps to separate the sheet filter functions in the same way builders are separated. New functions are added with remarkable regularity
def sheet_filter_1( name ): return re.match( r'pattern', name )
Or, perhaps something like this that uses a shell file-name pattern instead of a more sophisticated regular expression.
def sheet_filter_2( name ): return fnmatch.fnmatch( name, 'pattern' )
11.5.5. Command-Line Interface¶
We have an optional argument for verbosity and a positional argument that provides all the files to profile.
def parse_args():
parser= argparse.ArgumentParser()
parser.add_argument( 'file', nargs='+' )
parser.add_argument( '-l', '--layout' )
parser.add_argument( '-v', '--verbose', dest='verbosity',
default=logging.INFO, action='store_const', const=logging.DEBUG )
return parser.parse_args()
The overall main program looks something like this.
if __name__ == "__main__":
logging.basicConfig( stream=sys.stderr )
args= parse_args()
logging.getLogger().setLevel( args.verbosity )
builder= make_builder( args )
try:
for file in args:
with workbook.open_workbook( input ) as source:
process_workbook( source, builder )
status= 0
except Exception as e:
logging.exception( e )
status= 3
logging.shutdown()
sys.exit( status )
This main program switch allows us to test the various functions (process_workbook()
, process_sheet()
, the builders, etc.) in isolation.
It also allows us to reuse these functions to build larger (and more complete) applications from smaller components.
In Stingray Demo Applications we’ll look at two demonstration applications, as well as a unit test.
11.6. Variant Records and COBOL REDEFINES¶
Ideally, a data source is in “first normal form”: all the rows are a single type of data. We can apply a Build, Validate, Construct sequence simply.
In too many cases, a data source has multiple types of data. In COBOL files, it’s common to have header records or trailer records which are summaries of the details sandwiched in the middle.
Similarly, a spreadsheet may be populated with summary rows that must be discarded or handled separately. We might, for example, write the summary to a different destination and use it to confirm that all rows were properly processed.
Because of the COBOL REDEFINES clause, there may be invalid data. We can’t assume that all attributes have valid data. This makes our processing slightly different because we can’t necessarily do eager evaluation of each row of data.
We’ll look at a number of techniques for handling variant records.
11.6.1. Trivial Filtering¶
When using an Embedded Schema Loader based on schema.loader.HeadingRowSchemaLoader
we can extend this loader to reject rows.
The schema.loader.HeadingRowSchemaLoader.rows()
method can do simple filtering.
This is most appropriate for excluding blank rows or summary rows from a spreadsheet.
11.6.2. Multiple Passes and Filters¶
When we have multiple data types within a single sheet, we can process this data using Multiple Passes and Filters. Each pass uses different filters to separate the various types of data.
This relies on an eager production of an intermediate object from the raw data. This may not work well for COBOL files.
The multiple-pass option looks like this. Each pass applies a filter and then does the appropriate processing.
def process_sheet_filter_1( sheet ): counts= defaultdict( int ) object_iter = ( builder(row)) for row in sheet.schema.rows_as_dict_iter(sheet) ) for source in object_iter: counts['read'] += 1 if filter_1(source): counts['filter_1'] += 1 processing_1(source) else: counts['rejected'] += 1 return counts
Each filter is a simple boolean function like this.
def filter_1( source ): return some condition
The conditions may be trivial: source['column'] == value
. The conditions
may be more complex. It’s good to encapsulate them as distinct, named functions.
We could make the filter function and processing function an argument
to a generic process_sheet()
function.
Sometimes this is a good simplification, sometimes it leads to extra
complexity of little value.
11.6.3. One Pass and Switch¶
When we have multiple data types within a single sheet, We can make single pass over the data, using an if-elif “switch” to distinguish the different types of rows. Each type of row is handled separately.
This relies on an eager production of an intermediate object from the raw data. This may not work well for COBOL files.
The one-pass option looks like this. A “switch” function is used to discriminate each kind of row that is found in the sheet.
def process_sheet_switch( sheet ): counts= defaultdict( int ) object_iter = ( builder(row)) for row in sheet.schema.rows_as_dict_iter(sheet) ) for source in object_iter: counts['read'] += 1 if switch_1(source): processing_1(source) counts['switch_1'] += 1 elif switch_2(source): processing_2(source) counts['switch_2'] += 1 elif etc. else: counts['rejected'] += 1 return counts
Each switch function is a simple boolean function like this.
def switch_1( source ): return some condition
The conditions may be trivial: source['column'] == value
. The conditions
may be more complex. It’s good to encapsulate them as distinct, named functions.
We may be able to build a simple mapping from switch function to process function.
switch_process = ( (switch_1, processing_1), (switch_2, processing_2), )
This allows us to write a generic sheet processing function.
def process_sheet_switch( sheet, mapping ):
counts= defaultdict( int )
object_iter = (
builder(row))
for row in sheet.schema.rows_as_dict_iter(sheet) )
for source in object_iter:
counts['read'] += 1
processed= None
for switch, process in mapping:
if switch(source):
processed= switch.__name__
process( source )
counts[processed] += 1
if not processed:
counts['rejected'] += 1
return counts
This can more easily be extended by adding to the switch_process
mapping.
11.6.4. Lazy Switch Processing¶
The above two examples rely on building an iterator over intermediate objects.
The object_iter
builds objects eagerly.
This may not always work for COBOL files. Here’s a variation that might be helpful.
We’ll decompose the builders so that the switch is applied first. Then the builder and processing can depend on the switch.
switch_build_process = ( (switch_1, builder_1, processing_1), (switch_2, builder_2, processing_2), )
This structure can be used with the following generic sheet processing.
def process_sheet_switch( sheet, mapping ):
counts= defaultdict( int )
for row in sheet.schema.rows(sheet):
counts['read'] += 1
processed= None
for switch, builder, process in mapping:
if switch(row):
processed= switch.__name__
source= builder( row )
process( source )
counts[processed] += 1
if not processed:
counts['rejected'] += 1
return counts
This is slightly more complex. It the advantage of not attempting to process a row unless some initial sanity check has been done. Once the switch function determines the type of the row, then an appropriate builder can be invoked and the row processed.
In many cases, the processing starts with more complex data quality validation. If so, that can be added to the mapping. It would become a switch-builder-validator-process mapping that decomposes each step of the processing pipeline.
11.7. Big Data Performance¶
We’ve broken our processing down into separate steps which work with generic Python data structures. The idea is that we can use multiprocessing to spread the pipeline into separate processors or cores.
Each stage of the Build, Validate, Construct sequence can be decomposed. We can read raw data from the source file, apply a switch and put the raw “Row” objects into a processing queue.
A builder process can dequeue row objects from the processing queue, apply a builder, and put objects into a validation queue.
A validator process can dequeue built objects (dictionaries, for example) and validate them. Invalid objects can be written to a queue for logging. Valid objects can be written to another queue for processing.
The final processing queue will get a sequence of valid objects all of a single type. The final processing might involve (slow) database transactions, and there may need to be multiple worker processes fetching from this queue.
11.8. File Naming and External Schema¶
Some data management discipline is also essential be sure that the schema and file match up properly. Naming conventions and standardized directory structures are essential for working with external schema.
11.8.1. Well Known Formats¶
For well-known physical formats (.csv
, .xls
, .xlsx
, .xlsm
, .ods
,
.numbers
) the filename extension describes the physical format. Additional
information is required to determine the Logical Layout.
The schema may be loaded from column headers, in which case the binding is handled
via an embedded schema loader. If the schema.loader.HeadingRowSchemaLoader
is used, no more information is required. If a customized schema loader is used
(because the headings are not trivially the first row of a sheet), then we must
– somehow – bind application to external schema. The filename extension doesn’t
really help with this. The best choice is to use a configuration file of some kind.
If the schema is external, and we’re working with a well-known physical format, then we have an even more complex binding issue. The schema will often require a customized schema loader. Each file must be associated with a schema file and a schema loader class name. File naming conventions won’t help. This, too, should rely on a configuration file.
11.8.2. Fixed Formats and COBOL¶
For fixed-format files, the filename extension does not describe the physical layout. Further, a fixed format schema must combine logical layout and physical format into a single description.
For fixed format files, the following conventions help bind a file to its schema.
- The data file extension is the base name of a schema file.
mydata.someschema
. Do not use.dat
or something else uninformative. - Schema files must be be either a DDE file or a
workbook in a well-known format.
someschema.cob
orsomeschema.xlsx
.
Examples. We might see the following file names.
september_2001.exchange_1
november_2011.some_dde_name
october_2011.some_dde_name
exchange_1.xls
some_dde_name.cob
The september_2001.exchange_1
file is a fixed format file
which requires the exchange_1.xls
metadata workbook.
The november_2011.some_dde_name
and october_2011.some_dde_name
files
are fixed format files which require the some_dde_name.cob
metadata.
11.8.3. External Schema Workbooks¶
A workbook with an external schema sheet must adhere to a few conventions to be usable.
These rules form the basis for the schema.loader.BareExternalSchemaLoader
class. To change the rules, extend that class.
- The standard sheet name
"Schema"
defines the appropriate sheet. - There must be an internal meta-schema on line one of the sheet that provides the expected column names.
- The column names “name”, “offset”, “size”, “type” are used.
- Only “name” is required in general.
- For fixed format files, “offset”, “size” and “type” will also be required.
- Additional columns are allowed, but will be ignored.
- Type definitions are “text”, “number”, “date” and “boolean”.
11.9. Binding a Schema to an Application¶
We would like to be sure that our application’s expectations for a schema are aligned with the schema actually present. An application has several ways to bind its schema information.
- Implicitly. The code simply mentions specific columns (either by name or position).
- Explicitly. The code has a formal “requires” check to be sure that the schema provided by the input file actually matches the schema required by the application.
Explicit schema binding parallels the configuration management issue of module dependency. A file can be said to provide a given schema and an application requires a given schema.
Sadly, we don’t always have a pithy summary of a schema. We can’t trivially examine a file to be sure it conforms to a schema. In the case of well-known file formats with an embedded schema, we can do a test like this to determine if the schema is what we expect.
all( req in schema for req in ('some', 'list', 'of', 'columns') )
This covers 80% of the use cases. For all other cases, we don’t have a reliable way to confirm the file’s schema.
If we don’t implement this, we’re left with implicit schema binding in our applications. In short, we have to include data validity checks, a debugging log, and some kind of warning technique.
11.10. Schema Version Numbering¶
XSD’s can have version numbers. This is a very cool.
See http://www.xfront.com/Versioning.pdf for detailed discussion of how to represent schema version information.
Databases, however, lack version numbering in the schema. This leads to potential compatibilty issues between application programs that expect version 3 of the schema and an older database that implements version 2 of the schema.
Our file schema, similarly, don’t have a tidy, unambiguous numbering.
For external schema, we can embed the version in the file names.
We might want to use something like this econometrics_vendor_1.2
.
This identifies the
generic type of data, the source for that file, and the schema version
number.
Within a SQL database, we can easily use the schema name to carry version information. We could have aname_version
kind of convention for all schema, allowing an application to confirm schema compatibility with a trivial SQL query.
For embedded schema, however, we have no easy to handle schema identification and version numbering. We’re forced to build an algorithm to examine the actual names in the embedded schema to deduce the version.
This problem with embedded schema leads to using data profiling to reason out what the file is. This may devolve to a manual examination of the data profiling results to allow a human to determine the schema. Then, once the schema has been identified, command-line options can be used to bind the schema to file for correct processing.
11.11. Frequently Asked Questions¶
11.11.1. Junk Data¶
For inexplicable reasons, we can wind up with files that are damaged in some way.
“there is a 65-byte “header” at the start of the file, what would be the best (least hacky) way to skip over the first 65 bytes?”
This is one of the reasons why use both a file name and an open file object as arguments for opening a workbook.
with open("file_with_junk.some_schema","rb") as cobol:
cobol.seek( 66 )
wb = stingray.cobol.EBCDIC_File( cobol.filename, file_object=cobol )
This should permit skipping past the junk.