The Stingray Schema-Based File Reader

The Stingray Reader tackles four fundamental issues in processing a file:

  • How are the bytes organized? What is the Physical Format?
  • Haw are the data objects organized? What is the Logical Layout?
  • What do the bytes mean? What is the Conceptual Content?
  • How can we assure ourselves that our applications will work with this file?

The problem we have is that the schema is not always bound to a given file nor is the schema clearly bound to an application program. There are two examples of this separation between schema and content:

  • We might have a spreadsheet where there aren’t even column titles.
  • We might have a pure data file (for example from a legacy COBOL program) which is described by a separate schema.

One goal of good software is to cope reasonably well with variability of user-supplied inputs. Providing data by spreadsheet is often the most desirable choice for users. In some cases, it’s the only acceptable choice. Since spreadsheets are tweaked manually, they may not have a simple, fixed schema or logical layout.

A workbook (the container of individual “spread sheets”) can be encoded in any of a number of physical formats: XLS, CSV, XLSX, ODS to name a few. We would like our applications to be independent of these physical formats. We’d like to focus on the logical layout.

Data supplied in the form of a workbook can suffer from numerous data quality issues. We need to be assured that a file actually conforms to a required schema.

The TODO List

Todo

Test hashable interface of Cell

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cell.rst, line 211.)

Todo

Support Occurs Depending On

Read files with variable length records – they have record headers and possibly block headers.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol.rst, line 347.)

Todo

Support Occurs Depending On

In order to fetch data for ODO, the attribute offsets and sizes cannot all be computed in advance during parsing.

They must be computed lazily during data fetching.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol.rst, line 396.)

Todo

Fix dump()

  1. Doesn’t work properly in the presence of redefines. Indeed, it’s not clear what it might mean – the data might be invalid when there’s a redefines. Only a hex dump might be sensible.
  2. This approach doesn’t seem to make sense for nested OCCURS. How do we display a more complex structure?

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol.rst, line 587.)

Todo

refactor

Move the following sections to cobol_defs

9.6. Essential Class Definitions

9.6.1. Usage Strategy Hierarchy 9.6.2. Allocation Strategy Hierarchy 9.6.3. Occurs Strategy Hierarchy 9.6.4. Location Calculation Strategy 9.6.5. DDE Class

9.7. DDE Preparation Processing 9.8. Set Size and Offset 9.9. Dump a Record

Put a from stingray.cobol.cobol_defs import * here.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 283.)

Todo

Support Occurs Depending On

The Allocation strategy hierarchy does not yet handle ODO.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 483.)

Todo

Support Occurs Depending On

For a successor, we should use the predecessor in the refers_to field to track down the offset of the predecessor.

Our offset is predecessor offset + predecessor total size.

The predecessor may have to do some thinking to get its total size or offset because of an Occurs Depending On situation.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 602.)

Todo

Support Occurs Depending On

The total size of an occurs depending requires a record with live data.

Otherwise, the total size is trivially computed from the DDE definition.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 623.)

Todo

Support Occurs Depending On

For the first item in a group, we should use the group parent in the dde field to track down the offset of the group we’re a member of.

Our offset is the group offset, since we’re first.

The group may have to do some thinking to get its predecessor’s total size or offset because of an Occurs Depending On situation.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 656.)

Todo

Rewrite for ODO

The total size of an occurs depending requires a record with live data.

Otherwise, the total size is trivially computed from the DDE definition.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 681.)

Todo

Support Occurs Depending On

The number attribute must be derived EITHER from the definition or a data record. We need to bind this to a record.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 708.)

Todo

Refactor Occurs.number()

There’s a disconnect here because OCCURS works on the DDE. Data access works on the Attribute. There’s no linkage from DDE to Attribute.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 713.)

Todo

refactor setSizeAndOffset()

Refactor setSizeAndOffset() into the Allocation class methods to remove isinstance() nonsense.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 1154.)

Todo

Support Occurs Depending On

Finish setSizeAndOffset() for ODO situations. Get occurs.number from the data record when necessary.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_defs.rst, line 1159.)

Todo

Support Occurs Depending On

The syntax is more complex: OCCURS [int TO] int [TIMES] DEPENDING [ON] name

This leads to variable positions for items which follow the occurs clause, based on the name value.

This means that the offset is not necessarily fixed when there’s a complex ODO. We’ll have to make offset (and size) a property that has one of two strategies.

  • Statically Located. The base case where offsets are static.
  • Variably Located. The complex ODO situation where there’s an ODO in the record. All ODO “depends on” fields become part of the offset calculation. This means we need an index for depends on clauses.

The technical buzzphrase is “a data item following, but not subordinate to, a variable-length table in the same level-01 record.”

See http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.aix.cbl.doc%2Ftptbl27.htm

These are the “Appendix D, Complex ODO” rules.

What has to be done to Stingray is this.

  1. There are three species of relationships between DDE elements: Predecessor/Successor, Parent/Child (or Group/Elementary), and Redefines. Currently, the pred/succ relationship is implied by the parent having a sequence of children. We can’t easily find a predecessor without a horrible \textbf{O}(n) search.
  2. There are two strategies for doing offset/size calculations.
  • Statically Located. What’s in place now: get the information from the attribute. This has to be extracted into a top-level DDE object.
  • Variably Located. A calculation based on live data so that ODO works. This object will be plugged into the top-level DDE when an ODO is found.
  1. The current static calculation can still be done if there is no ODO. As soon as an ODO shows up, then the calculation strategy switches to the Variably Located instance.

The existing record dump and other features should still work for statically located records. For Variably Located records, the calculations all raise an exception or return None or something awkward like that.

The offset calculation can be seen as a recursive trip “up” the tree following redefines, predecessor and parent relationships (in that order) to calculate the size of everything prior to the element in question. We could make offset and total size into properties which do this recursive calculation.

The “size” of a elementary items is still simply based on the picture. For group items, however, size becomes based on total size which in turn, may be based on ODO data.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_loader.rst, line 292.)

Todo

88-level items could create boolean-valued properties.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_loader.rst, line 349.)

Todo

Support Occurs Depending On

Enable the OCCURS Format 2 parsing. Fix unit tests.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_loader.rst, line 646.)

Todo

Handle more complex VALUE clause.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_loader.rst, line 789.)

Todo

Support Occurs Depending On

The set size and offset is constrained by the presence of an Occurs Depending On. What do we do in the RecordFactory.decorate() for ODO records? Do we return a status code? Raise an exception? Just ignore it?

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/cobol_loader.rst, line 967.)

Todo

Index by name and path, also.

The simple positional schema isn’t really appropriate for all purposes. For COBOL and fixed format files with external schema, we often must process things lazily by field name.

This is unlike spreadsheets where we can process all fields eagerly and in order.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/schema.rst, line 322.)

Todo

Support Occurs Depending On

The offset may not be constant.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/schema.rst, line 364.)

Todo

EBCDIC File V format with Occurs Depending On to show the combination.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/testing/cobol.rst, line 401.)

Todo

Test EXTERNAL, GLOBAL as Skipped Words, too.

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/testing/cobol_loader.rst, line 936.)

Todo

Implement Numbers_Workbook

(The original entry is located in /Users/slott/Documents/Projects/Stingray-4.1/source/workbook.rst, line 951.)

Indices and Tables

Table Of Contents

Next topic

1. Introduction

This Page