10.2.1. COBOL Package – Extend Schema to Handle EBCDIC¶
The COBOL package is a (large) Python __init__.py
module which
includes much of the public API for working with COBOL files.
This module extends Stingray in several directions.
- A new
schema.Attribute
subclass,cobol.RepeatingAttribute
. - A handy
cobol.dump()
function. - The hierarchy of classes based on
cobol.COBOL_File
which provide more sophisticated COBOL-based workbooks.
Within the package we have the cobol.loader
module which parses DDE’s
to create a schema.
10.2.1.1. Module Overheads¶
We depend on cell
, schema
, and workbook
.
We’ll also import one class definition from cobol.defs
.
"""stingray.cobol -- Extend the core Stingray definitions to handle COBOL
DDE's and COBOL files, including packed decimal and EBCDIC data.
"""
import codecs
import struct
import decimal
import warnings
import pprint
import logging
import stingray.schema
import stingray.sheet
from stingray.workbook.fixed import Fixed_Workbook
from stingray.cobol.defs import TextCell
10.2.1.2. RepeatingAttribute Subclasses of Attribute¶
Two new schema.Attribute
subclasses are required to carry all the
additional attribute information developed during COBOL DDE parsing.
An attribute that has an OCCURS
clause (or who’s parent has an OCCURS
clause)
can accept an cobol.RepeatingAttribute.index()
method to provide index values used to compute
effective offsets.
There are two variants.
- The initial, immutable,
cobol.RepeatingAttribute
as parsed. - A working
cobol.IndexedAttribute
. This is a subclass ofcobol.RepeatingAttribute
and it contains partial or complete indexing. Partial indexing means that a tuple is built bycobol.COBOL_File.row_get()
. Full indexing means that a singleCell
can be built.
http://yuml.me/diagram/scruffy;/class/
#cobol.attribute,
[Attribute]^[RepeatingAttribute],
[Schema]<>-[Attribute],
[Fixed_Workbook]-uses->[Attribute],
[Fixed_Workbook]^[COBOL_File],
[COBOL_File]-uses->[RepeatingAttribute].
In order to fetch data for an ODO OCCURS
element, the attribute offsets and sizes
cannot all be computed during parsing.
They must be computed lazily during data fetching. The cobol.ODO_LazyRow
class handles the Occurs Depending On situation.
Here are the attributes inherited from schema.Attribute
.
-
cobol.
name
¶ The attribute name. Typically always available for most kinds of schema.
-
cobol.
create
¶ Cell class to create. If omitted, the class-level
Attribute.default_cell
will be used. By default, this refers tocell.TextCell
.
-
cobol.
position
¶ Optional sequential position. This is set by the
schema.Schema
that contains this object.
The additional values commonly provided by simple fixed format file schemata. These can’t be treated as simple values, however, since they’re clearly changed based on the ODO issues.
-
cobol.
size
¶ Size within the buffer.
These two properties over overridden by the cobol.IndexedAttribute
subclass;
this is created by the cobol.RepeatingAttribute.index()
method.
The superclass versions are simple a delegation to the DDE.
If cobol.RepeatingAttribute.index()
is used, the subclass object is built
where these values come from the index
method results.
-
cobol.
dimensionality
¶ A tuple of DDE’s that defines the dimensionality pushed down to this item through the COBOL DDE hierarchy.
This meay be set by the
cobol.RepeatingAttribute.index()
method.
-
cobol.
offset
¶ Optional offset into a buffer. This may be statically defined, or it may be dynamic because of variably-located data supporting the Occurs Depends On.
This meay be set by the
cobol.RepeatingAttribute.index()
method.
This subclass introduces yet more attribute-like properties that simply delegate to the DDE.
-
cobol.
dde
¶ A weakref to a
cobol.defs.DDE
object.
-
cobol.
path
¶ The
"."
-separated path from top-level name to this element’s name.
-
cobol.
usage
¶ The original DDE.usage object, an instance of
cobol.defs.Usage
-
cobol.
redefines
¶ The original DDE.allocation object, an instance of
cobol.defs.Allocation
-
cobol.
picture
¶ The original DDE.picture object, an instance of
cobol.loader.Picture
-
cobol.
size_scale_precision
¶ The original DDE.sizeScalePrecision object, a tuple with size, scale and precision derived from the picture.
-
class
cobol.
RepeatingAttribute
¶ An attribute with dimensionality. Not all COBOL items repeat.
class RepeatingAttribute( stingray.schema.Attribute ):
"""An attribute with dimensionality. Not all COBOL items repeat.
An "OCCURS" clause will define repeating values.
An "OCCURS DEPENDING ON" clause may define variably located values.
"""
default_cell= TextCell
def __init__(self, name, dde, offset=None, size=None, create=None, position=None, **kw):
self.dde= dde
self.name, self.size, self.create, self.position = name, size, create, position
if not self.create:
self.create= self.default_cell
if offset is not None:
warnings.warn( "Offset {0} is ignored; {1} used".format(offset, self.dde().offset), stacklevel=2 )
self.__dict__.update( kw )
def __repr__( self ):
dim= ", ".join( map( repr, self.dimensionality ) )
return "Attribute( name={0.name!r}, position={0.position}, offset={0.offset}, size={0.size}, dimensionality=({1}) )".format(
self, dim )
-
RepeatingAttribute.
index
(*values)¶ If the number of index values matches the dimensionality, we’ll return a tweaked attribute which has just the offset required and a dimensionality of
tuple()
.If the number of index values is insufficient, we’ll return a tweaked attribute with which has the starting offset and the dimensions left otherwise unspecified.
If the number of index values is excessive, we’ll attempt to pop from an empty list.
Note that
cobol.RepeatingAttribute.index()
is applied incrementally when the application supplies some of the indices.- First, an application can supply some of the indices, creating
cobol.IndexedAttribute
with an initial offset. - Second, the
COBOL_File
will supply any remaining indices, creating yet more temporarycobol.IndexedAttribute
based on the initial offset.
- First, an application can supply some of the indices, creating
def index( self, *values ):
""""Apply possibly incomplete index values to an attribute.
We do this by cloning this attribute and setting a modified
dimensionality and offset.
:param values: 0-based index values. Yes, legacy COBOL language is 1-based.
For Python applications, zero-based makes more sense.
:returns: A :py:class:`cobol.IndexedAttribute` copy, with modified offset
and dimensionality that can be used with :py:meth:`COBOL_File.row_get`.
"""
assert values, "Missing index values"
# Original values for a RepeatingAttribute
# Modified values for an IndexedAttribute
offset= self.offset
dim_list= list(self.dimensionality)
# Apply given index values.
val_list= list(values)
while val_list:
index= val_list.pop(0)
dim= dim_list.pop(0)
offset += dim.size * index
# Build new subclass object with indexes applied.
clone= IndexedAttribute( self, offset, dim_list )
return clone
With this, a row.cell(schema.get('name').index(i))
will compute a proper offset.
We “clone” the attribute to assure that each time we apply (or don’t apply) the index, nothing stateful will have happened to the original immutable attribute definition.
Note that an incomplete set of index values forces the underlying
workbook to create a Python tuple (or tuple of tuples) structure to
contain all the requested values. See cobol.COBOL_File.row_get()
.
The additional properties which are simply shortcuts so that a
generic cobol.RepeatingAttribute
has access to the DDE details.
@property
def dimensionality(self):
"""tuple of parent DDE's. Baseline value; no indexes applied."""
return self.dde().dimensionality
@property
def offset(self):
"""Baseline value; no indexes applied."""
return self.dde().offset
@property
def path(self):
return self.dde().pathTo()
@property
def usage(self):
return self.dde().usage
@property
def redefines(self):
return self.dde().allocation
@property
def picture(self):
return self.dde().picture
@property
def size_scale_precision(self):
return self.dde().sizeScalePrecision
-
class
cobol.
IndexedAttribute
¶ The IndexedAttribute is a subclass of
cobol.RepeatingAttribute
with (some) indices applied. Since this inherits thecobol.RepeatingAttribute.index()
method, we can apply indices incrementally.This class is not built directly, but only created by
cobol.RepeatingAttribute.index()
with some (or all) indices applied.
class IndexedAttribute( RepeatingAttribute ):
"""An attribute with dimensionality and indexes applied.
This must be built from a :py:class:`cobol.RepeatingAttribute`. It will copy
some attributes in an effort to somewhat improve efficiency.
"""
default_cell= TextCell
def __init__(self, base, offset, dimensionality ):
self.dde= base.dde
self.name, self.size, self.create, self.position = base.name, base.size, base.create, base.position
self._offset= offset
self._dimensionality= dimensionality
@property
def dimensionality(self):
"""tuple of DDE's; Set by ``attribute.index()``."""
return self._dimensionality
@property
def offset(self):
"""Set by ``attribute.index()``."""
return self._offset
10.2.1.3. COBOL LazyRow¶
The sheet.LazyRow
class is blissfully unaware of the need to compute
sizes and offsets for COBOL.
-
class
cobol.
ODO_LazyRow
¶ This subclass of
sheet.LazyRow
to provide add the feature to recompute sizes and offsets in the case of a variable-located DDE due to an Occurs Depending On.
class ODO_LazyRow( stingray.sheet.LazyRow ):
"""If the DDE is variably-located, tweak the sizes and offsets."""
def __init__( self, sheet, **state ):
"""Build the row from the bytes.
:param sheet: the containing sheet.
:param **state: worksheet-specific state value to save.
"""
super().__init__( sheet, **state )
for dde in self.sheet.schema.info.get('dde',[]):
if dde.variably_located:
dde.setSizeAndOffset(self)
self._size= dde.totalSize
else:
self._size= len(self._state['data'])
10.2.1.4. Dump a Record¶
-
cobol.
dump_iter
(aDDE, aRow)¶ To support dumping raw data from a record, this will iterate through all items in an original DDE. It will a five-tuple with (dde, attribute, indices, bytes, Cell) for each DDE.
If the DDE does not have an OCCURS clause, the indices will be an empty tuple. Otherwise, each individual combination will be yielded. For big, nested tables, this may turn out to be a lot of combinations.
The bytes is the raw bytes for non-FILLER and non-group elements.
The Cell will be a Cell object, either with valid data or an
cobol.defs.ErrorCell
.
def dump_iter( aDDE, aRow ):
"""Yields iterator over tuples of (dde, attribute, indices, bytes, Cell)"""
def expand_dims( dimensionality, partial=() ):
if not dimensionality:
yield partial
return
top = dimensionality[0]
rest= dimensionality[1:]
for i in range(top):
for e in expand_dims( rest, partial+(i,) ):
yield e
attr= aDDE.attribute() # Final size and offset details
if aDDE.dimensionality:
for indices in expand_dims( aDDE.dimensionality ):
yield aDDE, aDDE.attribute, indices, aRow.cell(attr,indices).raw, aRow.cell(attr,indices)
elif aDDE.picture and aDDE.name != "FILLER":
yield aDDE, aDDE.attribute(), (), aRow.cell(attr).raw, aRow.cell(attr)
else: # FILLER or group level without a picture: no data is available
yield aDDE, aDDE.attribute, (), None, None
for child in aDDE.children:
#pprint.pprint( child )
for details in dump_iter( child, aRow ):
yield details
-
cobol.
dump
(schema, row)¶ Dump data from a record, driven by the original DDE structure.
def dump( schema, aRow ):
print( "{:45s} {:3s} {:3s} {!s} {!s}".format("Field", "Pos", "Sz", "Raw", "Cell" ) )
for record in schema.info['dde']:
for aDDE, attr, indices, raw_bytes, cell in dump_iter(record, aRow):
print( "{:45s} {:3d} {:3d} {!r} {!s}".format(
aDDE.indent*' '+str(aDDE), aDDE.offset, aDDE.size,
raw_bytes, cell) )
10.2.1.5. COBOL “Workbook” Files¶
A COBOL file is – in effect – a single-sheet workbook with an external schema.
It looks, then, a lot like workbook.Fixed_Workbook
.
- A pure character file, encoded UNICODE characters in some standard encoding like UTF-8 or UTF-16. This cannot include COMP or COMP-3 fields because the codec would make a mess of the bit patterns.
- An EBCDIC-encoded byte file. This can include COMP or COMP-3 fields.
- An ASCII-encoded byte file. This can include COMP or COMP-3 fields. While this may exist, it seems to be very rare. We don’t implement it.
Note that each cell creation involves two features. This leads to a kind of Double Dispatch algorithm.
- The cell type.
cobol.defs.TextCell
,cobol.defs.NumberDisplayCell
,cobol.defs.NumberComp3Cell
orcobol.defs.NumberCompCell
. - The workbook encoding type. Character or EBCDIC (or ASCII).
The issue here is we’re stuck with a complex “double-dispatch” problem.
Each workbook subclass needs to implement methods for get_text
, number_display
,
number_comp
and number_comp3
.
The conversions, while tied to the workbook encoding, aren’t properly tied to stateful sheet and row processing in the workbook. They’re just bound to the encoding. Consequently, we can make them static methods, possibly even making this a mixin strategy.
The common use case looks like this.
- The application uses
row.cell( schema[n] )
to fetch acell.Cell
. Thecobol.ODO_LazyRow.cell()
method is simplysheet.workbook.row_get( buffer, attribute )
. It applies the cell type (via the schema item’s attribute) and the raw data in the row’s buffer. - The workbook
row_get( buffer, attribute )
has to do the following.- Convert the buffer into a proper value based on the
attribute
type information and the worksheet-specific methods for unpacking the various types of data. The variouscobol
Cell subclasses can refer to the proper conversion methods. - Create the required
cell.Cell
based on theattribute.create
function. Seeschema.Attribute
.
- Convert the buffer into a proper value based on the
There’s a less common use case to extract a subset of row bytes to populate a separate 01-level definition that’s not tied to the Workbook’s schema.
- The application uses
subrow= row.data( schema[n], other_schema )
to fetch some bytes that can be used to create a new LazyRow tied to a different schema. - The application uses
subrow.cell( subschema[m] )
to fetch acell.Cell
. This doesn’t go back to the original workbook, it goes to this “subrow” of the workbook.
http://yuml.me/diagram/scruffy;/class/
#cobol,
[Fixed_Workbook]^[COBOL_File],
[COBOL_File]^[Character_File],
[COBOL_File]^[EBCDIC_File].
10.2.1.5.1. COBOL File¶
-
class
cobol.
COBOL_File
¶ This class introduces the expanded version of
row_get
that honors a schema attribute with dimensionality.
class COBOL_File( Fixed_Workbook ):
"""A COBOL "workbook" file which uses :py:class:`cobol.RepeatingAttribute` and
creates COBOL Cell values. This is an abstraction which
lacks specific decoding methods.
This is a :py:class:`Fixed_Workbook`: a file with fixed-sized, no-punctuation fields.
A schema is required to parse the attributes.
The rows are defined as :py:class:`cobol.ODO_LazyRow` instances so that
bad data can be gracefully skipped over and Occurs Depending On offsets
can be properly calculated.
"""
row_class= ODO_LazyRow
-
COBOL_File.
row_get_index
(row, attr, *index)¶ Returning a particular Cell from a row, however, is more interesting for COBOL because the Attribute may contains an “OCCURS” clause. In which case, we may need to assemble a tuple of values.
If there is dimensionality, then take the top-level dimension (
dim[0]
) and use it as an iterator to fetch data based on the rest of the dimensions (dim[1:]
).This can assemble a recursive tuple-of-tuples if there are multiple levels of dimensionality.
If too few index values are provided, a tuple of results is built around the missing values.
If enough values are provided, a single result object will be built.
Note
Performance
This is the most-used method. Removing the if-statement would be a huge improvement.
def row_get_index( self, row, attr, *index ):
"""Emit a nested-tuple structure of Cell values using the given index values.
:param row: the source Row.
:param attr: the :py:class:`cobol.RepeatingAttribute`
with the original tuple of dimensions,
or a :py:class:`cobol.IndexedAttribute` which has
an offset and partial dimensions.
:param index: optional tuple of index values to use.
Instead of ``row_get( schema.get('name').index(i) )``
we can use ``row_get_index( schema.get('name'), i )``
:returns: a (possibly nested) tuple of Cell values matching the dims that lacked
index values.
"""
if attr.dimensionality and index:
# ``attr.index()`` probably not previously used.
# Apply all remaining values and get the resulting item.
final= attr.index( *index )
return self.row_get( row, final )
elif attr.dimensionality:
# ``attr.index()`` previously used with partial arg values.
# Build composite result.
d= attr.dimensionality[0].occurs.number(row)
result= []
for i in range(d):
sub= attr.index(i)
result.append( self.row_get( row, sub ) )
return tuple(result)
else:
# Doesn't belong here, delegate.
return self.row_get( row, attr )
-
COBOL_File.
row_get
(row, attr)¶ The API method will get data from a row described by an attribute. If the attribute has dimensions, then indices are used or multiple values are returned by
cobol.COBOL_File.row_get_index()
.If the attribute is has no dimensions, then it’s simply pulled from the source row.
Note
Performance
This is the most-used method. Removing the if-statement would be a huge improvement.
def row_get( self, row, attr ):
"""Create a Cell(s) from the row's data.
:param row: The current Row
:param attr: The desired Attribute; possibly tweaked to
have an offset and partial dimensions. Or possibly the original.
:returns: A single Cell or a nested tuple of Cells if indexes
were not provided.
"""
if attr.dimensionality:
return self.row_get_index( row, attr )
else:
extract= row._state['data'][attr.offset:attr.offset+attr.size]
return attr.create( extract, self, attr=attr )
Note that this depends on the superclass, which depends ordinary Unicode/ASCII line breaks. This will not work for EBCDIC files, which may lack appropriate line break characters. For that, we’ll need to use specific physical format parsing helpers based on the Z/OS RECFM parameter used to define the file.
-
COBOL_File.
subrow
(subschema, text_cell)¶ In some COBOL files, there can be 01-level “subrecords” buried within an 01-level record.
We can use
wb.subrow(subschema, row.cell(schema_header_dict['GENERIC-FIELD']))
to map a particular field (‘GENERIC-FIELD’) to an entire 01-level schema, creating a “subrow” from a single field within the parent row.
def subrow( self, subschema, text_cell ):
"""Build a row-like object from a single field.
:param subschema: a schema built from an 01-level DDE.
:param text_cell: a specific text cell to use.
"""
subrow = self.row_class(
stingray.sheet.ExternalSchemaSheet( self, "", subschema ),
data= text_cell.raw,
)
return subrow
10.2.1.5.2. Character File¶
-
class
cobol.
Character_File
¶ This is subclass of
COBOL_File
that handles COBOL data parsing where the underlying file is text. Since the file is text, Python handles any OS-level bytes-to-text conversions.
class Character_File( COBOL_File ):
"""A COBOL "workbook" file with decoding functions for
proper character data.
"""
The following functions are used to do data conversions for COBOL Character files.
Text is easy, Python’s io.open
has already handled this.
@staticmethod
def text( buffer, attr ):
"""Extract a text field's value."""
return buffer
Numeric data with usage DISPLAY
is essentially text. In some cases, the
picture has V
, which means that we must handle this implicit decimal point.
The “display” feature is the COBOL default: everything is plain text.
Note
The core rule for character files
Leading separate sign is the default for character files.
COBOL can support other kinds of signs. This conversion doesn’t.
@staticmethod
def number_display( buffer, attr ):
"""Extract a numeric field's value.
Based on leading, separate sign.
"""
final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision
try:
display=buffer.strip()
if precision != 0 and dec_sign == 'V':
display= display[:-precision]+"."+display[-precision:]
return decimal.Decimal( display )
except Exception:
Character_File.log.debug( "Can't process {0!r} from {1!r}".format(display,buffer) )
raise
COMP-3 in proper character files may not make any sense at all. A codec would make a hash of the bit patterns required. However, we’ve defined the method here so that it can be used by the EBCDIC subclass trivially.
We’re going to build an ASCII version of the number by decoding the bytes into a mutable bytearray and decorating them with decimal point and sign. This is demonstrably faster and avoids object creation to the extent possible.
@staticmethod
def unpack( buffer ):
"""Include ' ' position for leading sign character.
Trailing sign field will be 48+0xd for negative.
48+0xf is "unsigned" and 48+0xc is positive.
"""
yield 32 # ord(b' ')
for n in buffer:
yield 48+(n>>4) # ord(b'0')
yield 48+(n&0x0f)
@staticmethod
def number_comp3( buffer, attr ):
"""Decode comp-3, packed decimal values.
Each byte is two decimal digits.
Last byte has a digit plus sign information: 0xd is <0, 0xf is unsigned, and 0xc >=0.
"""
final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision
#print( repr(buffer), "from", repr(display) )
digits = bytearray( Character_File.unpack( buffer ) )
# Proper sign in front; replace trailing sign with space.
digits[0]= 45 if digits[-1]==48+0xd else 32 # ord(b'-'), ord(b' ')
digits[-1]= 32 # ord(' ')
# Add decimal place if needed.
if precision:
digits[-precision:]= digits[-precision-1:-1] # Shift digits to right.
digits[-precision-1]= 46 # Insert ord(b'.')
try:
return decimal.Decimal( digits.decode("ASCII") )
except Exception:
Character_File.log.debug( "Can't process {0!r} from {1!r}".format(digits,buffer) )
raise
COMP in proper character files may not make any sense, either. A codec would make a hash of the bit patterns required. Again, we’ve defined it here because that’s relatively simple to extend.
We’re simply going to unpack big-endian bytes.
@staticmethod
def number_comp( buffer, attr ):
"""Decode comp, binary values."""
final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision
if length <= 4:
sc, bytes = '>h', 2
elif length <= 9:
sc, bytes = '>i', 4
else:
sc, bytes = '>q', 8
n= struct.unpack( sc, buffer )
return decimal.Decimal( n[0] )
Class-level logger
Character_File.log= logging.getLogger( Character_File.__qualname__ )
10.2.1.5.3. EBCDIC File¶
The EBCDIC files require specific physical “Record Format” (RECFM) assistance. These classes define a number of Z/OS RECFM conversion. We recognize four actual RECFM’s plus an additional special case.
- F - Fixed.
- FB - Fixed Blocked.
- V - Variable, data must have the RDW word preserved.
- VB - Variable Blocked, data must have BDW and RDW words.
- N - Variable, but no BDW or RDW words. This involves some buffer management magic to recover the records properly.
Note
IBM z/Architecture mainframes are all big-endian
-
class
cobol.
RECFM_Parser
¶ This class hierarchy breaks up EBCDIC files into records.
class RECFM_Parser:
"""Parse a physical file format."""
def record_iter( self ):
"""Return each physical record, stripped of headers."""
raise NotImplementedError
def used( self, bytes ):
"""The number of bytes actually consumed.
Only really relevant for RECFM_N subclass to handle variable-length
records with no RDW/BDW overheads.
"""
pass
-
class
cobol.
RECFM_F
¶ Simple fixed-length records. No header words.
class RECFM_F(RECFM_Parser):
"""Parse RECFM=F; the lrecl is the length of each record."""
def __init__( self, source, lrecl=None ):
"""
:param source: the file
:param lrecl: the record length.
"""
super().__init__()
self.source= source
self.lrecl= lrecl
def record_iter( self ):
data= self.source.read(self.lrecl)
while len(data) != 0:
yield data
data= self.source.read(self.lrecl)
def rdw_iter( self ):
"""Yield rows with RDW, effectively RECFM_V format."""
for row in self.record_iter():
yield struct.pack( ">H2x", len(row)+4 )+row
-
class
cobol.
RECFM_FB
¶ Simple fixed-blocked records. No header words.
class RECFM_FB( RECFM_F ):
"""Parse RECFM=FB; the lrecl is the length of each record.
It's not clear that there's any difference between F and FB.
"""
pass
-
class
cobol.
RECFM_V
¶ Variable-length records. Each record has an RDW header word with the length.
class RECFM_V(RECFM_Parser):
"""Parse RECFM=V; the lrecl is a maximum, which we ignore."""
def __init__( self, source, lrecl=None ):
"""
:param source: the file
:param lrecl: a maximum, but it's ignored.
"""
super().__init__()
self.source= source
def record_iter( self ):
"""Iterate over records, stripped of RDW's."""
for rdw, row in self._data_iter():
yield row
def rdw_iter( self ):
"""Iterate over records which include the 4-byte RDW."""
for rdw, row in self._data_iter():
yield rdw+row
def _data_iter( self ):
rdw= self.source.read(4)
while len(rdw) != 0:
size = struct.unpack( ">H2x", rdw )[0]
data= self.source.read( size-4 )
yield rdw, data
rdw= self.source.read(4)
We might want to implement the RECFM_Parser.used()
method to compare the number of bytes
used against the RDW size.
-
class
cobol.
RECFM_VB
¶ Variable-length, blocked records. Each block has a BDW; each record has an RDW header word. These BDW and RDW describe the structure of the file.
class RECFM_VB(RECFM_Parser):
"""Parse RECFM=VB; the lrecl is a maximum, which we ignore."""
def __init__( self, source, lrecl=None ):
"""
:param source: the file
:param lrecl: a maximum, but it's ignored.
"""
super().__init__()
self.source= source
def record_iter( self ):
"""Iterate over records, stripped of RDW's."""
for rdw, row in self._data_iter():
yield row
def rdw_iter( self ):
"""Iterate over records which include the 4-byte RDW."""
for rdw, row in self._data_iter():
yield rdw+row
def bdw_iter( self ):
"""Iterate over blocks, which include 4-byte BDW and records with 4-byte RDW's."""
bdw= self.source.read(4)
while len(bdw) != 0:
blksize = struct.unpack( ">H2x", bdw )[0]
block_data= self.source.read( blksize-4 )
yield bdw+block_data
bdw= self.source.read(4)
def _data_iter( self ):
bdw= self.source.read(4)
while len(bdw) != 0:
blksize = struct.unpack( ">H2x", bdw )[0]
block_data= self.source.read( blksize-4 )
offset= 0
while offset != len(block_data):
assert offset+4 < len(block_data), "Corrupted Data Block {!r}".format(block_data)
rdw= block_data[offset:offset+4]
size= struct.unpack( ">H2x", rdw )[0]
yield rdw, block_data[offset+4:offset+size]
offset += size
bdw= self.source.read(4)
We might want to implement a generic RECFM_Parser.used()
method to compare the number of bytes
used against the RDW size and raise an exception in the event of a mismatch.
-
class
cobol.
RECFM_N
¶ Variable-length records without RDW’s. Exasperating because we have to feed bytes to the buffer as needed until the record is complete.
class RECFM_N:
"""Parse RECFM=V without RDW (or RECFM=VB without BDW or RDW).
The lrecl is ignored.
"""
def __init__( self, source, lrecl=None ):
"""
:param source: the file
:param lrecl: a maximum, but it's ignored.
"""
super().__init__()
self.source= source
self.buffer= self.source.read( 32768 )
def record_iter( self ):
while len(self.buffer) != 0:
yield self.buffer
# What if used() is not called? This will loop forever!
def used( self, bytes ):
#print( "Consumed {0} Bytes".format(bytes) )
self.buffer= self.buffer[bytes:]+self.source.read(32768-bytes)
-
class
cobol.
EBCDIC_File
¶ This subclass handles EBCDIC conversion and COMP-3 packed decimal numbers. For this to work, the schema needs to use slightly different Cell-type conversions.
Otherwise, this is similar to processing simple character data.
class EBCDIC_File( Character_File ):
"""A COBOL "workbook" file with decoding functions for
EBCDIC data. If a file_object is provided, it must be
opened in byte mode, and no decoder can be used.
"""
decoder= codecs.getdecoder('cp037')
def __init__( self, name, file_object=None, schema=None, RECFM="N" ):
"""Prepare the workbook for reading.
:param name: File name
:param file_object: Optional file-like object. If omitted, the named file is opened.
The object must be opened in byte mode; no decoder should be used.
:param schema: The schema to use.
:param RECFM: The legacy Z/OS RECFM to use. This must be one
of "F", "FB", "V", "VB". This is translated to an appropriate
RECFM class: RECFM_F, RECFM_FB, RECFM_V, or RECFM_VB.
"""
super().__init__( name, file_object, schema )
if self.file_obj:
self.the_file= None
self.wb= self.file_obj
else:
self.the_file = open( name, 'rb' )
self.wb= self.the_file
self.schema= schema
parser_class= {
"F" : RECFM_F,
"FB": RECFM_FB,
"V" : RECFM_V,
"VB": RECFM_VB,
"N": RECFM_N,
}[RECFM]
self.parser= parser_class(self.wb, schema.lrecl())
-
EBCDIC_File.
rows_of
(sheet)¶ We must extend the
workbook.Character_File.rows_of()
method to deal with two issues:If the schema depends on a variably located DDE, then we need to do the
cobol.defs.setSizeAndOffset()
function using the DDE. This is done automagically by thecobol.ODO_LazyRow
object.The legacy Z/OS RECFM details.
- We might have F or FB files, which are simply long runs of EBCDIC bytes with no line breaks. The LRECL must match the DDE.
- We might have V (or VB) which have 4-byte header on each row (plus a 4-byte header on each block.) The LRECL doesn’t matter.
- We can tolerate the awful situation where it’s variable length (Occurs Depending On) but there are no RECFM=V or RECFM=VB header words. We call this RECFM=N. We fetch an oversized buffer and push back bytes beyond the end of the record.
This means that the
super().rows_of( sheet )
has been replaced with a RECFM-aware byte-parser. This byte parser may involve a back-and-forth to handle RECFM=N. In the case of RECFM=N, we provide an overly-large buffer (32768 bytes) and after any size and offset calculations, therow._size
shows how many bytes were actually used.
def rows_of( self, sheet ):
"""Iterate through all "rows" of this "sheet".
Really, this means all records of this COBOL file.
Note the handshake with RECFM parser to show how many
bytes were really needed. For RECFM_N, this is important.
For other RECFM, this is ignored.
:py:class:`cobol.ODO_LazyRow` may adjust the schema
if it has an Occurs Depending On.
"""
for data in self.parser.record_iter():
row= ODO_LazyRow( sheet, data=data )
self.parser.used(sheet.schema.lrecl())
yield row
The following functions are used to do data conversions for COBOL EBCDIC files. Text requires using a codec to translate EBCDIC-encoded characters.
@staticmethod
def text( buffer, attr ):
"""Extract a text field's value."""
text, size = EBCDIC_File.decoder(buffer)
return text
When a number usage is DISPLAY
, it’s text:
we simply convert the bytes from EBCDIC to Unicode
and treat them more-or-less like a text field.
Note the subtlety around “Signed” display fields. The last byte will include a sign in addition to the digit.
- The last EBCDIC character might be ‘xF1’ to ‘xF9’ which is unsigned.
- The last EBCDIC character might be ‘xC1’ to ‘xC9’ which is positive.
- The last EBCDIC character might be ‘xD1’ to ‘xD9’ which is negative.
Really.
@staticmethod
def number_display( buffer, attr ):
"""Extract a numeric field's value."""
if attr.size_scale_precision.signed:
# Fiddle bits to make EBCDIC char from signed digit.
last_digit = bytes( [(buffer[-1] & 0x0F) | 0xF0] )
sign = '-' if buffer[-1] >> 4 == 0xD else ''
text, size = EBCDIC_File.decoder(last_digit if len(buffer) == 1 else buffer[:-1] + last_digit)
return Character_File.number_display( sign+text, attr )
else:
text, size = EBCDIC_File.decoder(buffer)
return Character_File.number_display( text, attr )
10.2.1.5.4. ASCII File¶
We could define a subclass for files encoded in ASCII which contain COMP and COMP-3 values.
This is left as a future extension.