.. #!/usr/bin/env python3 .. _`cobol_init`: ####################################################### COBOL Package -- Extend Schema to Handle EBCDIC ####################################################### The COBOL package is a (large) Python ``__init__.py`` module which includes much of the public API for working with COBOL files. This module extends Stingray in several directions. - A new :py:class:`schema.Attribute` subclass, :py:class:`cobol.RepeatingAttribute`. - A handy :py:func:`cobol.dump` function. - The hierarchy of classes based on :py:class:`cobol.COBOL_File` which provide more sophisticated COBOL-based workbooks. Within the package we have the :py:mod:`cobol.loader` module which parses DDE's to create a schema. Module Overheads ================= .. py:module:: cobol We depend on :py:mod:`cell`, :py:mod:`schema`, and :py:mod:`workbook`. We'll also import one class definition from :py:mod:`cobol.defs`. :: """stingray.cobol -- Extend the core Stingray definitions to handle COBOL DDE's and COBOL files, including packed decimal and EBCDIC data. """ import codecs import struct import decimal import warnings import pprint import logging import stingray.schema import stingray.sheet from stingray.workbook.fixed import Fixed_Workbook from stingray.cobol.defs import TextCell RepeatingAttribute Subclasses of Attribute =========================================== Two new :py:class:`schema.Attribute` subclasses are required to carry all the additional attribute information developed during COBOL DDE parsing. An attribute that has an ``OCCURS`` clause (or who's parent has an ``OCCURS`` clause) can accept an :py:meth:`cobol.RepeatingAttribute.index` method to provide index values used to compute effective offsets. There are two variants. - The initial, immutable, :py:class:`cobol.RepeatingAttribute` as parsed. - A working :py:class:`cobol.IndexedAttribute`. This is a subclass of :py:class:`cobol.RepeatingAttribute` and it contains partial or complete indexing. Partial indexing means that a tuple is built by :py:meth:`cobol.COBOL_File.row_get`. Full indexing means that a single ``Cell`` can be built. .. code-block:: none http://yuml.me/diagram/scruffy;/class/ #cobol.attribute, [Attribute]^[RepeatingAttribute], [Schema]<>-[Attribute], [Fixed_Workbook]-uses->[Attribute], [Fixed_Workbook]^[COBOL_File], [COBOL_File]-uses->[RepeatingAttribute]. .. image:: cobol_attribute.png In order to fetch data for an ODO ``OCCURS`` element, the attribute offsets and sizes cannot **all** be computed during parsing. They must be computed lazily during data fetching. The :py:class:`cobol.ODO_LazyRow` class handles the Occurs Depending On situation. Here are the attributes inherited from :py:class:`schema.Attribute`. .. py:attribute:: name The attribute name. Typically always available for most kinds of schema. .. py:attribute:: create Cell class to create. If omitted, the class-level :py:data:`Attribute.default_cell` will be used. By default, this refers to :py:class:`cell.TextCell`. .. py:attribute:: position Optional sequential position. This is set by the :py:class:`schema.Schema` that contains this object. The additional values commonly provided by simple fixed format file schemata. These can't be treated as simple values, however, since they're clearly changed based on the ODO issues. .. py:attribute:: size Size within the buffer. These two properties over overridden by the :py:class:`cobol.IndexedAttribute` subclass; this is created by the :py:meth:`cobol.RepeatingAttribute.index` method. The superclass versions are simple a delegation to the DDE. If :py:meth:`cobol.RepeatingAttribute.index` is used, the subclass object is built where these values come from the ``index`` method results. .. py:attribute:: dimensionality A tuple of DDE's that defines the dimensionality pushed down to this item through the COBOL DDE hierarchy. This meay be set by the :py:meth:`cobol.RepeatingAttribute.index` method. .. py:attribute:: offset Optional offset into a buffer. This may be statically defined, or it may be dynamic because of variably-located data supporting the Occurs Depends On. This meay be set by the :py:meth:`cobol.RepeatingAttribute.index` method. This subclass introduces yet more attribute-like properties that simply delegate to the DDE. .. py:attribute:: dde A weakref to a :py:class:`cobol.defs.DDE` object. .. py:attribute:: path The ``"."``-separated path from top-level name to this element's name. .. py:attribute:: usage The original DDE.usage object, an instance of :py:class:`cobol.defs.Usage` .. py:attribute:: redefines The original DDE.allocation object, an instance of :py:class:`cobol.defs.Allocation` .. py:attribute:: picture The original DDE.picture object, an instance of :py:class:`cobol.loader.Picture` .. py:attribute:: size_scale_precision The original DDE.sizeScalePrecision object, a tuple with size, scale and precision derived from the picture. .. py:class:: RepeatingAttribute An attribute with dimensionality. Not all COBOL items repeat. :: class RepeatingAttribute( stingray.schema.Attribute ): """An attribute with dimensionality. Not all COBOL items repeat. An "OCCURS" clause will define repeating values. An "OCCURS DEPENDING ON" clause may define variably located values. """ default_cell= TextCell def __init__(self, name, dde, offset=None, size=None, create=None, position=None, **kw): self.dde= dde self.name, self.size, self.create, self.position = name, size, create, position if not self.create: self.create= self.default_cell if offset is not None: warnings.warn( "Offset {0} is ignored; {1} used".format(offset, self.dde().offset), stacklevel=2 ) self.__dict__.update( kw ) def __repr__( self ): dim= ", ".join( map( repr, self.dimensionality ) ) return "Attribute( name={0.name!r}, position={0.position}, offset={0.offset}, size={0.size}, dimensionality=({1}) )".format( self, dim ) .. py:method:: RepeatingAttribute.index( *values ) If the number of index values matches the dimensionality, we'll return a tweaked attribute which has just the offset required and a dimensionality of ``tuple()``. If the number of index values is insufficient, we'll return a tweaked attribute with which has the starting offset and the dimensions left otherwise unspecified. If the number of index values is excessive, we'll attempt to pop from an empty list. Note that :py:meth:`cobol.RepeatingAttribute.index` is applied incrementally when the application supplies some of the indices. - First, an application can supply some of the indices, creating :py:class:`cobol.IndexedAttribute` with an initial offset. - Second, the :py:class:`COBOL_File` will supply any remaining indices, creating yet more temporary :py:class:`cobol.IndexedAttribute` based on the initial offset. :: def index( self, *values ): """"Apply possibly incomplete index values to an attribute. We do this by cloning this attribute and setting a modified dimensionality and offset. :param values: 0-based index values. Yes, legacy COBOL language is 1-based. For Python applications, zero-based makes more sense. :returns: A :py:class:`cobol.IndexedAttribute` copy, with modified offset and dimensionality that can be used with :py:meth:`COBOL_File.row_get`. """ assert values, "Missing index values" # Original values for a RepeatingAttribute # Modified values for an IndexedAttribute offset= self.offset dim_list= list(self.dimensionality) # Apply given index values. val_list= list(values) while val_list: index= val_list.pop(0) dim= dim_list.pop(0) offset += dim.size * index # Build new subclass object with indexes applied. clone= IndexedAttribute( self, offset, dim_list ) return clone With this, a ``row.cell(schema.get('name').index(i))`` will compute a proper offset. We "clone" the attribute to assure that each time we apply (or don't apply) the index, nothing stateful will have happened to the original immutable attribute definition. Note that an incomplete set of index values forces the underlying workbook to create a Python tuple (or tuple of tuples) structure to contain all the requested values. See :py:meth:`cobol.COBOL_File.row_get`. The additional properties which are simply shortcuts so that a generic :py:class:`cobol.RepeatingAttribute` has access to the DDE details. :: @property def dimensionality(self): """tuple of parent DDE's. Baseline value; no indexes applied.""" return self.dde().dimensionality @property def offset(self): """Baseline value; no indexes applied.""" return self.dde().offset @property def path(self): return self.dde().pathTo() @property def usage(self): return self.dde().usage @property def redefines(self): return self.dde().allocation @property def picture(self): return self.dde().picture @property def size_scale_precision(self): return self.dde().sizeScalePrecision .. py:class:: IndexedAttribute The IndexedAttribute is a subclass of :py:class:`cobol.RepeatingAttribute` with (some) indices applied. Since this inherits the :py:meth:`cobol.RepeatingAttribute.index` method, we can apply indices incrementally. This class is not built directly, but only created by :py:meth:`cobol.RepeatingAttribute.index` with some (or all) indices applied. :: class IndexedAttribute( RepeatingAttribute ): """An attribute with dimensionality and indexes applied. This must be built from a :py:class:`cobol.RepeatingAttribute`. It will copy some attributes in an effort to somewhat improve efficiency. """ default_cell= TextCell def __init__(self, base, offset, dimensionality ): self.dde= base.dde self.name, self.size, self.create, self.position = base.name, base.size, base.create, base.position self._offset= offset self._dimensionality= dimensionality @property def dimensionality(self): """tuple of DDE's; Set by ``attribute.index()``.""" return self._dimensionality @property def offset(self): """Set by ``attribute.index()``.""" return self._offset COBOL LazyRow ============== The :py:class:`sheet.LazyRow` class is blissfully unaware of the need to compute sizes and offsets for COBOL. .. py:class:: ODO_LazyRow This subclass of :py:class:`sheet.LazyRow` to provide add the feature to recompute sizes and offsets in the case of a variable-located DDE due to an Occurs Depending On. :: class ODO_LazyRow( stingray.sheet.LazyRow ): """If the DDE is variably-located, tweak the sizes and offsets.""" def __init__( self, sheet, **state ): """Build the row from the bytes. :param sheet: the containing sheet. :param **state: worksheet-specific state value to save. """ super().__init__( sheet, **state ) for dde in self.sheet.schema.info.get('dde',[]): if dde.variably_located: dde.setSizeAndOffset(self) self._size= dde.totalSize else: self._size= len(self._state['data']) Dump a Record =============== .. py:function:: dump_iter( aDDE, aRow ) To support dumping raw data from a record, this will iterate through all items in an original DDE. It will a five-tuple with (dde, attribute, indices, bytes, Cell) for each DDE. If the DDE does not have an OCCURS clause, the indices will be an empty tuple. Otherwise, each individual combination will be yielded. For big, nested tables, this may turn out to be a lot of combinations. The bytes is the raw bytes for non-FILLER and non-group elements. The Cell will be a Cell object, either with valid data or an :py:class:`cobol.defs.ErrorCell`. :: def dump_iter( aDDE, aRow ): """Yields iterator over tuples of (dde, attribute, indices, bytes, Cell)""" def expand_dims( dimensionality, partial=() ): if not dimensionality: yield partial return top = dimensionality[0] rest= dimensionality[1:] for i in range(top): for e in expand_dims( rest, partial+(i,) ): yield e attr= aDDE.attribute() # Final size and offset details if aDDE.dimensionality: for indices in expand_dims( aDDE.dimensionality ): yield aDDE, aDDE.attribute, indices, aRow.cell(attr,indices).raw, aRow.cell(attr,indices) elif aDDE.picture and aDDE.name != "FILLER": yield aDDE, aDDE.attribute(), (), aRow.cell(attr).raw, aRow.cell(attr) else: # FILLER or group level without a picture: no data is available yield aDDE, aDDE.attribute, (), None, None for child in aDDE.children: #pprint.pprint( child ) for details in dump_iter( child, aRow ): yield details .. py:function:: dump( schema, row ) Dump data from a record, driven by the original DDE structure. :: def dump( schema, aRow ): print( "{:45s} {:3s} {:3s} {!s} {!s}".format("Field", "Pos", "Sz", "Raw", "Cell" ) ) for record in schema.info['dde']: for aDDE, attr, indices, raw_bytes, cell in dump_iter(record, aRow): print( "{:45s} {:3d} {:3d} {!r} {!s}".format( aDDE.indent*' '+str(aDDE), aDDE.offset, aDDE.size, raw_bytes, cell) ) COBOL "Workbook" Files ======================== A COBOL file is -- in effect -- a single-sheet workbook with an external schema. It looks, then, a lot like :py:class:`workbook.Fixed_Workbook`. - A pure character file, encoded UNICODE characters in some standard encoding like UTF-8 or UTF-16. This cannot include COMP or COMP-3 fields because the codec would make a mess of the bit patterns. - An EBCDIC-encoded byte file. This can include COMP or COMP-3 fields. - An ASCII-encoded byte file. This can include COMP or COMP-3 fields. While this may exist, it seems to be very rare. We don't implement it. Note that each cell creation involves two features. This leads to a kind of **Double Dispatch** algorithm. - The cell type. :py:class:`cobol.defs.TextCell`, :py:class:`cobol.defs.NumberDisplayCell`, :py:class:`cobol.defs.NumberComp3Cell` or :py:class:`cobol.defs.NumberCompCell`. - The workbook encoding type. Character or EBCDIC (or ASCII). The issue here is we're stuck with a complex "double-dispatch" problem. Each workbook subclass needs to implement methods for ``get_text``, ``number_display``, ``number_comp`` and ``number_comp3``. The conversions, while tied to the workbook encoding, aren't properly tied to stateful sheet and row processing in the workbook. They're just bound to the encoding. Consequently, we can make them static methods, possibly even making this a mixin strategy. The common use case looks like this. 1. The application uses :code:`row.cell( schema[n] )` to fetch a :py:class:`cell.Cell`. The :py:meth:`cobol.ODO_LazyRow.cell` method is simply ``sheet.workbook.row_get( buffer, attribute )``. It applies the cell type (via the schema item's attribute) and the raw data in the row's buffer. 2. The workbook ``row_get( buffer, attribute )`` has to do the following. - Convert the buffer into a proper value based on the ``attribute`` type information **and** the worksheet-specific methods for unpacking the various types of data. The various :py:mod:`cobol` Cell subclasses can refer to the proper conversion methods. - Create the required :py:class:`cell.Cell` based on the ``attribute.create`` function. See :class:`schema.Attribute`. There's a less common use case to extract a subset of row bytes to populate a separate 01-level definition that's not tied to the Workbook's schema. 1. The application uses ``subrow= row.data( schema[n], other_schema )`` to fetch some bytes that can be used to create a new LazyRow tied to a different schema. 2. The application uses ``subrow.cell( subschema[m] )`` to fetch a :py:class:`cell.Cell`. This doesn't go back to the original workbook, it goes to this "subrow" of the workbook. .. code-block:: none http://yuml.me/diagram/scruffy;/class/ #cobol, [Fixed_Workbook]^[COBOL_File], [COBOL_File]^[Character_File], [COBOL_File]^[EBCDIC_File]. .. image:: cobol_file.png :width: 6in COBOL File -------------- .. py:class:: COBOL_File This class introduces the expanded version of ``row_get`` that honors a schema attribute with dimensionality. :: class COBOL_File( Fixed_Workbook ): """A COBOL "workbook" file which uses :py:class:`cobol.RepeatingAttribute` and creates COBOL Cell values. This is an abstraction which lacks specific decoding methods. This is a :py:class:`Fixed_Workbook`: a file with fixed-sized, no-punctuation fields. A schema is required to parse the attributes. The rows are defined as :py:class:`cobol.ODO_LazyRow` instances so that bad data can be gracefully skipped over and Occurs Depending On offsets can be properly calculated. """ row_class= ODO_LazyRow .. py:method:: COBOL_File.row_get_index( row, attr, *index ) Returning a particular Cell from a row, however, is more interesting for COBOL because the Attribute may contains an "OCCURS" clause. In which case, we may need to assemble a tuple of values. If there is dimensionality, then take the top-level dimension (``dim[0]``) and use it as an iterator to fetch data based on the rest of the dimensions (``dim[1:]``). This can assemble a recursive tuple-of-tuples if there are multiple levels of dimensionality. If too few index values are provided, a tuple of results is built around the missing values. If enough values are provided, a single result object will be built. .. note:: Performance This is the most-used method. Removing the if-statement would be a huge improvement. :: def row_get_index( self, row, attr, *index ): """Emit a nested-tuple structure of Cell values using the given index values. :param row: the source Row. :param attr: the :py:class:`cobol.RepeatingAttribute` with the original tuple of dimensions, or a :py:class:`cobol.IndexedAttribute` which has an offset and partial dimensions. :param index: optional tuple of index values to use. Instead of ``row_get( schema.get('name').index(i) )`` we can use ``row_get_index( schema.get('name'), i )`` :returns: a (possibly nested) tuple of Cell values matching the dims that lacked index values. """ if attr.dimensionality and index: # ``attr.index()`` probably not previously used. # Apply all remaining values and get the resulting item. final= attr.index( *index ) return self.row_get( row, final ) elif attr.dimensionality: # ``attr.index()`` previously used with partial arg values. # Build composite result. d= attr.dimensionality[0].occurs.number(row) result= [] for i in range(d): sub= attr.index(i) result.append( self.row_get( row, sub ) ) return tuple(result) else: # Doesn't belong here, delegate. return self.row_get( row, attr ) .. py:method:: COBOL_File.row_get( row, attr ) The API method will get data from a row described by an attribute. If the attribute has dimensions, then indices are used or multiple values are returned by :py:meth:`cobol.COBOL_File.row_get_index`. If the attribute is has no dimensions, then it's simply pulled from the source row. .. note:: Performance This is the most-used method. Removing the if-statement would be a huge improvement. :: def row_get( self, row, attr ): """Create a Cell(s) from the row's data. :param row: The current Row :param attr: The desired Attribute; possibly tweaked to have an offset and partial dimensions. Or possibly the original. :returns: A single Cell or a nested tuple of Cells if indexes were not provided. """ if attr.dimensionality: return self.row_get_index( row, attr ) else: extract= row._state['data'][attr.offset:attr.offset+attr.size] return attr.create( extract, self, attr=attr ) Note that this depends on the superclass, which depends ordinary Unicode/ASCII line breaks. This will not work for EBCDIC files, which may lack appropriate line break characters. For that, we'll need to use specific physical format parsing helpers based on the Z/OS RECFM parameter used to define the file. .. py:method:: COBOL_File.subrow( subschema, text_cell ) In some COBOL files, there can be 01-level "subrecords" buried within an 01-level record. We can use ``wb.subrow(subschema, row.cell(schema_header_dict['GENERIC-FIELD']))`` to map a particular field ('GENERIC-FIELD') to an entire 01-level schema, creating a "subrow" from a single field within the parent row. :: def subrow( self, subschema, text_cell ): """Build a row-like object from a single field. :param subschema: a schema built from an 01-level DDE. :param text_cell: a specific text cell to use. """ subrow = self.row_class( stingray.sheet.ExternalSchemaSheet( self, "", subschema ), data= text_cell.raw, ) return subrow Character File ----------------- .. py:class:: Character_File This is subclass of :py:class:`COBOL_File` that handles COBOL data parsing where the underlying file is text. Since the file is text, Python handles any OS-level bytes-to-text conversions. :: class Character_File( COBOL_File ): """A COBOL "workbook" file with decoding functions for proper character data. """ The following functions are used to do data conversions for COBOL Character files. Text is easy, Python's ``io.open`` has already handled this. :: @staticmethod def text( buffer, attr ): """Extract a text field's value.""" return buffer Numeric data with usage ``DISPLAY`` is essentially text. In some cases, the picture has ``V``, which means that we must handle this implicit decimal point. The "display" feature is the COBOL default: everything is plain text. .. note:: The core rule for character files Leading separate sign is the default for character files. COBOL can support other kinds of signs. This conversion doesn't. :: @staticmethod def number_display( buffer, attr ): """Extract a numeric field's value. Based on leading, separate sign. """ final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision try: display=buffer.strip() if precision != 0 and dec_sign == 'V': display= display[:-precision]+"."+display[-precision:] return decimal.Decimal( display ) except Exception: Character_File.log.debug( "Can't process {0!r} from {1!r}".format(display,buffer) ) raise COMP-3 in proper character files may not make any sense at all. A codec would make a hash of the bit patterns required. However, we've defined the method here so that it can be used by the EBCDIC subclass trivially. We're going to build an ASCII version of the number by decoding the bytes into a mutable bytearray and decorating them with decimal point and sign. This is demonstrably faster and avoids object creation to the extent possible. :: @staticmethod def unpack( buffer ): """Include ' ' position for leading sign character. Trailing sign field will be 48+0xd for negative. 48+0xf is "unsigned" and 48+0xc is positive. """ yield 32 # ord(b' ') for n in buffer: yield 48+(n>>4) # ord(b'0') yield 48+(n&0x0f) @staticmethod def number_comp3( buffer, attr ): """Decode comp-3, packed decimal values. Each byte is two decimal digits. Last byte has a digit plus sign information: 0xd is <0, 0xf is unsigned, and 0xc >=0. """ final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision #print( repr(buffer), "from", repr(display) ) digits = bytearray( Character_File.unpack( buffer ) ) # Proper sign in front; replace trailing sign with space. digits[0]= 45 if digits[-1]==48+0xd else 32 # ord(b'-'), ord(b' ') digits[-1]= 32 # ord(' ') # Add decimal place if needed. if precision: digits[-precision:]= digits[-precision-1:-1] # Shift digits to right. digits[-precision-1]= 46 # Insert ord(b'.') try: return decimal.Decimal( digits.decode("ASCII") ) except Exception: Character_File.log.debug( "Can't process {0!r} from {1!r}".format(digits,buffer) ) raise COMP in proper character files may not make any sense, either. A codec would make a hash of the bit patterns required. Again, we've defined it here because that's relatively simple to extend. We're simply going to unpack big-endian bytes. :: @staticmethod def number_comp( buffer, attr ): """Decode comp, binary values.""" final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision if length <= 4: sc, bytes = '>h', 2 elif length <= 9: sc, bytes = '>i', 4 else: sc, bytes = '>q', 8 n= struct.unpack( sc, buffer ) return decimal.Decimal( n[0] ) Class-level logger :: Character_File.log= logging.getLogger( Character_File.__qualname__ ) EBCDIC File --------------- The EBCDIC files require specific physical "Record Format" (RECFM) assistance. These classes define a number of Z/OS RECFM conversion. We recognize four actual RECFM's plus an additional special case. - F - Fixed. - FB - Fixed Blocked. - V - Variable, data must have the RDW word preserved. - VB - Variable Blocked, data must have BDW and RDW words. - N - Variable, but no BDW or RDW words. This involves some buffer management magic to recover the records properly. .. note:: IBM z/Architecture mainframes are all big-endian .. py:class:: RECFM_Parser This class hierarchy breaks up EBCDIC files into records. :: class RECFM_Parser: """Parse a physical file format.""" def record_iter( self ): """Return each physical record, stripped of headers.""" raise NotImplementedError def used( self, bytes ): """The number of bytes actually consumed. Only really relevant for RECFM_N subclass to handle variable-length records with no RDW/BDW overheads. """ pass .. py:class:: RECFM_F Simple fixed-length records. No header words. :: class RECFM_F(RECFM_Parser): """Parse RECFM=F; the lrecl is the length of each record.""" def __init__( self, source, lrecl=None ): """ :param source: the file :param lrecl: the record length. """ super().__init__() self.source= source self.lrecl= lrecl def record_iter( self ): data= self.source.read(self.lrecl) while len(data) != 0: yield data data= self.source.read(self.lrecl) def rdw_iter( self ): """Yield rows with RDW, effectively RECFM_V format.""" for row in self.record_iter(): yield struct.pack( ">H2x", len(row)+4 )+row .. py:class:: RECFM_FB Simple fixed-blocked records. No header words. :: class RECFM_FB( RECFM_F ): """Parse RECFM=FB; the lrecl is the length of each record. It's not clear that there's any difference between F and FB. """ pass .. py:class:: RECFM_V Variable-length records. Each record has an RDW header word with the length. :: class RECFM_V(RECFM_Parser): """Parse RECFM=V; the lrecl is a maximum, which we ignore.""" def __init__( self, source, lrecl=None ): """ :param source: the file :param lrecl: a maximum, but it's ignored. """ super().__init__() self.source= source def record_iter( self ): """Iterate over records, stripped of RDW's.""" for rdw, row in self._data_iter(): yield row def rdw_iter( self ): """Iterate over records which include the 4-byte RDW.""" for rdw, row in self._data_iter(): yield rdw+row def _data_iter( self ): rdw= self.source.read(4) while len(rdw) != 0: size = struct.unpack( ">H2x", rdw )[0] data= self.source.read( size-4 ) yield rdw, data rdw= self.source.read(4) We might want to implement the :py:meth:`RECFM_Parser.used` method to compare the number of bytes used against the RDW size. .. py:class:: RECFM_VB Variable-length, blocked records. Each block has a BDW; each record has an RDW header word. These BDW and RDW describe the structure of the file. :: class RECFM_VB(RECFM_Parser): """Parse RECFM=VB; the lrecl is a maximum, which we ignore.""" def __init__( self, source, lrecl=None ): """ :param source: the file :param lrecl: a maximum, but it's ignored. """ super().__init__() self.source= source def record_iter( self ): """Iterate over records, stripped of RDW's.""" for rdw, row in self._data_iter(): yield row def rdw_iter( self ): """Iterate over records which include the 4-byte RDW.""" for rdw, row in self._data_iter(): yield rdw+row def bdw_iter( self ): """Iterate over blocks, which include 4-byte BDW and records with 4-byte RDW's.""" bdw= self.source.read(4) while len(bdw) != 0: blksize = struct.unpack( ">H2x", bdw )[0] block_data= self.source.read( blksize-4 ) yield bdw+block_data bdw= self.source.read(4) def _data_iter( self ): bdw= self.source.read(4) while len(bdw) != 0: blksize = struct.unpack( ">H2x", bdw )[0] block_data= self.source.read( blksize-4 ) offset= 0 while offset != len(block_data): assert offset+4 < len(block_data), "Corrupted Data Block {!r}".format(block_data) rdw= block_data[offset:offset+4] size= struct.unpack( ">H2x", rdw )[0] yield rdw, block_data[offset+4:offset+size] offset += size bdw= self.source.read(4) We might want to implement a generic :py:meth:`RECFM_Parser.used` method to compare the number of bytes used against the RDW size and raise an exception in the event of a mismatch. .. py:class:: RECFM_N Variable-length records without RDW's. Exasperating because we have to feed bytes to the buffer as needed until the record is complete. :: class RECFM_N: """Parse RECFM=V without RDW (or RECFM=VB without BDW or RDW). The lrecl is ignored. """ def __init__( self, source, lrecl=None ): """ :param source: the file :param lrecl: a maximum, but it's ignored. """ super().__init__() self.source= source self.buffer= self.source.read( 32768 ) def record_iter( self ): while len(self.buffer) != 0: yield self.buffer # What if used() is not called? This will loop forever! def used( self, bytes ): #print( "Consumed {0} Bytes".format(bytes) ) self.buffer= self.buffer[bytes:]+self.source.read(32768-bytes) .. py:class:: EBCDIC_File This subclass handles EBCDIC conversion and COMP-3 packed decimal numbers. For this to work, the schema needs to use slightly different Cell-type conversions. Otherwise, this is similar to processing simple character data. :: class EBCDIC_File( Character_File ): """A COBOL "workbook" file with decoding functions for EBCDIC data. If a file_object is provided, it must be opened in byte mode, and no decoder can be used. """ decoder= codecs.getdecoder('cp037') def __init__( self, name, file_object=None, schema=None, RECFM="N" ): """Prepare the workbook for reading. :param name: File name :param file_object: Optional file-like object. If omitted, the named file is opened. The object must be opened in byte mode; no decoder should be used. :param schema: The schema to use. :param RECFM: The legacy Z/OS RECFM to use. This must be one of "F", "FB", "V", "VB". This is translated to an appropriate RECFM class: RECFM_F, RECFM_FB, RECFM_V, or RECFM_VB. """ super().__init__( name, file_object, schema ) if self.file_obj: self.the_file= None self.wb= self.file_obj else: self.the_file = open( name, 'rb' ) self.wb= self.the_file self.schema= schema parser_class= { "F" : RECFM_F, "FB": RECFM_FB, "V" : RECFM_V, "VB": RECFM_VB, "N": RECFM_N, }[RECFM] self.parser= parser_class(self.wb, schema.lrecl()) .. py:method:: EBCDIC_File.rows_of( sheet ) We must extend the :py:meth:`workbook.Character_File.rows_of` method to deal with two issues: - If the schema depends on a variably located DDE, then we need to do the :py:func:`cobol.defs.setSizeAndOffset` function using the DDE. This is done automagically by the :py:class:`cobol.ODO_LazyRow` object. - The legacy Z/OS RECFM details. * We might have F or FB files, which are simply long runs of EBCDIC bytes with no line breaks. The LRECL must match the DDE. * We might have V (or VB) which have 4-byte header on each row (plus a 4-byte header on each block.) The LRECL doesn't matter. * We can tolerate the awful situation where it's variable length (Occurs Depending On) but there are no RECFM=V or RECFM=VB header words. We call this RECFM=N. We fetch an oversized buffer and push back bytes beyond the end of the record. This means that the ``super().rows_of( sheet )`` has been replaced with a RECFM-aware byte-parser. This byte parser may involve a back-and-forth to handle RECFM=N. In the case of RECFM=N, we provide an overly-large buffer (32768 bytes) and after any size and offset calculations, the ``row._size`` shows how many bytes were actually used. :: def rows_of( self, sheet ): """Iterate through all "rows" of this "sheet". Really, this means all records of this COBOL file. Note the handshake with RECFM parser to show how many bytes were really needed. For RECFM_N, this is important. For other RECFM, this is ignored. :py:class:`cobol.ODO_LazyRow` may adjust the schema if it has an Occurs Depending On. """ for data in self.parser.record_iter(): row= ODO_LazyRow( sheet, data=data ) self.parser.used(sheet.schema.lrecl()) yield row The following functions are used to do data conversions for COBOL EBCDIC files. Text requires using a codec to translate EBCDIC-encoded characters. :: @staticmethod def text( buffer, attr ): """Extract a text field's value.""" text, size = EBCDIC_File.decoder(buffer) return text When a number usage is ``DISPLAY``, it's text: we simply convert the bytes from EBCDIC to Unicode and treat them more-or-less like a text field. Note the subtlety around "Signed" display fields. The last byte will include a sign in addition to the digit. - The last EBCDIC character might be '\xF1' to '\xF9' which is unsigned. - The last EBCDIC character might be '\xC1' to '\xC9' which is positive. - The last EBCDIC character might be '\xD1' to '\xD9' which is negative. Really. :: @staticmethod def number_display( buffer, attr ): """Extract a numeric field's value.""" if attr.size_scale_precision.signed: # Fiddle bits to make EBCDIC char from signed digit. last_digit = bytes( [(buffer[-1] & 0x0F) | 0xF0] ) sign = '-' if buffer[-1] >> 4 == 0xD else '' text, size = EBCDIC_File.decoder(last_digit if len(buffer) == 1 else buffer[:-1] + last_digit) return Character_File.number_display( sign+text, attr ) else: text, size = EBCDIC_File.decoder(buffer) return Character_File.number_display( text, attr ) ASCII File ------------------ We could define a subclass for files encoded in ASCII which contain COMP and COMP-3 values. This is left as a future extension.