..    #!/usr/bin/env python3

..  _`cobol_init`: 

#######################################################
COBOL Package -- Extend Schema to Handle EBCDIC
#######################################################

The COBOL package is a (large) Python ``__init__.py`` module which
includes much of the public API for working with COBOL files.

This module extends Stingray in several directions.

-   A new :py:class:`schema.Attribute` subclass, :py:class:`cobol.RepeatingAttribute`.

-   A handy :py:func:`cobol.dump` function.

-   The hierarchy of classes based on :py:class:`cobol.COBOL_File` which provide
    more sophisticated COBOL-based workbooks.
    
Within the package we have the :py:mod:`cobol.loader` module which parses DDE's
to create a schema. 

Module Overheads
=================

..  py:module:: cobol

We depend on :py:mod:`cell`, :py:mod:`schema`, and :py:mod:`workbook`.
We'll also import one class definition from :py:mod:`cobol.defs`.

::

    """stingray.cobol -- Extend the core Stingray definitions to handle COBOL
    DDE's and COBOL files, including packed decimal and EBCDIC data.
    """
    import codecs
    import struct
    import decimal
    import warnings
    import pprint
    import logging

    import stingray.schema
    import stingray.sheet
    from stingray.workbook.fixed import Fixed_Workbook


    from stingray.cobol.defs import TextCell

RepeatingAttribute Subclasses of Attribute
===========================================

Two new :py:class:`schema.Attribute` subclasses are required to carry all the 
additional attribute information developed during COBOL DDE parsing.  

An attribute that has an ``OCCURS`` clause (or who's parent has an ``OCCURS`` clause)
can accept an :py:meth:`cobol.RepeatingAttribute.index` method to provide index values used to compute
effective offsets.

There are two variants.

-   The initial, immutable, :py:class:`cobol.RepeatingAttribute` as parsed.

-   A working :py:class:`cobol.IndexedAttribute`. This is a subclass of 
    :py:class:`cobol.RepeatingAttribute` and it contains partial or complete
    indexing. Partial indexing means that a tuple is built by 
    :py:meth:`cobol.COBOL_File.row_get`. Full indexing means that a single
    ``Cell`` can be built.

..  code-block:: none

    http://yuml.me/diagram/scruffy;/class/
    #cobol.attribute,
    [Attribute]^[RepeatingAttribute],
    [Schema]<>-[Attribute],
    [Fixed_Workbook]-uses->[Attribute],
    [Fixed_Workbook]^[COBOL_File],
    [COBOL_File]-uses->[RepeatingAttribute].

..  image:: cobol_attribute.png

In order to fetch data for an ODO ``OCCURS`` element, the attribute offsets and sizes
cannot **all** be computed during parsing. 
They must be computed lazily during data fetching. The :py:class:`cobol.ODO_LazyRow` 
class handles the Occurs Depending On situation.

Here are the attributes inherited from :py:class:`schema.Attribute`.

..  py:attribute:: name

    The attribute name. Typically always available for most kinds of schema.
    
..  py:attribute:: create

    Cell class to create.  If omitted, the class-level
    :py:data:`Attribute.default_cell` will be used.
    By default, this refers to :py:class:`cell.TextCell`.
    
..  py:attribute:: position

    Optional sequential position. This is set by the :py:class:`schema.Schema`
    that contains this object.

The additional values commonly provided by simple fixed format file schemata.
These can't be treated as simple values, however, since they're
clearly changed based on the ODO issues.
   
..  py:attribute:: size

    Size within the buffer.

These two properties over overridden by the :py:class:`cobol.IndexedAttribute` subclass;
this is created by the :py:meth:`cobol.RepeatingAttribute.index` method. 
The superclass versions are simple a delegation to the DDE. 
If :py:meth:`cobol.RepeatingAttribute.index` is used, the subclass object is built
where these values come from the ``index`` method results.

..  py:attribute:: dimensionality

    A tuple of DDE's that defines the dimensionality pushed down to this
    item through the COBOL DDE hierarchy.

    This meay be set by the :py:meth:`cobol.RepeatingAttribute.index` method.

..  py:attribute:: offset

    Optional offset into a buffer. This may be statically defined,
    or it may be dynamic because of variably-located data supporting
    the Occurs Depends On.
    
    This meay be set by the :py:meth:`cobol.RepeatingAttribute.index` method.
     
This subclass introduces yet more attribute-like properties that simply
delegate to the DDE.

..  py:attribute:: dde

    A weakref to a :py:class:`cobol.defs.DDE` object.

..  py:attribute:: path

    The ``"."``-separated path from top-level name to this element's name.
    
..  py:attribute:: usage

    The original DDE.usage object, an instance of :py:class:`cobol.defs.Usage`
    
..  py:attribute:: redefines

    The original DDE.allocation object, an instance of :py:class:`cobol.defs.Allocation`
    
..  py:attribute:: picture

    The original DDE.picture object, an instance of :py:class:`cobol.loader.Picture`
    
..  py:attribute:: size_scale_precision

    The original DDE.sizeScalePrecision object, a tuple with size, scale and precision derived
    from the picture.


..  py:class:: RepeatingAttribute

    An attribute with dimensionality. Not all COBOL items repeat.


::


    class RepeatingAttribute( stingray.schema.Attribute ):
        """An attribute with dimensionality. Not all COBOL items repeat.
        
        An "OCCURS" clause will define repeating values. 
        An "OCCURS DEPENDING ON" clause may define variably located values. 
        """
        default_cell= TextCell
        def __init__(self, name, dde, offset=None, size=None, create=None, position=None, **kw):
            self.dde= dde
            self.name, self.size, self.create, self.position = name, size, create, position
            if not self.create:
                self.create= self.default_cell
            if offset is not None:
                warnings.warn( "Offset {0} is ignored; {1} used".format(offset, self.dde().offset), stacklevel=2 )
            self.__dict__.update( kw )
        def __repr__( self ):
            dim= ", ".join( map( repr, self.dimensionality ) )
            return "Attribute( name={0.name!r}, position={0.position}, offset={0.offset}, size={0.size}, dimensionality=({1}) )".format( 
                self, dim )

..  py:method:: RepeatingAttribute.index( *values )

    If the number of index values matches the dimensionality, we'll return a tweaked
    attribute which has just the offset required and a dimensionality of ``tuple()``.

    If the number of index values is insufficient, we'll return a tweaked attribute
    with which has the starting offset and the dimensions left otherwise unspecified.

    If the number of index values is excessive, we'll attempt to pop from an empty
    list.

    Note that :py:meth:`cobol.RepeatingAttribute.index` is applied incrementally when the application supplies some
    of the indices.

    -   First, an application can supply some of the indices, creating
        :py:class:`cobol.IndexedAttribute` with an initial offset.

    -   Second, the :py:class:`COBOL_File` will supply any remaining indices,
        creating yet more temporary  :py:class:`cobol.IndexedAttribute` based on the initial offset.
 
::    

        def index( self, *values ):
            """"Apply possibly incomplete index values to an attribute.
            We do this by cloning this attribute and setting a modified 
            dimensionality and offset.
            
            :param values: 0-based index values.  Yes, legacy COBOL language is 1-based.
                For Python applications, zero-based makes more sense.
            :returns: A :py:class:`cobol.IndexedAttribute` copy, with modified offset
            and dimensionality that can be used with :py:meth:`COBOL_File.row_get`.
            """
            assert values, "Missing index values"
            # Original values for a RepeatingAttribute
            # Modified values for an IndexedAttribute
            offset= self.offset
            dim_list= list(self.dimensionality)
            # Apply given index values.
            val_list= list(values)
            while val_list:
                index= val_list.pop(0)
                dim= dim_list.pop(0)
                offset += dim.size * index
            # Build new subclass object with indexes applied.
            clone= IndexedAttribute( self, offset, dim_list )
            return clone

With this, a ``row.cell(schema.get('name').index(i))`` will compute a proper offset.

We "clone" the attribute to assure that each time we apply (or don't apply)
the index, nothing stateful will have happened to the original immutable attribute
definition. 

Note that an incomplete set of index values forces the underlying 
workbook to create a Python tuple (or tuple of tuples) structure to
contain all the requested values. See :py:meth:`cobol.COBOL_File.row_get`.

The additional properties which are simply shortcuts so that a 
generic :py:class:`cobol.RepeatingAttribute` has access to the DDE details.

::

        @property
        def dimensionality(self):
            """tuple of parent DDE's. Baseline value; no indexes applied."""
            return self.dde().dimensionality
        @property
        def offset(self):
            """Baseline value; no indexes applied."""
            return self.dde().offset
        @property
        def path(self):
            return self.dde().pathTo()
        @property
        def usage(self):
            return self.dde().usage
        @property
        def redefines(self):
            return self.dde().allocation
        @property
        def picture(self):
            return self.dde().picture
        @property
        def size_scale_precision(self):
            return self.dde().sizeScalePrecision

..  py:class:: IndexedAttribute

    The IndexedAttribute is a subclass of :py:class:`cobol.RepeatingAttribute` 
    with (some) indices applied. Since this inherits the :py:meth:`cobol.RepeatingAttribute.index`
    method, we can apply indices incrementally.

    This class is not built directly, but only created by :py:meth:`cobol.RepeatingAttribute.index`
    with some (or all) indices applied.

::

    class IndexedAttribute( RepeatingAttribute ):
        """An attribute with dimensionality and indexes applied.
        This must be built from a :py:class:`cobol.RepeatingAttribute`. It will copy
        some attributes in an effort to somewhat improve efficiency.
        """
        default_cell= TextCell
        def __init__(self, base, offset, dimensionality ):
            self.dde= base.dde
            self.name, self.size, self.create, self.position = base.name, base.size, base.create, base.position
            self._offset= offset
            self._dimensionality= dimensionality
        @property
        def dimensionality(self):
            """tuple of DDE's; Set by ``attribute.index()``."""
            return self._dimensionality
        @property
        def offset(self):
            """Set by ``attribute.index()``."""
            return self._offset

COBOL LazyRow
==============

The :py:class:`sheet.LazyRow` class is blissfully unaware of the need to compute
sizes and offsets for COBOL.


..  py:class:: ODO_LazyRow

    This subclass of :py:class:`sheet.LazyRow` to provide add the feature to recompute sizes
    and offsets in the case of a variable-located DDE due to an Occurs Depending On.

::

    class ODO_LazyRow( stingray.sheet.LazyRow ):
        """If the DDE is variably-located, tweak the sizes and offsets."""
        
        def __init__( self, sheet, **state ):
            """Build the row from the bytes.
            
            :param sheet: the containing sheet.
            :param **state: worksheet-specific state value to save.
            """
            super().__init__( sheet, **state )
            for dde in self.sheet.schema.info.get('dde',[]):
                if dde.variably_located:
                    dde.setSizeAndOffset(self) 
                self._size= dde.totalSize
            else:
                self._size= len(self._state['data'])


Dump a Record
===============

..  py:function:: dump_iter( aDDE, aRow )

    To support dumping raw data from a record, this will iterate through all items
    in an original DDE. It will a five-tuple with (dde, attribute, indices, bytes, Cell)
    for each DDE.

    If the DDE does not have an OCCURS clause, the indices will be an empty tuple.
    Otherwise, each individual combination will be yielded. For big, nested tables, this
    may turn out to be a lot of combinations.

    The bytes is the raw bytes for non-FILLER and non-group elements. 

    The Cell will be a Cell object, either with valid data or an :py:class:`cobol.defs.ErrorCell`.

    
::

    def dump_iter( aDDE, aRow ):
        """Yields iterator over tuples of (dde, attribute, indices, bytes, Cell)"""
        def expand_dims( dimensionality, partial=() ):
            if not dimensionality: 
                yield partial
                return
            top = dimensionality[0]
            rest= dimensionality[1:]
            for i in range(top):
                for e in expand_dims( rest, partial+(i,) ):
                    yield e
        attr= aDDE.attribute() # Final size and offset details
        if aDDE.dimensionality: 
            for indices in expand_dims( aDDE.dimensionality ):
                yield aDDE, aDDE.attribute, indices, aRow.cell(attr,indices).raw, aRow.cell(attr,indices)
        elif aDDE.picture and aDDE.name != "FILLER":
            yield aDDE, aDDE.attribute(), (), aRow.cell(attr).raw, aRow.cell(attr)
        else: # FILLER or group level without a picture: no data is available
            yield aDDE, aDDE.attribute, (), None, None
        for child in aDDE.children:
            #pprint.pprint( child )
            for details in dump_iter( child, aRow ):
                yield details

..  py:function:: dump( schema, row )

    Dump data from a record, driven by the original DDE structure.

::

    def dump( schema, aRow ):
        print( "{:45s} {:3s} {:3s} {!s} {!s}".format("Field", "Pos", "Sz", "Raw", "Cell" ) )
        for record in schema.info['dde']:
            for aDDE, attr, indices, raw_bytes, cell in dump_iter(record, aRow):
                print( "{:45s} {:3d} {:3d} {!r} {!s}".format(
                    aDDE.indent*'  '+str(aDDE), aDDE.offset, aDDE.size, 
                    raw_bytes, cell) )
        

COBOL "Workbook" Files
========================

A COBOL file is -- in effect -- a single-sheet workbook with an external schema.
It looks, then, a lot like :py:class:`workbook.Fixed_Workbook`.  

-   A pure character file, encoded UNICODE characters in some standard encoding
    like UTF-8 or UTF-16.  This cannot include COMP or COMP-3 fields because
    the codec would make a mess of the bit patterns.

-   An EBCDIC-encoded byte file.  This can include COMP or COMP-3 fields.

-   An ASCII-encoded byte file.  This can include COMP or COMP-3 fields.  
    While this may exist, it seems to be very rare. We don't implement it.

Note that each cell creation involves two features. This leads to a kind of **Double Dispatch** algorithm.  

-   The cell type.  :py:class:`cobol.defs.TextCell`, 
    :py:class:`cobol.defs.NumberDisplayCell`, 
    :py:class:`cobol.defs.NumberComp3Cell` or :py:class:`cobol.defs.NumberCompCell`.

-   The workbook encoding type.  Character or EBCDIC (or ASCII).

The issue here is we're stuck with a complex "double-dispatch" problem.
Each workbook subclass needs to implement methods for ``get_text``, ``number_display``,
``number_comp`` and ``number_comp3``.  

The conversions, while tied to the workbook encoding, aren't properly tied to
stateful sheet and row processing in the workbook.  They're just bound to the 
encoding.  Consequently, we can make them static methods, possibly even 
making this a mixin strategy.

The common use case looks like this.

1.  The application uses :code:`row.cell( schema[n] )` to fetch a :py:class:`cell.Cell`.
    The :py:meth:`cobol.ODO_LazyRow.cell` method is simply ``sheet.workbook.row_get( buffer, attribute )``.  
    It applies the cell type (via the schema item's attribute) and the raw data in the row's buffer.

2.  The workbook ``row_get( buffer, attribute )`` has to do the following.

    -   Convert the buffer into a proper value based on the ``attribute`` type
        information **and** the worksheet-specific methods for unpacking the 
        various types of data.  The various :py:mod:`cobol` Cell subclasses
        can refer to the proper conversion methods.
    
    -   Create the required :py:class:`cell.Cell` based on the ``attribute.create`` function.
        See :class:`schema.Attribute`.
    
There's a less common use case to extract a subset of row bytes to populate a 
separate 01-level definition that's not tied to the Workbook's schema.

1.  The application uses ``subrow= row.data( schema[n], other_schema )`` to fetch some bytes that can
    be used to create a new LazyRow tied to a different schema.

2.  The application uses ``subrow.cell( subschema[m] )`` to fetch a :py:class:`cell.Cell`.
    This doesn't go back to the original workbook, it goes to this "subrow" of the
    workbook.

..  code-block:: none

    http://yuml.me/diagram/scruffy;/class/
    #cobol,
    [Fixed_Workbook]^[COBOL_File],
    [COBOL_File]^[Character_File],
    [COBOL_File]^[EBCDIC_File].

..  image:: cobol_file.png
    :width: 6in
    
COBOL File
--------------

..  py:class:: COBOL_File

    This class introduces the expanded version of ``row_get`` that honors
    a schema attribute with dimensionality.

::

    class COBOL_File( Fixed_Workbook ):
        """A COBOL "workbook" file which uses  :py:class:`cobol.RepeatingAttribute` and
        creates COBOL Cell values.  This is an abstraction which
        lacks specific decoding methods.
        
        This is a :py:class:`Fixed_Workbook`: a file with fixed-sized, no-punctuation fields.
        A schema is required to parse the attributes.
        
        The rows are defined as :py:class:`cobol.ODO_LazyRow` instances so that
        bad data can be gracefully skipped over and Occurs Depending On offsets
        can be properly calculated.
        """
        row_class= ODO_LazyRow

..  py:method:: COBOL_File.row_get_index( row, attr, *index )

    Returning a particular Cell from a row, however, is more interesting for COBOL
    because the Attribute may contains an "OCCURS" clause.  In which case, we may need
    to assemble a tuple of values.

    If there is dimensionality, then take the top-level dimension (``dim[0]``) and
    use it as an iterator to fetch data based on the rest of the dimensions (``dim[1:]``).

    This can assemble a recursive tuple-of-tuples if there are multiple levels
    of dimensionality. 

    If too few index values are provided, a tuple of results is built around the missing values.

    If enough values are provided, a single result object will be built.

..  note:: Performance

    This is the most-used method. Removing the if-statement would be
    a huge improvement.


::

        def row_get_index( self, row, attr, *index ):
            """Emit a nested-tuple structure of Cell values using the given index values.
            :param row: the source Row.
            :param attr: the  :py:class:`cobol.RepeatingAttribute`
                with the original tuple of dimensions,
                or a :py:class:`cobol.IndexedAttribute` which has 
                an offset and partial dimensions. 
            :param index: optional tuple of index values to use.
                Instead of ``row_get( schema.get('name').index(i) )``
                we can use ``row_get_index( schema.get('name'), i )``
            :returns: a (possibly nested) tuple of Cell values matching the dims that lacked
                index values.
            """
            if attr.dimensionality and index:
                # ``attr.index()`` probably not previously used.
                # Apply all remaining values and get the resulting item.
                final= attr.index( *index )
                return self.row_get( row, final )
            elif attr.dimensionality:
                # ``attr.index()`` previously used with partial arg values.
                # Build composite result.
                d= attr.dimensionality[0].occurs.number(row)
                result= []
                for i in range(d):
                    sub= attr.index(i)
                    result.append( self.row_get( row, sub ) )
                return tuple(result)
            else:
                # Doesn't belong here, delegate.
                return self.row_get( row, attr ) 
                
..  py:method:: COBOL_File.row_get( row, attr )

    The API method will get data from a row described by an attribute.
    If the attribute has dimensions, then indices are used or multiple values are returned
    by :py:meth:`cobol.COBOL_File.row_get_index`.

    If the attribute is has no dimensions, then it's simply pulled from the source row.

..  note:: Performance

    This is the most-used method. Removing the if-statement would be
    a huge improvement.
    
:: 

        def row_get( self, row, attr ):
            """Create a Cell(s) from the row's data.
            :param row: The current Row
            :param attr: The desired Attribute; possibly tweaked to 
                have an offset and partial dimensions. Or possibly the original.
            :returns: A single Cell or a nested tuple of Cells if indexes
                were not provided.
            """ 
            if attr.dimensionality:
                return self.row_get_index( row, attr )
            else:
                extract= row._state['data'][attr.offset:attr.offset+attr.size]
                return attr.create( extract, self, attr=attr ) 

Note that this depends on the superclass, which depends ordinary Unicode/ASCII line breaks.
This will not work for EBCDIC files, which may lack appropriate line break characters.
For that, we'll need to use specific physical format parsing helpers based on the 
Z/OS RECFM parameter used to define the file.

..  py:method:: COBOL_File.subrow( subschema, text_cell )

    In some COBOL files, there can be 01-level "subrecords" buried within an 01-level record.

    We can use ``wb.subrow(subschema, row.cell(schema_header_dict['GENERIC-FIELD']))``
    to map a particular field ('GENERIC-FIELD') to an entire 01-level schema, creating
    a "subrow" from a single field within the parent row.

::

        def subrow( self, subschema, text_cell ):
            """Build a row-like object from a single field.
            
            :param subschema: a schema built from an 01-level DDE.
            :param text_cell: a specific text cell to use.
            """
            subrow = self.row_class(
                stingray.sheet.ExternalSchemaSheet( self, "", subschema ),
                data= text_cell.raw,
            )
            return subrow


Character File
-----------------


..  py:class:: Character_File

    This is subclass of :py:class:`COBOL_File` that handles COBOL data parsing
    where the underlying file is text. Since the file is text, Python handles
    any OS-level bytes-to-text conversions.

::

    class Character_File( COBOL_File ):
        """A COBOL "workbook" file with decoding functions for
        proper character data.
        """

The following functions are used to do data conversions for COBOL Character files.  
Text is easy, Python's ``io.open`` has already handled this.

::

        @staticmethod
        def text( buffer, attr ): 
            """Extract a text field's value."""
            return buffer 

Numeric data with usage ``DISPLAY`` is essentially text. In some cases, the
picture has ``V``, which means that we must handle this implicit decimal point.
The "display" feature is the COBOL default: everything is plain text.

..  note:: The core rule for character files

    Leading separate sign is the default for character files.
    
    COBOL can support other kinds of signs. This conversion doesn't.
    
    
::

        @staticmethod
        def number_display( buffer, attr ):
            """Extract a numeric field's value.
            Based on leading, separate sign.
            """
            final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision
            try:
                display=buffer.strip()
                if precision != 0 and dec_sign == 'V':
                    display= display[:-precision]+"."+display[-precision:]
                return decimal.Decimal( display )
            except Exception:
                Character_File.log.debug( "Can't process {0!r} from {1!r}".format(display,buffer) )
                raise

COMP-3 in proper character files may not make any sense at all.  
A codec would make a hash of the bit patterns required.  
However, we've defined the method here so that it can be used by the EBCDIC subclass
trivially.

We're going to build an ASCII version of the number by decoding the bytes into
a mutable bytearray and decorating them with decimal point and sign. This is 
demonstrably faster and avoids object creation to the extent possible.


::

        @staticmethod
        def unpack( buffer ):
            """Include ' ' position for leading sign character.
            Trailing sign field will be 48+0xd for negative.
            48+0xf is "unsigned" and 48+0xc is positive.
            """
            yield 32 # ord(b' ')
            for n in buffer:
                yield 48+(n>>4) # ord(b'0')
                yield 48+(n&0x0f)

        @staticmethod
        def number_comp3( buffer, attr ):
            """Decode comp-3, packed decimal values.
    
            Each byte is two decimal digits.
    
            Last byte has a digit plus sign information: 0xd is <0, 0xf is unsigned, and 0xc >=0.
            """
            final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision
            #print( repr(buffer), "from", repr(display) )
            digits = bytearray( Character_File.unpack( buffer ) )
            # Proper sign in front; replace trailing sign with space.
            digits[0]= 45 if digits[-1]==48+0xd else 32 # ord(b'-'), ord(b' ')
            digits[-1]= 32 # ord(' ') 
            # Add decimal place if needed.
            if precision:
                digits[-precision:]= digits[-precision-1:-1] # Shift digits to right.
                digits[-precision-1]= 46 # Insert ord(b'.')
            try:
                return decimal.Decimal( digits.decode("ASCII") )
            except Exception:
                Character_File.log.debug( "Can't process {0!r} from {1!r}".format(digits,buffer) )
                raise
        
COMP in proper character files may not make any sense, either. 
A codec would make a hash of the bit patterns required.  
Again, we've defined it here because that's relatively simple to extend.

We're simply going to unpack big-endian bytes.

::

        @staticmethod
        def number_comp( buffer, attr ):
            """Decode comp, binary values."""
            final, alpha, length, scale, precision, signed, dec_sign = attr.size_scale_precision
            if length <= 4:
                sc, bytes = '>h', 2
            elif length <= 9:
                sc, bytes = '>i', 4
            else:
                sc, bytes = '>q', 8
            n= struct.unpack( sc, buffer )
            return decimal.Decimal( n[0] )
        
Class-level logger

::

    Character_File.log= logging.getLogger( Character_File.__qualname__ )
    
EBCDIC File
---------------

The EBCDIC files require specific physical "Record Format" (RECFM) assistance.
These classes define a number of Z/OS RECFM conversion. We recognize four
actual RECFM's plus an additional special case.

-   F - Fixed.

-   FB - Fixed Blocked.

-   V - Variable, data must have the RDW word preserved.

-   VB - Variable Blocked, data must have BDW and RDW words.

-   N - Variable, but no BDW or RDW words. This involves some buffer management
    magic to recover the records properly.

..  note::  IBM z/Architecture mainframes are all big-endian

..  py:class:: RECFM_Parser

    This class hierarchy breaks up EBCDIC files into records. 


::

    class RECFM_Parser:
        """Parse a physical file format."""
        def record_iter( self ):
            """Return each physical record, stripped of headers."""
            raise NotImplementedError
        def used( self, bytes ):
            """The number of bytes actually consumed.
            Only really relevant for RECFM_N subclass to handle variable-length
            records with no RDW/BDW overheads.
            """
            pass

..  py:class:: RECFM_F

    Simple fixed-length records. No header words.

::

    class RECFM_F(RECFM_Parser):
        """Parse RECFM=F; the lrecl is the length of each record."""
        def __init__( self, source, lrecl=None ):
            """
            :param source: the file
            :param lrecl: the record length.
            """
            super().__init__()
            self.source= source
            self.lrecl= lrecl
        def record_iter( self ):
            data= self.source.read(self.lrecl)
            while len(data) != 0:
                yield data
                data= self.source.read(self.lrecl)
        def rdw_iter( self ):
            """Yield rows with RDW, effectively RECFM_V format."""
            for row in self.record_iter():
                yield struct.pack( ">H2x", len(row)+4 )+row

..  py:class:: RECFM_FB

    Simple fixed-blocked records. No header words.

::

    class RECFM_FB( RECFM_F ):
        """Parse RECFM=FB; the lrecl is the length of each record.
        
        It's not clear that there's any difference between F and FB.
        """
        pass
        
..  py:class:: RECFM_V

    Variable-length records. Each record has an RDW header word with the length.

::

    class RECFM_V(RECFM_Parser):
        """Parse RECFM=V; the lrecl is a maximum, which we ignore."""
        def __init__( self, source, lrecl=None ):
            """
            :param source: the file
            :param lrecl: a maximum, but it's ignored.
            """
            super().__init__()
            self.source= source
        def record_iter( self ):
            """Iterate over records, stripped of RDW's."""
            for rdw, row in self._data_iter():
                yield row
        def rdw_iter( self ):
            """Iterate over records which include the 4-byte RDW."""
            for rdw, row in self._data_iter():
                yield rdw+row        
        def _data_iter( self ):
            rdw= self.source.read(4)
            while len(rdw) != 0:
                size = struct.unpack( ">H2x", rdw )[0]
                data= self.source.read( size-4 )
                yield rdw, data
                rdw= self.source.read(4)
                
We might want to implement the :py:meth:`RECFM_Parser.used` method to compare the number of bytes
used against the RDW size.

..  py:class:: RECFM_VB

    Variable-length, blocked records. Each block has a BDW; each record has an RDW header word.
    These BDW and RDW describe the structure of the file.

::

    class RECFM_VB(RECFM_Parser):
        """Parse RECFM=VB; the lrecl is a maximum, which we ignore."""
        def __init__( self, source, lrecl=None ):
            """
            :param source: the file
            :param lrecl: a maximum, but it's ignored.
            """
            super().__init__()
            self.source= source
        def record_iter( self ):
            """Iterate over records, stripped of RDW's."""
            for rdw, row in self._data_iter():
                yield row
        def rdw_iter( self ):
            """Iterate over records which include the 4-byte RDW."""
            for rdw, row in self._data_iter():
                yield rdw+row        
        def bdw_iter( self ):
            """Iterate over blocks, which include 4-byte BDW and records with 4-byte RDW's."""
            bdw= self.source.read(4)
            while len(bdw) != 0:
                blksize = struct.unpack( ">H2x", bdw )[0]
                block_data= self.source.read( blksize-4 )
                yield bdw+block_data
                bdw= self.source.read(4)
        def _data_iter( self ):
            bdw= self.source.read(4)
            while len(bdw) != 0:
                blksize = struct.unpack( ">H2x", bdw )[0]
                block_data= self.source.read( blksize-4 )
                offset= 0
                while offset != len(block_data): 
                    assert offset+4 < len(block_data), "Corrupted Data Block {!r}".format(block_data)
                    rdw= block_data[offset:offset+4]
                    size= struct.unpack( ">H2x", rdw )[0]
                    yield rdw, block_data[offset+4:offset+size]
                    offset += size
                bdw= self.source.read(4)
                
We might want to implement a generic :py:meth:`RECFM_Parser.used` method to compare the number of bytes
used against the RDW size and raise an exception in the event of a mismatch.

..  py:class:: RECFM_N

    Variable-length records without RDW's. Exasperating because we have to feed 
    bytes to the buffer as needed until the record is complete.

::

    class RECFM_N:
        """Parse RECFM=V without RDW (or RECFM=VB without BDW or RDW).
        The lrecl is ignored.
        """
        def __init__( self, source, lrecl=None ):
            """
            :param source: the file
            :param lrecl: a maximum, but it's ignored.
            """
            super().__init__()
            self.source= source
            self.buffer= self.source.read( 32768 )
        def record_iter( self ):
            while len(self.buffer) != 0:
                yield self.buffer
                # What if used() is not called? This will loop forever!
        def used( self, bytes ):
            #print( "Consumed {0} Bytes".format(bytes) )
            self.buffer= self.buffer[bytes:]+self.source.read(32768-bytes)

..  py:class:: EBCDIC_File

    This subclass handles EBCDIC conversion and COMP-3
    packed decimal numbers.  For this to work, the schema needs to use slightly different Cell-type conversions.  

    Otherwise, this is similar to processing simple character data.


::

    class EBCDIC_File( Character_File ):
        """A COBOL "workbook" file with decoding functions for
        EBCDIC data. If a file_object is provided, it must be 
        opened in byte mode, and no decoder can be used.
        """
        decoder= codecs.getdecoder('cp037')
        def __init__( self, name, file_object=None, schema=None, RECFM="N" ):
            """Prepare the workbook for reading.
            :param name: File name
            :param file_object: Optional file-like object.  If omitted, the named file is opened.
                The object must be opened in byte mode; no decoder should be used.
            :param schema: The schema to use.
            :param RECFM: The legacy Z/OS RECFM to use. This must be one
                of "F", "FB", "V", "VB". This is translated to an appropriate
                RECFM class: RECFM_F, RECFM_FB, RECFM_V, or RECFM_VB.
            """
            super().__init__( name, file_object, schema )
            if self.file_obj:
                self.the_file= None
                self.wb= self.file_obj
            else:
                self.the_file = open( name, 'rb' )
                self.wb= self.the_file
            self.schema= schema
            parser_class= {
                "F" : RECFM_F, 
                "FB": RECFM_FB, 
                "V" : RECFM_V,
                "VB": RECFM_VB,
                "N":  RECFM_N,
                }[RECFM]
            self.parser= parser_class(self.wb, schema.lrecl())

..  py:method:: EBCDIC_File.rows_of( sheet )

    We must extend the :py:meth:`workbook.Character_File.rows_of` method to deal with 
    two issues:

    -   If the schema depends on a variably located DDE, then we need to do the 
        :py:func:`cobol.defs.setSizeAndOffset` function using the DDE.
        This is done automagically by the :py:class:`cobol.ODO_LazyRow` object.
    
    -   The legacy Z/OS RECFM details. 

        *   We might have F or FB files, which are simply
            long runs of EBCDIC bytes with no line breaks.
            The LRECL must match the DDE.
        
        *   We might have V (or VB) which have 4-byte header on each row (plus a 4-byte header on each block.)
            The LRECL doesn't matter.
        
        *   We can tolerate the awful situation where it's variable length (Occurs Depending On)
            but there are no RECFM=V or RECFM=VB header words. We call this RECFM=N.
            We fetch an oversized buffer and push back bytes beyond the end of the record.
    
        This means that the ``super().rows_of( sheet )`` has been replaced with a RECFM-aware
        byte-parser. This byte parser may involve a back-and-forth to handle RECFM=N.
        In the case of RECFM=N, we provide an overly-large buffer (32768 bytes) and after
        any size and offset calculations, the ``row._size`` shows how many bytes were
        actually used.

::

        def rows_of( self, sheet ):
            """Iterate through all "rows" of this "sheet". 
            Really, this means all records of this COBOL file.
            
            Note the handshake with RECFM parser to show how many
            bytes were really needed.  For RECFM_N, this is important.
            For other RECFM, this is ignored.
            
            :py:class:`cobol.ODO_LazyRow` may adjust the schema 
            if it has an Occurs Depending On.
            """
            for data in self.parser.record_iter():
                row= ODO_LazyRow( sheet, data=data )
                self.parser.used(sheet.schema.lrecl())
                yield row

The following functions are used to do data conversions for COBOL EBCDIC files.  
Text requires using a codec to translate EBCDIC-encoded characters.

::

        @staticmethod
        def text( buffer, attr ): 
            """Extract a text field's value."""
            text, size = EBCDIC_File.decoder(buffer)
            return text

When a number usage is ``DISPLAY``, it's text: 
we simply convert the bytes from EBCDIC to Unicode
and treat them more-or-less like a text field.

Note the subtlety around "Signed" display fields. The last byte
will include a sign in addition to the digit.

-   The last EBCDIC character might be '\xF1' to '\xF9' which is unsigned.
    
-   The last EBCDIC character might be '\xC1' to '\xC9' which is positive.

-   The last EBCDIC character might be '\xD1' to '\xD9' which is negative.
    
Really.


::

        @staticmethod
        def number_display( buffer, attr ):
            """Extract a numeric field's value."""
            if attr.size_scale_precision.signed:
                # Fiddle bits to make EBCDIC char from signed digit.
                last_digit = bytes( [(buffer[-1] & 0x0F) | 0xF0] )
                sign = '-' if buffer[-1] >> 4 == 0xD else ''
                text, size = EBCDIC_File.decoder(last_digit if len(buffer) == 1 else buffer[:-1] + last_digit)
                return Character_File.number_display( sign+text, attr )
            else:
                text, size = EBCDIC_File.decoder(buffer)
                return Character_File.number_display( text, attr )        

ASCII File
------------------

We could define a subclass for files encoded in ASCII which contain COMP and COMP-3 values.

This is left as a future extension.