.. #!/usr/bin/env python3 .. _`cells`: ######################################################## Cell Module -- Data Element Containers and Conversions ######################################################## .. py:module:: cell The point of a :py:class:`cell.Cell` is two-fold: - **Capture**. That is, decode the information in the source file into a Python object that represents the source spreadsheet value. For XLS or XLSX formats, there are a variety of cell data types. For CSV, all cells are text. For a fixed format file, we may have to exploit the physical format information to properly decode the bytes or characters. We may even have to cope with EBCDIC or packed decimal conversions. - **Convert**. Provide the cell value coerced to another Python type. This is rather open-ended; it's hard to provide a comprehensive list of types to which data can be converted. Capture Use Case ==================== The **Capture** use case is defined by our physical formats. We have several places to look for source data formats: - :py:mod:`xlrd`. - XSLX (ECMA 376). - Apple Numbers. ``xlrd`` identifies the follow cell types found in XLS workbooks. .. csv-table:: :header: "Type symbol","Type number","Python value" :widths: 33, 5, 55 XL_CELL_EMPTY,0,"empty string ''" XL_CELL_TEXT,1,"a Unicode string" XL_CELL_NUMBER,2,"float" XL_CELL_DATE,3,"float" XL_CELL_BOOLEAN,4,"int; 1 means TRUE, 0 means FALSE" XL_CELL_ERROR,5,"int representing internal Excel codes; for a text representation, refer to the supplied dictionary error_text_from_code" XL_CELL_BLANK,6,"empty string ''. Note: this type will appear only when open_workbook(..., formatting_info=True) is used." An XLSX (per ECMA 376, section 18.18.11) or ODS provides a similar list of cell types. The data is always encoded as a proper string that can be converted (if necessary) based on the type code. .. csv-table:: :header: "Enumeration Value","Description" :widths: 22, 55 ``b`` (Boolean),Cell containing a boolean. ``d`` (Date),Cell contains a date in the ISO 8601 format. ``e`` (Error),Cell containing an error. ``inlineStr`` (Inline String),"Cell containing an (inline) rich string, i.e., one not in the shared string table. If this cell type is used, then the cell value is in the is element rather than the v element in the cell (c element)." ``n`` (Number),Cell containing a number. ``s`` (Shared String),Cell containing a shared string. ``str`` (String),Cell containing a formula string. Dates formatted as strings are -- always -- a problem. There's no generic solution. An application may need to write suitable extensions to handle this. See below under `Conversion Functions`_. We should depend on the :py:mod:`locale` module to provide proper format strings for converting between date and string. An Apple Numbers spreadsheet appears to have the following cell types. Some of the tags appear to have no value, so their purpose is unclear. .. csv-table:: :header: "Tag","Description" :widths: 22, 55 ``d`` (Date),Cell containing a date ``f`` (Formula),Cell containing a formula ``g`` (Empty),Two or more empty cells ``n`` (Number),Cell containing a number ``o`` (?),One empty cell ``s`` (?), ``t`` (Text),Cell containing text ``pm`` (Popup Menu),A popup menu of otherc cell values This provides a reasonably complete set of source data types. They vary by physical format. They all have a common history, starting with VisiCalc and including MultiPlan. Here's a summary: .. csv-table:: :header: "Type","XLRD","XLSX","Numbers" :widths: 22,22,22,22 Empty,Two Varieties,No,Two Varieties Text,Yes,Three Varieties,Yes Number,Yes,Yes,Yes Date,Two Encodings,Yes,Yes Boolean,Yes,Yes,No Error,Yes,Yes,No Convert Use Cases ======================= There are several use cases for output conversion (or "transformation"). - Trivial. ``float``, ``str`` or an empty cell. Essentially, we're using the captured data type directly. - Easy. ``float`` or ``str`` to ``decimal``. Generally, currency fields are stored as ``float`` in the workbook; this needs to be coverted to ``decimal`` to be useful. - Obscure. ``datetime`` based on ``float``. ``xlrd`` handles this elegantly. - Variable. ``datetime`` based on ``str``. The variability becomes rather complex. It's also application-specific, since it depends on the source of the data. - Horrible. Digit strings. US Zip codes. Social Security Numbers. Phone numbers without punctuation. These are digit strings which a spreadsheet application may transform to a floating point number; these need to be rebuilt as proper digit strings with leading zeroes. We'd like code that looks like these examples: .. parsed-literal:: "{0} has {1}".format( foo.to_str(), foo.to_float() ) today= bar.to_datetime().date() zip_code.to_digit_str(5) This **Convert** aspect of a :py:class:`cell.Cell` is part of *using* the logical layout. We'll address that under :ref:`developer`, below. Model ====== .. code-block:: none http://yuml.me/diagram/scruffy;/class/ #cell, [Cell]^[EmptyCell], [Cell]^[TextCell], [Cell]^[NumberCell], [Cell]^[FloatDateCell], [Cell]^[BooleanCell], [Cell]^[ErrorCell]. .. image:: cell.png :width: 6in Circularity ============ Note that a Cell depends on a Workbook. And a Workbook (via a Sheet and a Row) depends on a Cell. In some languages, we'd be obligated to define interfaces so that these two classes could depend on each other sensibly. In Python, however, we don't need to create an elaborate web of dependencies. Overheads =========== Here are the module docstring and imports. :: """stingray.cell -- Defines Cell as the atomic data element in a sheet of a workbook. A cell has a value, it's part of a workbook. """ import locale import decimal import datetime import time from collections import Hashable A version string. :: __version__ = "4.4.8" Just to be sure that any locale-based processing will actually work, we establish a default locale. :: locale.setlocale(locale.LC_ALL, '') Cell ======= .. py:class:: Cell The :py:class:`cell.Cell` class hierarchy extends this base class. Note that we have a relatively short list of built-in conversions. For more complex, application-specific conversions, the raw :py:attr:`value` is available as a property. .. py:attribute:: value The raw data, often a String from a workbook. May also be a sequence of bytes for COBOL. .. py:attribute:: workbook The :py:class:`workbook.base.Workbook` that created this Cell. This is largely used for Excel date conversions, but there could be other context needs for lazy access to data. :: class Cell( Hashable ): """A class hierarchy for each kind of Cell. """ def __init__( self, value=None, workbook=None ): """Build a new Cell; the atomic data element of a workbook. :param value: Raw value, generally a string for most workbooks. :param workbook: Parent workbook, required for some conversions and for lazy access to data. """ self._value, self.workbook = value, workbook def __repr__( self ): return "{0}({1!r})".format( self.__class__.__name__, self._value ) def is_empty( self ): return self._value is None def to_int( self ): return NotImplemented def to_float( self ): return NotImplemented def to_decimal( self, digits=None ): return NotImplemented def to_str( self ): return NotImplemented def to_datetime( self, format=None ): return NotImplemented def to_digit_str( self, len=5 ): return NotImplemented One feature of a cell that's required when we do data profiling is to create a usable hash from the cell class and raw data value. :: def __hash__( self ): return hash(self._value) ^ hash(self.__class__) def __eq__( self, other ): return self.__class__ == other.__class__ and self._value == other._value def __ne__( self, other ): return self.__class__ != other.__class__ or self._value != other._value We make a token effort at making a cell more-or-less immutable. This makes it hashable. :: @property def value( self ): return self._value .. todo:: Unit test cases for the hashable interface of Cell EmptyCell ============ .. py:class:: EmptyCell An ``EmptyCell`` implements empty cells. :py:mod:`xlrd` may report them as a type ``XL_CELL_EMPTY``. A Numbers spreadsheet may use the ```` or ```` tag. :: class EmptyCell( Cell ): """The *value* will be '', but we ignore that.""" def is_empty( self ): return True def to_int( self ): return None def to_float( self ): return None def to_decimal( self, digits=None ): return None def to_str( self ): return None def to_datetime( self, format=None ): return None def to_digit_str( self, len=None ): return None TextCell ============ .. py:class:: TextCell A ``TextCell`` implements the cells with text values. :py:mod:`xlrd` may report them as a type ``XL_CELL_TEXT``. It's often possible to interpret the text as some other value, so the conversions make reasonable attempts at that. This is used for CSV workbooks as well as XLS workbooks. This is the default type for Fixed format files, also. Note that COBOL files will explicitly have bytes values, not string values. :: class TextCell( Cell ): """A Cell which contains a Python string value.""" def to_int( self ): return int( self.value ) def to_float( self ): return float( self.value ) def to_decimal( self, digits=0 ): return decimal.Decimal( self.value ) def to_str( self ): return self.value def to_datetime( self, format=None ): if format is None: try: format = locale.nl_langinfo(locale.D_FMT) except AttributeError as e: # Windows format = "%x" return datetime.datetime.strptime(self.value,format) def to_digit_str( self, length=5 ): txt= "{0:0>{length}d}".format(int(self.value), length=length) return txt NumberCell ============ .. py:class:: NumberCell A ``NumberCell`` implements the cells with a float value. :py:mod:`xlrd` may report them as a type ``XL_CELL_NUMBER``. A variety of conversions make sense for a number value. :: class NumberCell( Cell ): """A cell which contains a Python float value.""" def to_int( self ): return int( self.value ) def to_float( self ): if isinstance(self.value,float): return self.value # likely, it's Decimal! return float(self.value) def to_decimal( self, digits=0 ): if isinstance(self.value,float): fmt= "{0:0.{digits}f}" return decimal.Decimal( fmt.format(self.value, digits=digits) ) elif isinstance(self.value,decimal.Decimal): return self.value else: return decimal.Decimal(self.value) def to_str( self ): return str(self.value) def to_datetime( self, format=None ): assert format is None, "Format is not used." return self.workbook.float_to_date(self.value) #try: # dt= xlrd.xldate_as_tuple(self.value, self.workbook.datemode) #except xlrd.xldate.XLDateAmbiguous as e: # ex= ValueError( "Ambiguous Date: {0}".format(self.value) ) # raise ex from e #return datetime.datetime(*dt) def to_digit_str( self, length=5 ): txt= "{0:0>{length}d}".format(int(self.value), length=length) return txt FloatDateCell =============== .. py:class:: FloatDateCell A ``FloatDateCell`` implements the cells with ``XL_CELL_DATE``. Since the conversions are all identical to number, we simply inherit the features of a number. :: class FloatDateCell( NumberCell ): """A cell which contains a float value that is actually an Excel date.""" pass Other formats have other kinds of date cells that aren't simply dressed-up floating-point numbers. BooleanCell ============ .. py:class:: BooleanCell A ``BooleanCell`` implements the cells with ``XL_CELL_BOOLEAN``. Since the conversions are all identical to number, we simply inherit the features. :: class BooleanCell( NumberCell ): """A cell which contains a boolean value.""" pass ErrorCell ============ .. py:class:: ErrorCell An ``ErrorCell`` implements the cells with ``XL_CELL_ERROR``. The only sensible conversion is :py:meth:`ErrorCell.to_str` which reports the error string for the cell. :: class ErrorCell( Cell ): """A cell which contains an error code.""" def to_int( self ): raise ValueError( self.value ) def to_float( self ): raise ValueError( self.value ) def to_decimal( self, digits=0 ): raise ValueError( self.value ) def to_str( self ): return self.value def to_datetime( self, format=None ): raise ValueError( self.value ) def to_digit_str( self, length=5 ): raise ValueError( self.value ) DateCell ============ .. py:class:: DateCell A ``DateCell`` implements a cell with a proper date-time value. This is a value which did not come from a workbook float value. This could be a parsed string, for example. A variety of conversions make sense for a proper date value. :: class DateCell( Cell ): """A cell which contains a proper :py:mod:`datetime` value.""" def to_int( self ): return int(self.to_float()) def to_float( self ): return self.workbook.date_to_float(self.value) #timetuple= self.value.timetuple()[:6] #xl= xlrd.xldate.xldate_from_datetime_tuple( # timetuple, # self.workbook.datemode) #return xl def to_decimal( self, digits=0 ): fmt= "{0:0.{digits}f}" return decimal.Decimal( fmt.format(self.to_float(),digits=digits) ) def to_str( self ): return str(self.value) def to_datetime( self, format=None ): return self.value def to_digit_str( self, length=5 ): fmt= "{{0:0>{0}d}}".format(length) return fmt.format( self.to_int() ) For Apple Nnumbers, the ```` cells have a native date format. This is a unique feature, since :py:mod:`xlrd`, XLSX and ODS don't have a proper date cell value. Conversion Functions ======================= .. todo:: Refactor these into the :py:mod:`schema` module. These functions are used to define schema, not process Cell objects *per se*. The idea here is to create some functions that can be used to build Schema attributes that handle proper date conversions. .. py:function:: date_from_string(format) A closure based on a format string that returns a single-argument conversion function. :param format: the format string to use. :: def date_from_string( format ): def the_conversion( string ): return datetime.datetime.strptime( string, format ) return the_conversion This forms a factory for the :py:class:`cell.DateCell` class. .. py:function:: datecell_from_string(format) A closure based on a format string that returns a single-argument conversion function. :param format: the format string to use. :: def datecell_from_string( format ): dt_conv= date_from_string( format ) def the_conversion( string, workbook ): return DateCell( dt_conv( string ), workbook ) return the_conversion This could be used like this in a schema definition. .. parsed-literal:: d = Attribute( name="mm-dd-yy", size=\ *n*, offset=\ *m*, create=stingray.cell.datecell_from_string("%m/%d/%y") ) .. py:function:: date_from_float(workbook) A closure based on an XLS workbook's datemode setting that returns a single-argument conversion function. :param workbook: the xlrd workbook with the required datemode. :: def date_from_float(workbook): return workbook.float_to_date Once the definition of this was somewhat more complex. It was dependent on :py:mod:`xlrd`; We've refactored it out of here to isolate all :py:mod:`xlrd` dependencies properly. .. parsed-literal:: def date_from_float( workbook ): def the_conversion( value ): try: dt= xlrd.xldate_as_tuple(value, workbook.datemode) except xlrd.xldate.XLDateAmbiguous as e: ex= ValueError( "Ambiguous Date: {0!r}".format(value) ) raise ex from e return datetime.datetime(\*dt) return the_conversion This function can be used to convert a raw float to a more useful :py:class:`cell.DateCell` instance. This is only sensible in the XLRD context where dates have peculiar conversion rules. .. parsed-literal:: float2date= stingray.cell.date_from_float(workbook) d = Attribute( name="mm-dd-yy", size=\ *n*, offset=\ *m*, create=lambda x, w: stingray.cell.DateCell( float2date(x), w ) )