4. Cell Module – Data Element Containers and Conversions¶
The point of a cell.Cell
is two-fold:
- Capture. That is, decode the information in the source file into a Python object that represents the source spreadsheet value. For XLS or XLSX formats, there are a variety of cell data types. For CSV, all cells are text. For a fixed format file, we may have to exploit the physical format information to properly decode the bytes or characters. We may even have to cope with EBCDIC or packed decimal conversions.
- Convert. Provide the cell value coerced to another Python type. This is rather open-ended; it’s hard to provide a comprehensive list of types to which data can be converted.
4.1. Capture Use Case¶
The Capture use case is defined by our physical formats. We have several places to look for source data formats:
xlrd
.- XSLX (ECMA 376).
- Apple Numbers.
xlrd
identifies the follow cell types found in XLS workbooks.
Type symbol | Type number | Python value |
---|---|---|
XL_CELL_EMPTY | 0 | empty string ‘’ |
XL_CELL_TEXT | 1 | a Unicode string |
XL_CELL_NUMBER | 2 | float |
XL_CELL_DATE | 3 | float |
XL_CELL_BOOLEAN | 4 | int; 1 means TRUE, 0 means FALSE |
XL_CELL_ERROR | 5 | int representing internal Excel codes; for a text representation, refer to the supplied dictionary error_text_from_code |
XL_CELL_BLANK | 6 | empty string ‘’. Note: this type will appear only when open_workbook(..., formatting_info=True) is used. |
An XLSX (per ECMA 376, section 18.18.11) or ODS provides a similar list of cell types. The data is always encoded as a proper string that can be converted (if necessary) based on the type code.
Enumeration Value | Description |
---|---|
b (Boolean) |
Cell containing a boolean. |
d (Date) |
Cell contains a date in the ISO 8601 format. |
e (Error) |
Cell containing an error. |
inlineStr (Inline String) |
Cell containing an (inline) rich string, i.e., one not in the shared string table. If this cell type is used, then the cell value is in the is element rather than the v element in the cell (c element). |
n (Number) |
Cell containing a number. |
s (Shared String) |
Cell containing a shared string. |
str (String) |
Cell containing a formula string. |
Dates formatted as strings are – always – a problem. There’s no generic solution. An application may need to write suitable extensions to handle this. See below under Conversion Functions.
We should depend on the locale
module to provide proper format strings for converting between date and string.
An Apple Numbers spreadsheet appears to have the following cell types. Some of the tags appear to have no value, so their purpose is unclear.
Tag | Description |
---|---|
d (Date) |
Cell containing a date |
f (Formula) |
Cell containing a formula |
g (Empty) |
Two or more empty cells |
n (Number) |
Cell containing a number |
o (?) |
One empty cell |
s (?) |
|
t (Text) |
Cell containing text |
pm (Popup Menu) |
A popup menu of otherc cell values |
This provides a reasonably complete set of source data types. They vary by physical format. They all have a common history, starting with VisiCalc and including MultiPlan. Here’s a summary:
Type | XLRD | XLSX | Numbers |
---|---|---|---|
Empty | Two Varieties | No | Two Varieties |
Text | Yes | Three Varieties | Yes |
Number | Yes | Yes | Yes |
Date | Two Encodings | Yes | Yes |
Boolean | Yes | Yes | No |
Error | Yes | Yes | No |
4.2. Convert Use Cases¶
There are several use cases for output conversion (or “transformation”).
- Trivial.
float
,str
or an empty cell. Essentially, we’re using the captured data type directly. - Easy.
float
orstr
todecimal
. Generally, currency fields are stored asfloat
in the workbook; this needs to be coverted todecimal
to be useful. - Obscure.
datetime
based onfloat
.xlrd
handles this elegantly. - Variable.
datetime
based onstr
. The variability becomes rather complex. It’s also application-specific, since it depends on the source of the data. - Horrible. Digit strings. US Zip codes. Social Security Numbers. Phone numbers without punctuation. These are digit strings which a spreadsheet application may transform to a floating point number; these need to be rebuilt as proper digit strings with leading zeroes.
We’d like code that looks like these examples:
"{0} has {1}".format( foo.to_str(), foo.to_float() )
today= bar.to_datetime().date()
zip_code.to_digit_str(5)
This Convert aspect of a cell.Cell
is part of using the logical layout.
We’ll address that under The Stingray Developer’s Guide, below.
4.3. Model¶
http://yuml.me/diagram/scruffy;/class/
#cell,
[Cell]^[EmptyCell],
[Cell]^[TextCell],
[Cell]^[NumberCell],
[Cell]^[FloatDateCell],
[Cell]^[BooleanCell],
[Cell]^[ErrorCell].
4.4. Circularity¶
Note that a Cell depends on a Workbook.
And a Workbook (via a Sheet and a Row) depends on a Cell.
In some languages, we’d be obligated to define interfaces so that these two classes could depend on each other sensibly. In Python, however, we don’t need to create an elaborate web of dependencies.
4.5. Overheads¶
Here are the module docstring and imports.
"""stingray.cell -- Defines Cell as the atomic data element in a sheet
of a workbook.
A cell has a value, it's part of a workbook.
"""
import locale
import decimal
import datetime
import time
from collections import Hashable
A version string.
__version__ = "4.4.8"
Just to be sure that any locale-based processing will actually work, we establish a default locale.
locale.setlocale(locale.LC_ALL, '')
4.6. Cell¶
-
class
cell.
Cell
¶ The
cell.Cell
class hierarchy extends this base class. Note that we have a relatively short list of built-in conversions. For more complex, application-specific conversions, the rawvalue
is available as a property.-
value
¶ The raw data, often a String from a workbook. May also be a sequence of bytes for COBOL.
-
workbook
¶ The
workbook.base.Workbook
that created this Cell. This is largely used for Excel date conversions, but there could be other context needs for lazy access to data.
-
class Cell( Hashable ):
"""A class hierarchy for each kind of Cell.
"""
def __init__( self, value=None, workbook=None ):
"""Build a new Cell; the atomic data element of a workbook.
:param value: Raw value, generally a string for most workbooks.
:param workbook: Parent workbook, required for some
conversions and for lazy access to data.
"""
self._value, self.workbook = value, workbook
def __repr__( self ):
return "{0}({1!r})".format(
self.__class__.__name__, self._value )
def is_empty( self ):
return self._value is None
def to_int( self ): return NotImplemented
def to_float( self ): return NotImplemented
def to_decimal( self, digits=None ): return NotImplemented
def to_str( self ): return NotImplemented
def to_datetime( self, format=None ): return NotImplemented
def to_digit_str( self, len=5 ): return NotImplemented
One feature of a cell that’s required when we do data profiling is to create a usable hash from the cell class and raw data value.
def __hash__( self ):
return hash(self._value) ^ hash(self.__class__)
def __eq__( self, other ):
return self.__class__ == other.__class__ and self._value == other._value
def __ne__( self, other ):
return self.__class__ != other.__class__ or self._value != other._value
We make a token effort at making a cell more-or-less immutable. This makes it hashable.
@property
def value( self ):
return self._value
Todo
Unit test cases for the hashable interface of Cell
4.7. EmptyCell¶
-
class
cell.
EmptyCell
¶ An
EmptyCell
implements empty cells.xlrd
may report them as a typeXL_CELL_EMPTY
. A Numbers spreadsheet may use the<o>
or<g>
tag.
class EmptyCell( Cell ):
"""The *value* will be '', but we ignore that."""
def is_empty( self ): return True
def to_int( self ): return None
def to_float( self ): return None
def to_decimal( self, digits=None ): return None
def to_str( self ): return None
def to_datetime( self, format=None ): return None
def to_digit_str( self, len=None ): return None
4.8. TextCell¶
-
class
cell.
TextCell
¶ A
TextCell
implements the cells with text values.xlrd
may report them as a typeXL_CELL_TEXT
. It’s often possible to interpret the text as some other value, so the conversions make reasonable attempts at that.This is used for CSV workbooks as well as XLS workbooks. This is the default type for Fixed format files, also.
Note that COBOL files will explicitly have bytes values, not string values.
class TextCell( Cell ):
"""A Cell which contains a Python string value."""
def to_int( self ):
return int( self.value )
def to_float( self ):
return float( self.value )
def to_decimal( self, digits=0 ):
return decimal.Decimal( self.value )
def to_str( self ):
return self.value
def to_datetime( self, format=None ):
if format is None:
try:
format = locale.nl_langinfo(locale.D_FMT)
except AttributeError as e:
# Windows
format = "%x"
return datetime.datetime.strptime(self.value,format)
def to_digit_str( self, length=5 ):
txt= "{0:0>{length}d}".format(int(self.value), length=length)
return txt
4.9. NumberCell¶
-
class
cell.
NumberCell
¶ A
NumberCell
implements the cells with a float value.xlrd
may report them as a typeXL_CELL_NUMBER
. A variety of conversions make sense for a number value.
class NumberCell( Cell ):
"""A cell which contains a Python float value."""
def to_int( self ):
return int( self.value )
def to_float( self ):
if isinstance(self.value,float):
return self.value
# likely, it's Decimal!
return float(self.value)
def to_decimal( self, digits=0 ):
if isinstance(self.value,float):
fmt= "{0:0.{digits}f}"
return decimal.Decimal( fmt.format(self.value, digits=digits) )
elif isinstance(self.value,decimal.Decimal):
return self.value
else:
return decimal.Decimal(self.value)
def to_str( self ):
return str(self.value)
def to_datetime( self, format=None ):
assert format is None, "Format is not used."
return self.workbook.float_to_date(self.value)
#try:
# dt= xlrd.xldate_as_tuple(self.value, self.workbook.datemode)
#except xlrd.xldate.XLDateAmbiguous as e:
# ex= ValueError( "Ambiguous Date: {0}".format(self.value) )
# raise ex from e
#return datetime.datetime(*dt)
def to_digit_str( self, length=5 ):
txt= "{0:0>{length}d}".format(int(self.value), length=length)
return txt
4.10. FloatDateCell¶
-
class
cell.
FloatDateCell
¶ A
FloatDateCell
implements the cells withXL_CELL_DATE
. Since the conversions are all identical to number, we simply inherit the features of a number.
class FloatDateCell( NumberCell ):
"""A cell which contains a float value that is actually an Excel date."""
pass
Other formats have other kinds of date cells that aren’t simply dressed-up floating-point numbers.
4.11. BooleanCell¶
-
class
cell.
BooleanCell
¶ A
BooleanCell
implements the cells withXL_CELL_BOOLEAN
. Since the conversions are all identical to number, we simply inherit the features.
class BooleanCell( NumberCell ):
"""A cell which contains a boolean value."""
pass
4.12. ErrorCell¶
-
class
cell.
ErrorCell
¶ An
ErrorCell
implements the cells withXL_CELL_ERROR
. The only sensible conversion isErrorCell.to_str()
which reports the error string for the cell.
class ErrorCell( Cell ):
"""A cell which contains an error code."""
def to_int( self ):
raise ValueError( self.value )
def to_float( self ):
raise ValueError( self.value )
def to_decimal( self, digits=0 ):
raise ValueError( self.value )
def to_str( self ):
return self.value
def to_datetime( self, format=None ):
raise ValueError( self.value )
def to_digit_str( self, length=5 ):
raise ValueError( self.value )
4.13. DateCell¶
-
class
cell.
DateCell
¶ A
DateCell
implements a cell with a proper date-time value. This is a value which did not come from a workbook float value.This could be a parsed string, for example.
A variety of conversions make sense for a proper date value.
class DateCell( Cell ):
"""A cell which contains a proper :py:mod:`datetime` value."""
def to_int( self ):
return int(self.to_float())
def to_float( self ):
return self.workbook.date_to_float(self.value)
#timetuple= self.value.timetuple()[:6]
#xl= xlrd.xldate.xldate_from_datetime_tuple(
# timetuple,
# self.workbook.datemode)
#return xl
def to_decimal( self, digits=0 ):
fmt= "{0:0.{digits}f}"
return decimal.Decimal( fmt.format(self.to_float(),digits=digits) )
def to_str( self ):
return str(self.value)
def to_datetime( self, format=None ):
return self.value
def to_digit_str( self, length=5 ):
fmt= "{{0:0>{0}d}}".format(length)
return fmt.format( self.to_int() )
For Apple Nnumbers, the <d>
cells have a native date format.
This is a unique feature, since xlrd
, XLSX and ODS don’t have a
proper date cell value.
4.14. Conversion Functions¶
Todo
Refactor these into the schema
module.
These functions are used to define schema, not process Cell objects per se.
The idea here is to create some functions that can be used to build Schema attributes that handle proper date conversions.
-
cell.
date_from_string
(format)¶ A closure based on a format string that returns a single-argument conversion function.
Parameters: format – the format string to use.
def date_from_string( format ):
def the_conversion( string ):
return datetime.datetime.strptime( string, format )
return the_conversion
This forms a factory for the cell.DateCell
class.
-
cell.
datecell_from_string
(format)¶ A closure based on a format string that returns a single-argument conversion function.
Parameters: format – the format string to use.
def datecell_from_string( format ):
dt_conv= date_from_string( format )
def the_conversion( string, workbook ):
return DateCell( dt_conv( string ), workbook )
return the_conversion
This could be used like this in a schema definition.
d = Attribute( name="mm-dd-yy", size=n, offset=m, create=stingray.cell.datecell_from_string("%m/%d/%y") )
-
cell.
date_from_float
(workbook)¶ A closure based on an XLS workbook’s datemode setting that returns a single-argument conversion function.
Parameters: workbook – the xlrd workbook with the required datemode.
def date_from_float(workbook):
return workbook.float_to_date
Once the definition of this was somewhat more complex. It was dependent on
xlrd
; We’ve refactored it out of here to isolate all xlrd
dependencies properly.
def date_from_float( workbook ): def the_conversion( value ): try: dt= xlrd.xldate_as_tuple(value, workbook.datemode) except xlrd.xldate.XLDateAmbiguous as e: ex= ValueError( "Ambiguous Date: {0!r}".format(value) ) raise ex from e return datetime.datetime(*dt) return the_conversion
This function can be used to convert a raw float to a more useful
cell.DateCell
instance. This is only sensible in the XLRD context
where dates have peculiar conversion rules.
float2date= stingray.cell.date_from_float(workbook) d = Attribute( name="mm-dd-yy", size=n, offset=m, create=lambda x, w: stingray.cell.DateCell( float2date(x), w ) )