2012-08-14 by Stefan Urbanek

Data Types: From Storage to Analysis

What is the data type of 10? Depends on who you are, what are you going to do with it. I would expect my software friends to say that it is an "integer". Why this information might not be sufficient or not relevant? How analysts see the data?

Storage Data Type

If we say "data type", engineers would name types they know from typed programming languages: small integer, double precision float, character. This data type comes from how the data are stored in memory. The type specifies what operations can be done with the data stored at that particuliar place and how much memory is taken. To add two integers on an Intel processor there is an instruction called ADD, to add two floats there is a different instruction called FADD (Dear kids: this used to be on a separate chip in PCs!). To add an integer with an float, there has to be a conversion done. Database people would say decimal, date or string. Same as with memory data types, each type has it's allowed set of operations and size it takes in the database. They both are of one kinds of data types: storage data types.

Storage data type, as the name suggests, is used by software (compiler, database system) to know how much memory it takes to store the value of that type and to select appropriate operations (or operation variants).

Concrete vs. Generic

The number of storage data types and their differentiation is exhausting. To name a few:

C language has more than 25 concrete numeric types and differentiates by floatness, size and sign flag
PostgreSQL has 9 numeric types, differentiates by size and floatness
NumPy differentiates not only by size and sign, but also by byte order

Do I need all taht information about data type when working with data? In most cases I don't, it is information for machine, not for me as data analyst/scientist. There are cases when knowing about data types might be handy, like optimisation (for memory consumption for example) or error prevention (of some kind) by type checking in typed languages.

For simplification, some tools use generic data types and hide the concrete storage type: integer, float (or real), string, ... No storage size, no byte order. Those are low level details.

For reading the data, no input from user is required, as short int is integer and double is real. Problem with generic data types is that there might be multiple options how to store a generic integer.

Analytical Data Types

When doing data analysis I think about variable values and what I can do with them. In data analysis adding two integers or two floats is the same. It is just a + b. There is only one kind of addition: + (remember the ADD and FADD?). However, there might be numbers that adding them together will have no meaning, like adding two invoice numbers or years together.

To specify how the values should be treated during data analysis, there is another kind of data type: analytical data type or also called variable types. They are:

Set (or Nominal Variable): Values represent categories, like colors or contract. types. Fields of this type might be numbers which represent for example group numbers, but have no mathematical interpretation. For example addition of years 1989 and 2012 has no meaning.
Ordered Set (or Ordinal Variable): Similar to set field type, but values can be ordered in a meaningful order.
Flag (or Binary): Special case of set type where values can be one of two types, such as 1 or 0, ‘yes’ or ‘no’, ‘true’ or ‘false’.
Discrete: Set of integers - values can be ordered and one can perform arithmetic operations on them, such as: 1 apple + 2 apples = 3 apples.
Range: Numerical value, such as financial amount, temperature

The analytical data types are disstinct from storage data types. Take for example just an integer: it can be from a set without any arithmetic operations (ID, year), can be a discrete number (count of something), a flag with binary values of 40 and 50. Integer as a set can be ordered as set of product sizes or unordered as kind of IDs or category numbers where categories are ordered by their names rather.

In addition to the mentioned data types, it is sometimes useful to specify that the tool or algorithm should just ignore a field/column/variable. For that purpose typeless analytical data type might be used.

Here is an example of storage and analytical data types:

The idea behind analytical data types is described for example in nice introductory data mining book [1] or also in [2]. [1] differentiates measures as interval-scaled variables and ratio-scaled variables. Interesting that [2] describes the "set", which they call "categorical variable" as "generalization of the binary in that it can take one more than two states", not the other way around.

[1] Max Bramer: Principles of Datamining, Springer Verlag London United 2007, p12.

[2] Jaiwen Wan and Micheline Kamber: Data Mining - concepts and techniques, Elsevier 2006, p392.

Keep the metadata with you

As data are passed through algorithms, blocks of processing code, data types (along with other relevant metadata) should be passed with them. Data types can be in some cases guessed from data stream or explicitly expressed by a user, sometimes they can be reflected (like in a database). It is good to keep them, even if sometimes it is not possible to maintain accuracy or compatibility of data types between data sources and targets.

If done right, even after couple of transformations, one can say to an analytical metadata accepting function/algorithm: "get averages of this dataset" and it will understand it as "get averages of amounts in this dataset".

Basic metadata that should be considered when creating data processing or data analysing interfaces are:

number of fields
field names (as analyst I rather refer by name than index, as field position might differ among source chunks sometimes)
field order (for tabular data it is implicit, for document based databases it should be specified)
storage data types (at least generic, concrete if available or possible)
analytical datatype

The minimal metadata structure for a dataset relevant to both: analysts who use data and engineers who prepare data would therefore be a list of tuples: (name, storage type, analytical type).

Conclusion

Typeless programming languages allow programmers to focus on structuring the data and remove the necessity of fiddling with physical storage implementation. Hiding concrete storage types from data analysts allows them to focus on properties of their data relevant to analysis. Less burden on mind definitely helps our thinking process.

Nevertheless, there are more kinds...

Links

Data Brewery documentation of metadata structures.

2012-07-27 by Stefan Urbanek

Using Pandas as Brewery Backend

UPDATE: Added info about caching.

First time I looked at Pandas (python data analysis framework) I thought: that would be great backend/computation engine for data Brewery.

To recap core principle of Brewery: it is flow based data streaming framework with processing nodes connected by pipes. A typical node can have one or multiple inputs and has output. Source nodes have no inputs, target nodes have no outputs.

Current brewery implementation uses one thread per node (was written in times when Python was new to me and I did not know about GIL and stuff). Can be considered just as prototype...

Had this idea in mind for quite a some time, however coming from database world, the only potential implementation was through database tables with nodes performing SQL operations on them. I was not happy by requirement of some SQL DB server for data processing, not mentioning speed and function set (well, ok, pandas is missing the non-numeric stuff).

Here is the draft of the idea, how to implement data transfer between nodes in Brewery using tables. The requirements are

follow data modeller's workflow
do not rewrite data – I want to be able to see what was the result at each step
have some kind of provenance (where this field comes from?)

See larger image on imgur.

Table represents a pipe: each pipe field is mapped to a table column. If node performs only field operation, then table can be shared between nodes. If node affects rows, then new table should be considered. Every "pipe" can be cached and stream can be run from the cached point, if the computation takes longer time than desired during model development process.

Pandas offers structure called DataFrame, which holds data in a tabular form consisting of series of Series (fancier array objects). Each of the series represents a collection of field's values for analytical/computational step. Nodes that share same field structure and same records can share the series which can be grouped in a table/DataFrame.

Node can:

create completely new field structure (source node, aggregation, join, ...)
add a field (various derive/compute nodes)
remove a field (field filter, field replacement)

Just adding or removing a field does not affect the series, therefore nodes can just point to series they "need". Aggregation or join nodes generate not only new field structure, they affect number and representation of records as well, therefore the field series differ form their respective source series (compare: "year" in invoices and "year" in summary of invoices). For those kind of nodes new table/DataFrame should be created.

Sampling nodes or selection nodes can generate additional Series with boolean values based on selection. Each node can have hidden input column representing the selection.

There are couple of things I am missing so far: DataFrame that will be a "view" of another data frame – that is: DataFrame will not copy series, only reference them. Another feature is more custom metadata for a table column (DataFrame series), including "analytical datatype" (I will write about this later as it is not crucial in this case). They might be there, I just did not discovered them yet.

I am not an expert in Pandas, I have just started exploring the framework. Looks very promising for this kind of problem.

2012-04-13 by Stefan Urbanek

Data Streaming Basics in Brewery

How to build and run a data analysis stream? Why streams? I am going to talk about how to use brewery from command line and from Python scripts.

Brewery is a Python framework and a way of analysing and auditing data. Basic principle is flow of structured data through processing and analysing nodes. This architecture allows more transparent, understandable and maintainable data streaming process.

You might want to use brewery when you:

want to learn more about data
encounter unknown datasets and/or you do not know what you have in your datasets
do not know exactly how to process your data and you want to play-around without getting lost
want to create alternative analysis paths and compare them
measure data quality and feed data quality results into the data processing process

There are many approaches and ways how to the data analysis. Brewery brings a certain workflow to the analyst:

examine data
prototype a stream (can use data sampling, not to overheat the machine)
see results and refine stream, create alternatives (at the same time)
repeat 3. until satisfied

Brewery makes the steps 2. and 3. easy - quick prototyping, alternative branching, comparison. Tries to keep the analysts workflow clean and understandable.

Building and Running a Stream

There are two ways to create a stream: programmatic in Python and command-line without Python knowledge requirement. Both ways have two alternatives: quick and simple, but with limited feature set. And the other is full-featured but is more verbose.

The two programmatic alternatives to create a stream are: basic construction and "HOM" or forking construction. The two command line ways to run a stream: run and pipe. We are now going to look closer at them.

Note regarding Zen of Python: this does not go against "There should be one – and preferably only one – obvious way to do it." There is only one way: the raw construction. The others are higher level ways or ways in different environments.

In our examples below we are going to demonstrate simple linear (no branching) stream that reads a CSV file, performs very basic audit and "pretty prints" out the result. The stream looks like this:

Command line

Brewery comes with a command line utility brewery which can run streams without needing to write a single line of python code. Again there are two ways of stream description: json-based and plain linear pipe.

The simple usage is with brewery pipe command:

brewery pipe csv_source resource=data.csv audit pretty_printer

The pipe command expects list of nodes and attribute=value pairs for node configuration. If there is no source pipe specified, CSV on standard input is used. If there is no target pipe, CSV on standard output is assumed:

cat data.csv | brewery pipe audit

The actual stream with implicit nodes is:

The json way is more verbose but is full-featured: you can create complex processing streams with many branches. stream.json:

    {
        "nodes": { 
            "source": { "type":"csv_source", "resource": "data.csv" },
            "audit":  { "type":"audit" },
            "target": { "type":"pretty_printer" }
        },
        "connections": [
            ["source", "audit"],
            ["audit", "target"]
        ]
    }

And run:

$ brewery run stream.json

To list all available nodes do:

$ brewery nodes

To get more information about a node, run brewery nodes:

$ brewery nodes string_strip

Note that data streaming from command line is more limited than the python way. You might not get access to nodes and node features that require python language, such as python storage type nodes or functions.

Higher order messaging

Preferred programming way of creating streams is through higher order messaging (HOM), which is, in this case, just fancy name for pretending doing something while in fact we are preparing the stream.

This way of creating a stream is more readable and maintainable. It is easier to insert nodes in the stream and create forks while not losing picture of the stream. Might be not suitable for very complex streams though. Here is an example:

    b = brewery.create_builder()
    b.csv_source("data.csv")
    b.audit()
    b.pretty_printer()

When this piece of code is executed, nothing actually happens to the data stream. The stream is just being prepared and you can run it anytime:

    b.stream.run()

What actually happens? The builder b is somehow empty object that accepts almost anything and then tries to find a node that corresponds to the method called. Node is instantiated, added to the stream and connected to the previous node.

You can also create branched stream:

    b = brewery.create_builder()
    b.csv_source("data.csv")
    b.audit()

    f = b.fork()
    f.csv_target("audit.csv")

    b.pretty_printer()

Basic Construction

This is the lowest level way of creating the stream and allows full customisation and control of the stream. In the basic construction method the programmer prepares all node instance objects and connects them explicitly, node-by-node. Might be a too verbose, however it is to be used by applications that are constructing streams either using an user interface or from some stream descriptions. All other methods are using this one.

    from brewery import Stream
    from brewery.nodes import CSVSourceNode, AuditNode, PrettyPrinterNode

    stream = Stream()

    # Create pre-configured node instances
    src = CSVSourceNode("data.csv")
    stream.add(src)

    audit = AuditNode()
    stream.add(audit)

    printer = PrettyPrinterNode()
    stream.add(printer)

    # Connect nodes: source -> target
    stream.connect(src, audit)
    stream.connect(audit, printer)

    stream.run()

It is possible to pass nodes as dictionary and connections as list of tuples (source, target):

    stream = Stream(nodes, connections)

Future plans

What would be lovely to have in brewery?

Probing and data quality indicators – tools for simple data probing and easy way of creating data quality indicators. Will allow something like "test-driven-development" but for data. This is the next step.

Stream optimisation – merge multiple nodes into single processing unit before running the stream. Might be done in near future.

Backend-based nodes and related data transfer between backend nodes – For example, two SQL nodes might pass data through a database table instead of built-in data pipe or two numpy/scipy-based nodes might use numpy/scipy structure to pass data to avoid unnecessary streaming. Not very soon, but foreseeable future.

Stream compilation – compile a stream to an optimised script. Not too soon, but like to have that one.

Last, but not least: Currently there is little performance cost because of the nature of brewery implementation. This penalty will be explained in another blog post, however to make long story short, it has to do with threads, Python GIL and non-optimalized stream graph. There is no future prediction for this one, as it might be included step-by-step. Also some Python 3 features look promising, such as yield from in Python 3.3 (PEP 308).

Links

2012-04-04 by Stefan Urbanek

Brewery 0.8 Released

I'm glad to announce new release of Brewery – stream based data auditing and analysis framework for Python.

There are quite a few updates, to mention the notable ones:

new brewery runner with commands run and graph
new nodes: pretty printer node (for your terminal pleasure), generator function node
many CSV updates and fixes

Added several simple how-to examples, such as: aggregation of remote CSV, basic audit of a CSV, how to use a generator function. Feedback and questions are welcome. I'll help you.

Note that there are couple changes that break compatibility, however they can be updated very easily. I apologize for the inconvenience, but until 1.0 the changes might happen more frequently. On the other hand, I will try to make them as painless as possible.

Full listing of news, changes and fixes is below.

Version 0.8

News

Changed license to MIT
Created new brewery runner commands: 'run' and 'graph':
- 'brewery run stream.json' will execute the stream
- 'brewery graph stream.json' will generate graphviz data
Nodes: Added pretty printer node - textual output as a formatted table
Nodes: Added source node for a generator function
Nodes: added analytical type to derive field node
Preliminary implementation of data probes (just concept, API not decided yet for 100%)
CSV: added empty_as_null option to read empty strings as Null values
Nodes can be configured with node.configure(dictionary, protected). If 'protected' is True, then protected attributes (specified in node info) can not be set with this method.
added node identifier to the node reference doc
added create_logger
added experimental retype feature (works for CSV only at the moment)
Mongo Backend - better handling of record iteration

Changes

CSV: resource is now explicitly named argument in CSV*Node
CSV: convert fields according to field storage type (instead of all-strings)
Removed fields getter/setter (now implementation is totally up to stream subclass)
AggregateNode: rename aggregates to measures, added measures as public node attribute
moved errors to brewery.common
removed field_name(), now str(field) should be used
use named blogger 'brewery' instead of the global one
better debug-log labels for nodes (node type identifier + python object ID)

WARNING: Compatibility break:

depreciate __node_info__ and use plain node_info instead
Stream.update() now takes nodes and connections as two separate arguments

Fixes

added SQLSourceNode, added option to keep ifelds instead of dropping them in FieldMap and FieldMapNode (patch by laurentvasseur @ bitbucket)
better traceback handling on node failure (now actually the traceback is displayed)
return list of field names as string representation of FieldList
CSV: fixed output of zero numeric value in CSV (was empty string)

Brewery 0.7 Released

New small release is out with quite nice addition of documentation. It does not bring too many new features, but contains a refactoring towards better package structure, that breaks some compatibility.

Documentation updates

installation instructions with list of optional dependencies
information about fields and metadata
included documentation about data store classes
included text from previous blog post about Higher Order Messaging

Framework Changes

added soft (optional) dependencies on backend libraries. Exception with useful information will be raised when functionality that depends on missing package is used. Example: “Exception: Optional package ‘sqlalchemy’ is not installed. Please install the package from http://www.sqlalchemy.org/ to be able to use: SQL streams. Recommended version is > 0.7”
field related classes and functions were moved from ‘ds’ module to ‘metadata’ and included in brewery top-level: Field, FieldList, expand_record, collapse_record
added probes

Depreciated functions

brewery.ds.field_name() - use str(field) instead
brewery.ds.fieldlist() - use brewery.metadata.FieldList() instead

Streams

new node: DeriveNode - derive new field with callables or string formula (python expression)
new SelectNode implementation: accepts callables or string with python code
former SelectNode renamed to FunctionSelectNode

Enjoy!

Data Types: From Storage to Analysis

Storage Data Type

Concrete vs. Generic

Analytical Data Types

Keep the metadata with you

Conclusion

Links

Share:

Using Pandas as Brewery Backend

Share:

Data Streaming Basics in Brewery

Building and Running a Stream

Command line

Higher order messaging

Basic Construction

Future plans

Links

Share:

Brewery 0.8 Released

Version 0.8

News

Changes

Fixes

Links

Share:

Brewery 0.7 Released

Links

Share:

Tags