Data Brewery

analytical data streams & online analytical processing Python frameworks

    by Stefan Urbanek
  • Cubes 0.9.1: Ranges, denormalization and query cell

    The new minor release of Cubes – light-weight Python OLAP framework – brings range cuts, denormalization with the slicer tool and cells in /report query, together with fixes and important changes.

    See the second part of this post for the full list.

    Range Cuts

    Range cuts were implemented in the SQL Star Browser. They are used as follows:

    Python:

    cut = RangeCut("date", [2010], [2012,5,10])
    cut_hi = RangeCut("date", None, [2012,5,10])
    cut_low = RangeCut("date", [2010], None)
    

    To specify a range in slicer server where keys are sortable:

        http://localhost:5000/aggregate?cut=date:2004-2005
        http://localhost:5000/aggregate?cut=date:2004,2-2005,5,1
    

    Open ranges:

        http://localhost:5000/aggregate?cut=date:2010-
        http://localhost:5000/aggregate?cut=date:2004,1,1-
        http://localhost:5000/aggregate?cut=date:-2005,5,10
        http://localhost:5000/aggregate?cut=date:-2012,5
    

    Denormalization with slicer Tool

    Now it is possible to denormalize tour data with the slicer tool. You do not have to denormalize using python script. Data are denormalized in a way how denormalized browser would expect them to be. You can tune the process using command line switches, if you do not like the defaults.

    Denormalize all cubes in the model:

    $ slicer denormalize slicer.ini
    

    Denormalize only one cube::

    $ slicer denormalize -c contracts slicer.ini
    

    Create materialized denormalized view with indexes::

    $ slicer denormalize --materialize --index slicer.ini
    

    Example slicer.ini:

    [workspace]
    denormalized_view_prefix = mft_
    denormalized_view_schema = denorm_views
    
    # This switch is used by the browser:
    use_denormalization = yes
    

    For more information see Cubes slicer tool documentation

    Cells in Report

    Use cell to specify all cuts (type can be range, point or set):

    {
        "cell": [
            {
                "dimension": "date",
                "type": "range",
                "from": [2010,9],
                "to": [2011,9]
            }
        ],
        "queries": {
            "report": {
                "query": "aggregate",
                "drilldown": {"date":"year"}
            }
        }
    }
    

    For more information see the slicer server documentation.

    New Features

    • cut_from_string(): added parsing of range and set cuts from string; introduced requirement for key format: Keys should now have format “alphanumeric character or underscore” if they are going to be converted to strings (for example when using slicer HTTP server)
    • cut_from_dict(): create a cut (of appropriate class) from a dictionary description
    • Dimension.attribute(name): get attribute instance from name
    • added exceptions: CubesError, ModelInconsistencyError, NoSuchDimensionError, NoSuchAttributeError, ArgumentError, MappingError, WorkspaceError and BrowserError

    StarBrowser:

    • implemented RangeCut conditions

    Slicer Server:

    • /report JSON now accepts cell with full cell description as dictionary, overrides URL parameters

    Slicer tool:

    • denormalize option for (bulk) denormalization of cubes (see the the slicer documentation for more information)

    Changes

    • important: all /report JSON requests should now have queries wrapped in the key queries. This was originally intended way of use, but was not correctly implemented. A descriptive error message is returned from the server if the key queries is not present. Despite being rather a bug-fix, it is listed here as it requires your attention for possible change of your code.
    • warn when no backend is specified during slicer context creation

    Fixes

    • Better handling of missing optional packages, also fixes #57 (now works without slqalchemy and without werkzeug as expected)
    • see change above about /report and queries
    • push more errors as JSON responses to the requestor, instead of just failing with an exception

    Links

    Sources can be found on github. Read the documentation.

    Join the Google Group for discussion, problem solving and announcements.

    Submit issues and suggestions on github

    IRC channel #databrewery on irc.freenode.net

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: announcement release cubes olap by Stefan Urbanek
  • Cubes 0.9 Released

    The new version of Cubes – light-weight Python OLAP framework – brings new StarBrowser, which we discussed in previous blog posts:

    The new SQL backend is written from scratch, it is much cleaner, transparent, configurable and open for future extensions. Also allows direct browsing of star/snowflake schema without denormalization, therefore you can use Cubes on top of a read-only database. See DenormalizedMapper and SnowflakeMapper for more information.

    Just to name a few new features: multiple aggregated computations (min, max,…), cell details, optional/configurable denormalization.

    Important Changes

    Summary of most important changes that might affect your code:

    Slicer: Change all your slicer.ini configuration files to have [workspace] section instead of old [db] or [backend]. Depreciation warning is issued, will work if not changed.

    Model: Change dimensions in model to be an array instead of a dictionary. Same with cubes. Old style: "dimensions" = { "date" = ... } new style: "dimensions" = [ { "name": "date", ... } ]. Will work if not changed, just be prepared.

    Python: Use Dimension.hierarchy() instead of Dimension.default_hierarchy.

    New Features

    • slicer_context() - new method that holds all relevant information from configuration. can be reused when creating tools that work in connected database environment
    • added Hierarchy.all_attributes() and .key_attributes()
    • Cell.rollup_dim() - rolls up single dimension to a specified level. this might later replace the Cell.rollup() method
    • Cell.drilldown() - drills down the cell
    • create_workspace(backend,model, **options) - new top-level method for creating a workspace by specifying backend name. Easier to create browsers (from possible browser pool) programmatically. The backend name might be full module name path or relative to the cubes.backends, for example sql.star for new or sql.browser for old SQL browser.
    • get_backend() - get backend by name

    • AggregationBrowser.cell_details(): New method returning values of attributes representing the cell. Preliminary implementation, return value might change.

    • AggregationBrowser.cut_details(): New method returning values of attributes representing a single cut. Preliminary implementation, return value might change.

    • Dimension.validate() now checks whether there are duplicate attributes

    • Cube.validate() now checks whether there are duplicate measures or details

    SQL backend:

    • new StarBrowser implemented:
      • StarBrowser supports snowflakes or denormalization (optional)
      • for snowflake browsing no write permission is required (does not have to be denormalized)
    • new DenormalizedMapper for mapping logical model to denormalized view
    • new SnowflakeMapper for mapping logical model to a snowflake schema
    • ddl_for_model() - get schema DDL as string for model
    • join finder and attribute mapper are now just Mapper - class responsible for finding appropriate joins and doing logical-to-physical mappings
    • coalesce_attribute() - new method for coalescing multiple ways of describing a physical attribute (just attribute or table+schema+attribute)
    • dimension argument was removed from all methods working with attributes (the dimension is now required attribute property)
    • added create_denormalized_view() with options: materialize, create_index, keys_only

    Slicer tool/server:

    • slicer ddl - generate schema DDL from model
    • slicer test - test configuration and model against database and report list of issues, if any
    • Backend options are now in [workspace], removed configurability of custom backend section. Warning are issued when old section names [db] and [backend] are used
    • server responds to /details which is a result of AggregationBrowser.cell_details()

    Examples:

    • added simple Flask based web example - dimension aggregation browser

    Changes

    • in Model: dimension and cube dictionary specification during model initialization is depreciated, list should be used (with explicitly mentioned attribute “name”) — important
    • important: Now all attribute references in the model (dimension attributes, measures, …) are required to be instances of Attribute() and the attribute knows it’s dimension
    • removed hierarchy argument from Dimension.all_attributes() and .key_attributes()
    • renamed builder to denormalizer
    • Dimension.default_hierarchy is now depreciated in favor of Dimension.hierarchy() which now accepts no arguments or argument None - returning default hierarchy in those two cases
    • metadata are now reused for each browser within one workspace - speed improvement.

    Fixes

    • Slicer version should be same version as Cubes: Original intention was to have separate server, therefore it had its own versioning. Now there is no reason for separate version, moreover it can introduce confusion.
    • Proper use of database schema in the Mapper

    Links

    Sources can be found on github. Read the documentation.

    Join the Google Group for discussion, problem solving and announcements.

    Submit issues and suggestions on github

    IRC channel #databrewery on irc.freenode.net

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: cubes announcement release by Stefan Urbanek
  • Star Browser, Part 3: Aggregations and Cell Details

    Last time I was talking about joins and denormalisation in the Star Browser. This is the last part about the star browser where I will describe the aggregation and what has changed, compared to the old browser.

    The Star Browser is new aggregation browser in for the Cubes – lightweight Python OLAP Framework. Next version v0.9 will be released next week.

    Aggregation

    sum is not the only aggregation. The new browser allows to have other aggregate functions as well, such as min, max.

    You can specify the aggregations for each measure separately:

    {
        "name": "amount",
        "aggregations": ["sum", "min", "max"]
    }
    

    The resulting aggregated attribute name will be constructed from the measure name and aggregation suffix, for example the mentioned amount will have three aggregates in the result: amount_sum, amount_min and amount_max.

    Source code reference: see StarQueryBuilder.aggregations_for_measure

    Aggregation Result

    Result of aggregation is a structure containing: summary - summary for the aggregated cell, drilldown - drill down cells, if was desired, and total_cell_count - total cells in the drill down, regardless of pagination.

    Cell Details

    When we are browsing the cube, the cell provides current browsing context. For aggregations and selections to happen, only keys and some other internal attributes are necessary. Those can not be presented to the user though. For example we have geography path (country, region) as ['sk', 'ba'], however we want to display to the user Slovakia for the country and Bratislava for the region. We need to fetch those values from the data store. Cell details is basically a human readable description of the current cell.

    For applications where it is possible to store state between aggregation calls, we can use values from previous aggregations or value listings. Problem is with web applications - sometimes it is not desirable or possible to store whole browsing context with all details. This is exact the situation where fetching cell details explicitly might come handy.

    Note: The Original browser added cut information in the summary, which was ok when only point cuts were used. In other situations the result was undefined and mostly erroneous.

    The cell details are now provided separately by method AggregationBrowser.cell_details(cell) which has Slicer HTTP equivalent /details or {"query":"detail", ...} in /report request. The result is a list of

    For point cuts, the detail is a list of dictionaries for each level. For example our previously mentioned path ['sk', 'ba'] would have details described as:

    [
        {
            "geography.country_code": "sk",
            "geography.country_name": "Slovakia",
            "geography.something_more": "..."
            "_key": "sk",
            "_label": "Slovakia"
        },
        {
            "geography.region_code": "ba",
            "geography.region_name": "Bratislava",
            "geography.something_even_more": "...",
            "_key": "ba",
            "_label": "Bratislava"
        }
    ]
    

    You might have noticed the two redundant keys: _key and _label - those contain values of a level key attribute and level label attribute respectively. It is there to simplify the use of the details in presentation layer, such as templates. Take for example doing only one-dimensional browsing and compare presentation of “breadcrumbs”:

    labels = [detail["_label"] for detail in cut_details]
    

    Which is equivalent to:

    levels = dimension.hierarchy.levels()
    labels = []
    for i, detail in enumerate(cut_details):
        labels.append(detail[level[i].label_attribute.full_name()])
    

    Note that this might change a bit: either full detail will be returned or just key and label, depending on an option argument (not yet decided).

    Pre-aggregation

    The Star Browser is being created with SQL pre-aggregation in mind. This is not possible in the old browser, as it is not flexible enough. It is planned to be integrated when all basic features are finished.

    Proposed access from user’s perspective will be through configuration options: use_preaggregation, preaggregation_prefix, preaggregation_schema and a method for cube pre-aggregation will be available through the slicer tool.

    Summary

    The new browser has better internal structure resulting in increased flexibility for future extensions. It fixes not so good architectural decisions of the old browser.

    New and fixed features:

    • direct star/snowflake schema browsing
    • improved mappings - more transparent and understandable process
    • ability to explicitly specify database schemas
    • multiple aggregations

    The new backend sources are here and the mapper is here.

    To do

    To be done in the near future:

    • DDL generator for denormalized schema, corresponding logical schema and physical schema
    • explicit list of attributes to be selected (instead of all)
    • selection of aggregations per-request (now all specified in model are used)

    Links

    See also Cubes at github, Cubes Documentation, Mailing List and Submit issues. Also there is an IRC channel #databrewery on irc.freenode.net

  • Tags: cubes olap by Stefan Urbanek
  • Star Browser, Part 2: Joins and Denormalization

    Last time I was talking about how logical attributes are mapped to the physical table columns in the Star Browser. Today I will describe how joins are formed and how denormalization is going to be used.

    The Star Browser is new aggregation browser in for the Cubes – lightweight Python OLAP Framework.

    Star, Snowflake, Master and Detail

    Star browser supports a star:

    … and snowflake database schema:

    The browser should know how to construct the star/snowflake and that is why you have to specify the joins of the schema. The join specification is very simple:

    "joins" = [
        { "master": "fact_sales.product_id", "detail": "dim_product.id" }
    ]
    

    Joins support only single-column keys, therefore you might have to create surrogate keys for your dimensions.

    As in mappings, if you have specific needs for explicitly mentioning database schema or any other reason where table.column reference is not enough, you might write:

    "joins" = [
        { 
            "master": "fact_sales.product_id",
            "detail": {
                "schema": "sales",
                "table": "dim_products",
                "column": "id"
            }
    ]
    

    What if you need to join same table twice? For example, you have list of organizations and you want to use it as both: supplier and service consumer. It can be done by specifying alias in the joins:

    "joins" = [
        {
            "master": "contracts.supplier_id", 
            "detail": "organisations.id",
            "alias": "suppliers"
        },
        {
            "master": "contracts.consumer_id", 
            "detail": "organisations.id",
            "alias": "consumers"
        }
    ]
    

    In the mappings you refer to the table by alias specified in the joins, not by real table name:

    "mappings": {
        "supplier.name": "suppliers.org_name",
        "consumer.name": "consumers.org_name"
    }
    

    Relevant Joins and Denormalization

    The new mapper joins only tables that are relevant for given query. That is, if you are browsing by only one dimension, say product, then only product dimension table is joined.

    Joins are slow, expensive and the denormalization can be helpful:

    The old browser is based purely on the denormalized view. Despite having a performance gain, it has several disadvantages. From the join/performance perspective the major one is, that the denormalization is required and it is not possible to browse data in a database that was “read-only”. This requirements was also one unnecessary step for beginners, which can be considered as usability problem.

    Current implementation of the Mapper and StarBrowser allows denormalization to be integrated in a way, that it might be used based on needs and situation:

    It is not yet there and this is what needs to be done:

    • function for denormalization - similar to the old one: will take cube and view name and will create denormalized view (or a table)
    • make mapper accept the view and ignore joins

    Goal is not just to slap denormalization in, but to make it a configurable alternative to default star browsing. From user’s perspective, the workflow will be:

    1. browse star/snowflake until need for denormalization arises
    2. configure denormalization and create denormalized view
    3. browse the denormalized view

    The proposed options are: use_denormalization, denormalized_view_prefix, denormalized_view_schema.

    The Star Browser is half-ready for the denormalization, just few changes are needed in the mapper and maybe query builder. These changes have to be compatible with another, not-yet-included feature: SQL pre-aggregation.

    Conclusion

    The new way of joining is very similar to the old one, but has much more cleaner code and is separated from mappings. Also it is more transparent. New feature is the ability to specify a database schema. Planned feature to be integrated is automatic join detection based on foreign keys.

    In the next post (the last post in this series) about the new StarBrowser, I am going to explain aggregation improvements and changes.

    Links

    Relevant source code is this one (github).

    See also Cubes at github, Cubes Documentation, Mailing List and Submit issues. Also there is an IRC channel #databrewery on irc.freenode.net

  • Tags: cubes olap by Stefan Urbanek
  • Star Browser, Part 1: Mappings

    Star Browser is new aggregation browser in for the Cubes – lightweight Python OLAP Framework. I am going to talk briefly about current state and why new browser is needed. Then I will describe in more details the new browser: how mappings work, how tables are joined. At the end I will mention what will be added soon and what is planned in the future.

    Originally I wanted to write one blog post about this, but it was too long, so I am going to split it into three:

    Why new browser?

    Current denormalized browser is good, but not good enough. Firstly, it has grown into a spaghetti-like structure inside and adding new features is quite difficult. Secondly, it is not immediately clear what is going on inside and not only new users are getting into troubles. For example the mapping of logical to physical is not obvious; denormalization is forced to be used, which is good at the end, but is making OLAP newbies puzzled.

    The new browser, called StarBrowser. is half-ready and will fix many of the old decisions with better ones.

    Mapping

    Cubes provides an analyst’s view of dimensions and their attributes by hiding the physical representation of data. One of the most important parts of proper OLAP on top of the relational database is the mapping of physical attributes to logical.

    First thing that was implemented in the new browser is proper mapping of logical attributes to physical table columns. For example, take a reference to an attribute name in a dimension product. What is the column of what table in which schema that contains the value of this dimension attribute?

    There are two ways how the mapping is being done: implicit and explicit. The simplest, straightforward and most customizable is the explicit way, where the actual column reference is provided in the model description:

    "mappings": {
        "product.name": "dm_products.product_name"
    }
    

    If it is in different schema or any part of the reference contains a dot:

    "mappings": {
        "product.name": {
                "schema": "sales",
                "table": "dm_products",
                "column": "product_name"
            }
    }
    

    Disadvantage of the explicit way is it’s verbosity and the fact that developer has to write more metadata, obviously.

    Both, explicit and implicit mappings have ability to specify default database schema (if you are using Oracle, PostgreSQL or any other DB which supports schemas).

    The mapping process process is like this:

    Implicit Mapping

    With implicit mapping one can match a database schema with logical model and does not have to specify additional mapping metadata. Expected structure is star schema with one table per (denormalized) dimension.

    Basic rules:

    • fact table should have same name as represented cube
    • dimension table should have same name as the represented dimension, for example: product (singular)
    • references without dimension name in them are expected to be in the fact table, for example: amount, discount (see note below for simple flat dimensions)
    • column name should have same name as dimension attribute: name, code, description
    • if attribute is localized, then there should be one column per localization and should have locale suffix: description_en, description_sk, description_fr (see below for more information)

    This means, that by default product.name is mapped to the table product and column name. Measure amount is mapped to the table sales and column amount

    What about dimensions that have only one attribute, like one would not have a full date but just a year? In this case it is kept in the fact table without need of separate dimension table. The attribute is treated in by the same rule as measure and is referenced by simple year. This is applied to all dimensions that have only one attribute (representing key as well). This dimension is referred to as flat and without details.

    Note for advanced users: this behavior can be disabled by setting simplify_dimension_references to False in the mapper. In that case you will have to have separate table for the dimension attribute and you will have to reference the attribute by full name. This might be useful when you know that your dimension will be more detailed.

    Localization

    Despite localization taking place first in the mapping process, we talk about it at the end, as it might be not so commonly used feature. From physical point of view, the data localization is very trivial and requires language denormalization - that means that each language has to have its own column for each attribute.

    In the logical model, some of the attributes may contain list of locales that are provided for the attribute. For example product category can be in English, Slovak or German. It is specified in the model like this:

    attributes = [{
        "name" = "category",
        "locales" = [en, sk, de],
    }]
    

    During the mapping process, localized logical reference is created first:

    In short: if attribute is localizable and locale is requested, then locale suffix is added. If no such localization exists then default locale is used. Nothing happens to non-localizable attributes.

    For such attribute, three columns should exist in the physical model. There are two ways how the columns should be named. They should have attribute name with locale suffix such as category_sk and category_en (underscore because it is more common in table column names), if implicit mapping is used. You can name the columns as you like, but you have to provide explicit mapping in the mapping dictionary. The key for the localized logical attribute should have .locale suffix, such as product.category.sk for Slovak version of category attribute of dimension product. Here the dot is used because dots separate logical reference parts.

    Customization of the Implicit

    The implicit mapping process has a little bit of customization as well:

    • dimension table prefix: you can specify what prefix will be used for all dimension tables. For example if the prefix is dim_ and attribute is product.name then the table is going to be dim_product.
    • fact table prefix: used for constructing fact table name from cube name. Example: having prefix ft_ all fact attributes of cube sales are going to be looked up in table ft_sales
    • fact table name: one can explicitly specify fact table name for each cube separately

    The Big Picture

    Here is the whole mapping schema, after localization:

    Links

    The commented mapper source is here.

  • Tags: cubes sql by Stefan Urbanek
  • Cubes Backend Progress and Comparison

    I’ve been working on a new SQL backend for cubes called StarBrowser. Besides new features and fixes, it is going to be more polished and maintainable.

    Current Backend Comparison

    In the following table you can see comparison of backends (or rather aggregation browsers). Current backend is sql.browser which reqiures denormalized table as a source. Future preferred backend will be sql.star.

    Document link at Google Docs.

    Star Browser state

    More detailed description with schemas and description of what is happening behind will be published once the browser will be useable in most of the important features (that is, no sooner than drill-down is implemented). Here is a peek to the new browser features.

    • separated attribute mapper - doing the logical-to-physical mapping. or in other words: knows what column in which table represents what dimension attribute or a measure
    • more intelligent join building - uses only joins that are relevant to the retrieved attributes, does not join the whole star/snowflake if not necessary
    • allows tables to be stored in different database schemas (previously everything had to be in one schema)

    There is still some work to be done, including drill-down and ordering of results.

    You can try limited feature set of the browser by using sql.star backend name. Do not expect much at this time, however if you find a bug, I would be glad if report it through github issues. The source is in the cubes/backends/sql/star.py and cubes/backends/sql/common.py (or here).

    New and improved

    Here is a list of features you can expect (not yet fully implemented, if at all started):

    • more SQL aggregation types and way to specify what aggregations should be used by-default for each measure
    • DDL schema generator for: denormalized table, logical model - star schema, physical model
    • model tester - tests whether all attributes and joins are valid in the physical model

    Also the new implementation of star browser will allow easier integration of pre-aggregated store (planned) and various other optimisations.

  • Tags: cubes sql by Stefan Urbanek
  • Data Streaming Basics in Brewery

    How to build and run a data analysis stream? Why streams? I am going to talk about how to use brewery from command line and from Python scripts.

    Brewery is a Python framework and a way of analysing and auditing data. Basic principle is flow of structured data through processing and analysing nodes. This architecture allows more transparent, understandable and maintainable data streaming process.

    You might want to use brewery when you:

    • want to learn more about data
    • encounter unknown datasets and/or you do not know what you have in your datasets
    • do not know exactly how to process your data and you want to play-around without getting lost
    • want to create alternative analysis paths and compare them
    • measure data quality and feed data quality results into the data processing process

    There are many approaches and ways how to the data analysis. Brewery brings a certain workflow to the analyst:

    1. examine data
    2. prototype a stream (can use data sampling, not to overheat the machine)
    3. see results and refine stream, create alternatives (at the same time)
    4. repeat 3. until satisfied

    Brewery makes the steps 2. and 3. easy - quick prototyping, alternative branching, comparison. Tries to keep the analysts workflow clean and understandable.

    Building and Running a Stream

    There are two ways to create a stream: programmatic in Python and command-line without Python knowledge requirement. Both ways have two alternatives: quick and simple, but with limited feature set. And the other is full-featured but is more verbose.

    The two programmatic alternatives to create a stream are: basic construction and “HOM” or forking construction. The two command line ways to run a stream: run and pipe. We are now going to look closer at them.

    Note regarding Zen of Python: this does not go against “There should be one – and preferably only one – obvious way to do it.” There is only one way: the raw construction. The others are higher level ways or ways in different environments.

    In our examples below we are going to demonstrate simple linear (no branching) stream that reads a CSV file, performs very basic audit and “pretty prints” out the result. The stream looks like this:

    Command line

    Brewery comes with a command line utility brewery which can run streams without needing to write a single line of python code. Again there are two ways of stream description: json-based and plain linear pipe.

    The simple usage is with brewery pipe command:

    brewery pipe csv_source resource=data.csv audit pretty_printer
    

    The pipe command expects list of nodes and attribute=value pairs for node configuration. If there is no source pipe specified, CSV on standard input is used. If there is no target pipe, CSV on standard output is assumed:

    cat data.csv | brewery pipe audit
    

    The actual stream with implicit nodes is:

    The json way is more verbose but is full-featured: you can create complex processing streams with many branches. stream.json:

        {
            "nodes": { 
                "source": { "type":"csv_source", "resource": "data.csv" },
                "audit":  { "type":"audit" },
                "target": { "type":"pretty_printer" }
            },
            "connections": [
                ["source", "audit"],
                ["audit", "target"]
            ]
        }
    

    And run:

    $ brewery run stream.json
    

    To list all available nodes do:

    $ brewery nodes
    

    To get more information about a node, run brewery nodes <node_name>:

    $ brewery nodes string_strip
    

    Note that data streaming from command line is more limited than the python way. You might not get access to nodes and node features that require python language, such as python storage type nodes or functions.

    Higher order messaging

    Preferred programming way of creating streams is through higher order messaging (HOM), which is, in this case, just fancy name for pretending doing something while in fact we are preparing the stream.

    This way of creating a stream is more readable and maintainable. It is easier to insert nodes in the stream and create forks while not losing picture of the stream. Might be not suitable for very complex streams though. Here is an example:

        b = brewery.create_builder()
        b.csv_source("data.csv")
        b.audit()
        b.pretty_printer()
    

    When this piece of code is executed, nothing actually happens to the data stream. The stream is just being prepared and you can run it anytime:

        b.stream.run()
    

    What actually happens? The builder b is somehow empty object that accepts almost anything and then tries to find a node that corresponds to the method called. Node is instantiated, added to the stream and connected to the previous node.

    You can also create branched stream:

        b = brewery.create_builder()
        b.csv_source("data.csv")
        b.audit()
    
        f = b.fork()
        f.csv_target("audit.csv")
    
        b.pretty_printer()
    

    Basic Construction

    This is the lowest level way of creating the stream and allows full customisation and control of the stream. In the basic construction method the programmer prepares all node instance objects and connects them explicitly, node-by-node. Might be a too verbose, however it is to be used by applications that are constructing streams either using an user interface or from some stream descriptions. All other methods are using this one.

        from brewery import Stream
        from brewery.nodes import CSVSourceNode, AuditNode, PrettyPrinterNode
    
        stream = Stream()
    
        # Create pre-configured node instances
        src = CSVSourceNode("data.csv")
        stream.add(src)
    
        audit = AuditNode()
        stream.add(audit)
    
        printer = PrettyPrinterNode()
        stream.add(printer)
    
        # Connect nodes: source -> target
        stream.connect(src, audit)
        stream.connect(audit, printer)
    
        stream.run()
    

    It is possible to pass nodes as dictionary and connections as list of tuples (source, target):

        stream = Stream(nodes, connections)
    

    Future plans

    What would be lovely to have in brewery?

    Probing and data quality indicators – tools for simple data probing and easy way of creating data quality indicators. Will allow something like “test-driven-development” but for data. This is the next step.

    Stream optimisation – merge multiple nodes into single processing unit before running the stream. Might be done in near future.

    Backend-based nodes and related data transfer between backend nodes – For example, two SQL nodes might pass data through a database table instead of built-in data pipe or two numpy/scipy-based nodes might use numpy/scipy structure to pass data to avoid unnecessary streaming. Not very soon, but foreseeable future.

    Stream compilation – compile a stream to an optimised script. Not too soon, but like to have that one.

    Last, but not least: Currently there is little performance cost because of the nature of brewery implementation. This penalty will be explained in another blog post, however to make long story short, it has to do with threads, Python GIL and non-optimalized stream graph. There is no future prediction for this one, as it might be included step-by-step. Also some Python 3 features look promising, such as yield from in Python 3.3 (PEP 308).

    Links

  • Tags: brewery by Stefan Urbanek
  • Brewery 0.8 Released

    I’m glad to announce new release of Brewery – stream based data auditing and analysis framework for Python.

    There are quite a few updates, to mention the notable ones:

    • new brewery runner with commands run and graph
    • new nodes: pretty printer node (for your terminal pleasure), generator function node
    • many CSV updates and fixes

    Added several simple how-to examples, such as: aggregation of remote CSV, basic audit of a CSV, how to use a generator function. Feedback and questions are welcome. I’ll help you.

    Note that there are couple changes that break compatibility, however they can be updated very easily. I apologize for the inconvenience, but until 1.0 the changes might happen more frequently. On the other hand, I will try to make them as painless as possible.

    Full listing of news, changes and fixes is below.

    Version 0.8

    News

    • Changed license to MIT
    • Created new brewery runner commands: ‘run’ and ‘graph’:
      • ‘brewery run stream.json’ will execute the stream
      • ‘brewery graph stream.json’ will generate graphviz data
    • Nodes: Added pretty printer node - textual output as a formatted table
    • Nodes: Added source node for a generator function
    • Nodes: added analytical type to derive field node
    • Preliminary implementation of data probes (just concept, API not decided yet for 100%)
    • CSV: added empty_as_null option to read empty strings as Null values
    • Nodes can be configured with node.configure(dictionary, protected). If ‘protected’ is True, then protected attributes (specified in node info) can not be set with this method.

    • added node identifier to the node reference doc

    • added create_logger

    • added experimental retype feature (works for CSV only at the moment)

    • Mongo Backend - better handling of record iteration

    Changes

    • CSV: resource is now explicitly named argument in CSV*Node
    • CSV: convert fields according to field storage type (instead of all-strings)
    • Removed fields getter/setter (now implementation is totally up to stream subclass)
    • AggregateNode: rename aggregates to measures, added measures as public node attribute
    • moved errors to brewery.common
    • removed field_name(), now str(field) should be used
    • use named blogger ‘brewery’ instead of the global one
    • better debug-log labels for nodes (node type identifier + python object ID)

    WARNING: Compatibility break:

    • depreciate __node_info__ and use plain node_info instead
    • Stream.update() now takes nodes and connections as two separate arguments

    Fixes

    • added SQLSourceNode, added option to keep ifelds instead of dropping them in FieldMap and FieldMapNode (patch by laurentvasseur @ bitbucket)
    • better traceback handling on node failure (now actually the traceback is displayed)
    • return list of field names as string representation of FieldList
    • CSV: fixed output of zero numeric value in CSV (was empty string)

    Links

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: brewery release announcement by Stefan Urbanek
  • Cubes 0.8 Released

    Another minor release of Cubes - Light Weight Python OLAP framework is out. Main change is that backend is no longer hard-wired in the Slicer server and can be selected through configuration file.

    There were lots of documentation changes, for example the reference was separated from the rest of docs. Hello World! example was added.

    The news, changes and fixes are:

    New Features

    • Started writing StarBrowser - another SQL aggregation browser with different approach (see code/docs)

    Slicer Server:

    • added configuration option modules under [server] to load additional modules
    • added ability to specify backend module
    • backend configuration is in [backend] by default, for SQL it stays in [db]
    • added server config option for default prettyprint value (useful for demontration purposes)

    Documentation:

    • Changed license to MIT + small addition. Please refer to the LICENSE file.
    • Updated documentation - added missing parts, made reference more readable, moved class and function reference docs from descriptive part to reference (API) part.
    • added backend documentation
    • Added “Hello World!” example

    Changed Features

    • removed default SQL backend from the server
    • moved worskpace creation into the backend module

    Fixes

    • Fixed create_view to handle not materialized properly (thanks to deytao)
    • Slicer tool header now contains #!/usr/bin/env python

    Links

    If you have any questions, comments, requests, do not hesitate to ask.

  • Tags: cubes announcement olap by Stefan Urbanek
  • Cubes goes MIT license with small addition for SaaS

    Cubes - The Lightweight Python OLAP Framework is now licensed under the MIT license with small addition. The full license is as follows:

    Copyright (c) 2011-2012 Stefan Urbanek, see AUTHORS for more details

    Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

    • The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

    • If your version of the Software supports interaction with it remotely through a computer network, the above copyright notice and this permission notice shall be accessible to all users.

    THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

    The addition says, that if you use it as part of software as a service (SaaS) you have to provide the copyright notice in an about, legal info, credits or some similar kind of page or info box. That’s all.

    May it be like that? :-)

    Updated Cubes sources are here, as usual.

    Enjoy.

  • Tags: cubes opensource legal