2012-05-01 by Stefan Urbanek

Star Browser, Part 2: Joins and Denormalization

Last time I was talking about how logical attributes are mapped to the physical table columns in the Star Browser. Today I will describe how joins are formed and how denormalization is going to be used.

The Star Browser is new aggregation browser in for the Cubes – lightweight Python OLAP Framework.

Star, Snowflake, Master and Detail

Star browser supports a star:

... and snowflake database schema:

The browser should know how to construct the star/snowflake and that is why you have to specify the joins of the schema. The join specification is very simple:

"joins" = [
    { "master": "fact_sales.product_id", "detail": "dim_product.id" }
]

Joins support only single-column keys, therefore you might have to create surrogate keys for your dimensions.

As in mappings, if you have specific needs for explicitly mentioning database schema or any other reason where table.column reference is not enough, you might write:

"joins" = [
    { 
        "master": "fact_sales.product_id",
        "detail": {
            "schema": "sales",
            "table": "dim_products",
            "column": "id"
        }
]

What if you need to join same table twice? For example, you have list of organizations and you want to use it as both: supplier and service consumer. It can be done by specifying alias in the joins:

"joins" = [
    {
        "master": "contracts.supplier_id", 
        "detail": "organisations.id",
        "alias": "suppliers"
    },
    {
        "master": "contracts.consumer_id", 
        "detail": "organisations.id",
        "alias": "consumers"
    }
]

In the mappings you refer to the table by alias specified in the joins, not by real table name:

"mappings": {
    "supplier.name": "suppliers.org_name",
    "consumer.name": "consumers.org_name"
}

Relevant Joins and Denormalization

The new mapper joins only tables that are relevant for given query. That is, if you are browsing by only one dimension, say product, then only product dimension table is joined.

Joins are slow, expensive and the denormalization can be helpful:

The old browser is based purely on the denormalized view. Despite having a performance gain, it has several disadvantages. From the join/performance perspective the major one is, that the denormalization is required and it is not possible to browse data in a database that was "read-only". This requirements was also one unnecessary step for beginners, which can be considered as usability problem.

Current implementation of the Mapper and StarBrowser allows denormalization to be integrated in a way, that it might be used based on needs and situation:

It is not yet there and this is what needs to be done:

  • function for denormalization - similar to the old one: will take cube and view name and will create denormalized view (or a table)
  • make mapper accept the view and ignore joins

Goal is not just to slap denormalization in, but to make it a configurable alternative to default star browsing. From user's perspective, the workflow will be:

  1. browse star/snowflake until need for denormalization arises
  2. configure denormalization and create denormalized view
  3. browse the denormalized view

The proposed options are: use_denormalization, denormalized_view_prefix, denormalized_view_schema.

The Star Browser is half-ready for the denormalization, just few changes are needed in the mapper and maybe query builder. These changes have to be compatible with another, not-yet-included feature: SQL pre-aggregation.

Conclusion

The new way of joining is very similar to the old one, but has much more cleaner code and is separated from mappings. Also it is more transparent. New feature is the ability to specify a database schema. Planned feature to be integrated is automatic join detection based on foreign keys.

In the next post (the last post in this series) about the new StarBrowser, I am going to explain aggregation improvements and changes.

Links

Relevant source code is this one (github).

See also Cubes at github, Cubes Documentation, Mailing List and Submit issues. Also there is an IRC channel #databrewery on irc.freenode.net

2012-04-30 by Stefan Urbanek

Star Browser, Part 1: Mappings

Star Browser is new aggregation browser in for the Cubes – lightweight Python OLAP Framework. I am going to talk briefly about current state and why new browser is needed. Then I will describe in more details the new browser: how mappings work, how tables are joined. At the end I will mention what will be added soon and what is planned in the future.

Originally I wanted to write one blog post about this, but it was too long, so I am going to split it into three:

Why new browser?

Current denormalized browser is good, but not good enough. Firstly, it has grown into a spaghetti-like structure inside and adding new features is quite difficult. Secondly, it is not immediately clear what is going on inside and not only new users are getting into troubles. For example the mapping of logical to physical is not obvious; denormalization is forced to be used, which is good at the end, but is making OLAP newbies puzzled.

The new browser, called StarBrowser. is half-ready and will fix many of the old decisions with better ones.

Mapping

Cubes provides an analyst's view of dimensions and their attributes by hiding the physical representation of data. One of the most important parts of proper OLAP on top of the relational database is the mapping of physical attributes to logical.

First thing that was implemented in the new browser is proper mapping of logical attributes to physical table columns. For example, take a reference to an attribute name in a dimension product. What is the column of what table in which schema that contains the value of this dimension attribute?

There are two ways how the mapping is being done: implicit and explicit. The simplest, straightforward and most customizable is the explicit way, where the actual column reference is provided in the model description:

"mappings": {
    "product.name": "dm_products.product_name"
}

If it is in different schema or any part of the reference contains a dot:

"mappings": {
    "product.name": {
            "schema": "sales",
            "table": "dm_products",
            "column": "product_name"
        }
}

Disadvantage of the explicit way is it's verbosity and the fact that developer has to write more metadata, obviously.

Both, explicit and implicit mappings have ability to specify default database schema (if you are using Oracle, PostgreSQL or any other DB which supports schemas).

The mapping process process is like this:

Implicit Mapping

With implicit mapping one can match a database schema with logical model and does not have to specify additional mapping metadata. Expected structure is star schema with one table per (denormalized) dimension.

Basic rules:

  • fact table should have same name as represented cube
  • dimension table should have same name as the represented dimension, for example: product (singular)
  • references without dimension name in them are expected to be in the fact table, for example: amount, discount (see note below for simple flat dimensions)
  • column name should have same name as dimension attribute: name, code, description
  • if attribute is localized, then there should be one column per localization and should have locale suffix: description_en, description_sk, description_fr (see below for more information)

This means, that by default product.name is mapped to the table product and column name. Measure amount is mapped to the table sales and column amount

What about dimensions that have only one attribute, like one would not have a full date but just a year? In this case it is kept in the fact table without need of separate dimension table. The attribute is treated in by the same rule as measure and is referenced by simple year. This is applied to all dimensions that have only one attribute (representing key as well). This dimension is referred to as flat and without details.

Note for advanced users: this behavior can be disabled by setting simplify_dimension_references to False in the mapper. In that case you will have to have separate table for the dimension attribute and you will have to reference the attribute by full name. This might be useful when you know that your dimension will be more detailed.

Localization

Despite localization taking place first in the mapping process, we talk about it at the end, as it might be not so commonly used feature. From physical point of view, the data localization is very trivial and requires language denormalization - that means that each language has to have its own column for each attribute.

In the logical model, some of the attributes may contain list of locales that are provided for the attribute. For example product category can be in English, Slovak or German. It is specified in the model like this:

attributes = [{
    "name" = "category",
    "locales" = [en, sk, de],
}]

During the mapping process, localized logical reference is created first:

In short: if attribute is localizable and locale is requested, then locale suffix is added. If no such localization exists then default locale is used. Nothing happens to non-localizable attributes.

For such attribute, three columns should exist in the physical model. There are two ways how the columns should be named. They should have attribute name with locale suffix such as category_sk and category_en (underscore because it is more common in table column names), if implicit mapping is used. You can name the columns as you like, but you have to provide explicit mapping in the mapping dictionary. The key for the localized logical attribute should have .locale suffix, such as product.category.sk for Slovak version of category attribute of dimension product. Here the dot is used because dots separate logical reference parts.

Customization of the Implicit

The implicit mapping process has a little bit of customization as well:

  • dimension table prefix: you can specify what prefix will be used for all dimension tables. For example if the prefix is dim_ and attribute is product.name then the table is going to be dim_product.
  • fact table prefix: used for constructing fact table name from cube name. Example: having prefix ft_ all fact attributes of cube sales are going to be looked up in table ft_sales
  • fact table name: one can explicitly specify fact table name for each cube separately

The Big Picture

Here is the whole mapping schema, after localization:

Links

The commented mapper source is here.

2012-04-29 by Stefan Urbanek

Cubes Backend Progress and Comparison

I've been working on a new SQL backend for cubes called StarBrowser. Besides new features and fixes, it is going to be more polished and maintainable.

Current Backend Comparison

In the following table you can see comparison of backends (or rather aggregation browsers). Current backend is sql.browser which reqiures denormalized table as a source. Future preferred backend will be sql.star.

Document link at Google Docs.

Star Browser state

More detailed description with schemas and description of what is happening behind will be published once the browser will be useable in most of the important features (that is, no sooner than drill-down is implemented). Here is a peek to the new browser features.

  • separated attribute mapper - doing the logical-to-physical mapping. or in other words: knows what column in which table represents what dimension attribute or a measure
  • more intelligent join building - uses only joins that are relevant to the retrieved attributes, does not join the whole star/snowflake if not necessary
  • allows tables to be stored in different database schemas (previously everything had to be in one schema)

There is still some work to be done, including drill-down and ordering of results.

You can try limited feature set of the browser by using sql.star backend name. Do not expect much at this time, however if you find a bug, I would be glad if report it through github issues. The source is in the cubes/backends/sql/star.py and cubes/backends/sql/common.py (or here).

New and improved

Here is a list of features you can expect (not yet fully implemented, if at all started):

  • more SQL aggregation types and way to specify what aggregations should be used by-default for each measure
  • DDL schema generator for: denormalized table, logical model - star schema, physical model
  • model tester - tests whether all attributes and joins are valid in the physical model

Also the new implementation of star browser will allow easier integration of pre-aggregated store (planned) and various other optimisations.

2012-03-09 by Stefan Urbanek

Cubes 0.8 Released

Another minor release of Cubes - Light Weight Python OLAP framework is out. Main change is that backend is no longer hard-wired in the Slicer server and can be selected through configuration file.

There were lots of documentation changes, for example the reference was separated from the rest of docs. Hello World! example was added.

The news, changes and fixes are:

New Features

  • Started writing StarBrowser - another SQL aggregation browser with different approach (see code/docs)

Slicer Server:

  • added configuration option modules under [server] to load additional modules
  • added ability to specify backend module
  • backend configuration is in [backend] by default, for SQL it stays in [db]
  • added server config option for default prettyprint value (useful for demontration purposes)

Documentation:

  • Changed license to MIT + small addition. Please refer to the LICENSE file.
  • Updated documentation - added missing parts, made reference more readable, moved class and function reference docs from descriptive part to reference (API) part.
  • added backend documentation
  • Added "Hello World!" example

Changed Features

  • removed default SQL backend from the server
  • moved worskpace creation into the backend module

Fixes

  • Fixed create_view to handle not materialized properly (thanks to deytao)
  • Slicer tool header now contains #!/usr/bin/env python

Links

  • github sources: https://github.com/Stiivi/cubes
  • Documentation: http://packages.python.org/cubes/
  • Mailing List: http://groups.google.com/group/cubes-discuss
  • Submit issues here: https://github.com/Stiivi/cubes/issues
  • IRC channel: #databrewery on irc.freenode.net

If you have any questions, comments, requests, do not hesitate to ask.

2012-02-23 by Stefan Urbanek

Cubes goes MIT license with small addition for SaaS

Cubes - The Lightweight Python OLAP Framework is now licensed under the MIT license with small addition. The full license is as follows:

Copyright (c) 2011-2012 Stefan Urbanek, see AUTHORS for more details

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

  • The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

  • If your version of the Software supports interaction with it remotely through a computer network, the above copyright notice and this permission notice shall be accessible to all users.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

The addition says, that if you use it as part of software as a service (SaaS) you have to provide the copyright notice in an about, legal info, credits or some similar kind of page or info box. That's all.

May it be like that? :-)

Updated Cubes sources are here, as usual.

Enjoy.

Previous Page | Next Page