analytical data streams & online analytical processing Python frameworks
The new minor release of Cubes – light-weight Python
OLAP framework –
brings range cuts,
denormalization
with the slicer tool and cells in /report query, together with fixes and
important changes.
See the second part of this post for the full list.
Range cuts were implemented in the SQL Star Browser. They are used as follows:
Python:
cut = RangeCut("date", [2010], [2012,5,10])
cut_hi = RangeCut("date", None, [2012,5,10])
cut_low = RangeCut("date", [2010], None)
To specify a range in slicer server where keys are sortable:
http://localhost:5000/aggregate?cut=date:2004-2005
http://localhost:5000/aggregate?cut=date:2004,2-2005,5,1
Open ranges:
http://localhost:5000/aggregate?cut=date:2010-
http://localhost:5000/aggregate?cut=date:2004,1,1-
http://localhost:5000/aggregate?cut=date:-2005,5,10
http://localhost:5000/aggregate?cut=date:-2012,5
Now it is possible to denormalize tour data with the slicer tool. You do not have to denormalize using python script. Data are denormalized in a way how denormalized browser would expect them to be. You can tune the process using command line switches, if you do not like the defaults.
Denormalize all cubes in the model:
$ slicer denormalize slicer.ini
Denormalize only one cube::
$ slicer denormalize -c contracts slicer.ini
Create materialized denormalized view with indexes::
$ slicer denormalize --materialize --index slicer.ini
Example slicer.ini:
[workspace] denormalized_view_prefix = mft_ denormalized_view_schema = denorm_views # This switch is used by the browser: use_denormalization = yes
For more information see Cubes slicer tool documentation
Use cell to specify all cuts (type can be range, point or set):
{
"cell": [
{
"dimension": "date",
"type": "range",
"from": [2010,9],
"to": [2011,9]
}
],
"queries": {
"report": {
"query": "aggregate",
"drilldown": {"date":"year"}
}
}
}
For more information see the slicer server documentation.
StarBrowser:
Slicer Server:
/report JSON now accepts cell with full cell description as dictionary,
overrides URL parametersSlicer tool:
denormalize option for (bulk) denormalization of cubes (see the the slicer
documentation for more information)/report JSON requests should now have queries wrapped in the key
queries. This was originally intended way of use, but was not correctly
implemented. A descriptive error message is returned from the server if the
key queries is not present. Despite being rather a bug-fix, it is listed
here as it requires your attention for possible change of your code./report and queriesSources can be found on github. Read the documentation.
Join the Google Group for discussion, problem solving and announcements.
Submit issues and suggestions on github
IRC channel #databrewery on irc.freenode.net
If you have any questions, comments, requests, do not hesitate to ask.
Last time I was talking about joins and denormalisation in the Star Browser. This is the last part about the star browser where I will describe the aggregation and what has changed, compared to the old browser.
The Star Browser is new aggregation browser in for the Cubes – lightweight Python OLAP Framework. Next version v0.9 will be released next week.
sum is not the only aggregation. The new browser allows to have other aggregate functions as well, such as min, max.
You can specify the aggregations for each measure separately:
{
"name": "amount",
"aggregations": ["sum", "min", "max"]
}
The resulting aggregated attribute name will be constructed from the measure
name and aggregation suffix, for example the mentioned amount will have
three aggregates in the result: amount_sum, amount_min and amount_max.
Source code reference: see StarQueryBuilder.aggregations_for_measure
Result of aggregation is a structure containing: summary - summary for the
aggregated cell, drilldown - drill down cells, if was desired, and
total_cell_count - total cells in the drill down, regardless of pagination.
When we are browsing the cube, the cell provides current browsing context. For
aggregations and selections to happen, only keys and some other internal
attributes are necessary. Those can not be presented to the user though. For
example we have geography path (country, region) as ['sk', 'ba'],
however we want to display to the user Slovakia for the country and
Bratislava for the region. We need to fetch those values from the data
store. Cell details is basically a human readable description of the current
cell.
For applications where it is possible to store state between aggregation calls, we can use values from previous aggregations or value listings. Problem is with web applications - sometimes it is not desirable or possible to store whole browsing context with all details. This is exact the situation where fetching cell details explicitly might come handy.
Note: The Original browser added cut information in the summary, which was ok when only point cuts were used. In other situations the result was undefined and mostly erroneous.
The cell details are now provided separately by method
AggregationBrowser.cell_details(cell) which has Slicer HTTP equivalent
/details or {"query":"detail", ...} in /report request. The result is
a list of
For point cuts, the detail is a list of dictionaries for each level. For
example our previously mentioned path ['sk', 'ba'] would have details
described as:
[
{
"geography.country_code": "sk",
"geography.country_name": "Slovakia",
"geography.something_more": "..."
"_key": "sk",
"_label": "Slovakia"
},
{
"geography.region_code": "ba",
"geography.region_name": "Bratislava",
"geography.something_even_more": "...",
"_key": "ba",
"_label": "Bratislava"
}
]
You might have noticed the two redundant keys: _key and _label - those
contain values of a level key attribute and level label attribute
respectively. It is there to simplify the use of the details in presentation
layer, such as templates. Take for example doing only one-dimensional
browsing and compare presentation of “breadcrumbs”:
labels = [detail["_label"] for detail in cut_details]
Which is equivalent to:
levels = dimension.hierarchy.levels()
labels = []
for i, detail in enumerate(cut_details):
labels.append(detail[level[i].label_attribute.full_name()])
Note that this might change a bit: either full detail will be returned or just key and label, depending on an option argument (not yet decided).
The Star Browser is being created with SQL pre-aggregation in mind. This is not possible in the old browser, as it is not flexible enough. It is planned to be integrated when all basic features are finished.
Proposed access from user’s perspective will be through configuration options:
use_preaggregation, preaggregation_prefix, preaggregation_schema and
a method for cube pre-aggregation will be available through the slicer tool.
The new browser has better internal structure resulting in increased flexibility for future extensions. It fixes not so good architectural decisions of the old browser.
New and fixed features:
The new backend sources are here and the mapper is here.
To be done in the near future:
See also Cubes at github, Cubes Documentation, Mailing List and Submit issues. Also there is an IRC channel #databrewery on irc.freenode.net
Last time I was talking about how logical attributes are mapped to the physical table columns in the Star Browser. Today I will describe how joins are formed and how denormalization is going to be used.
The Star Browser is new aggregation browser in for the Cubes – lightweight Python OLAP Framework.
Star browser supports a star:

… and snowflake database schema:

The browser should know how to construct the star/snowflake and that is why you have to specify the joins of the schema. The join specification is very simple:
"joins" = [
{ "master": "fact_sales.product_id", "detail": "dim_product.id" }
]
Joins support only single-column keys, therefore you might have to create surrogate keys for your dimensions.
As in mappings, if you have specific needs for explicitly mentioning database
schema or any other reason where table.column reference is not enough, you
might write:
"joins" = [
{
"master": "fact_sales.product_id",
"detail": {
"schema": "sales",
"table": "dim_products",
"column": "id"
}
]
What if you need to join same table twice? For example, you have list of organizations and you want to use it as both: supplier and service consumer. It can be done by specifying alias in the joins:
"joins" = [
{
"master": "contracts.supplier_id",
"detail": "organisations.id",
"alias": "suppliers"
},
{
"master": "contracts.consumer_id",
"detail": "organisations.id",
"alias": "consumers"
}
]
In the mappings you refer to the table by alias specified in the joins, not by real table name:
"mappings": {
"supplier.name": "suppliers.org_name",
"consumer.name": "consumers.org_name"
}

The new mapper joins only tables that are relevant for given query. That is, if you are browsing by only one dimension, say product, then only product dimension table is joined.
Joins are slow, expensive and the denormalization can be helpful:

The old browser is based purely on the denormalized view. Despite having a performance gain, it has several disadvantages. From the join/performance perspective the major one is, that the denormalization is required and it is not possible to browse data in a database that was “read-only”. This requirements was also one unnecessary step for beginners, which can be considered as usability problem.
Current implementation of the Mapper and StarBrowser allows denormalization to be integrated in a way, that it might be used based on needs and situation:

It is not yet there and this is what needs to be done:
Goal is not just to slap denormalization in, but to make it a configurable alternative to default star browsing. From user’s perspective, the workflow will be:
The proposed options are: use_denormalization, denormalized_view_prefix,
denormalized_view_schema.
The Star Browser is half-ready for the denormalization, just few changes are needed in the mapper and maybe query builder. These changes have to be compatible with another, not-yet-included feature: SQL pre-aggregation.
The new way of joining is very similar to the old one, but has much more cleaner code and is separated from mappings. Also it is more transparent. New feature is the ability to specify a database schema. Planned feature to be integrated is automatic join detection based on foreign keys.
In the next post (the last post in this series) about the new StarBrowser, I am going to explain aggregation improvements and changes.
Relevant source code is this one (github).
See also Cubes at github, Cubes Documentation, Mailing List and Submit issues. Also there is an IRC channel #databrewery on irc.freenode.net
Another minor release of Cubes - Light Weight Python OLAP framework is out. Main change is that backend is no longer hard-wired in the Slicer server and can be selected through configuration file.
There were lots of documentation changes, for example the reference was separated from the rest of docs. Hello World! example was added.
The news, changes and fixes are:
modules under [server] to load additional
modulesprettyprint value (useful for
demontration purposes)If you have any questions, comments, requests, do not hesitate to ask.
What is inside the Cubes Python OLAP Framework? Here is a brief overview of the core modules, their purpose and functionality.
The lightweight framework Cubes is composed of four public modules:

Logical model describes the data from user’s or analyst’s perspective: data how they are being measured, aggregated and reported. Model is independent of physical implementation of data. This physical independence makes it easier to focus on data instead on ways of how to get the data in understandable form.
Cubes model is described by:

There is a utility function provided for loading the model from a JSON file: load_model.
The model module object are capable of being localized (see Model Localization for more information). The cubes provides localization at the metadata level (the model) and functionality to have localization at the data level.
See also: Model Documentation
Core of the Cubes analytics functionality is the aggregation browser. The browser module contains utility classes and functions for the browser to work.

The module components are:
The types of cuts:
[2010] for year level of Date dimension, [2010,1,7] for full date point.Example of point cut effect:

The module provides couple utility functions:
path_from_string - construct a dimension path (point) from a stringstring_from_path - get a string representation of a dimension path (point)string_from_cuts and cuts_from_string are for conversion between string and list of cuts. (Currently only list of point cuts are supported in the string representation)The aggregation browser can:
aggregate(cell))aggregate(cell, drilldown="date"))facts(cell))fact(id))There is convenience function report(cell, report) that can be implemented by backend in more efficient way to get multiple aggregation queries in single call.
More about aggregated browsing can be found in the Cubes documentation.
Actual aggregation is provided by the backends. The backend should implement aggregation browser interface.

Cubes comes with built-in ROLAP backend which uses SQL database through SQLAlchemy. The backend has two major components:
There was an attempt to write a Mongo DB backend, but it does not work any more, it is included in the sources only as reminder, that there should be a mongo backend sometime in the future.
Anyone can write a backend. If you are interested, drop me a line.
Cubes comes with Slicer - a WSGI HTTP OLAP server with API for most of the cubes framework functionality. The server is based on the Werkzeug framework.

Intended use of the slicer is basically as follows:
For more information, please refer to the Cubes Slicer server documentation.
There are plenty things to get improved, of course. Current focus is not on performance, but on achieving simple usability.
The Cubes sources can be found on Github: https://github.com/stiivi/cubes . There is also a IRC channel #databrewery on irc.freenode.net (I try to be there during late evening CET). Issues can be reported on the github project page.
If you have any questions, suggestions, recommendations, just let me know.
I am glad to announce new minor release of Cubes - Light Weight Python OLAP framework for multidimensional data aggregation and browsing. The news, changes and fixes are:
Slicer server:
Slicer server:
Added tutorials in tutorials/ with models in tutorials/models/ and data in tutorials/data/:
If you have any questions, comments, requests, do not hesitate to ask.
In this Cubes OLAP how-to we are going to learn:
In the previous tutorial we learned how to use model descriptions in a JSON file and how to do physical to logical mappings.
Data used are similar as in the second tutorial, manually modified IBRD Balance Sheet taken from The World Bank. Difference between second tutorial and this one is added two columns: category code and sub-category code. They are simple letter codes for the categories and subcategories.
Some dimensions can have multiple levels forming a hierarchy. For example dates have year, month, day; geography has country, region, city; product might have category, subcategory and the product.
Note: Cubes supports multiple hierarchies, for example for date you might have year-month-day or year-quarter-month-day. Most dimensions will have one hierarchy, thought.
In our example we have the item dimension with three levels of hierarchy: category, subcategory and
line item:

The levels are defined in the model:
"levels": [
{
"name":"category",
"label":"Category",
"attributes": ["category"]
},
{
"name":"subcategory",
"label":"Sub-category",
"attributes": ["subcategory"]
},
{
"name":"line_item",
"label":"Line Item",
"attributes": ["line_item"]
}
]
You can see a slight difference between this model description and the previous one: we didn’t just specify
level names and didn’t let cubes to fill-in the defaults. Here we used explicit description of each level.
name is level identifier, label is human-readable label of the level that can be
used in end-user applications and attributes is list of attributes that belong to the level.
The first attribute, if not specified otherwise, is the key attribute of the level.
Other level description attributes are key and label_attribute. The
key specifies attribute name which contains key for the level. Key is an id number, code or
anything that uniquely identifies the dimension level. label_attribute is name of an attribute
that contains human-readable value that can be displayed in user-interface elements such as tables or
charts.
In this how-to we are going to skip all off-topic code, such as data initialization. The full example can
be found in the tutorial sources with suffix
03.
In short we need:
model_03.json) prepared with appropriate mappingsDrill-down is an action that will provide more details about data. Drilling down through a dimension hierarchy will expand next level of the dimension. It can be compared to browsing through your directory structure.
We create a function that will recursively traverse a dimension hierarchy and will print-out aggregations (count of records in this example) at the actual browsed location.
Attributes
dimensionPath is list of dimension points (keys) at each level. It is like file-system path.
def drill_down(cell, dimension, path = []):
Get dimension’s default hierarchy. Cubes supports multiple hierarchies, for example for date you might have year-month-day or year-quarter-month-day. Most dimensions will have one hierarchy, thought.
hierarchy = dimension.default_hierarchy
Base path is path to the most detailed element, to the leaf of a tree, to the fact. Can we go deeper in the hierarchy?
if hierarchy.path_is_base(path):
return
Get the next level in the hierarchy. levels_for_path returns list of levels according to
provided path. When drilldown is set to True then one more level is returned.
levels = hierarchy.levels_for_path(path,drilldown=True)
current_level = levels[-1]
We need to know name of the level key attribute which contains a path component. If the model does not explicitly specify key attribute for the level, then first attribute will be used:
level_key = dimension.attribute_reference(current_level.key)
For prettier display, we get name of attribute which contains label to be displayed for the current level. If there is no label attribute, then key attribute is used.
level_label = dimension.attribute_reference(current_level.label_attribute)
We do the aggregation of the cell… Think of ls $CELL command in commandline, where
$CELL is a directory name. In this function we can think of $CELL to be same as
current working directory (pwd)
result = browser.aggregate(cell, drilldown=[dimension])
for record in result.drilldown:
print "%s%s: %d" % (indent, record[level_label], record["record_count"])
...
And now the drill-down magic. First, construct new path by key attribute value appended to the current path:
drill_path = path[:] + [record[level_key]]
Then get a new cell slice for current path:
drill_down_cell = cell.slice(dimension, drill_path)
And do recursive drill-down:
drill_down(drill_down_cell, dimension, drill_path)
The function looks like this:

Working function example 03 can be found in the tutorial
sources.
Get the full cube (or any part of the cube you like):
cell = browser.full_cube()
And do the drill-down through the item dimension:
drill_down(cell, cube.dimension("item"))
The output should look like this:
a: 32
da: 8
Borrowings: 2
Client operations: 2
Investments: 2
Other: 2
dfb: 4
Currencies subject to restriction: 2
Unrestricted currencies: 2
i: 2
Trading: 2
lo: 2
Net loans outstanding: 2
nn: 2
Nonnegotiable, nonintrest-bearing demand obligations on account of subscribed capital: 2
oa: 6
Assets under retirement benefit plans: 2
Miscellaneous: 2
Premises and equipment (net): 2
Note that because we have changed our source data, we see level codes instead of level names. We will fix that later. Now focus on the drill-down.
See that nice hierarchy tree?
Now if you slice the cell through year 2010 and do the exact same drill-down:
cell = cell.slice("year", [2010])
drill_down(cell, cube.dimension("item"))
you will get similar tree, but only for year 2010 (obviously).
Codes and ids are good for machines and programmers, they are short, might follow some scheme, easy to handle in scripts. Report users have no much use of them, as they look cryptic and have no meaning for the first sight.
Our source data contains two columns for category and for subcategory: column with code and column with label for user interfaces. Both columns belong to the same dimension and to the same level. The key column is used by the analytical system to refer to the dimension point and the label is just decoration.
Levels can have any number of detail attributes. The detail attributes have no analytical meaning and are just ignored during aggregations. If you want to do analysis based on an attribute, make it a separate dimension instead.
So now we fix our model by specifying detail attributes for the levels:

The model description is:
"levels": [
{
"name":"category",
"label":"Category",
"label_attribute": "category_label",
"attributes": ["category", "category_label"]
},
{
"name":"subcategory",
"label":"Sub-category",
"label_attribute": "subcategory_label",
"attributes": ["subcategory", "subcategory_label"]
},
{
"name":"line_item",
"label":"Line Item",
"attributes": ["line_item"]
}
]
}
Note the label_attribute keys. They specify which attribute contains label to be displayed.
Key attribute is by-default the first attribute in the list. If one wants to use some other attribute it
can be specified in key_attribute.
Because we added two new attributes, we have to add mappings for them:
"mappings": { "item.line_item": "line_item",
"item.subcategory": "subcategory",
"item.subcategory_label": "subcategory_label",
"item.category": "category",
"item.category_label": "category_label"
}
In the example tutorial, which can be found in the Cubes sources under tutorial/ directory,
change the model file from model/model_03.json to model/model_03-labels.json
and run the code again. Or fix the file as specified above.
Now the result will be:
Assets: 32
Derivative Assets: 8
Borrowings: 2
Client operations: 2
Investments: 2
Other: 2
Due from Banks: 4
Currencies subject to restriction: 2
Unrestricted currencies: 2
Investments: 2
Trading: 2
Loans Outstanding: 2
Net loans outstanding: 2
Nonnegotiable: 2
Nonnegotiable, nonintrest-bearing demand obligations on account of subscribed capital: 2
Other Assets: 6
Assets under retirement benefit plans: 2
Miscellaneous: 2
Premises and equipment (net): 2
Try to remove the last level line_item from the model file and see what happens. Code still works, but displays only two levels. What does that mean? If metadata - logical model - is used properly in an application, then application can handle most of the model changes without any application modifications. That is, if you add new level or remove a level, there is no need to change your reporting application.
Next: slicing and dicing or slicer server, not sure yet.
If you have any questions, suggestions, comments, let me know.
Book: Star Schema – The Complete ReferenceWell written book - very understandable even for a beginner, despite being focused on more advanced specialists. Explains multi-dimensional database design: star schemas, snowflakes, fact tables, dimensions, aggregated data browsing and more.
I am happy to announce another release of Cubes - Python OLAP framework for multidimensional data aggregation and browsing.
This release, besides some new features, renames Cuboid to more appropriate Cell. This introduces backward python API incompatibility.
Main source repository has changed to Github https://github.com/Stiivi/cubes
Model Initialization Defaults:
default and dimension will be considered flatNote: This initialization defaults might be moved into a separate utility function/class that will populate incomplete model (see Issue #8 )
Slicer server:
view, added view_prefix and view_suffix, the cube view name will be constructed by concatenating view prefix + cube name + view suffixThis change is considered final and therefore we can mark it is as API version 1.
Links:
If you have any questions, comments, requests, do not hesitate to ask.
New version of Cubes - Python OLAP framework and server - was released.
Cubes is a framework for:
star and snowflake schema denormalisation
alias to joins in model description, see joins documentation for more information.log a and log_level configuration options (under [server])format= parameter to /facts, accepts json and csvfields= parameter to /facts - comma separated list of returned fields in CSV (see API)json_record_limit in [server] section), CSV can return whole dataset and will do it iteratively (we do not want to consume all of our memory, do we?)Also many bugs were fixed, including localization in fact(s) retrieval and pagination. Sharing of single SQLAlchemy engine and model within server thread was added for performance reasons.
Enjoy.