2012-10-05 by Stefan Urbanek
After a while, here is an update to Cubes - Python Lightweight OLAP framework for multidimensional modeling. There are some changes included that were mentioned in the EruoPython talk such as table_rows
and cross_table
.
I recommend to look at updated examples in the Github repository. The Flask example is now "real" example instead of "sandbox" example and you can see how to generate a simple table for dimension hierarchy browsing.
There is also a more complex example with star-like schema dataset in the cubes-examples github repository. Follow the instructions in README files how to make it running.
There are some backward incompatible changes in this release – until 1.0 the "point" releases will contain this kind of changes, as it is still evolving. You can find more information below.
Quick Summary
- Way how model is constructed has changed. Designated methods are
create_model()
or load_model()
- Dimension defition can have a "template". For example:
{
"name": "contract_date",
"template": "date"
}
- added
table_rows()
and cross_table()
to aggregation result for more convenient table creation. The table_rows
takes care of providing appropriate dimension key and label for browsed level.
- added
simple_model(cube_name, dimension_names, measures)
Incompatibilities: use create_model()
instead of Model(**dict)
, if you
were using just load_model()
, you are fine.
New Features
- To address issue #8
create_model(dict)
was added as replacement for Model(**dict)
. Model()
from
now on will expect correctly constructed model objects. create_model()
will
be able to handle various simplifications and defaults during the
construction process.
- added
info
attribute to all model objects. It can be used to store custom,
application or front-end specific information
- preliminary implementation of
cross_table()
(interface might be changed)
AggregationResult.table_rows()
- new method that iterates through
drill-down rows and returns a tuple with key, label, path, and rest of the
fields.
- dimension in model description can specify another template dimension – all
properties from the template will be inherited in the new dimension. All
dimension properties specified in the new dimension completely override the
template specification
- added
point_cut_for_dimension
- added
simple_model(cube_name, dimensions, measures)
– creates a single-cube
model with flat dimensions from a list of dimension names and measures from
a list of measure names. For example:
model = simple_model("contracts", ["year","contractor", "type"], ["amount"])
Slicer Server:
/cell
– return cell details (replaces /details
)
Changes
- creation of a model from dictionary through
Model(dict)
is depreciated, use
create_model(dict)
instead. All initialization code will be moved there.
Depreciation warnings were added. Old functionality retained for the time
being. (important)
- Replaced
Attribute.full_name()
with Attribute.ref()
- Removed
Dimension.attribute_reference()
as same can be achieved with
dim(attr).ref()
AggregationResult.drilldown
renamed to AggregationResults.cells
(important)
Planned Changes:
str(Attribute)
will return ref() instead of attribute name as it is more
useful
Fixes
- order of dimensions is now preserved in the Model
Links
Sources can be found on github.
Read the documentation.
Join the Google Group for discussion, problem solving and announcements.
Submit issues and suggestions on github
IRC channel #databrewery on irc.freenode.net
If you have any questions, comments, requests, do not hesitate to ask.
2012-06-12 by Stefan Urbanek
Cubes is going to EuroPython 2012.
EDIT: Added "Need help?".
There are going to be two sessions. First there will be talk introducing to light-weight OLAP with Cubes and Slicer, on Friday at 9:45 in room Tagliatelle (add to calendar). Afterwards there will be longer, more in-depth and hands-on training about Slicing and Dicing with Cubes on Friday at 14:30 in room Pizza Napoli (add to calendar)
In the talk I will introduce the framework and explain reasons for it's existence. Then I will dig into architecture, features and briefly show examples how to use it for slicing and dicing. Newbies are welcome.
The training will go into more details and the participants will learn:
- how to prepare data for aggregated browsing - star and snowflake schemas
- how to create a logical model, define cubes, dimensions and hierarchies
- how to browse aggregated data and how to slice and dice cubes from within
Python
- how to create a WSGI OLAP server ("in 15 minutes" style) for aggregated data
browsing and how to use it in your web application for providing (browsable)
data to end-user reports
- how to provide localized reporting
If the time permits, we can look at the anatomy of the framework and see how
to implement a backend for another kind of data store.
I will be focusing on the existing SQL (relational OLAP) backend.
Customized examples
You might use the training session (and not only the session) to solve your
problem - just bring your own sample data, if you like.
Do you have any data that you would like to slice and dice? Have a database
schema and do not know how to create a logical model? You can send me a data
sample or a schema, so I can prepare examples based on problem you are
solving.
Please, do not send any confidential data or schemas under NDA.
Need help?
If you have any questions or would like to help with your data: from data
preparation, through data modeling to slicing and dicing. You can grab me
during the whole event. If you can not find me, just tweet me:
@Stiivi.
Participation
If anyone is interested in participating in the project, he is welcome. Here are some features that are either out of scope of my skills and I would like to cooperate with someone more professional, or I do not have available resources to do that:
I am also very open to new feature suggestions and feature change requests.
Just little note: Cubes is meant to be small and simple. At least for now.
There are plenty of complex and feature-rich solutions out there. If we can
make new, more complex features as non-critical, optional plug-ins, that would
be great.
Links and Calendar Events
You can add the talks to your calendar by following the links:
2012-05-29 by Stefan Urbanek
The new minor release of Cubes – light-weight Python
OLAP framework –
brings range cuts,
denormalization
with the slicer tool and cells in /report
query, together with fixes and
important changes.
See the second part of this post for the full list.
Range Cuts
Range cuts were implemented in the SQL Star Browser. They are used as follows:
Python:
cut = RangeCut("date", [2010], [2012,5,10])
cut_hi = RangeCut("date", None, [2012,5,10])
cut_low = RangeCut("date", [2010], None)
To specify a range in slicer server where keys are sortable:
http://localhost:5000/aggregate?cut=date:2004-2005
http://localhost:5000/aggregate?cut=date:2004,2-2005,5,1
Open ranges:
http://localhost:5000/aggregate?cut=date:2010-
http://localhost:5000/aggregate?cut=date:2004,1,1-
http://localhost:5000/aggregate?cut=date:-2005,5,10
http://localhost:5000/aggregate?cut=date:-2012,5
Denormalization with slicer Tool
Now it is possible to denormalize tour data with the slicer tool. You do not
have to denormalize using python script. Data are denormalized in a way how
denormalized browser would expect them to be. You can tune the process using
command line switches, if you do not like the defaults.
Denormalize all cubes in the model:
$ slicer denormalize slicer.ini
Denormalize only one cube::
$ slicer denormalize -c contracts slicer.ini
Create materialized denormalized view with indexes::
$ slicer denormalize --materialize --index slicer.ini
Example slicer.ini
:
[workspace]
denormalized_view_prefix = mft_
denormalized_view_schema = denorm_views
# This switch is used by the browser:
use_denormalization = yes
For more information see Cubes slicer tool documentation
Cells in Report
Use cell
to specify all cuts (type can be range
, point
or set
):
{
"cell": [
{
"dimension": "date",
"type": "range",
"from": [2010,9],
"to": [2011,9]
}
],
"queries": {
"report": {
"query": "aggregate",
"drilldown": {"date":"year"}
}
}
}
For more information see the slicer server
documentation.
New Features
- cut_from_string(): added parsing of range and set cuts from string;
introduced requirement for key format: Keys should now have format
"alphanumeric character or underscore" if they are going to be converted to
strings (for example when using slicer HTTP server)
- cut_from_dict(): create a cut (of appropriate class) from a dictionary
description
- Dimension.attribute(name): get attribute instance from name
- added exceptions: CubesError, ModelInconsistencyError, NoSuchDimensionError,
NoSuchAttributeError, ArgumentError, MappingError, WorkspaceError and
BrowserError
StarBrowser:
- implemented RangeCut conditions
Slicer Server:
/report
JSON now accepts cell
with full cell description as dictionary,
overrides URL parameters
Slicer tool:
denormalize
option for (bulk) denormalization of cubes (see the the slicer
documentation for more information)
Changes
- important: all
/report
JSON requests should now have queries wrapped in the key
queries
. This was originally intended way of use, but was not correctly
implemented. A descriptive error message is returned from the server if the
key queries
is not present. Despite being rather a bug-fix, it is listed
here as it requires your attention for possible change of your code.
- warn when no backend is specified during slicer context creation
Fixes
- Better handling of missing optional packages, also fixes #57 (now works
without slqalchemy and without werkzeug as expected)
- see change above about
/report
and queries
- push more errors as JSON responses to the requestor, instead of just failing
with an exception
Links
Sources can be found on github.
Read the documentation.
Join the Google Group for discussion, problem solving and announcements.
Submit issues and suggestions on github
IRC channel #databrewery on irc.freenode.net
If you have any questions, comments, requests, do not hesitate to ask.
2012-05-12 by Stefan Urbanek
Last time I was talking about joins and
denormalisation in the Star
Browser. This is the last part about the star browser where I will describe the aggregation and what has changed, compared to the old browser.
The Star Browser is new aggregation browser in for the Cubes – lightweight
Python OLAP Framework. Next version v0.9 will be released next week.
Aggregation
sum is not the only aggregation. The new browser allows to have other
aggregate functions as well, such as min, max.
You can specify the aggregations for each measure separately:
{
"name": "amount",
"aggregations": ["sum", "min", "max"]
}
The resulting aggregated attribute name will be constructed from the measure
name and aggregation suffix, for example the mentioned amount will have
three aggregates in the result: amount_sum
, amount_min
and amount_max
.
Source code reference: see StarQueryBuilder.aggregations_for_measure
Aggregation Result
Result of aggregation is a structure containing: summary
- summary for the
aggregated cell, drilldown
- drill down cells, if was desired, and
total_cell_count
- total cells in the drill down, regardless of pagination.
Cell Details
When we are browsing the cube, the cell provides current browsing context. For
aggregations and selections to happen, only keys and some other internal
attributes are necessary. Those can not be presented to the user though. For
example we have geography path (country
, region
) as ['sk', 'ba']
,
however we want to display to the user Slovakia
for the country and
Bratislava
for the region. We need to fetch those values from the data
store. Cell details is basically a human readable description of the current
cell.
For applications where it is possible to store state between aggregation
calls, we can use values from previous aggregations or value listings. Problem
is with web applications - sometimes it is not desirable or possible to store
whole browsing context with all details. This is exact the situation where
fetching cell details explicitly might come handy.
Note: The Original browser added cut information in the summary, which was ok
when only point cuts were used. In other situations the result was undefined
and mostly erroneous.
The cell details are now provided separately by method
AggregationBrowser.cell_details(cell)
which has Slicer HTTP equivalent
/details
or {"query":"detail", ...}
in /report
request. The result is
a list of
For point cuts, the detail is a list of dictionaries for each level. For
example our previously mentioned path ['sk', 'ba']
would have details
described as:
[
{
"geography.country_code": "sk",
"geography.country_name": "Slovakia",
"geography.something_more": "..."
"_key": "sk",
"_label": "Slovakia"
},
{
"geography.region_code": "ba",
"geography.region_name": "Bratislava",
"geography.something_even_more": "...",
"_key": "ba",
"_label": "Bratislava"
}
]
You might have noticed the two redundant keys: _key
and _label
- those
contain values of a level key attribute and level label attribute
respectively. It is there to simplify the use of the details in presentation
layer, such as templates. Take for example doing only one-dimensional
browsing and compare presentation of "breadcrumbs":
labels = [detail["_label"] for detail in cut_details]
Which is equivalent to:
levels = dimension.hierarchy.levels()
labels = []
for i, detail in enumerate(cut_details):
labels.append(detail[level[i].label_attribute.full_name()])
Note that this might change a bit: either full detail will be returned or just
key and label, depending on an option argument (not yet decided).
Pre-aggregation
The Star Browser is being created with SQL pre-aggregation in mind. This is
not possible in the old browser, as it is not flexible enough. It is planned
to be integrated when all basic features are finished.
Proposed access from user's perspective will be through configuration options:
use_preaggregation
, preaggregation_prefix
, preaggregation_schema
and
a method for cube pre-aggregation will be available through the slicer tool.
Summary
The new browser has better internal structure resulting in increased
flexibility for future extensions. It fixes not so good architectural
decisions of the old browser.
New and fixed features:
- direct star/snowflake schema browsing
- improved mappings - more transparent and understandable process
- ability to explicitly specify database schemas
- multiple aggregations
The new backend sources are
here
and the mapper is
here.
To do
To be done in the near future:
- DDL generator for denormalized schema, corresponding logical schema and
physical schema
- explicit list of attributes to be selected (instead of all)
- selection of aggregations per-request (now all specified in model are used)
Links
See also Cubes at github,
Cubes Documentation,
Mailing List
and Submit issues. Also there is an
IRC channel #databrewery on
irc.freenode.net
2012-05-01 by Stefan Urbanek
Last time I was talking about how logical attributes are mapped to the
physical table columns in the
Star Browser. Today I will describe how joins are formed and how
denormalization is going to be used.
The Star Browser is new aggregation browser in for the
Cubes – lightweight Python OLAP Framework.
Star, Snowflake, Master and Detail
Star browser supports a star:
... and snowflake database schema:
The browser should know how to construct the star/snowflake and that is why
you have to specify the joins of the schema. The join specification is very
simple:
"joins" = [
{ "master": "fact_sales.product_id", "detail": "dim_product.id" }
]
Joins support only single-column keys, therefore you might have to create
surrogate keys for your dimensions.
As in mappings, if you have specific needs for explicitly mentioning database
schema or any other reason where table.column
reference is not enough, you
might write:
"joins" = [
{
"master": "fact_sales.product_id",
"detail": {
"schema": "sales",
"table": "dim_products",
"column": "id"
}
]
What if you need to join same table twice? For example, you have list of
organizations and you want to use it as both: supplier and service consumer.
It can be done by specifying alias in the joins:
"joins" = [
{
"master": "contracts.supplier_id",
"detail": "organisations.id",
"alias": "suppliers"
},
{
"master": "contracts.consumer_id",
"detail": "organisations.id",
"alias": "consumers"
}
]
In the mappings you refer to the table by alias specified in the joins, not by
real table name:
"mappings": {
"supplier.name": "suppliers.org_name",
"consumer.name": "consumers.org_name"
}
Relevant Joins and Denormalization
The new mapper joins only tables that are relevant for given query. That is,
if you are browsing by only one dimension, say product, then only product
dimension table is joined.
Joins are slow, expensive and the denormalization can be
helpful:
The old browser is based purely on the denormalized view. Despite having a
performance gain, it has several disadvantages. From the
join/performance perspective the major one is, that the denormalization is
required and it is not possible to browse data in a database that was
"read-only". This requirements was also one unnecessary step for beginners,
which can be considered as usability problem.
Current implementation of the Mapper and StarBrowser allows
denormalization to be integrated in a way, that it might be used based on
needs and situation:
It is not yet there and this is what needs to be done:
- function for denormalization - similar to the old one: will take cube and
view name and will create denormalized view (or a table)
- make mapper accept the view and ignore joins
Goal is not just to slap denormalization in, but to make it a configurable
alternative to default star browsing. From user's perspective, the workflow
will be:
- browse star/snowflake until need for denormalization arises
- configure denormalization and create denormalized view
- browse the denormalized view
The proposed options are: use_denormalization
, denormalized_view_prefix
,
denormalized_view_schema
.
The Star Browser is half-ready for the denormalization, just few changes are
needed in the mapper and maybe query builder. These changes have to be
compatible with another, not-yet-included feature: SQL pre-aggregation.
Conclusion
The new way of joining is very similar to the old one, but has much more
cleaner code and is separated from mappings. Also it is more transparent. New
feature is the ability to specify a database schema. Planned feature to be
integrated is automatic join detection based on foreign keys.
In the next post (the last post in this series) about the new StarBrowser, I am going to
explain aggregation improvements and changes.
Links
Relevant source code is this one (github).
See also Cubes at github,
Cubes Documentation,
Mailing List
and Submit issues. Also there is an
IRC channel #databrewery on
irc.freenode.net