2011-11-24 by Stefan Urbanek

Cubes Tutorial 2 - Model and Mappings

In the first tutorial we talked about how to construct model programmatically and how to do basic aggregations.

In this tutorial we are going to learn:

  • how to use model description file
  • why and how to use logical to physical mappings

Data used are the same as in the first tutorial, IBRD Balance Sheet taken from The World Bank. However, for purpose of this tutorial, the file was little bit manually edited: the column “Line Item” is split into two: Subcategory and Line Item to provide two more levels to total of three levels of hierarchy.

Logical Model

The Cubes framework uses a logical model. Logical model describes the data from user’s or analyst’s perspective: data how they are being measured, aggregated and reported. Model creates an abstraction layer therefore making reports independent of physical structure of the data. More information can be found in the framework documentation

The model description file is a JSON file containing a dictionary:

{
    "dimensions": [  ...  ],
    "cubes": { ... }
}

First we define the dimensions. They might be shared by multiple cubes, therefore they belong to the model space. There are two dimensions: item and year in our dataset. The year dimension is flat, contains only one level and has no details. The dimension item has three levels: category, subcategory and line item. It looks like this:

We define them as:

{
    "dimensions": [
        {"name":"item",
         "levels": ["category", "subcategory", "line_item"]
        },
        {"name":"year"}
    ],
    "cubes": {...}
}

The levels of our tutorial dimensions are simple, with no details. There is little bit of implicit construction going on behind the scenes of dimension initialization, but that will be described later. In short: default hierarchy is created and for each level single attribute is created with the same name as the level.

Next we define the cubes. The cube is in most cases specified by list of dimensions and measures:

{
    "dimensions": [...],
    "cubes": [
        {
            "name": "irbd_balance",
            "dimensions": ["item", "year"],
            "measures": ["amount"]
        }
    ]
}

And we are done: we have dimensions and a cube. Well, almost done: we have to tell the framework, which attributes are going to be used.

Attribute Naming

As mentioned before, cubes uses logical model to describe the data used in the reports. To assure consistency with dimension attribute naming, cubes uses sheme: dimension.attribute for non-flat dimensions. Why? Firstly, it decreases doubt to which dimension the attribute belongs. Secondly the item.category will always be item.category in the report, regardless of how the field will be named in the source and in which table the field exists.

Imagine a snowflake schema: fact table in the middle with references to multiple tables containing various dimension data. There might be a dimension spanning through multiple tables, like product category in one table, product subcategory in another table. We should not care about what table the attribute comes from, we should care only that the attribute is called category and belongs to a dimension product for example.

Another reason is, that in localized data, the analyst will use item.category_label and appropriate localized physical attribute will be used. Just to name few reasons.

Knowing the naming scheme we have following cube attribute names:

  • year (it is flat dimension)
  • item.category
  • item.subcategory
  • item.line_item

Problem is, that the table does not have the columns with the names. That is what mapping is for: maps logical attributes in the model into physical attributes in the table.

Mapping

The source table looks like this:

We have to tell how the dimension attributes are mapped to the table columns. It is a simple dictionary where keys are dimension attribute names and values are physical table column names.

{
    ...
    "cubes": [
        {
            "name":"irbd_balance",
            ...
            "mappings": { "item.line_item": "line_item",
                          "item.subcategory": "subcategory",
                          "item.category": "category" }
        }
    ]
}

Note: The mapping values might be backend specific. They are physical table column names for the current implementation of the SQL backend.

Full model looks like this:

{
    "dimensions": [
        {"name":"item",
         "levels": ["category", "subcategory", "line_item"]
        },
        {"name":"year"}
    ],
    "cubes": [
        {
            "name":"irbd_balance",
            "dimensions": ["item", "year"],
            "measures": ["amount"],
            "mappings": { "item.line_item": "line_item",
                          "item.subcategory": "subcategory",
                          "item.category": "category" }
        }
        ]
    }
}

Example

Now we have the model, saved for example in the models/model_02.json. Let’s do some preparation:

Define table names and a view name to be used later. The view is going to be used as logical abstraction.

FACT_TABLE = "ft_irbd_balance"
FACT_VIEW = "vft_irbd_balance"

Load the data, as in the previous example, using the tutorial helper function (again, do not use that in production):

engine = sqlalchemy.create_engine('sqlite:///:memory:')
tutorial.create_table_from_csv(engine, 
                      "data/IBRD_Balance_Sheet__FY2010-t02.csv", 
                      table_name=FACT_TABLE, 
                      fields=[
                            ("category", "string"), 
                            ("subcategory", "string"), 
                            ("line_item", "string"),
                            ("year", "integer"), 
                            ("amount", "integer")],
                      create_id=True

                        )
connection = engine.connect()

The new data sheet is in the github repository.

Load the model, get the cube and specify where cube’s source data comes from:

workspace = cubes.Workspace()
workspace.import_model("models/model_02.json")
cube = workspace.cube("irbd_balance")
cube.fact = FACT_TABLE

We have to prepare the logical structures used by the browser. Currenlty provided is simple data denormalizer: creates one wide view with logical column names (optionally with localization). Following code initializes the denomralizer and creates a view for the cube:

dn = cubes.backends.sql.SQLDenormalizer(cube, connection)

dn.create_view(FACT_VIEW)

And from this point on, we can continue as usual:

browser = cubes.backends.sql.SQLBrowser(cube, connection, view_name = FACT_VIEW)

cell = cubes.Cell(cube)
result = browser.aggregate(cell)

print "Record count: %d" % result.summary["record_count"]
print "Total amount: %d" % result.summary["amount_sum"]

The tutorial sources can be found in the Cubes github repository. Requires current git clone.

Next: Drill-down through deep hierarchy.

If you have any questions, suggestions, comments, let me know.