2011-03-23 by Stefan Urbanek

Brew data from Scraper Wiki

New subproject sprouted in Brewery: Opendata. The new package will contain wrappers for various open data services with APIs for structured data. First wrapper is for the Scraper Wiki. There are two new classes: ScraperWikiDataSource for plain data reading and ScraperWikiSourceNode for stream processing.

Example with ScraperWikiDataSource: Copy data from Scraper Wiki source into a local database. Table will be automatically created and replaced according to data structure in the source:

from brewery.opendata import *
from brewery.ds import *

src = ScraperWikiDataSource("seznam_insolvencnich_spravcu")
target = SQLDataTarget(url = "postgres://localhost/sandbox", table = "swiki_data",
                        create = True, replace = True)
src.initialize()
target.fields = src.fields
target.initialize()

for row in src.rows():
    target.append(row)

src.finalize()
target.finalize()

Another example using streams: simple completeness audit report of source data. Fail threshold is set to 10%.

The stream looks like this:

Scraper Wiki simple example

  1. from scraper wiki feed data to data audit node
  2. based on value threshold generate new textual field that will state whether the data passed or failed completeness test (there should be no more than 10% of empty values)
  3. print formatted report

And the source code for the stream set-up is:

nodes = {
    "source": ScraperWikiSourceNode("seznam_insolvencnich_spravcu"),
    "audit": AuditNode(distinct_threshold = None),
    "threshold": ValueThresholdNode(),
    "print": FormattedPrinterNode(),
}

connections = [ 
                ("source", "audit"), 
                ("audit", "threshold"),
                ("threshold", "print")
                ]

nodes["print"].header = u"field                            nulls     status   distinct\n" \
                         "------------------------------------------------------------"
nodes["print"].format = u"{field_name:<30.30} {null_record_ratio: >7.2%} {null_record_ratio_bin:>10} {distinct_count:>10}"

nodes["threshold"].thresholds = [ ["null_record_ratio", 0.10] ]
nodes["threshold"].bin_names = ("ok", "fail")

stream = Stream(nodes, connections)

try:
    stream.run()
except StreamRuntimeError, e:
    e.print_exception()

Output:

field                            nulls     status   distinct
------------------------------------------------------------
cp_S                             0.00%         ok         84
cp_TP                           31.00%       fail         66
datumNarozeni                   18.00%       fail         83
denPozastaveni                 100.00%       fail          1
denVzniku                        0.00%         ok         91
denZaniku                      100.00%       fail          1
dne                             99.00%       fail          2
dobaPlatnosti                  100.00%       fail          1
...

nazev                           82.00%       fail         19
okres_S                          5.00%         ok         38
okres_TP                        38.00%       fail         35
...

In this example you can see how successful you scraper is or how complete the provided data are. This simple stream helps you to fine-tune your scraping method. 

Possible use, besides during development, would be to integrate the stream into automated process to get feedback on how complete your daily/monthly processing was.

In one of the following posts I will show you how to do “join” (in SQL sense) between datasets, for example how to enrich data from Scraper Wiki with details you have stored in a CSV or another scraper.