Brew data from Scraper Wiki
New subproject sprouted in Brewery: Opendata. The new package will contain wrappers for various open data services with APIs for structured data. First wrapper is for the Scraper Wiki. There are two new classes: ScraperWikiDataSource for plain data reading and ScraperWikiSourceNode for stream processing.
Example with ScraperWikiDataSource: Copy data from Scraper Wiki source into a local database. Table will be automatically created and replaced according to data structure in the source:
from brewery.opendata import *
from brewery.ds import *
src = ScraperWikiDataSource("seznam_insolvencnich_spravcu")
target = SQLDataTarget(url = "postgres://localhost/sandbox", table = "swiki_data",
                        create = True, replace = True)
src.initialize()
target.fields = src.fields
target.initialize()
for row in src.rows():
    target.append(row)
src.finalize()
target.finalize()
Another example using streams: simple completeness audit report of source data. Fail threshold is set to 10%.
The stream looks like this:

- from scraper wiki feed data to data audit node
 - based on value threshold generate new textual field that will state whether the data passed or failed completeness test (there should be no more than 10% of empty values)
 - print formatted report
 
And the source code for the stream set-up is:
nodes = {
    "source": ScraperWikiSourceNode("seznam_insolvencnich_spravcu"),
    "audit": AuditNode(distinct_threshold = None),
    "threshold": ValueThresholdNode(),
    "print": FormattedPrinterNode(),
}
connections = [ 
                ("source", "audit"), 
                ("audit", "threshold"),
                ("threshold", "print")
                ]
nodes["print"].header = u"field                            nulls     status   distinct\n" \
                         "------------------------------------------------------------"
nodes["print"].format = u"{field_name:<30.30} {null_record_ratio: >7.2%} {null_record_ratio_bin:>10} {distinct_count:>10}"
nodes["threshold"].thresholds = [ ["null_record_ratio", 0.10] ]
nodes["threshold"].bin_names = ("ok", "fail")
stream = Stream(nodes, connections)
try:
    stream.run()
except StreamRuntimeError, e:
    e.print_exception()
Output:
field nulls status distinct ------------------------------------------------------------ cp_S 0.00% ok 84 cp_TP 31.00% fail 66 datumNarozeni 18.00% fail 83 denPozastaveni 100.00% fail 1 denVzniku 0.00% ok 91 denZaniku 100.00% fail 1 dne 99.00% fail 2 dobaPlatnosti 100.00% fail 1 ... nazev 82.00% fail 19 okres_S 5.00% ok 38 okres_TP 38.00% fail 35 ...
In this example you can see how successful you scraper is or how complete the provided data are. This simple stream helps you to fine-tune your scraping method.
Possible use, besides during development, would be to integrate the stream into automated process to get feedback on how complete your daily/monthly processing was.
In one of the following posts I will show you how to do “join” (in SQL sense) between datasets, for example how to enrich data from Scraper Wiki with details you have stored in a CSV or another scraper.