PostprocessingΒΆ

Specifications can contain postprocessing operations to make changes on the obtained data after extraction.

Postprocessors are functions that take a mapping and return a mapping. Like with transformers, a map to look up postprocessor callables from names has to be given to the load_spec function.

For example, to add a key by combining the director name with the title:

def add_director_title(data):
    data["director_title"] = "%(director)s's '%(title)s'" % data
    return data

postprocessors = {"director_title": add_director_title}

rules = [
    {
        "key": "title",
        "extractor": {"path": "//title//text()"}
    },
    {
        "key": "director",
        "extractor": {"path": "//div[@class='director']//a/text()"}
    }
]

spec = load_spec(
    {"rules": rules, "post": ["director_title"]},
    postprocessors=postprocessors
)
data = spec.scrape(document, doctype="html")

# data:
{
    "title": "The Shining",
    "director": "Stanley Kubrick",
    "director_title": "Stanley Kubrick's 'The Shining'"
}