PreprocessingΒΆ

Specifications can contain preprocessing operations which allow modifications on the tree before starting data extraction.

Preprocessors are functions that take the root node of a tree and return a node to be used as the root in extraction operations. Like with transformers, a map to look up preprocessor callables from names has to be given to the load_spec function.

For example, to gather all the person names from the document, we can use a preprocessor to select the relevant a tags and add them a class attribute which we can later use in path queries:

def mark_people_links(root):
    for anchor in root.xpath("//a[starts-with(@href, '/people/')]"):
        anchor.attrib["class"] = "person"
    return root

preprocessors = {"mark_people": mark_people_links}

rules = [
    {
        "key": "title",
        "extractor": {"path": "//title//text()"}
    },
    {
        "key": "people",
        "extractor": {"foreach": "//a[@class='person']", "path": "./text()"}
    }
]

spec = load_spec(
    {"pre": ["mark_people"], "rules": rules},
    preprocessors=preprocessors
)
data = spec.scrape(document, doctype="html")

# data:
{
    "title": "The Shining",
    "people": ["Stanley Kubrick", "Jack Nicholson", "Shelley Duvall"]
}