Lower-level functions

Piculet also provides a lower-level API where you can run the stages separately. For example, if the same document will be scraped multiple times with different rules, calling the scrape function repeatedly will cause the document to be parsed into a DOM tree repeatedly. Instead, you can create the DOM tree once, and run extraction rules against this tree multiple times.

Also, this API uses code to express the specification (instead of strings) and therefore development tools can help better in writing the rules by showing error indicators and suggesting autocompletions.

Building the tree

The DOM tree can be created from the document using the build_tree function:

>>> from piculet import build_tree
>>> root = build_tree(document)

If the document needs to be converted from HTML to XML, you can use the html_to_xhtml function:

>>> from piculet import html_to_xhtml
>>> converted = html_to_xhtml(document)
>>> root = build_tree(converted)

If lxml is available, you can use the lxml_html parameter for building the tree without converting an HTML document into XHTML:

>>> root = build_tree(document, lxml_html=True)

Note

Note that if you use the lxml.html builder, there might be differences about how the tree is built compared to the piculet conversion method, and the path queries for preprocessing and extraction might need changes.

Preprocessing

Preprocessors are functions that take an element in the DOM tree as parameter and modify the tree. The preprocessors registry contains preprocessor generators which take the parameters other than the element to apply the operation to, and return a function that expects the element:

>>> from piculet import preprocessors
>>> remove_ads = preprocessors.remove(path='//div[@class="ad"]')
>>> remove_ads(root)

Warning

The preprocessing functions assume that the root of the tree doesn’t change.

Data extraction

The API for data extraction has a one-to-one correspondance with the specification mapping.

Path extractors are applied to elements to extract the value for a single data item.

>>> from piculet import Path
>>> path = Path('//span[@class="year"]/text()', transform=int)
>>> path(root)
1980

The sep parameter can be used concatenate using a separator string:

>>> path = Path('//table[@class="cast"]/tr/td[1]/a/text()', sep=", ")
>>> path(root)
'Jack Nicholson, Shelley Duvall'

You can use the chain utility function to generate chained transformers:

>>> from piculet import chain
>>> path = Path(
...     '//span[@class="year"]/text()',
...     transform=chain(int, lambda x: x // 100 + 1),
... )
>>> path(root)
20

Every item in the result mapping is generated by a Rule in the API. Rules are applied to elements to extract data items in the result mapping, so their basic function is to associate the keys with the values.

>>> from piculet import Rule
>>> rule = Rule(
...     key="year",
...     value=Path('//span[@class="year"]/text()', transform=int),
... )
>>> rule(root)
{'year': 1980}

Items extractors are applied to elements to extract subitems for a data item. Basically, they are rule collections.

>>> from piculet import Items
>>> rules = [
...     Rule(
...         key="title",
...         value=Path('//title/text()'),
...     ),
...     Rule(
...         key="year",
...         value=Path('//span[@class="year"]/text()', transform=int),
...     ),
... ]
>>> items = Items(rules)
>>> items(root)
{'title': 'The Shining', 'year': 1980}

Items extractors act both as the top level extractor that gets applied to the root of the tree, and also as an extractor for any rule with subitems.

An extractor can have a foreach parameter if it will be multi-valued:

>>> rules = [
...     Rule(
...         key="genres",
...         value=Path(
...             foreach='//ul[@class="genres"]/li',
...             path="./text()",
...             transform=str.lower,
...         ),
...     ),
... ]
>>> items = Items(rules)
>>> items(root)
{'genres': ['horror', 'drama']}

The key parameter of a rule can be an extractor in which case it can be used to extract the key value from content. A rule can also have a foreach parameter for generating multiple items in one rule. These features will work as they are described in the data extraction section.

A more complete example with transformations is given below. Again note that the specification is exactly the same as given in the corresponding mapping example in the data extraction chapter.

>>> rules = [
...     Rule(
...         key="cast",
...         value=Items(
...             foreach='//table[@class="cast"]/tr',
...             rules=[
...                 Rule(
...                     key="name",
...                     value=Path("./td[1]/a/text()"),
...                 ),
...                 Rule(
...                     key="character",
...                     value=Path("./td[2]/text()"),
...                 ),
...              ],
...              transform=lambda x: "%(name)s as %(character)s" % x
...         ),
...     ),
... ]
>>> Items(rules)(root)
{'cast': ['Jack Nicholson as Jack Torrance',
  'Shelley Duvall as Wendy Torrance']}

A rule can have a section parameter as described in the data extraction chapter:

>>> rules = [
...     Rule(
...         key="director",
...         value=Items(
...             section='//div[@class="director"]//a',
...             rules=[
...                 Rule(
...                     key="name",
...                     value=Path("./text()"),
...                 ),
...                 Rule(
...                     key="link",
...                     value=Path("./@href"),
...                 ),
...             ],
...         ),
...     ),
... ]
>>> Items(rules)(root)
{'director': {'name': 'Stanley Kubrick', 'link': '/people/1'}}