Lower-level functions

Piculet also provides a lower-level API where you can run the stages separately. For example, if the same document will be scraped multiple times with different rules, calling the scrape function repeatedly will cause the document to be parsed into a DOM tree repeatedly. Instead, you can create the DOM tree once and run extraction rules against this tree multiple times.

Also, this API uses classes to express the specification and therefore development tools can help better in writing the rules by showing error indicators and suggesting autocompletions.

Building the tree

The DOM tree can be created from the document using the build_tree function:

>>> from piculet import build_tree
>>> root = build_tree(document)

If the document needs to be converted from HTML to XML, you can use the html_to_xhtml function:

>>> from piculet import html_to_xhtml
>>> converted = html_to_xhtml(document)
>>> root = build_tree(converted)

If lxml is available, you can use the lxml_html parameter for building the tree without converting an HTML document into XHTML:

>>> root = build_tree(document, lxml_html=True)

Note

Note that if you use the lxml.html builder, there might be differences about how the tree is built compared to the piculet conversion method and the path queries for preprocessing and extraction might need changes.

Preprocessing

The tree can be modified using the preprocess function:

>>> from piculet import preprocess
>>> ops = [{"op": "remove", "path": '//div[class="ad"]'}]
>>> preprocess(root, ops)

Data extraction

The class-based API to data extraction has a one-to-one correspondance with the specification mapping. A Rule object corresponds to a key-value pair in the items list. Its value is produced by an extractor. In the simple case, an extractor is a Path object which is a combination of a path, a reducer, and a transformer.

>>> from piculet import Path, Rule, reducers, transformers
>>> extractor = Path('//span[@class="year"]/text()',
...                  reduce=reducers.first,
...                  transform=transformers.int)
>>> rule = Rule(key="year", extractor=extractor)
>>> rule.extract(root)
{'year': 1980}

An extractor can have a foreach attribute if it will be multi-valued:

>>> extractor = Path(foreach='//ul[@class="genres"]/li',
...                  path="./text()",
...                  reduce=reducers.first,
...                  transform=transformers.lower)
>>> rule = Rule(key="genres", extractor=extractor)
>>> rule.extract(root)
{'genres': ['horror', 'drama']}

The key attribute of a rule can be an extractor in which case it can be used to extract the key value from content. A rule can also have a foreach attribute for generating multiple items in one rule. These features will work as they are described in the data extraction section.

A Rules object contains a collection of rule objects and it corresponds to the “items” part in the specification mapping. It acts both as the top level extractor that gets applied to the root of the tree, and also as an extractor for any rule with subrules.

>>> from piculet import Rules
>>> rules = [Rule(key="title",
...               extractor=Path("//title/text()")),
...          Rule(key="year",
...               extractor=Path('//span[@class="year"]/text()',
...               transform=transformers.int))]
>>> Rules(rules).extract(root)
{'title': 'The Shining', 'year': 1980}

A more complete example with transformations is below. Again note that, the specification is exactly the same as given in the corresponding mapping example in the data extraction chapter.

>>> rules = [
...     Rule(key="cast",
...          extractor=Rules(
...              foreach='//table[@class="cast"]/tr',
...              rules=[
...                  Rule(key="name",
...                       extractor=Path("./td[1]/a/text()")),
...                  Rule(key="character",
...                       extractor=Path("./td[2]/text()"))
...              ],
...              transform=lambda x: "%(name)s as %(character)s" % x
...          ))
... ]
>>> Rules(rules).extract(root)
{'cast': ['Jack Nicholson as Jack Torrance',
  'Shelley Duvall as Wendy Torrance']}

A rules object can have a section attribute as described in the data extraction chapter:

>>> rules = [
...     Rule(key="director",
...          extractor=Rules(
...              section='//div[@class="director"]//a',
...              rules=[
...                  Rule(key="name",
...                       extractor=Path("./text()")),
...                  Rule(key="link",
...                       extractor=Path("./@href"))
...              ]))
... ]
>>> Rules(rules).extract(root)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}