Piculet

Copyright (C) 2014-2019 H. Turgut Uyar <uyar@tekir.org>

Piculet is a module for extracting data from XML or HTML documents using XPath queries. It consists of a single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications. It also provides a command line interface.

PyPI:https://pypi.org/project/piculet/
Repository:https://github.com/uyar/piculet
Documentation:https://piculet.readthedocs.io/

Piculet has been tested with Python 2.7, Python 3.4+, and compatible versions of PyPy. You can install the latest version using pip:

pip install piculet

Contents

Overview

Scraping a document consists of three stages:

  1. Building a DOM tree out of the document. This is a straightforward operation for an XML document. For an HTML document, Piculet will first try to convert it into XHTML and then build the tree from that.
  2. Preprocessing the tree. This is an optional stage. In some cases it might be helpful to do some changes on the tree to simplify the extraction process.
  3. Extracting data out of the tree.

The preprocessing and extraction stages are expressed as part of a scraping specification. The specification is a mapping which can be stored in a file format that can represent a mapping, such as JSON or YAML. Details about the specification are given in later chapters.

Command Line Interface

Installing Piculet creates a script named piculet which can be used to invoke the command line interface:

$ piculet -h
usage: piculet [-h] [--debug] command ...

The scrape command extracts data out of a document as described by a specification file:

$ piculet scrape -h
usage: piculet scrape [-h] -s SPEC [--html] document

The location of the document can be given as a file path or a URL. For example, say you want to extract some data from the file shining.html. An example specification is given in movie.json. Download both of these files and run the command:

$ piculet scrape -s movie.json shining.html

This should print the following output:

{
  "cast": [
    {
      "character": "Jack Torrance",
      "link": "/people/2",
      "name": "Jack Nicholson"
    },
    {
      "character": "Wendy Torrance",
      "link": "/people/3",
      "name": "Shelley Duvall"
    }
  ],
  "director": {
    "link": "/people/1",
    "name": "Stanley Kubrick"
  },
  "genres": [
    "Horror",
    "Drama"
  ],
  "language": "English",
  "review": "Fantastic movie. Definitely recommended.",
  "runtime": "144 minutes",
  "title": "The Shining",
  "year": 1980
}

For HTML documents, the --html option has to be used. If the document address starts with http:// or https://, the content will be taken from the given URL. For example, to extract some data from the Wikipedia page for David Bowie, download the wikipedia.json file and run the command:

piculet scrape -s wikipedia.json --html "https://en.wikipedia.org/wiki/David_Bowie"

This should print the following output:

{
  "birthplace": "Brixton, London, England",
  "born": "1947-01-08",
  "name": "David Bowie",
  "occupation": [
    "Singer",
    "songwriter",
    "actor"
  ]
}

In the same command, change the name part of the URL to Merlene_Ottey and you will get similar data for Merlene Ottey. Note that since the markup used in Wikipedia pages for persons varies, the kinds of data you get with this specification will also vary.

Piculet can be used as a simplistic HTML to XHTML convertor by invoking it with the h2x command. This command takes the file name as input and prints the converted content, as in piculet h2x foo.html. If the input file name is given as - it will read the content from the standard input and therefore can be used as part of a pipe: cat foo.html | piculet h2x -

Using in programs

The scraping operation can also be invoked programmatically using the scrape_document function. Note that this function prints its output and doesn’t return anything:

from piculet import scrape_document

url = "https://en.wikipedia.org/wiki/David_Bowie"
spec = "wikipedia.json"
scrape_document(url, spec, content_format="html")

YAML support

To use YAML for specification, Piculet has to be installed with YAML support:

pip install piculet[yaml]

Note that this will install an external module for parsing YAML files, and therefore will not be contained to the standard library anymore.

The YAML version of the configuration example above can be found in movie.yaml.

Data extraction

This section explains how to write the specification for extracting data from a document. We’ll scrape the following HTML content for the movie “The Shining” in our examples:

<html>
    <head>
        <meta charset="utf-8"/>
        <title>The Shining</title>
    </head>
    <body>
        <h1>The Shining (<span class="year">1980</span>)</h1>
        <ul class="genres">
            <li>Horror</li>
            <li>Drama</li>
        </ul>
        <div class="director">
            <h3>Director:</h3>
            <p><a href="/people/1">Stanley Kubrick</a></p>
        </div>
        <table class="cast">
            <tr>
                <td><a href="/people/2">Jack Nicholson</a></td>
                <td>Jack Torrance</td>
            </tr>
            <tr>
                <td><a href="/people/3">Shelley Duvall</a></td>
                <td>Wendy Torrance</td>
            </tr>
        </table>
        <div class="info">
            <h3>Runtime:</h3>
            <p>144 minutes</p>
        </div>
        <div class="info">
            <h3>Language:</h3>
            <p>English</p>
        </div>
        <div class="review">
            <em>Fantastic</em> movie.
            Definitely recommended.
        </div>
    </body>
</html>

Instead of the scrape_document function that reads the content and the specification from files, we’ll use the scrape function that works directly on the content and the specification map:

>>> from piculet import scrape

Assuming the HTML document above is saved as shining.html, let’s get its content:

>>> with open("shining.html") as f:
...     document = f.read()

The scrape function assumes that the document is in XML format. So if any conversion is needed, it has to be done before calling this function. [1] After building the DOM tree, the function will apply the extraction rules to the root element of the tree, and return a mapping where each item is generated by one of the rules.

Note

Piculet uses the ElementTree module for building and querying XML trees. However, it will make use of the lxml package if it’s installed. The scrape function takes an optional lxml_html parameter which will use the HTML builder from the lxml package, thereby building the tree without converting HTML into XML first.

The specification mapping contains two keys: the pre key is for specifying the preprocessing operations (these will be covered in the next section), and the items key is for specifying the rules that describe how to extract the data:

spec = {"pre": [...], "items": [...]}

The items list contains item mappings, where each item has a key and a value description. The key specifies the key for the item in the output mapping and the value specifies how to extract the data to set as the value for that item. Typically, a value specifier consists of a path query and a reducing function. The query is applied to the root and a list of strings is obtained. Then, the reducing function converts this list into a single string. [2]

For example, to get the title of the movie from the example document, we can write:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}

The .//title/text() path generates the list ['The Shining'] and the reducing function first selects the first element from that list.

Note

By default, the XPath queries are limited by what ElementTree supports (plus the text() and @attr clauses which are added by Piculet). However, if the lxml package is installed a much wider range of XPath constructs can be used.

Multiple items can be collected in a single invocation:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()",
...                 "reduce": "first"
...             }
...         },
...         {
...             "key": "year",
...             "value": {
...                 "path": '//span[@class="year"]/text()',
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': '1980'}

If a path doesn’t match any element in the tree, the item will be excluded from the output. Note that in the following example, the “foo” key doesn’t get included:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()",
...                 "reduce": "first"
...              }
...         },
...         {
...             "key": "foo",
...             "value": {
...                 "path": "//foo/text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}

Reducing

Piculet contains a few predefined reducing functions. Other than the first reducer used in the examples above, a very common reducer is concat which will concatenate the selected strings:

>>> spec = {
...     "items": [
...         {
...             "key": "full_title",
...             "value": {
...                 "path": "//h1//text()",
...                 "reduce": "concat"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}

concat is the default reducer, i.e. if no reducer is given, the strings will be concatenated:

>>> spec = {
...     "items": [
...         {
...             "key": "full_title",
...             "value": {
...                 "path": "//h1//text()"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}

If you want to get rid of extra whitespace, you can use the clean reducer. After concatenating the strings, this will remove leading and trailing whitespace and replace multiple whitespace with a single space:

>>> spec = {
...     "items": [
...         {
...             "key": "review",
...             "value": {
...                 "path": '//div[@class="review"]//text()',
...                 "reduce": "clean"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'review': 'Fantastic movie. Definitely recommended.'}

In this example, the concat reducer would have produced the value '\n            Fantastic movie.\n            Definitely recommended.\n        '

As explained above, if a path query doesn’t match any element, the item gets automatically excluded. That means, Piculet doesn’t try to apply the reducing function on the result of the path query if it’s an empty list. Therefore, reducing functions can safely assume that the path result is a non-empty list.

If you want to use a custom reducer, you have to register it first. The name for the specifier (the first parameter) has to be a valid Python identifier.

>>> from piculet import reducers
>>> reducers.register("second", lambda x: x[1])
>>> spec = {
...     "items": [
...         {
...             "key": "year",
...             "value": {
...                 "path": "//h1//text()",
...                 "reduce": "second"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'year': '1980'}

Transforming

After the reduction operation, you can apply a transformation to the resulting string. A transformation function must take a string as parameter and can return any value of any type. Piculet contains several predefined transformers: int, float, bool, len, lower, upper, capitalize. For example, to get the year of the movie as an integer:

>>> spec = {
...     "items": [
...         {
...             "key": "year",
...             "value": {
...                 "path": '//span[@class="year"]/text()',
...                 "reduce": "first",
...                 "transform": "int"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'year': 1980}

If you want to use a custom transformer, you have to register it first:

>>> from piculet import transformers
>>> transformers.register("year25", lambda x: int(x) + 25)
>>> spec = {
...     "items": [
...         {
...             "key": "25th_year",
...             "value": {
...                 "path": '//span[@class="year"]/text()',
...                 "reduce": "first",
...                 "transform": "year25"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'25th_year': 2005}

Multi-valued items

Data with multiple values can be created by using a foreach key in the value specifier. This is a path expression to select elements from the tree. [3] The path and reducing function will be applied to each selected element and the obtained values will be the members of the resulting list. For example, to get the genres of the movie, we can write:

>>> spec = {
...     "items": [
...         {
...             "key": "genres",
...             "value": {
...                 "foreach": '//ul[@class="genres"]/li',
...                 "path": "./text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'genres': ['Horror', 'Drama']}

If the foreach key doesn’t match any element the item will be excluded from the result:

>>> spec = {
...     "items": [
...         {
...             "key": "foos",
...             "value": {
...                 "foreach": '//ul[@class="foos"]/li',
...                 "path": "./text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{}

If a transformation is specified, it will be applied to every element in the resulting list:

>>> spec = {
...     "items": [
...         {
...             "key": "genres",
...             "value": {
...                 "foreach": '//ul[@class="genres"]/li',
...                 "path": "./text()",
...                 "reduce": "first",
...                 "transform": "lower"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'genres': ['horror', 'drama']}

Subrules

Nested structures can be created by writing subrules as value specifiers. If the value specifier is a mapping that contains an items key, then this will be interpreted as a subrule and the generated mapping will be the value for the key.

>>> spec = {
...     "items": [
...         {
...             "key": "director",
...             "value": {
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": {
...                             "path": '//div[@class="director"]//a/text()',
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "link",
...                         "value": {
...                             "path": '//div[@class="director"]//a/@href',
...                             "reduce": "first"
...                         }
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}

Subrules can be combined with lists:

>>> spec = {
...     "items": [
...         {
...             "key": "cast",
...             "value": {
...                 "foreach": '//table[@class="cast"]/tr',
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": {
...                             "path": "./td[1]/a/text()",
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "link",
...                         "value": {
...                             "path": "./td[1]/a/@href",
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "character",
...                         "value": {
...                             "path": "./td[2]/text()",
...                             "reduce": "first"
...                         }
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'cast': [{'character': 'Jack Torrance',
   'link': '/people/2',
   'name': 'Jack Nicholson'},
  {'character': 'Wendy Torrance',
   'link': '/people/3',
   'name': 'Shelley Duvall'}]}

Items generated by subrules can also be transformed. The transformation function is always applied as the last step in a “value” definition. But transformers for subitems take mappings (as opposed to strings) as parameter.

>>> transformers.register("stars", lambda x: "%(name)s as %(character)s" % x)
>>> spec = {
...     "items": [
...         {
...             "key": "cast",
...             "value": {
...                 "foreach": '//table[@class="cast"]/tr',
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": {
...                             "path": "./td[1]/a/text()",
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "character",
...                         "value": {
...                             "path": "./td[2]/text()",
...                             "reduce": "first"
...                         }
...                     }
...                 ],
...                 "transform": "stars"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'cast': ['Jack Nicholson as Jack Torrance',
  'Shelley Duvall as Wendy Torrance']}

Generating keys from content

You can generate items where the key value also comes from the content. For example, consider how you would get the runtime and the language of the movie. Instead of writing multiple items for each h3 element under an “info” class div, we can write only one item that will select these divs and use the h3 text as the key. These elements can be selected using foreach specifications in the items. This will cause a new item to be generated for each selected element. To get the key value, we can use paths, reducers -and also transformers- that will be applied to the selected element:

>>> spec = {
...     "items": [
...         {
...             "foreach": '//div[@class="info"]',
...             "key": {
...                 "path": "./h3/text()",
...                 "reduce": "first"
...             },
...             "value": {
...                 "path": "./p/text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'Language:': 'English', 'Runtime:': '144 minutes'}

The normalize reducer concatenates the strings, converts it to lowercase, replaces spaces with underscores and strips other non-alphanumeric characters:

>>> spec = {
...     "items": [
...         {
...             "foreach": '//div[@class="info"]',
...             "key": {
...                 "path": "./h3/text()",
...                 "reduce": "normalize"
...             },
...             "value": {
...                 "path": "./p/text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'language': 'English', 'runtime': '144 minutes'}

You could also give a string instead of a path and reducer for the key. In this case, the elements would still be traversed; only the last one would set the final value for the item. This could be OK if you are sure that there is only one element that matches the foreach path of the key.

Sections

The specification also provides the ability to define sections within the document. An element can be selected as the root of a section such that the XPath queries in that section will be relative to that root. This can be used to make XPath expressions shorter and also constrain the search in the tree. For example, the “director” example above can also be written using sections:

>>> spec = {
...     "section": '//div[@class="director"]//a',
...     "items": [
...         {
...             "key": "director",
...             "value": {
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": {
...                             "path": "./text()",
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "link",
...                         "value": {
...                             "path": "./@href",
...                             "reduce": "first"
...                         }
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
[1]Note that the example document is already in XML format.
[2]This means that the query has to end with either text() or some attribute value as in @attr. And the reducing function should be implemented so that it takes a list of strings and returns a string.
[3]This implies that the foreach query should not end in text() or @attr.

Preprocessing

Other than extraction rules, specifications can also contain preprocessing operations which allow modifications on the tree before starting data extraction. Such operations can be needed to make data extraction simpler or to remove the need for some postprocessing operations on the collected data.

The syntax for writing preprocessing operations is as follows:

rules = {
    "pre": [
        {
            "op": "...",
            ...
        },
        {
            "op": "...",
            ...
        }
    ],
    "items": [ ... ]
}

Every preprocessing operation item has a name which is given as the value of the “op” key. The other items in the mapping are specific to the operation. The operations are applied in the order as they are written in the operations list.

The predefined preprocessing operations are explained below.

Removing elements

This operation removes from the tree all the elements (and its subtree) that are selected by a given XPath query:

{"op": "remove", "path": "..."}

Setting element attributes

This operation selects all elements by a given XPath query and sets an attribute for these elements to a given value:

{"op": "set_attr", "path": "...", "name": "...", "value": "..."}

The attribute “name” can be a literal string or an extractor as described in the data extraction chapter. Similarly, the attribute “value” can be given as a literal string or an extractor.

Setting element text

This operation selects all elements by a given XPath query and sets their texts to a given value:

{"op": "set_text", "path": "...", "text": "..."}

The “text” can be a literal string or an extractor.

Lower-level functions

Piculet also provides a lower-level API where you can run the stages separately. For example, if the same document will be scraped multiple times with different rules, calling the scrape function repeatedly will cause the document to be parsed into a DOM tree repeatedly. Instead, you can create the DOM tree once and run extraction rules against this tree multiple times.

Also, this API uses classes to express the specification and therefore development tools can help better in writing the rules by showing error indicators and suggesting autocompletions.

Building the tree

The DOM tree can be created from the document using the build_tree function:

>>> from piculet import build_tree
>>> root = build_tree(document)

If the document needs to be converted from HTML to XML, you can use the html_to_xhtml function:

>>> from piculet import html_to_xhtml
>>> converted = html_to_xhtml(document)
>>> root = build_tree(converted)

If lxml is available, you can use the lxml_html parameter for building the tree without converting an HTML document into XHTML:

>>> root = build_tree(document, lxml_html=True)

Note

Note that if you use the lxml.html builder, there might be differences about how the tree is built compared to the piculet conversion method and the path queries for preprocessing and extraction might need changes.

Preprocessing

The tree can be modified using the preprocess function:

>>> from piculet import preprocess
>>> ops = [{"op": "remove", "path": '//div[class="ad"]'}]
>>> preprocess(root, ops)

Data extraction

The class-based API to data extraction has a one-to-one correspondance with the specification mapping. A Rule object corresponds to a key-value pair in the items list. Its value is produced by an extractor. In the simple case, an extractor is a Path object which is a combination of a path, a reducer, and a transformer.

>>> from piculet import Path, Rule, reducers, transformers
>>> extractor = Path('//span[@class="year"]/text()',
...                  reduce=reducers.first,
...                  transform=transformers.int)
>>> rule = Rule(key="year", extractor=extractor)
>>> rule.extract(root)
{'year': 1980}

An extractor can have a foreach attribute if it will be multi-valued:

>>> extractor = Path(foreach='//ul[@class="genres"]/li',
...                  path="./text()",
...                  reduce=reducers.first,
...                  transform=transformers.lower)
>>> rule = Rule(key="genres", extractor=extractor)
>>> rule.extract(root)
{'genres': ['horror', 'drama']}

The key attribute of a rule can be an extractor in which case it can be used to extract the key value from content. A rule can also have a foreach attribute for generating multiple items in one rule. These features will work as they are described in the data extraction section.

A Rules object contains a collection of rule objects and it corresponds to the “items” part in the specification mapping. It acts both as the top level extractor that gets applied to the root of the tree, and also as an extractor for any rule with subrules.

>>> from piculet import Rules
>>> rules = [Rule(key="title",
...               extractor=Path("//title/text()")),
...          Rule(key="year",
...               extractor=Path('//span[@class="year"]/text()',
...               transform=transformers.int))]
>>> Rules(rules).extract(root)
{'title': 'The Shining', 'year': 1980}

A more complete example with transformations is below. Again note that, the specification is exactly the same as given in the corresponding mapping example in the data extraction chapter.

>>> rules = [
...     Rule(key="cast",
...          extractor=Rules(
...              foreach='//table[@class="cast"]/tr',
...              rules=[
...                  Rule(key="name",
...                       extractor=Path("./td[1]/a/text()")),
...                  Rule(key="character",
...                       extractor=Path("./td[2]/text()"))
...              ],
...              transform=lambda x: "%(name)s as %(character)s" % x
...          ))
... ]
>>> Rules(rules).extract(root)
{'cast': ['Jack Nicholson as Jack Torrance',
  'Shelley Duvall as Wendy Torrance']}

A rules object can have a section attribute as described in the data extraction chapter:

>>> rules = [
...     Rule(key="director",
...          extractor=Rules(
...              section='//div[@class="director"]//a',
...              rules=[
...                  Rule(key="name",
...                       extractor=Path("./text()")),
...                  Rule(key="link",
...                       extractor=Path("./@href"))
...              ]))
... ]
>>> Rules(rules).extract(root)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}

API

History

1.0.1 (2019-02-07)

  • Accept both .yaml and .yml as valid YAML file extensions.
  • Documentation fixes.

1.0 (2018-05-25)

  • Bumped version to 1.0.

1.0b7 (2018-03-21)

  • Dropped support for Python 3.3.
  • Fixes for handling Unicode data in HTML for Python 2.
  • Added registry for preprocessors.

1.0b6 (2018-01-17)

  • Support for writing specifications in YAML.

1.0b5 (2018-01-16)

  • Added a class-based API for writing specifications.
  • Added predefined transformation functions.
  • Removed callables from specification maps. Use the new API instead.
  • Added support for registering new reducers and transformers.
  • Added support for defining sections in document.
  • Refactored XPath evaluation method in order to parse path expressions once.
  • Preprocessing will be done only once when the tree is built.
  • Concatenation is now the default reducing operation.

1.0b4 (2018-01-02)

  • Added “–version” option to command line arguments.
  • Added option to force the use of lxml’s HTML builder.
  • Fixed the error where non-truthy values would be excluded from the result.
  • Added support for transforming node text during preprocess.
  • Added separate preprocessing function to API.
  • Renamed the “join” reducer as “concat”.
  • Renamed the “foreach” keyword for keys as “section”.
  • Removed some low level debug messages to substantially increase speed.

1.0b3 (2017-07-25)

  • Removed the caching feature.

1.0b2 (2017-06-16)

  • Added helper function for getting cache hash keys of URLs.

1.0b1 (2017-04-26)

  • Added optional value transformations.
  • Added support for custom reducer callables.
  • Added command-line option for scraping documents from local files.

1.0a2 (2017-04-04)

  • Added support for Python 2.7.
  • Fixed lxml support.

1.0a1 (2016-08-24)

  • First release on PyPI.

Indices and Tables