Data extraction

This section explains how to write the specification for extracting data from a document. We’ll scrape the following HTML content for the movie “The Shining” in our examples:

<html>
    <head>
        <meta charset="utf-8"/>
        <title>The Shining</title>
    </head>
    <body>
        <h1>The Shining (<span class="year">1980</span>)</h1>
        <ul class="genres">
            <li>Horror</li>
            <li>Drama</li>
        </ul>
        <div class="director">
            <h3>Director:</h3>
            <p><a href="/people/1">Stanley Kubrick</a></p>
        </div>
        <table class="cast">
            <tr>
                <td><a href="/people/2">Jack Nicholson</a></td>
                <td>Jack Torrance</td>
            </tr>
            <tr>
                <td><a href="/people/3">Shelley Duvall</a></td>
                <td>Wendy Torrance</td>
            </tr>
        </table>
        <div class="info">
            <h3>Runtime:</h3>
            <p>144 minutes</p>
        </div>
        <div class="info">
            <h3>Language:</h3>
            <p>English</p>
        </div>
        <div class="review">
            <em>Fantastic</em> movie.
            Definitely recommended.
        </div>
    </body>
</html>

Instead of the scrape_document function that reads the content and the specification from files, we’ll use the scrape function that works directly on the content and the specification map:

>>> from piculet import scrape

Assuming the HTML document above is saved as shining.html, let’s get its content:

>>> with open("shining.html") as f:
...     document = f.read()

The scrape function assumes that the document is in XML format. So if any conversion is needed, it has to be done before calling this function. [1] After building the DOM tree, the function will apply the extraction rules to the root element of the tree, and return a mapping where each item is generated by one of the rules.

Note

Piculet uses the ElementTree module for building and querying XML trees. However, it will make use of the lxml package if it’s installed. The scrape function takes an optional lxml_html parameter which will use the HTML builder from the lxml package, thereby building the tree without converting HTML into XML first.

The specification mapping contains two keys: the pre key is for specifying the preprocessing operations (these will be covered in the next section), and the items key is for specifying the rules that describe how to extract the data:

spec = {"pre": [...], "items": [...]}

The items list contains item mappings, where each item has a key and a value description. The key specifies the key for the item in the output mapping and the value specifies how to extract the data to set as the value for that item. Typically, a value specifier consists of a path query and a reducing function. The query is applied to the root and a list of strings is obtained. Then, the reducing function converts this list into a single string. [2]

For example, to get the title of the movie from the example document, we can write:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}

The .//title/text() path generates the list ['The Shining'] and the reducing function first selects the first element from that list.

Note

By default, the XPath queries are limited by what ElementTree supports (plus the text() and @attr clauses which are added by Piculet). However, if the lxml package is installed a much wider range of XPath constructs can be used.

Multiple items can be collected in a single invocation:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()",
...                 "reduce": "first"
...             }
...         },
...         {
...             "key": "year",
...             "value": {
...                 "path": '//span[@class="year"]/text()',
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': '1980'}

If a path doesn’t match any element in the tree, the item will be excluded from the output. Note that in the following example, the “foo” key doesn’t get included:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()",
...                 "reduce": "first"
...              }
...         },
...         {
...             "key": "foo",
...             "value": {
...                 "path": "//foo/text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}

Reducing

Piculet contains a few predefined reducing functions. Other than the first reducer used in the examples above, a very common reducer is concat which will concatenate the selected strings:

>>> spec = {
...     "items": [
...         {
...             "key": "full_title",
...             "value": {
...                 "path": "//h1//text()",
...                 "reduce": "concat"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}

concat is the default reducer, i.e. if no reducer is given, the strings will be concatenated:

>>> spec = {
...     "items": [
...         {
...             "key": "full_title",
...             "value": {
...                 "path": "//h1//text()"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}

If you want to get rid of extra whitespace, you can use the clean reducer. After concatenating the strings, this will remove leading and trailing whitespace and replace multiple whitespace with a single space:

>>> spec = {
...     "items": [
...         {
...             "key": "review",
...             "value": {
...                 "path": '//div[@class="review"]//text()',
...                 "reduce": "clean"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'review': 'Fantastic movie. Definitely recommended.'}

In this example, the concat reducer would have produced the value '\n            Fantastic movie.\n            Definitely recommended.\n        '

As explained above, if a path query doesn’t match any element, the item gets automatically excluded. That means, Piculet doesn’t try to apply the reducing function on the result of the path query if it’s an empty list. Therefore, reducing functions can safely assume that the path result is a non-empty list.

If you want to use a custom reducer, you have to register it first. The name for the specifier (the first parameter) has to be a valid Python identifier.

>>> from piculet import reducers
>>> reducers.register("second", lambda x: x[1])
>>> spec = {
...     "items": [
...         {
...             "key": "year",
...             "value": {
...                 "path": "//h1//text()",
...                 "reduce": "second"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'year': '1980'}

Transforming

After the reduction operation, you can apply a transformation to the resulting string. A transformation function must take a string as parameter and can return any value of any type. Piculet contains several predefined transformers: int, float, bool, len, lower, upper, capitalize. For example, to get the year of the movie as an integer:

>>> spec = {
...     "items": [
...         {
...             "key": "year",
...             "value": {
...                 "path": '//span[@class="year"]/text()',
...                 "reduce": "first",
...                 "transform": "int"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'year': 1980}

If you want to use a custom transformer, you have to register it first:

>>> from piculet import transformers
>>> transformers.register("year25", lambda x: int(x) + 25)
>>> spec = {
...     "items": [
...         {
...             "key": "25th_year",
...             "value": {
...                 "path": '//span[@class="year"]/text()',
...                 "reduce": "first",
...                 "transform": "year25"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'25th_year': 2005}

Multi-valued items

Data with multiple values can be created by using a foreach key in the value specifier. This is a path expression to select elements from the tree. [3] The path and reducing function will be applied to each selected element and the obtained values will be the members of the resulting list. For example, to get the genres of the movie, we can write:

>>> spec = {
...     "items": [
...         {
...             "key": "genres",
...             "value": {
...                 "foreach": '//ul[@class="genres"]/li',
...                 "path": "./text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'genres': ['Horror', 'Drama']}

If the foreach key doesn’t match any element the item will be excluded from the result:

>>> spec = {
...     "items": [
...         {
...             "key": "foos",
...             "value": {
...                 "foreach": '//ul[@class="foos"]/li',
...                 "path": "./text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{}

If a transformation is specified, it will be applied to every element in the resulting list:

>>> spec = {
...     "items": [
...         {
...             "key": "genres",
...             "value": {
...                 "foreach": '//ul[@class="genres"]/li',
...                 "path": "./text()",
...                 "reduce": "first",
...                 "transform": "lower"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'genres': ['horror', 'drama']}

Subrules

Nested structures can be created by writing subrules as value specifiers. If the value specifier is a mapping that contains an items key, then this will be interpreted as a subrule and the generated mapping will be the value for the key.

>>> spec = {
...     "items": [
...         {
...             "key": "director",
...             "value": {
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": {
...                             "path": '//div[@class="director"]//a/text()',
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "link",
...                         "value": {
...                             "path": '//div[@class="director"]//a/@href',
...                             "reduce": "first"
...                         }
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}

Subrules can be combined with lists:

>>> spec = {
...     "items": [
...         {
...             "key": "cast",
...             "value": {
...                 "foreach": '//table[@class="cast"]/tr',
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": {
...                             "path": "./td[1]/a/text()",
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "link",
...                         "value": {
...                             "path": "./td[1]/a/@href",
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "character",
...                         "value": {
...                             "path": "./td[2]/text()",
...                             "reduce": "first"
...                         }
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'cast': [{'character': 'Jack Torrance',
   'link': '/people/2',
   'name': 'Jack Nicholson'},
  {'character': 'Wendy Torrance',
   'link': '/people/3',
   'name': 'Shelley Duvall'}]}

Items generated by subrules can also be transformed. The transformation function is always applied as the last step in a “value” definition. But transformers for subitems take mappings (as opposed to strings) as parameter.

>>> transformers.register("stars", lambda x: "%(name)s as %(character)s" % x)
>>> spec = {
...     "items": [
...         {
...             "key": "cast",
...             "value": {
...                 "foreach": '//table[@class="cast"]/tr',
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": {
...                             "path": "./td[1]/a/text()",
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "character",
...                         "value": {
...                             "path": "./td[2]/text()",
...                             "reduce": "first"
...                         }
...                     }
...                 ],
...                 "transform": "stars"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'cast': ['Jack Nicholson as Jack Torrance',
  'Shelley Duvall as Wendy Torrance']}

Generating keys from content

You can generate items where the key value also comes from the content. For example, consider how you would get the runtime and the language of the movie. Instead of writing multiple items for each h3 element under an “info” class div, we can write only one item that will select these divs and use the h3 text as the key. These elements can be selected using foreach specifications in the items. This will cause a new item to be generated for each selected element. To get the key value, we can use paths, reducers -and also transformers- that will be applied to the selected element:

>>> spec = {
...     "items": [
...         {
...             "foreach": '//div[@class="info"]',
...             "key": {
...                 "path": "./h3/text()",
...                 "reduce": "first"
...             },
...             "value": {
...                 "path": "./p/text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'Language:': 'English', 'Runtime:': '144 minutes'}

The normalize reducer concatenates the strings, converts it to lowercase, replaces spaces with underscores and strips other non-alphanumeric characters:

>>> spec = {
...     "items": [
...         {
...             "foreach": '//div[@class="info"]',
...             "key": {
...                 "path": "./h3/text()",
...                 "reduce": "normalize"
...             },
...             "value": {
...                 "path": "./p/text()",
...                 "reduce": "first"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'language': 'English', 'runtime': '144 minutes'}

You could also give a string instead of a path and reducer for the key. In this case, the elements would still be traversed; only the last one would set the final value for the item. This could be OK if you are sure that there is only one element that matches the foreach path of the key.

Sections

The specification also provides the ability to define sections within the document. An element can be selected as the root of a section such that the XPath queries in that section will be relative to that root. This can be used to make XPath expressions shorter and also constrain the search in the tree. For example, the “director” example above can also be written using sections:

>>> spec = {
...     "section": '//div[@class="director"]//a',
...     "items": [
...         {
...             "key": "director",
...             "value": {
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": {
...                             "path": "./text()",
...                             "reduce": "first"
...                         }
...                     },
...                     {
...                         "key": "link",
...                         "value": {
...                             "path": "./@href",
...                             "reduce": "first"
...                         }
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
[1]Note that the example document is already in XML format.
[2]This means that the query has to end with either text() or some attribute value as in @attr. And the reducing function should be implemented so that it takes a list of strings and returns a string.
[3]This implies that the foreach query should not end in text() or @attr.