Data extraction

This section explains how to write the specification for extracting data from a document. We’ll scrape the following HTML content for the movie “The Shining” in our examples:

<html>
    <head>
        <meta charset="utf-8"/>
        <title>The Shining</title>
    </head>
    <body>
        <h1>The Shining (<span class="year">1980</span>)</h1>
        <ul class="genres">
            <li>Horror</li>
            <li>Drama</li>
        </ul>
        <div class="director">
            <h3>Director:</h3>
            <p><a href="/people/1">Stanley Kubrick</a></p>
        </div>
        <table class="cast">
            <tr>
                <td><a href="/people/2">Jack Nicholson</a></td>
                <td>Jack Torrance</td>
            </tr>
            <tr>
                <td><a href="/people/3">Shelley Duvall</a></td>
                <td>Wendy Torrance</td>
            </tr>
        </table>
        <div class="info">
            <h3>Runtime:</h3>
            <p>144 minutes</p>
        </div>
        <div class="info">
            <h3>Language:</h3>
            <p>English</p>
        </div>
        <div class="review">
            <em>Fantastic</em> movie.
            Definitely recommended.
        </div>
    </body>
</html>

Assuming the HTML document above is saved as shining.html, let’s get its content:

>>> with open("shining.html") as f:
...     document = f.read()

We’ll use the scrape function to extract data from the document:

>>> from piculet import scrape

This function assumes that the document is in XML format. So, if any conversion is needed, it has to be done before calling this function. [1]

After building the DOM tree, the function will apply the extraction rules to the root element of the tree, and return a mapping where each item is generated by one of the rules.

Note

Piculet uses the ElementTree module for building and querying XML trees. However, it will make use of the lxml package if it’s installed. The scrape function takes an optional lxml_html parameter which will use the HTML builder from the lxml package, thereby building the tree without converting HTML into XML first.

The specification contains two keys: pre for specifying the preprocessing operations (these will be covered in the next chapter), and items for specifying the rules that describe how to extract the data:

spec = {"pre": [...], "items": [...]}

The items list contains item mappings, where each item has a key and a value description. The key specifies the key for the item in the output mapping, and the value specifies how to extract the data to set as the value for that item. Typically, a value specifier is a path query. This query is applied to the root and the resulting list of strings is concatenated into a single string.

Note

This means that the query has to end with either text() or some attribute value as in @attr.

For example, to get the title of the movie from the example document, we can write:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}

The .//title/text() path generates the list ['The Shining'], and concatenation generates the resulting string.

Note

By default, XPath queries are limited by what ElementTree supports (plus a few additions by Piculet). However, if the lxml package is installed, a much wider range of XPath constructs can be used.

Multiple items can be collected in a single invocation:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()"
...             }
...         },
...         {
...             "key": "year",
...             "value": {
...                 "path": '//span[@class="year"]/text()'
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': '1980'}

If a path doesn’t match any element in the tree, the item will be excluded from the output. Note that in the following example, there’s no “foo” key in the result:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()"
...              }
...         },
...         {
...             "key": "foo",
...             "value": {
...                 "path": "//foo/text()"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}

You can specify a string to use as separator when concatenating the texts selected by the query:

>>> spec = {
...     "items": [
...         {
...             "key": "cast_names",
...             "value": {
...                 "path": '//table[@class="cast"]/tr/td[1]/a/text()',
...                 "sep": ", "
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'cast_names': 'Jack Nicholson, Shelley Duvall'}

Transforming

After getting the string value, you can apply a transformation to it. The transformation function must take a string as parameter, and can return any value of any type. Piculet contains several predefined transformers. For example, to get the year of the movie as an integer:

>>> spec = {
...     "items": [
...         {
...             "key": "year",
...             "value": {
...                 "path": '//span[@class="year"]/text()',
...                 "transform": "int"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'year': 1980}

If you want to use a custom transformer, you have to register it first:

>>> from piculet import transformers
>>> transformers.underscore = lambda s: s.replace(" ", "_")
>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": {
...                 "path": "//title/text()",
...                 "transform": "underscore"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The_Shining}

You can chain transformers using the | symbol:

>>> transformers.century = lambda x: x // 100 + 1
>>> spec = {
...     "items": [
...         {
...             "key": "century",
...             "value": {
...                 "path": '//span[@class="year"]/text()',
...                 "transform": "int|century"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'century': 20}

Shorthand notation

To make the specification more concise, you can write the value as a single string to combine the path and transform operations by splitting them with the | symbol:

>>> spec = {
...     "items": [
...         {
...             "key": "title",
...             "value": "//title/text()"
...         },
...         {
...             "key": "year",
...             "value": '//span[@class="year"]/text() | int'
...         },
...         {
...             "key": "century",
...             "value": '//span[@class="year"]/text() | int | century'
...         }
...     ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': 1980, 'century': 20}

Note

After this point, the examples will generally use the shorthand notation.

Multi-valued items

Data with multiple values can be created by using a foreach key in the value specifier. This is a path expression to select elements from the tree.

Note

This implies that the foreach query should not end in text() or @attr.

The path function will be applied to each selected element, and the obtained values will be the members of the resulting list. For example, to get the genres of the movie, we can write:

>>> spec = {
...     "items": [
...         {
...             "key": "genres",
...             "value": {
...                 "foreach": '//ul[@class="genres"]/li',
...                 "path": "./text()"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'genres': ['Horror', 'Drama']}

If the foreach key doesn’t match any element, the item will be excluded from the result:

>>> spec = {
...     "items": [
...         {
...             "key": "foos",
...             "value": {
...                 "foreach": '//ul[@class="foos"]/li',
...                 "path": "./text()"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{}

If a transformation is specified, it will be applied to every element in the resulting list:

>>> spec = {
...     "items": [
...         {
...             "key": "genres",
...             "value": {
...                 "foreach": '//ul[@class="genres"]/li',
...                 "path": "./text()",
...                 "transform": "lower"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'genres': ['horror', 'drama']}

Subitems

Nested structures can be created by writing subrules as value specifiers. If the value specifier is a mapping that contains an items key, then this will be interpreted as a subrule, and the generated mapping will be the value for the key.

>>> spec = {
...     "items": [
...         {
...             "key": "director",
...             "value": {
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": '//div[@class="director"]//a/text()'
...                     },
...                     {
...                         "key": "link",
...                         "value": '//div[@class="director"]//a/@href'
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'director': {'name': 'Stanley Kubrick', 'link': '/people/1'}}

Subitems can be combined with multi-values:

>>> spec = {
...     "items": [
...         {
...             "key": "cast",
...             "value": {
...                 "foreach": '//table[@class="cast"]/tr',
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": "./td[1]/a/text()"
...                     },
...                     {
...                         "key": "link",
...                         "value": "./td[1]/a/@href"
...                     },
...                     {
...                         "key": "character",
...                         "value": "./td[2]/text()"
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'cast': [{'name': 'Jack Nicholson',
   'link': '/people/2',
   'character': 'Jack Torrance'},
  {'name': 'Shelley Duvall',
   'link': '/people/3',
   'character': 'Wendy Torrance'}]}

Subitems can also be transformed. The transformation function is always applied as the last step in a “value” definition, therefore transformers for subitems take mappings (as opposed to strings) as parameter.

>>> transformers.stars = lambda x: "%(name)s as %(character)s" % x
>>> spec = {
...     "items": [
...         {
...             "key": "cast",
...             "value": {
...                 "foreach": '//table[@class="cast"]/tr',
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": "./td[1]/a/text()"
...                     },
...                     {
...                         "key": "character",
...                         "value": "./td[2]/text()"
...                     }
...                 ],
...                 "transform": "stars"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'cast': ['Jack Nicholson as Jack Torrance',
  'Shelley Duvall as Wendy Torrance']}

Generating keys from content

You can generate items where the key value also comes from the content. For example, consider how you would get the runtime and the language of the movie. Instead of writing multiple items for each h3 element under an “info” class div, we can write only one item that will select these divs and use the h3 text as the key. These elements can be selected using foreach specifications in the items. This will cause a new item to be generated for each selected element. To get the key value, we can use paths and transformers that will be applied to the selected element:

>>> spec = {
...     "items": [
...         {
...             "foreach": '//div[@class="info"]',
...             "key": {
...                 "path": "./h3/text()"
...             },
...             "value": {
...                 "path": "./p/text()"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'Runtime:': '144 minutes', 'Language:': 'English'}

The normalize transformer converts the string to lowercase, replaces spaces with underscores, and strips non-alphanumeric characters:

>>> spec = {
...     "items": [
...         {
...             "foreach": '//div[@class="info"]',
...             "key": {
...                 "path": "./h3/text()",
...                 "transform": "normalize"
...             },
...             "value": {
...                 "path": "./p/text()"
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'runtime': '144 minutes', 'language': 'English'}

Sections

The specification also provides the ability to define sections within the document. An element can be selected as the root of a section such that the XPath queries in that section will be relative to that root. This can be used to make XPath expressions shorter, and to constrain the search in the tree. For example, the “director” example above can also be written using sections:

>>> spec = {
...     "section": '//div[@class="director"]//a',
...     "items": [
...         {
...             "key": "director",
...             "value": {
...                 "items": [
...                     {
...                         "key": "name",
...                         "value": "./text()"
...                     },
...                     {
...                         "key": "link",
...                         "value": "./@href"
...                     }
...                 ]
...             }
...         }
...     ]
... }
>>> scrape(document, spec)
{'director': {'name': 'Stanley Kubrick', 'link': '/people/1'}}
[1]Note that the example document is already in XML format.