Data extraction¶
This section explains how to write the specification for extracting data from a document. We’ll scrape the following HTML content for the movie “The Shining” in our examples:
<html>
<head>
<meta charset="utf-8"/>
<title>The Shining</title>
</head>
<body>
<h1>The Shining (<span class="year">1980</span>)</h1>
<ul class="genres">
<li>Horror</li>
<li>Drama</li>
</ul>
<div class="director">
<h3>Director:</h3>
<p><a href="/people/1">Stanley Kubrick</a></p>
</div>
<table class="cast">
<tr>
<td><a href="/people/2">Jack Nicholson</a></td>
<td>Jack Torrance</td>
</tr>
<tr>
<td><a href="/people/3">Shelley Duvall</a></td>
<td>Wendy Torrance</td>
</tr>
</table>
<div class="info">
<h3>Runtime:</h3>
<p>144 minutes</p>
</div>
<div class="info">
<h3>Language:</h3>
<p>English</p>
</div>
<div class="review">
<em>Fantastic</em> movie.
Definitely recommended.
</div>
</body>
</html>
Assuming the HTML document above is saved as shining.html
, let’s get
its content:
>>> with open("shining.html") as f:
... document = f.read()
We’ll use the scrape
function to extract data
from the document:
>>> from piculet import scrape
This function assumes that the document is in XML format. So, if any conversion is needed, it has to be done before calling this function. [1]
After building the DOM tree, the function will apply the extraction rules to the root element of the tree, and return a mapping where each item is generated by one of the rules.
Note
Piculet uses the ElementTree module for building and querying
XML trees.
However, it will make use of the lxml package if it’s installed.
The scrape
function takes an optional lxml_html
parameter which will use the HTML builder from the lxml package,
thereby building the tree without converting HTML into XML first.
The specification contains two keys: pre
for specifying
the preprocessing operations (these will be covered in the next chapter),
and items
for specifying the rules that describe how to extract the data:
spec = {"pre": [...], "items": [...]}
The items list contains item mappings, where each item has a key
and
a value
description.
The key specifies the key for the item in the output mapping, and the value
specifies how to extract the data to set as the value for that item.
Typically, a value specifier is a path query.
This query is applied to the root and the resulting list of strings
is concatenated into a single string.
Note
This means that the query has to end with either text()
or some
attribute value as in @attr
.
For example, to get the title of the movie from the example document, we can write:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
The .//title/text()
path generates the list ['The Shining']
,
and concatenation generates the resulting string.
Note
By default, XPath queries are limited by what ElementTree supports (plus a few additions by Piculet). However, if the lxml package is installed, a much wider range of XPath constructs can be used.
Multiple items can be collected in a single invocation:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()"
... }
... },
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()'
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': '1980'}
If a path doesn’t match any element in the tree, the item will be excluded from the output. Note that in the following example, there’s no “foo” key in the result:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()"
... }
... },
... {
... "key": "foo",
... "value": {
... "path": "//foo/text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
You can specify a string to use as separator when concatenating the texts selected by the query:
>>> spec = {
... "items": [
... {
... "key": "cast_names",
... "value": {
... "path": '//table[@class="cast"]/tr/td[1]/a/text()',
... "sep": ", "
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast_names': 'Jack Nicholson, Shelley Duvall'}
Transforming¶
After getting the string value, you can apply a transformation to it.
The transformation function must take a string as parameter,
and can return any value of any type.
Piculet contains several predefined
transformers
.
For example, to get the year of the movie as an integer:
>>> spec = {
... "items": [
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "transform": "int"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'year': 1980}
If you want to use a custom transformer, you have to register it first:
>>> from piculet import transformers
>>> transformers.underscore = lambda s: s.replace(" ", "_")
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "transform": "underscore"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The_Shining}
You can chain transformers using the |
symbol:
>>> transformers.century = lambda x: x // 100 + 1
>>> spec = {
... "items": [
... {
... "key": "century",
... "value": {
... "path": '//span[@class="year"]/text()',
... "transform": "int|century"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'century': 20}
Shorthand notation¶
To make the specification more concise, you can write the value
as a single string to combine the path and transform operations
by splitting them with the |
symbol:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": "//title/text()"
... },
... {
... "key": "year",
... "value": '//span[@class="year"]/text() | int'
... },
... {
... "key": "century",
... "value": '//span[@class="year"]/text() | int | century'
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': 1980, 'century': 20}
Note
After this point, the examples will generally use the shorthand notation.
Multi-valued items¶
Data with multiple values can be created by using a foreach
key
in the value specifier.
This is a path expression to select elements from the tree.
Note
This implies that the foreach
query should not end in text()
or @attr
.
The path function will be applied to each selected element, and the obtained values will be the members of the resulting list. For example, to get the genres of the movie, we can write:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['Horror', 'Drama']}
If the foreach
key doesn’t match any element, the item will be excluded
from the result:
>>> spec = {
... "items": [
... {
... "key": "foos",
... "value": {
... "foreach": '//ul[@class="foos"]/li',
... "path": "./text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{}
If a transformation is specified, it will be applied to every element in the resulting list:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()",
... "transform": "lower"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['horror', 'drama']}
Subitems¶
Nested structures can be created by writing subrules as value specifiers.
If the value specifier is a mapping that contains an items
key,
then this will be interpreted as a subrule, and the generated mapping
will be the value for the key.
>>> spec = {
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": '//div[@class="director"]//a/text()'
... },
... {
... "key": "link",
... "value": '//div[@class="director"]//a/@href'
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'name': 'Stanley Kubrick', 'link': '/people/1'}}
Subitems can be combined with multi-values:
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": "./td[1]/a/text()"
... },
... {
... "key": "link",
... "value": "./td[1]/a/@href"
... },
... {
... "key": "character",
... "value": "./td[2]/text()"
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': [{'name': 'Jack Nicholson',
'link': '/people/2',
'character': 'Jack Torrance'},
{'name': 'Shelley Duvall',
'link': '/people/3',
'character': 'Wendy Torrance'}]}
Subitems can also be transformed. The transformation function is always applied as the last step in a “value” definition, therefore transformers for subitems take mappings (as opposed to strings) as parameter.
>>> transformers.stars = lambda x: "%(name)s as %(character)s" % x
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": "./td[1]/a/text()"
... },
... {
... "key": "character",
... "value": "./td[2]/text()"
... }
... ],
... "transform": "stars"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': ['Jack Nicholson as Jack Torrance',
'Shelley Duvall as Wendy Torrance']}
Generating keys from content¶
You can generate items where the key value also comes from the content.
For example, consider how you would get the runtime and the language
of the movie.
Instead of writing multiple items for each h3
element
under an “info” class div
, we can write only one item
that will select these divs and use the h3 text as the key.
These elements can be selected using foreach
specifications in the items.
This will cause a new item to be generated for each selected element.
To get the key value, we can use paths and transformers
that will be applied to the selected element:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()"
... },
... "value": {
... "path": "./p/text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'Runtime:': '144 minutes', 'Language:': 'English'}
The normalize
transformer converts
the string to lowercase, replaces spaces with underscores,
and strips non-alphanumeric characters:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()",
... "transform": "normalize"
... },
... "value": {
... "path": "./p/text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'runtime': '144 minutes', 'language': 'English'}
Sections¶
The specification also provides the ability to define sections within the document. An element can be selected as the root of a section such that the XPath queries in that section will be relative to that root. This can be used to make XPath expressions shorter, and to constrain the search in the tree. For example, the “director” example above can also be written using sections:
>>> spec = {
... "section": '//div[@class="director"]//a',
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": "./text()"
... },
... {
... "key": "link",
... "value": "./@href"
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'name': 'Stanley Kubrick', 'link': '/people/1'}}
[1] | Note that the example document is already in XML format. |