Data extraction¶
This section explains how to write the specification for extracting data from a document. We’ll scrape the following HTML content for the movie “The Shining” in our examples:
<html>
<head>
<meta charset="utf-8"/>
<title>The Shining</title>
</head>
<body>
<h1>The Shining (<span class="year">1980</span>)</h1>
<ul class="genres">
<li>Horror</li>
<li>Drama</li>
</ul>
<div class="director">
<h3>Director:</h3>
<p><a href="/people/1">Stanley Kubrick</a></p>
</div>
<table class="cast">
<tr>
<td><a href="/people/2">Jack Nicholson</a></td>
<td>Jack Torrance</td>
</tr>
<tr>
<td><a href="/people/3">Shelley Duvall</a></td>
<td>Wendy Torrance</td>
</tr>
</table>
<div class="info">
<h3>Runtime:</h3>
<p>144 minutes</p>
</div>
<div class="info">
<h3>Language:</h3>
<p>English</p>
</div>
<div class="review">
<em>Fantastic</em> movie.
Definitely recommended.
</div>
</body>
</html>
Instead of the scrape_document
function
that reads the content and the specification from files, we’ll use
the scrape
function that works directly on the content
and the specification map:
>>> from piculet import scrape
Assuming the HTML document above is saved as shining.html
, let’s get
its content:
>>> with open("shining.html") as f:
... document = f.read()
The scrape
function assumes that the document
is in XML format. So if any conversion is needed, it has to be done
before calling this function. [1] After building the DOM tree,
the function will apply the extraction rules to the root element of the tree,
and return a mapping where each item is generated by one of the rules.
Note
Piculet uses the ElementTree module for building and querying
XML trees. However, it will make use of the lxml package if it’s
installed. The scrape
function takes
an optional lxml_html
parameter which will use the HTML builder
from the lxml package, thereby building the tree without converting
HTML into XML first.
The specification mapping contains two keys: the pre
key is for specifying
the preprocessing operations (these will be covered in the next section),
and the items
key is for specifying the rules that describe how to extract
the data:
spec = {"pre": [...], "items": [...]}
The items list contains item mappings, where each item has a key
and
a value
description. The key specifies the key for the item in the output
mapping and the value specifies how to extract the data to set as the value
for that item. Typically, a value specifier consists of a path query and
a reducing function. The query is applied to the root and a list of strings
is obtained. Then, the reducing function converts this list into a single
string. [2]
For example, to get the title of the movie from the example document, we can write:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
The .//title/text()
path generates the list ['The Shining']
and the reducing function first
selects the first element from that list.
Note
By default, the XPath queries are limited by what ElementTree supports
(plus the text()
and @attr
clauses which are added by Piculet).
However, if the lxml package is installed a
much wider range of XPath constructs can be used.
Multiple items can be collected in a single invocation:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... },
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': '1980'}
If a path doesn’t match any element in the tree, the item will be excluded from the output. Note that in the following example, the “foo” key doesn’t get included:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... },
... {
... "key": "foo",
... "value": {
... "path": "//foo/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
Reducing¶
Piculet contains a few predefined reducing functions. Other than the first
reducer used in the examples above, a very common reducer is concat
which will concatenate the selected strings:
>>> spec = {
... "items": [
... {
... "key": "full_title",
... "value": {
... "path": "//h1//text()",
... "reduce": "concat"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}
concat
is the default reducer, i.e. if no reducer is given, the strings
will be concatenated:
>>> spec = {
... "items": [
... {
... "key": "full_title",
... "value": {
... "path": "//h1//text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}
If you want to get rid of extra whitespace, you can use the clean
reducer.
After concatenating the strings, this will remove leading and trailing
whitespace and replace multiple whitespace with a single space:
>>> spec = {
... "items": [
... {
... "key": "review",
... "value": {
... "path": '//div[@class="review"]//text()',
... "reduce": "clean"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'review': 'Fantastic movie. Definitely recommended.'}
In this example, the concat
reducer would have produced the value
'\n Fantastic movie.\n Definitely recommended.\n '
As explained above, if a path query doesn’t match any element, the item gets automatically excluded. That means, Piculet doesn’t try to apply the reducing function on the result of the path query if it’s an empty list. Therefore, reducing functions can safely assume that the path result is a non-empty list.
If you want to use a custom reducer, you have to register it first. The name for the specifier (the first parameter) has to be a valid Python identifier.
>>> from piculet import reducers
>>> reducers.register("second", lambda x: x[1])
>>> spec = {
... "items": [
... {
... "key": "year",
... "value": {
... "path": "//h1//text()",
... "reduce": "second"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'year': '1980'}
Transforming¶
After the reduction operation, you can apply a transformation
to the resulting string. A transformation function must take a string
as parameter and can return any value of any type. Piculet contains several
predefined transformers: int
, float
, bool
, len
, lower
,
upper
, capitalize
. For example, to get the year of the movie
as an integer:
>>> spec = {
... "items": [
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first",
... "transform": "int"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'year': 1980}
If you want to use a custom transformer, you have to register it first:
>>> from piculet import transformers
>>> transformers.register("year25", lambda x: int(x) + 25)
>>> spec = {
... "items": [
... {
... "key": "25th_year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first",
... "transform": "year25"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'25th_year': 2005}
Multi-valued items¶
Data with multiple values can be created by using a foreach
key
in the value specifier. This is a path expression to select elements
from the tree. [3] The path and reducing function will be applied
to each selected element and the obtained values will be the members
of the resulting list. For example, to get the genres of the movie,
we can write:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['Horror', 'Drama']}
If the foreach
key doesn’t match any element the item will be excluded
from the result:
>>> spec = {
... "items": [
... {
... "key": "foos",
... "value": {
... "foreach": '//ul[@class="foos"]/li',
... "path": "./text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{}
If a transformation is specified, it will be applied to every element in the resulting list:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()",
... "reduce": "first",
... "transform": "lower"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['horror', 'drama']}
Subrules¶
Nested structures can be created by writing subrules as value specifiers.
If the value specifier is a mapping that contains an items
key,
then this will be interpreted as a subrule and the generated mapping
will be the value for the key.
>>> spec = {
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": {
... "path": '//div[@class="director"]//a/text()',
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": '//div[@class="director"]//a/@href',
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
Subrules can be combined with lists:
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./td[1]/a/text()",
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": "./td[1]/a/@href",
... "reduce": "first"
... }
... },
... {
... "key": "character",
... "value": {
... "path": "./td[2]/text()",
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': [{'character': 'Jack Torrance',
'link': '/people/2',
'name': 'Jack Nicholson'},
{'character': 'Wendy Torrance',
'link': '/people/3',
'name': 'Shelley Duvall'}]}
Items generated by subrules can also be transformed. The transformation function is always applied as the last step in a “value” definition. But transformers for subitems take mappings (as opposed to strings) as parameter.
>>> transformers.register("stars", lambda x: "%(name)s as %(character)s" % x)
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./td[1]/a/text()",
... "reduce": "first"
... }
... },
... {
... "key": "character",
... "value": {
... "path": "./td[2]/text()",
... "reduce": "first"
... }
... }
... ],
... "transform": "stars"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': ['Jack Nicholson as Jack Torrance',
'Shelley Duvall as Wendy Torrance']}
Generating keys from content¶
You can generate items where the key value also comes from the content.
For example, consider how you would get the runtime and the language
of the movie. Instead of writing multiple items for each h3
element
under an “info” class div
, we can write only one item that will select
these divs and use the h3 text as the key. These elements can be selected using
foreach
specifications in the items. This will cause a new item
to be generated for each selected element. To get the key value,
we can use paths, reducers -and also transformers- that will be applied
to the selected element:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()",
... "reduce": "first"
... },
... "value": {
... "path": "./p/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'Language:': 'English', 'Runtime:': '144 minutes'}
The normalize
reducer concatenates the strings, converts it to lowercase,
replaces spaces with underscores and strips other non-alphanumeric characters:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()",
... "reduce": "normalize"
... },
... "value": {
... "path": "./p/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'language': 'English', 'runtime': '144 minutes'}
You could also give a string instead of a path and reducer for the key.
In this case, the elements would still be traversed; only the last one would
set the final value for the item. This could be OK if you are sure
that there is only one element that matches the foreach
path of the key.
Sections¶
The specification also provides the ability to define sections within the document. An element can be selected as the root of a section such that the XPath queries in that section will be relative to that root. This can be used to make XPath expressions shorter and also constrain the search in the tree. For example, the “director” example above can also be written using sections:
>>> spec = {
... "section": '//div[@class="director"]//a',
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./text()",
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": "./@href",
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
[1] | Note that the example document is already in XML format. |
[2] | This means that the query has to end with either text() or some
attribute value as in @attr . And the reducing function should be
implemented so that it takes a list of strings and returns a string. |
[3] | This implies that the foreach query should not end in text()
or @attr . |