Data extraction¶
Note
The example uses an HTML document in combination with XPath queries. JSON documents in combination with JMESPath queries conceptually work the same way, differing only in XPath/JMESPath related details.
This section explains how to write the rules for extracting the data. We’ll scrape the following HTML content for the movie “The Shining” in our examples:
<html>
<head>
<title>The Shining</title>
</head>
<body>
<h1>The Shining (<span class="year">1980</span>)</h1>
<ul class="genres">
<li>Horror</li>
<li>Drama</li>
</ul>
<div class="director">
<h3>Director:</h3>
<p><a href="/people/1">Stanley Kubrick</a></p>
</div>
<table class="cast">
<tr>
<td><a href="/people/2">Jack Nicholson</a></td>
<td>Jack Torrance</td>
</tr>
<tr>
<td><a href="/people/3">Shelley Duvall</a></td>
<td>Wendy Torrance</td>
</tr>
</table>
<div class="info">
<h3>Country</h3>
<p>United States</p>
</div>
<div class="info">
<h3>Language</h3>
<p>English</p>
</div>
</body>
</html>
Assuming the HTML document above is saved as shining.html,
let’s get its contents:
from pathlib import Path
document = Path("shining.html").read_text()
Rules¶
Each rule in the list specifies what the name of a piece of data will be, and how its value will be extracted. In the simple case, an extractor will use a path query.
For example, to get the title of the movie from the example document, we can write the following rule:
rule = {"key": "title", "extractor": {"path": "//title/text()"}}
Next, we use this rule in a specification
that we load using the load_spec function:
from piculet import load_spec
spec = load_spec({"rules": [rule]})
Now that we have the document and the specification,
we can use the scrape method
to extract data from the document:
data = spec.scrape(document, doctype="html")
# data:
{"title": "The Shining"}
The XPath query has to be arranged so that it will return a list of texts. These will be joined to produce the value. For example:
rule = {"key": "full_title", "extractor": {"path": "//h1//text()"}}
spec = load_spec({"rules": [rule]})
data = spec.scrape(document, doctype="html")
# data:
{"full_title": "The Shining (1980)"}
Multiple items can be collected in a single invocation:
rules = [
{"key": "title", "extractor": {"path": "//title/text()"}},
{"key": "country", "extractor": {"path": "//div[@class='info'][1]/p/text()"}}
]
spec = load_spec({"rules": rules})
data = spec.scrape(document, doctype="html")
# data:
{"title": "The Shining", "country": "United States"}
If a rule doesn’t produce a value, the item will be excluded from the output.
Note that in the following example, there’s no foo key in the result:
rules = [
{"key": "title", "extractor": {"path": "//title/text()"}},
{"key": "foo", "extractor": {"path": "//foo/text()"}}
]
spec = load_spec({"rules": rules})
data = spec.scrape(document, doctype="html")
# result:
{"title": "The Shining"}
Transforming results¶
Extractors can apply transformations to the values they have obtained. Each transformation has a name and an associated function. We tell the extractor to apply the function by giving its name in the extractor transforms. To match the transformer names to their functions, a lookup map has to be provided when the specification is loaded.
For example, the following rule for the movie year would produce a string:
{"key": "year", "extractor": {"path": "//span[@class='year']/text()"}}
To convert that value to an integer,
let’s define and use an int transformer:
rule = {
"key": "year",
"extractor": {
"path": "//span[@class='year']/text()",
"transforms": ["int"]
}
}
transformers = {"int": int}
spec = load_spec({"rules": [rule]}, transformers=transformers)
data = spec.scrape(document, doctype="html")
# data:
{"year": 1980}
Multiple transformations are applied in the order they are listed:
rule = {
"key": "title",
"extractor": {
"path": "//title/text()",
"transforms": ["remove_spaces", "titlecase"]
}
}
transformers = {
"titlecase": str.title,
"remove_spaces": lambda s: s.replace(" ", "")
}
spec = load_spec({"rules": [rule]}, transformers=transformers)
data = spec.scrape(document, doctype="html")
# data:
{"title": "Theshining"}
Multivalued results¶
Data with multiple values can be created by using a foreach key
in the extractor.
This should be a path expression to select elements from the tree.
After the elements are selected, the query in the path key
will be applied to each element,
and the obtained values will be collected in the resulting list.
For example, to get the genres of the movie, we can write:
rule = {
"key": "genres",
"extractor": {
"foreach": "//ul[@class='genres']/li",
"path": "./text()"
}
}
spec = load_spec({"rules": [rule]})
data = spec.scrape(document, doctype="html")
# data:
{"genres": ["Horror", "Drama"]}
If the foreach key doesn’t match any element, the item will be excluded
from the result:
rules = [
{
"key": "title",
"extractor": {"path": "//title/text()"}
},
{
"key": "foos",
"extractor": {
"foreach": "//ul[@class='foos']/li",
"path": "./text()"
}
}
]
spec = load_spec({"rules": rules})
data = spec.scrape(document, doctype="html")
# data:
{"title": "The Shining"}
If a transformation is specified, it will be applied to each element in the resulting list:
rule = {
"key": "genres",
"extractor": {
"foreach": "//ul[@class='genres']/li",
"path": "./text()",
"transforms": ["lower"]
}
}
transformers = {"lower": str.lower}
spec = load_spec({"rules": [rule]}, transformers=transformers)
data = spec.scrape(document, doctype="html")
# data:
{"genres": ["horror", "drama"]}
Subrules¶
Nested structures can be created by writing subrules as extractors.
If the extractor contains a rules key instead of a path,
then this will be interpreted as a subrule,
and the generated mapping will be the value for the key.
rule = {
"key": "director",
"extractor": {
"rules": [
{
"key": "name",
"extractor": {"path": "//div[@class='director']//a/text()"}
},
{
"key": "link",
"extractor": {"path": "//div[@class='director']//a/@href"}
}
]
}
}
spec = load_spec({"rules": [rule]})
data = spec.scrape(document, doctype="html")
# data:
{"director": {"name": "Stanley Kubrick", "link": "/people/1"}}
Extractors can select a different node as the root before applying the query.
This can improve readability and performance.
The root key has to be a path that selects the root for the operation.
If it returns multiple nodes, the first one will be selected.
The above rule is equivalent to:
rule = {
"key": "director",
"extractor": {
"root": "//div[@class='director']//a",
"rules": [
{
"key": "name",
"extractor": {"path": "./text()"}
},
{
"key": "link",
"extractor": {"path": "./@href"}
}
]
}
}
Subrules can be combined with multivalues:
rule = {
"key": "cast",
"extractor": {
"foreach": "//table[@class='cast']/tr",
"rules": [
{"key": "name", "extractor": {"path": "./td[1]/a/text()"}},
{"key": "character", "extractor": {"path": "./td[2]/text()"}}
]
}
}
spec = load_spec({"rules": [rule]})
data = spec.scrape(document, doctype="html")
# data:
{
"cast": [
{"name": "Jack Nicholson", "character": "Jack Torrance"},
{"name": "Shelley Duvall", "character": "Wendy Torrance"}
]
}
Moving the root takes place before selecting the elements using foreach.
The rule given above is equivalent to:
rule = {
"key": "cast",
"extractor": {
"root": "//table[@class='cast']",
"foreach": "./tr",
"rules": [
{"key": "name", "extractor": {"path": "./td[1]/a/text()"}},
{"key": "character", "extractor": {"path": "./td[2]/text()"}}
]
}
}
Subitems can also be transformed. The transformation functions are always applied as the last step in an extraction, therefore the first transformer will take the generated mapping as parameter.
rule = {
"key": "cast",
"extractor": {
"foreach": "//table[@class='cast']/tr",
"rules": [
{"key": "name", "extractor": {"path": "./td[1]/a/text()"}},
{"key": "character", "extractor": {"path": "./td[2]/text()"}}
],
"transforms": ["stars"]
}
}
transformers = {"stars": lambda x: "%(name)s as %(character)s" % x}
spec = load_spec({"rules": [rule]}, transformers=transformers)
data = spec.scrape(document, doctype="html")
# data:
{
"cast": [
"Jack Nicholson as Jack Torrance",
"Shelley Duvall as Wendy Torrance"
]
}
Generating keys from content¶
You can generate items where the key value also comes from the content.
For example, consider how you would get the country and the language
of the movie.
Instead of writing multiple items for each h3 element
under a div element with an info class,
we can write only one item that will select these divs
and use the h3 text as the key.
This method requires to locate the elements that contain both the key
and the value (in this example, the div).
These elements will be selected using a foreach specification.
Key and value extractors will be applied to each selected element.
rule = {
"foreach": "//div[@class='info']",
"key": {"path": "./h3/text()" },
"extractor": {"path": "./p/text()"}
}
spec = load_spec({"rules": [rule]})
data = spec.scrape(document, doctype="html")
# data:
{"Country": "United States", "Language": "English"}
Like values, keys can also be transformed:
rule = {
"foreach": "//div[@class='info']",
"key": {"path": "./h3/text()", "transforms": ["lower"]},
"extractor": {"path": "./p/text()"}
}
transformers = {"lower": str.lower}
spec = load_spec({"rules": [rule]}, transformers=transformers)
data = spec.scrape(document, doctype="html")
# data:
{"country": "United States", "language": "English"}