Piculet¶
Copyright (C) 2014-2019 H. Turgut Uyar <uyar@tekir.org>
Piculet is a module for extracting data from XML or HTML documents using XPath queries. It consists of a single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications. It also provides a command line interface.
PyPI: | https://pypi.org/project/piculet/ |
---|---|
Repository: | https://github.com/uyar/piculet |
Documentation: | https://piculet.readthedocs.io/ |
Piculet has been tested with Python 2.7, Python 3.4+, and compatible
versions of PyPy. You can install the latest version using pip
:
pip install piculet
Contents¶
Overview¶
Scraping a document consists of three stages:
- Building a DOM tree out of the document. This is a straightforward operation for an XML document. For an HTML document, Piculet will first try to convert it into XHTML and then build the tree from that.
- Preprocessing the tree. This is an optional stage. In some cases it might be helpful to do some changes on the tree to simplify the extraction process.
- Extracting data out of the tree.
The preprocessing and extraction stages are expressed as part of a scraping specification. The specification is a mapping which can be stored in a file format that can represent a mapping, such as JSON or YAML. Details about the specification are given in later chapters.
Command Line Interface¶
Installing Piculet creates a script named piculet
which can be used
to invoke the command line interface:
$ piculet -h
usage: piculet [-h] [--debug] command ...
The scrape
command extracts data out of a document as described by
a specification file:
$ piculet scrape -h
usage: piculet scrape [-h] -s SPEC [--html] document
The location of the document can be given as a file path or a URL. For example, say you want to extract some data from the file shining.html. An example specification is given in movie.json. Download both of these files and run the command:
$ piculet scrape -s movie.json shining.html
This should print the following output:
{
"cast": [
{
"character": "Jack Torrance",
"link": "/people/2",
"name": "Jack Nicholson"
},
{
"character": "Wendy Torrance",
"link": "/people/3",
"name": "Shelley Duvall"
}
],
"director": {
"link": "/people/1",
"name": "Stanley Kubrick"
},
"genres": [
"Horror",
"Drama"
],
"language": "English",
"review": "Fantastic movie. Definitely recommended.",
"runtime": "144 minutes",
"title": "The Shining",
"year": 1980
}
For HTML documents, the --html
option has to be used. If the document
address starts with http://
or https://
, the content will be taken
from the given URL. For example, to extract some data from the Wikipedia page
for David Bowie, download the wikipedia.json file and run the command:
piculet scrape -s wikipedia.json --html "https://en.wikipedia.org/wiki/David_Bowie"
This should print the following output:
{
"birthplace": "Brixton, London, England",
"born": "1947-01-08",
"name": "David Bowie",
"occupation": [
"Singer",
"songwriter",
"actor"
]
}
In the same command, change the name part of the URL to Merlene_Ottey
and
you will get similar data for Merlene Ottey. Note that since the markup
used in Wikipedia pages for persons varies, the kinds of data you get
with this specification will also vary.
Piculet can be used as a simplistic HTML to XHTML convertor by invoking it with
the h2x
command. This command takes the file name as input and prints
the converted content, as in piculet h2x foo.html
. If the input file name
is given as -
it will read the content from the standard input
and therefore can be used as part of a pipe:
cat foo.html | piculet h2x -
Using in programs¶
The scraping operation can also be invoked programmatically using
the scrape_document
function. Note that
this function prints its output and doesn’t return anything:
from piculet import scrape_document
url = "https://en.wikipedia.org/wiki/David_Bowie"
spec = "wikipedia.json"
scrape_document(url, spec, content_format="html")
YAML support¶
To use YAML for specification, Piculet has to be installed with YAML support:
pip install piculet[yaml]
Note that this will install an external module for parsing YAML files, and therefore will not be contained to the standard library anymore.
The YAML version of the configuration example above can be found in movie.yaml.
Data extraction¶
This section explains how to write the specification for extracting data from a document. We’ll scrape the following HTML content for the movie “The Shining” in our examples:
<html>
<head>
<meta charset="utf-8"/>
<title>The Shining</title>
</head>
<body>
<h1>The Shining (<span class="year">1980</span>)</h1>
<ul class="genres">
<li>Horror</li>
<li>Drama</li>
</ul>
<div class="director">
<h3>Director:</h3>
<p><a href="/people/1">Stanley Kubrick</a></p>
</div>
<table class="cast">
<tr>
<td><a href="/people/2">Jack Nicholson</a></td>
<td>Jack Torrance</td>
</tr>
<tr>
<td><a href="/people/3">Shelley Duvall</a></td>
<td>Wendy Torrance</td>
</tr>
</table>
<div class="info">
<h3>Runtime:</h3>
<p>144 minutes</p>
</div>
<div class="info">
<h3>Language:</h3>
<p>English</p>
</div>
<div class="review">
<em>Fantastic</em> movie.
Definitely recommended.
</div>
</body>
</html>
Instead of the scrape_document
function
that reads the content and the specification from files, we’ll use
the scrape
function that works directly on the content
and the specification map:
>>> from piculet import scrape
Assuming the HTML document above is saved as shining.html
, let’s get
its content:
>>> with open("shining.html") as f:
... document = f.read()
The scrape
function assumes that the document
is in XML format. So if any conversion is needed, it has to be done
before calling this function. [1] After building the DOM tree,
the function will apply the extraction rules to the root element of the tree,
and return a mapping where each item is generated by one of the rules.
Note
Piculet uses the ElementTree module for building and querying
XML trees. However, it will make use of the lxml package if it’s
installed. The scrape
function takes
an optional lxml_html
parameter which will use the HTML builder
from the lxml package, thereby building the tree without converting
HTML into XML first.
The specification mapping contains two keys: the pre
key is for specifying
the preprocessing operations (these will be covered in the next section),
and the items
key is for specifying the rules that describe how to extract
the data:
spec = {"pre": [...], "items": [...]}
The items list contains item mappings, where each item has a key
and
a value
description. The key specifies the key for the item in the output
mapping and the value specifies how to extract the data to set as the value
for that item. Typically, a value specifier consists of a path query and
a reducing function. The query is applied to the root and a list of strings
is obtained. Then, the reducing function converts this list into a single
string. [2]
For example, to get the title of the movie from the example document, we can write:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
The .//title/text()
path generates the list ['The Shining']
and the reducing function first
selects the first element from that list.
Note
By default, the XPath queries are limited by what ElementTree supports
(plus the text()
and @attr
clauses which are added by Piculet).
However, if the lxml package is installed a
much wider range of XPath constructs can be used.
Multiple items can be collected in a single invocation:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... },
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': '1980'}
If a path doesn’t match any element in the tree, the item will be excluded from the output. Note that in the following example, the “foo” key doesn’t get included:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... },
... {
... "key": "foo",
... "value": {
... "path": "//foo/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
Reducing¶
Piculet contains a few predefined reducing functions. Other than the first
reducer used in the examples above, a very common reducer is concat
which will concatenate the selected strings:
>>> spec = {
... "items": [
... {
... "key": "full_title",
... "value": {
... "path": "//h1//text()",
... "reduce": "concat"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}
concat
is the default reducer, i.e. if no reducer is given, the strings
will be concatenated:
>>> spec = {
... "items": [
... {
... "key": "full_title",
... "value": {
... "path": "//h1//text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}
If you want to get rid of extra whitespace, you can use the clean
reducer.
After concatenating the strings, this will remove leading and trailing
whitespace and replace multiple whitespace with a single space:
>>> spec = {
... "items": [
... {
... "key": "review",
... "value": {
... "path": '//div[@class="review"]//text()',
... "reduce": "clean"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'review': 'Fantastic movie. Definitely recommended.'}
In this example, the concat
reducer would have produced the value
'\n Fantastic movie.\n Definitely recommended.\n '
As explained above, if a path query doesn’t match any element, the item gets automatically excluded. That means, Piculet doesn’t try to apply the reducing function on the result of the path query if it’s an empty list. Therefore, reducing functions can safely assume that the path result is a non-empty list.
If you want to use a custom reducer, you have to register it first. The name for the specifier (the first parameter) has to be a valid Python identifier.
>>> from piculet import reducers
>>> reducers.register("second", lambda x: x[1])
>>> spec = {
... "items": [
... {
... "key": "year",
... "value": {
... "path": "//h1//text()",
... "reduce": "second"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'year': '1980'}
Transforming¶
After the reduction operation, you can apply a transformation
to the resulting string. A transformation function must take a string
as parameter and can return any value of any type. Piculet contains several
predefined transformers: int
, float
, bool
, len
, lower
,
upper
, capitalize
. For example, to get the year of the movie
as an integer:
>>> spec = {
... "items": [
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first",
... "transform": "int"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'year': 1980}
If you want to use a custom transformer, you have to register it first:
>>> from piculet import transformers
>>> transformers.register("year25", lambda x: int(x) + 25)
>>> spec = {
... "items": [
... {
... "key": "25th_year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first",
... "transform": "year25"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'25th_year': 2005}
Multi-valued items¶
Data with multiple values can be created by using a foreach
key
in the value specifier. This is a path expression to select elements
from the tree. [3] The path and reducing function will be applied
to each selected element and the obtained values will be the members
of the resulting list. For example, to get the genres of the movie,
we can write:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['Horror', 'Drama']}
If the foreach
key doesn’t match any element the item will be excluded
from the result:
>>> spec = {
... "items": [
... {
... "key": "foos",
... "value": {
... "foreach": '//ul[@class="foos"]/li',
... "path": "./text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{}
If a transformation is specified, it will be applied to every element in the resulting list:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()",
... "reduce": "first",
... "transform": "lower"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['horror', 'drama']}
Subrules¶
Nested structures can be created by writing subrules as value specifiers.
If the value specifier is a mapping that contains an items
key,
then this will be interpreted as a subrule and the generated mapping
will be the value for the key.
>>> spec = {
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": {
... "path": '//div[@class="director"]//a/text()',
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": '//div[@class="director"]//a/@href',
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
Subrules can be combined with lists:
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./td[1]/a/text()",
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": "./td[1]/a/@href",
... "reduce": "first"
... }
... },
... {
... "key": "character",
... "value": {
... "path": "./td[2]/text()",
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': [{'character': 'Jack Torrance',
'link': '/people/2',
'name': 'Jack Nicholson'},
{'character': 'Wendy Torrance',
'link': '/people/3',
'name': 'Shelley Duvall'}]}
Items generated by subrules can also be transformed. The transformation function is always applied as the last step in a “value” definition. But transformers for subitems take mappings (as opposed to strings) as parameter.
>>> transformers.register("stars", lambda x: "%(name)s as %(character)s" % x)
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./td[1]/a/text()",
... "reduce": "first"
... }
... },
... {
... "key": "character",
... "value": {
... "path": "./td[2]/text()",
... "reduce": "first"
... }
... }
... ],
... "transform": "stars"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': ['Jack Nicholson as Jack Torrance',
'Shelley Duvall as Wendy Torrance']}
Generating keys from content¶
You can generate items where the key value also comes from the content.
For example, consider how you would get the runtime and the language
of the movie. Instead of writing multiple items for each h3
element
under an “info” class div
, we can write only one item that will select
these divs and use the h3 text as the key. These elements can be selected using
foreach
specifications in the items. This will cause a new item
to be generated for each selected element. To get the key value,
we can use paths, reducers -and also transformers- that will be applied
to the selected element:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()",
... "reduce": "first"
... },
... "value": {
... "path": "./p/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'Language:': 'English', 'Runtime:': '144 minutes'}
The normalize
reducer concatenates the strings, converts it to lowercase,
replaces spaces with underscores and strips other non-alphanumeric characters:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()",
... "reduce": "normalize"
... },
... "value": {
... "path": "./p/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'language': 'English', 'runtime': '144 minutes'}
You could also give a string instead of a path and reducer for the key.
In this case, the elements would still be traversed; only the last one would
set the final value for the item. This could be OK if you are sure
that there is only one element that matches the foreach
path of the key.
Sections¶
The specification also provides the ability to define sections within the document. An element can be selected as the root of a section such that the XPath queries in that section will be relative to that root. This can be used to make XPath expressions shorter and also constrain the search in the tree. For example, the “director” example above can also be written using sections:
>>> spec = {
... "section": '//div[@class="director"]//a',
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./text()",
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": "./@href",
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
[1] | Note that the example document is already in XML format. |
[2] | This means that the query has to end with either text() or some
attribute value as in @attr . And the reducing function should be
implemented so that it takes a list of strings and returns a string. |
[3] | This implies that the foreach query should not end in text()
or @attr . |
Preprocessing¶
Other than extraction rules, specifications can also contain preprocessing operations which allow modifications on the tree before starting data extraction. Such operations can be needed to make data extraction simpler or to remove the need for some postprocessing operations on the collected data.
The syntax for writing preprocessing operations is as follows:
rules = {
"pre": [
{
"op": "...",
...
},
{
"op": "...",
...
}
],
"items": [ ... ]
}
Every preprocessing operation item has a name which is given as the value of the “op” key. The other items in the mapping are specific to the operation. The operations are applied in the order as they are written in the operations list.
The predefined preprocessing operations are explained below.
Removing elements¶
This operation removes from the tree all the elements (and its subtree) that are selected by a given XPath query:
{"op": "remove", "path": "..."}
Setting element attributes¶
This operation selects all elements by a given XPath query and sets an attribute for these elements to a given value:
{"op": "set_attr", "path": "...", "name": "...", "value": "..."}
The attribute “name” can be a literal string or an extractor as described in the data extraction chapter. Similarly, the attribute “value” can be given as a literal string or an extractor.
Setting element text¶
This operation selects all elements by a given XPath query and sets their texts to a given value:
{"op": "set_text", "path": "...", "text": "..."}
The “text” can be a literal string or an extractor.
Lower-level functions¶
Piculet also provides a lower-level API where you can run the stages
separately. For example, if the same document will be scraped multiple times
with different rules, calling the scrape
function repeatedly will cause
the document to be parsed into a DOM tree repeatedly. Instead, you can
create the DOM tree once and run extraction rules against this tree
multiple times.
Also, this API uses classes to express the specification and therefore development tools can help better in writing the rules by showing error indicators and suggesting autocompletions.
Building the tree¶
The DOM tree can be created from the document using
the build_tree
function:
>>> from piculet import build_tree
>>> root = build_tree(document)
If the document needs to be converted from HTML to XML, you can use
the html_to_xhtml
function:
>>> from piculet import html_to_xhtml
>>> converted = html_to_xhtml(document)
>>> root = build_tree(converted)
If lxml is available, you can use the lxml_html
parameter for building
the tree without converting an HTML document into XHTML:
>>> root = build_tree(document, lxml_html=True)
Note
Note that if you use the lxml.html builder, there might be differences about how the tree is built compared to the piculet conversion method and the path queries for preprocessing and extraction might need changes.
Preprocessing¶
The tree can be modified using the preprocess
function:
>>> from piculet import preprocess
>>> ops = [{"op": "remove", "path": '//div[class="ad"]'}]
>>> preprocess(root, ops)
Data extraction¶
The class-based API to data extraction has a one-to-one correspondance
with the specification mapping. A Rule
object
corresponds to a key-value pair in the items list. Its value is produced
by an extractor
. In the simple case, an extractor is
a Path
object which is a combination of a path,
a reducer, and a transformer.
>>> from piculet import Path, Rule, reducers, transformers
>>> extractor = Path('//span[@class="year"]/text()',
... reduce=reducers.first,
... transform=transformers.int)
>>> rule = Rule(key="year", extractor=extractor)
>>> rule.extract(root)
{'year': 1980}
An extractor can have a foreach
attribute if it will be multi-valued:
>>> extractor = Path(foreach='//ul[@class="genres"]/li',
... path="./text()",
... reduce=reducers.first,
... transform=transformers.lower)
>>> rule = Rule(key="genres", extractor=extractor)
>>> rule.extract(root)
{'genres': ['horror', 'drama']}
The key
attribute of a rule can be an extractor in which case it can be
used to extract the key value from content. A rule can also have a foreach
attribute for generating multiple items in one rule. These features will work
as they are described in the data extraction section.
A Rules
object contains a collection of rule objects
and it corresponds to the “items” part in the specification mapping. It acts
both as the top level extractor that gets applied to the root of the tree,
and also as an extractor for any rule with subrules.
>>> from piculet import Rules
>>> rules = [Rule(key="title",
... extractor=Path("//title/text()")),
... Rule(key="year",
... extractor=Path('//span[@class="year"]/text()',
... transform=transformers.int))]
>>> Rules(rules).extract(root)
{'title': 'The Shining', 'year': 1980}
A more complete example with transformations is below. Again note that, the specification is exactly the same as given in the corresponding mapping example in the data extraction chapter.
>>> rules = [
... Rule(key="cast",
... extractor=Rules(
... foreach='//table[@class="cast"]/tr',
... rules=[
... Rule(key="name",
... extractor=Path("./td[1]/a/text()")),
... Rule(key="character",
... extractor=Path("./td[2]/text()"))
... ],
... transform=lambda x: "%(name)s as %(character)s" % x
... ))
... ]
>>> Rules(rules).extract(root)
{'cast': ['Jack Nicholson as Jack Torrance',
'Shelley Duvall as Wendy Torrance']}
A rules object can have a section
attribute as described in the data
extraction chapter:
>>> rules = [
... Rule(key="director",
... extractor=Rules(
... section='//div[@class="director"]//a',
... rules=[
... Rule(key="name",
... extractor=Path("./text()")),
... Rule(key="link",
... extractor=Path("./@href"))
... ]))
... ]
>>> Rules(rules).extract(root)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
API¶
History¶
1.0.1 (2019-02-07)¶
- Accept both .yaml and .yml as valid YAML file extensions.
- Documentation fixes.
1.0 (2018-05-25)¶
- Bumped version to 1.0.
1.0b7 (2018-03-21)¶
- Dropped support for Python 3.3.
- Fixes for handling Unicode data in HTML for Python 2.
- Added registry for preprocessors.
1.0b6 (2018-01-17)¶
- Support for writing specifications in YAML.
1.0b5 (2018-01-16)¶
- Added a class-based API for writing specifications.
- Added predefined transformation functions.
- Removed callables from specification maps. Use the new API instead.
- Added support for registering new reducers and transformers.
- Added support for defining sections in document.
- Refactored XPath evaluation method in order to parse path expressions once.
- Preprocessing will be done only once when the tree is built.
- Concatenation is now the default reducing operation.
1.0b4 (2018-01-02)¶
- Added “–version” option to command line arguments.
- Added option to force the use of lxml’s HTML builder.
- Fixed the error where non-truthy values would be excluded from the result.
- Added support for transforming node text during preprocess.
- Added separate preprocessing function to API.
- Renamed the “join” reducer as “concat”.
- Renamed the “foreach” keyword for keys as “section”.
- Removed some low level debug messages to substantially increase speed.
1.0b3 (2017-07-25)¶
- Removed the caching feature.
1.0b2 (2017-06-16)¶
- Added helper function for getting cache hash keys of URLs.
1.0b1 (2017-04-26)¶
- Added optional value transformations.
- Added support for custom reducer callables.
- Added command-line option for scraping documents from local files.
1.0a2 (2017-04-04)¶
- Added support for Python 2.7.
- Fixed lxml support.
1.0a1 (2016-08-24)¶
- First release on PyPI.
\ Sort by:\ best rated\ newest\ oldest\
\\
Add a comment\ (markup):
\``code``
, \ code blocks:::
and an indented block after blank line