Piculet¶
Piculet is a module for extracting data from XML or HTML documents using XPath queries. It consists of a single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications. It also provides a command line interface.
Piculet is used for the parsers of the IMDbPY project.
Getting started¶
Piculet works with Python 3.7 and later versions.
You can install it using pip
:
pip install piculet
Installing Piculet creates a script named piculet
which can be used
to invoke the command line interface:
$ piculet -h
usage: piculet [-h] [--version] [--html] (-s SPEC | --h2x)
For example, say you want to extract some data from the file shining.html. An example specification is given in movie.json. Download both of these files and run the command:
$ cat shining.html | piculet -s movie.json
Getting help¶
The documentation is available on: https://tekir.org/piculet/
The source code can be obtained from: https://github.com/uyar/piculet
License¶
Copyright (C) 2014-2022 H. Turgut Uyar <uyar@tekir.org>
Piculet is released under the LGPL license, version 3 or later. Read the included LICENSE.txt file for details.
Contents¶
Overview¶
Scraping a document consists of three stages:
- Building a DOM tree out of the document. This is a straightforward operation for an XML document. For an HTML document, Piculet will first try to convert it into XHTML, and then build the tree from that.
- Preprocessing the tree. This is an optional stage. In some cases it might be helpful to do some changes on the tree to simplify the extraction process.
- Extracting data out of the tree.
The preprocessing and extraction stages are expressed as part of a scraping specification. The specification is a mapping which can be stored in a file format that can represent a mapping, such as JSON or YAML. Details about the specification are given in later chapters.
Command-line interface¶
The command-line interface reads the document from the standard input. After downloading the example files shining.html and movie.json, run the command:
$ cat shining.html | piculet -s movie.json
This should print the following output:
{
"cast": [
{
"character": "Jack Torrance",
"link": "/people/2",
"name": "Jack Nicholson"
},
{
"character": "Wendy Torrance",
"link": "/people/3",
"name": "Shelley Duvall"
}
],
"director": {
"link": "/people/1",
"name": "Stanley Kubrick"
},
"genres": [
"Horror",
"Drama"
],
"language": "English",
"review": "Fantastic movie. Definitely recommended.",
"runtime": "144 minutes",
"title": "The Shining",
"year": 1980
}
For HTML documents, the --html
option has to be used.
For example, to extract some data from the Wikipedia page for David Bowie,
download the wikipedia.json file and run the command:
$ curl -s "https://en.wikipedia.org/wiki/David_Bowie" | piculet -s wikipedia.json --html
This should print the following output:
{
"birthplace": "Brixton, London, England",
"born": "1947-01-08",
"name": "David Bowie",
"occupation": [
"Singer",
"songwriter",
"actor"
]
}
In the same command, change the name part of the URL to Merlene_Ottey
and you will get similar data for Merlene Ottey.
Note that since the markup used in Wikipedia pages for persons varies,
the kinds of data you get with this specification will also vary.
Piculet can also be used as a simplistic HTML to XHTML converter
by invoking it with the --h2x
option:
$ cat foo.html | piculet --h2x
YAML support¶
To use YAML for specification, Piculet has to be installed with YAML support:
pip install piculet[yaml]
Note that this will install an external module for parsing YAML files.
The YAML version of the configuration example above can be found in movie.yaml.
Data extraction¶
This section explains how to write the specification for extracting data from a document. We’ll scrape the following HTML content for the movie “The Shining” in our examples:
<html>
<head>
<meta charset="utf-8"/>
<title>The Shining</title>
</head>
<body>
<h1>The Shining (<span class="year">1980</span>)</h1>
<ul class="genres">
<li>Horror</li>
<li>Drama</li>
</ul>
<div class="director">
<h3>Director:</h3>
<p><a href="/people/1">Stanley Kubrick</a></p>
</div>
<table class="cast">
<tr>
<td><a href="/people/2">Jack Nicholson</a></td>
<td>Jack Torrance</td>
</tr>
<tr>
<td><a href="/people/3">Shelley Duvall</a></td>
<td>Wendy Torrance</td>
</tr>
</table>
<div class="info">
<h3>Runtime:</h3>
<p>144 minutes</p>
</div>
<div class="info">
<h3>Language:</h3>
<p>English</p>
</div>
<div class="review">
<em>Fantastic</em> movie.
Definitely recommended.
</div>
</body>
</html>
Assuming the HTML document above is saved as shining.html
, let’s get
its content:
>>> with open("shining.html") as f:
... document = f.read()
We’ll use the scrape
function to extract data
from the document:
>>> from piculet import scrape
This function assumes that the document is in XML format. So, if any conversion is needed, it has to be done before calling this function. [1]
After building the DOM tree, the function will apply the extraction rules to the root element of the tree, and return a mapping where each item is generated by one of the rules.
Note
Piculet uses the ElementTree module for building and querying
XML trees.
However, it will make use of the lxml package if it’s installed.
The scrape
function takes an optional lxml_html
parameter which will use the HTML builder from the lxml package,
thereby building the tree without converting HTML into XML first.
The specification contains two keys: pre
for specifying
the preprocessing operations (these will be covered in the next chapter),
and items
for specifying the rules that describe how to extract the data:
spec = {"pre": [...], "items": [...]}
The items list contains item mappings, where each item has a key
and
a value
description.
The key specifies the key for the item in the output mapping, and the value
specifies how to extract the data to set as the value for that item.
Typically, a value specifier is a path query.
This query is applied to the root and the resulting list of strings
is concatenated into a single string.
Note
This means that the query has to end with either text()
or some
attribute value as in @attr
.
For example, to get the title of the movie from the example document, we can write:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
The .//title/text()
path generates the list ['The Shining']
,
and concatenation generates the resulting string.
Note
By default, XPath queries are limited by what ElementTree supports (plus a few additions by Piculet). However, if the lxml package is installed, a much wider range of XPath constructs can be used.
Multiple items can be collected in a single invocation:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()"
... }
... },
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()'
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': '1980'}
If a path doesn’t match any element in the tree, the item will be excluded from the output. Note that in the following example, there’s no “foo” key in the result:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()"
... }
... },
... {
... "key": "foo",
... "value": {
... "path": "//foo/text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
You can specify a string to use as separator when concatenating the texts selected by the query:
>>> spec = {
... "items": [
... {
... "key": "cast_names",
... "value": {
... "path": '//table[@class="cast"]/tr/td[1]/a/text()',
... "sep": ", "
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast_names': 'Jack Nicholson, Shelley Duvall'}
Transforming¶
After getting the string value, you can apply a transformation to it.
The transformation function must take a string as parameter,
and can return any value of any type.
Piculet contains several predefined
transformers
.
For example, to get the year of the movie as an integer:
>>> spec = {
... "items": [
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "transform": "int"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'year': 1980}
If you want to use a custom transformer, you have to register it first:
>>> from piculet import transformers
>>> transformers.underscore = lambda s: s.replace(" ", "_")
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "transform": "underscore"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The_Shining}
You can chain transformers using the |
symbol:
>>> transformers.century = lambda x: x // 100 + 1
>>> spec = {
... "items": [
... {
... "key": "century",
... "value": {
... "path": '//span[@class="year"]/text()',
... "transform": "int|century"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'century': 20}
Shorthand notation¶
To make the specification more concise, you can write the value
as a single string to combine the path and transform operations
by splitting them with the |
symbol:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": "//title/text()"
... },
... {
... "key": "year",
... "value": '//span[@class="year"]/text() | int'
... },
... {
... "key": "century",
... "value": '//span[@class="year"]/text() | int | century'
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': 1980, 'century': 20}
Note
After this point, the examples will generally use the shorthand notation.
Multi-valued items¶
Data with multiple values can be created by using a foreach
key
in the value specifier.
This is a path expression to select elements from the tree.
Note
This implies that the foreach
query should not end in text()
or @attr
.
The path function will be applied to each selected element, and the obtained values will be the members of the resulting list. For example, to get the genres of the movie, we can write:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['Horror', 'Drama']}
If the foreach
key doesn’t match any element, the item will be excluded
from the result:
>>> spec = {
... "items": [
... {
... "key": "foos",
... "value": {
... "foreach": '//ul[@class="foos"]/li',
... "path": "./text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{}
If a transformation is specified, it will be applied to every element in the resulting list:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()",
... "transform": "lower"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['horror', 'drama']}
Subitems¶
Nested structures can be created by writing subrules as value specifiers.
If the value specifier is a mapping that contains an items
key,
then this will be interpreted as a subrule, and the generated mapping
will be the value for the key.
>>> spec = {
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": '//div[@class="director"]//a/text()'
... },
... {
... "key": "link",
... "value": '//div[@class="director"]//a/@href'
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'name': 'Stanley Kubrick', 'link': '/people/1'}}
Subitems can be combined with multi-values:
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": "./td[1]/a/text()"
... },
... {
... "key": "link",
... "value": "./td[1]/a/@href"
... },
... {
... "key": "character",
... "value": "./td[2]/text()"
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': [{'name': 'Jack Nicholson',
'link': '/people/2',
'character': 'Jack Torrance'},
{'name': 'Shelley Duvall',
'link': '/people/3',
'character': 'Wendy Torrance'}]}
Subitems can also be transformed. The transformation function is always applied as the last step in a “value” definition, therefore transformers for subitems take mappings (as opposed to strings) as parameter.
>>> transformers.stars = lambda x: "%(name)s as %(character)s" % x
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": "./td[1]/a/text()"
... },
... {
... "key": "character",
... "value": "./td[2]/text()"
... }
... ],
... "transform": "stars"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': ['Jack Nicholson as Jack Torrance',
'Shelley Duvall as Wendy Torrance']}
Generating keys from content¶
You can generate items where the key value also comes from the content.
For example, consider how you would get the runtime and the language
of the movie.
Instead of writing multiple items for each h3
element
under an “info” class div
, we can write only one item
that will select these divs and use the h3 text as the key.
These elements can be selected using foreach
specifications in the items.
This will cause a new item to be generated for each selected element.
To get the key value, we can use paths and transformers
that will be applied to the selected element:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()"
... },
... "value": {
... "path": "./p/text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'Runtime:': '144 minutes', 'Language:': 'English'}
The normalize
transformer converts
the string to lowercase, replaces spaces with underscores,
and strips non-alphanumeric characters:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()",
... "transform": "normalize"
... },
... "value": {
... "path": "./p/text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'runtime': '144 minutes', 'language': 'English'}
Sections¶
The specification also provides the ability to define sections within the document. An element can be selected as the root of a section such that the XPath queries in that section will be relative to that root. This can be used to make XPath expressions shorter, and to constrain the search in the tree. For example, the “director” example above can also be written using sections:
>>> spec = {
... "section": '//div[@class="director"]//a',
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": "./text()"
... },
... {
... "key": "link",
... "value": "./@href"
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'name': 'Stanley Kubrick', 'link': '/people/1'}}
[1] | Note that the example document is already in XML format. |
Preprocessing¶
Other than extraction rules, specifications can also contain preprocessing operations which allow modifications on the tree before starting data extraction. Such operations can be needed to make data extraction simpler, or to remove the need for some postprocessing operations on the collected data.
The syntax for writing preprocessing operations is as follows:
rules = {
"pre": [
{
"op": "...",
...
},
{
"op": "...",
...
}
],
"items": [ ... ]
}
Every preprocessing operation item has a name which is given as the value of the “op” key. The other items in the mapping are specific to the operation. The operations are applied in the order as they are written in the operations list.
The predefined preprocessing operations are explained below.
Removing elements¶
This operation removes from the tree all the elements (and its subtree) that are selected by a given XPath query:
{"op": "remove", "path": "..."}
Setting element attributes¶
This operation selects all elements by a given XPath query and sets an attribute for these elements to a given value:
{"op": "set_attr", "path": "...", "name": "...", "value": "..."}
The attribute “name” can be a literal string or an extractor as described in the data extraction chapter. Similarly, the attribute “value” can be given as a literal string or an extractor.
Setting element text¶
This operation selects all elements by a given XPath query and sets their texts to a given value:
{"op": "set_text", "path": "...", "text": "..."}
The “text” can be a literal string or an extractor.
Lower-level functions¶
Piculet also provides a lower-level API where you can run the stages
separately.
For example, if the same document will be scraped multiple times
with different rules, calling the scrape
function repeatedly will cause
the document to be parsed into a DOM tree repeatedly.
Instead, you can create the DOM tree once,
and run extraction rules against this tree multiple times.
Also, this API uses code to express the specification (instead of strings) and therefore development tools can help better in writing the rules by showing error indicators and suggesting autocompletions.
Building the tree¶
The DOM tree can be created from the document using
the build_tree
function:
>>> from piculet import build_tree
>>> root = build_tree(document)
If the document needs to be converted from HTML to XML, you can use
the html_to_xhtml
function:
>>> from piculet import html_to_xhtml
>>> converted = html_to_xhtml(document)
>>> root = build_tree(converted)
If lxml is available, you can use the lxml_html
parameter for building
the tree without converting an HTML document into XHTML:
>>> root = build_tree(document, lxml_html=True)
Note
Note that if you use the lxml.html builder, there might be differences about how the tree is built compared to the piculet conversion method, and the path queries for preprocessing and extraction might need changes.
Preprocessing¶
Preprocessors are functions that take an element in the DOM tree as parameter
and modify the tree.
The preprocessors
registry contains
preprocessor generators which take the parameters other than the element
to apply the operation to, and return a function that expects the element:
>>> from piculet import preprocessors
>>> remove_ads = preprocessors.remove(path='//div[@class="ad"]')
>>> remove_ads(root)
Warning
The preprocessing functions assume that the root of the tree doesn’t change.
Data extraction¶
The API for data extraction has a one-to-one correspondance with the specification mapping.
Path
extractors are applied to elements
to extract the value for a single data item.
>>> from piculet import Path
>>> path = Path('//span[@class="year"]/text()', transform=int)
>>> path(root)
1980
The sep
parameter can be used concatenate using a separator string:
>>> path = Path('//table[@class="cast"]/tr/td[1]/a/text()', sep=", ")
>>> path(root)
'Jack Nicholson, Shelley Duvall'
You can use the chain
utility function
to generate chained transformers:
>>> from piculet import chain
>>> path = Path(
... '//span[@class="year"]/text()',
... transform=chain(int, lambda x: x // 100 + 1),
... )
>>> path(root)
20
Every item in the result mapping is generated
by a Rule
in the API.
Rules are applied to elements to extract data items in the result mapping,
so their basic function is to associate the keys with the values.
>>> from piculet import Rule
>>> rule = Rule(
... key="year",
... value=Path('//span[@class="year"]/text()', transform=int),
... )
>>> rule(root)
{'year': 1980}
Items
extractors are applied to elements
to extract subitems for a data item.
Basically, they are rule collections.
>>> from piculet import Items
>>> rules = [
... Rule(
... key="title",
... value=Path('//title/text()'),
... ),
... Rule(
... key="year",
... value=Path('//span[@class="year"]/text()', transform=int),
... ),
... ]
>>> items = Items(rules)
>>> items(root)
{'title': 'The Shining', 'year': 1980}
Items extractors act both as the top level extractor that gets applied to the root of the tree, and also as an extractor for any rule with subitems.
An extractor can have a foreach
parameter if it will be multi-valued:
>>> rules = [
... Rule(
... key="genres",
... value=Path(
... foreach='//ul[@class="genres"]/li',
... path="./text()",
... transform=str.lower,
... ),
... ),
... ]
>>> items = Items(rules)
>>> items(root)
{'genres': ['horror', 'drama']}
The key
parameter of a rule can be an extractor
in which case it can be used to extract the key value from content.
A rule can also have a foreach
parameter
for generating multiple items in one rule.
These features will work as they are described in the data extraction section.
A more complete example with transformations is given below. Again note that the specification is exactly the same as given in the corresponding mapping example in the data extraction chapter.
>>> rules = [
... Rule(
... key="cast",
... value=Items(
... foreach='//table[@class="cast"]/tr',
... rules=[
... Rule(
... key="name",
... value=Path("./td[1]/a/text()"),
... ),
... Rule(
... key="character",
... value=Path("./td[2]/text()"),
... ),
... ],
... transform=lambda x: "%(name)s as %(character)s" % x
... ),
... ),
... ]
>>> Items(rules)(root)
{'cast': ['Jack Nicholson as Jack Torrance',
'Shelley Duvall as Wendy Torrance']}
A rule can have a section
parameter as described in the data extraction
chapter:
>>> rules = [
... Rule(
... key="director",
... value=Items(
... section='//div[@class="director"]//a',
... rules=[
... Rule(
... key="name",
... value=Path("./text()"),
... ),
... Rule(
... key="link",
... value=Path("./@href"),
... ),
... ],
... ),
... ),
... ]
>>> Items(rules)(root)
{'director': {'name': 'Stanley Kubrick', 'link': '/people/1'}}
API¶
Module for scraping XML and HTML documents using XPath queries.
It consists of this single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications.
For more information, please refer to the documentation: https://tekir.org/piculet/
-
class
piculet.
Extractor
¶ Bases:
abc.ABC
An abstract base extractor.
This class handles the common extraction operations such as transforming obtained raw values and handling multi-valued data.
-
static
from_desc
(desc)¶ Create an extractor from a description.
-
static
-
class
piculet.
HTMLNormalizer
(*, skip_tags: Iterable[str] = (), skip_attrs: Iterable[str] = ())¶ Bases:
html.parser.HTMLParser
HTML to XHTML converter.
In addition to converting the document to valid XHTML, this will also remove unwanted tags and attributes, along with all comments and DOCTYPE declarations.
Parameters: - skip_tags – Tags to remove.
- skip_attrs – Attributes to remove.
-
VOID_ELEMENTS
= frozenset({'link', 'embed', 'input', 'col', 'keygen', 'track', 'nextid', 'hr', 'bgsound', 'image', 'img', 'command', 'param', 'source', 'isindex', 'br', 'basefont', 'base', 'wbr', 'area', 'meta', 'menuitem', 'frame'})¶ Tags to treat as self-closing.
-
error
(message: str) → None¶ Ignore errors.
-
class
piculet.
Items
(rules: Sequence[Callable[[lxml.etree.Element], Mapping[KT, VT_co]]], *, section: Optional[str] = None, transform: Optional[Callable[[Mapping[KT, VT_co]], Any]] = None, foreach: Optional[str] = None)¶ Bases:
piculet.Extractor
An extractor that can get multiple pieces of data from an element.
Parameters: - rules – Functions for generating the items from the element.
- section – XPath expression for selecting the root element for queries.
- transform – Function for transforming the raw data items.
- foreach – XPath expression for selecting multiple subelements.
-
class
piculet.
Path
(path: str, *, sep: Optional[str] = None, transform: Optional[Callable[[str], Any]] = None, foreach: Optional[str] = None)¶ Bases:
piculet.Extractor
An extractor that can get a single piece of data from an element.
Parameters: - path – XPath expression for getting the raw data values.
- sep – Separator for joining the raw data values.
- transform – Function for transforming the raw data.
- foreach – XPath expression for selecting multiple subelements.
-
class
piculet.
Rule
(key: Union[str, Callable[[lxml.etree.Element], str]], value: piculet.Extractor, *, foreach: Optional[str] = None)¶ Bases:
object
A data generator.
Parameters: - key – Name to distinguish the data.
- value – Extractor that will generate the data.
- foreach – XPath expression for generating multiple data items.
-
static
from_desc
(desc)¶ Create a rule from a description.
-
piculet.
build_tree
(document: str, *, html: bool = False) → lxml.etree.Element¶ Build a tree from an XML document.
Parameters: - document – XML document to build the tree from.
- html – Whether the document is in HTML format.
Returns: Root element of the XML tree.
-
piculet.
chain
(*functions)¶ Chain functions to apply the output of one as the input of the next.
-
piculet.
html_to_xhtml
(document: str, *, skip_tags: Iterable[str] = (), skip_attrs: Iterable[str] = ()) → str¶ Convert an HTML document to XHTML.
Parameters: - document – HTML document to convert.
- skip_tags – Tags to exclude from the output.
- skip_attrs – Attributes to exclude from the output.
-
piculet.
preprocessor
(desc)¶ Create a preprocessor from a description.
-
piculet.
preprocessors
= namespace(remove=<function _remove>, set_attr=<function _set_attr>, set_text=<function _set_text>)¶ Predefined preprocessors.
-
piculet.
scrape
(document: str, spec: Mapping[KT, VT_co], *, html: bool = False) → Mapping[KT, VT_co]¶ Extract data from a document.
Parameters: - document – Document to scrape.
- spec – Preprocessing and extraction specification.
- html – Whether to use the HTML builder.
-
piculet.
transformers
= namespace(bool=<method-wrapper '__call__' of type object>, capitalize=<method 'capitalize' of 'str' objects>, clean=<function <lambda>>, float=<method-wrapper '__call__' of type object>, int=<method-wrapper '__call__' of type object>, len=<built-in function len>, lower=<method 'lower' of 'str' objects>, lstrip=<method 'lstrip' of 'str' objects>, normalize=<function <lambda>>, rstrip=<method 'rstrip' of 'str' objects>, strip=<method 'strip' of 'str' objects>, upper=<method 'upper' of 'str' objects>)¶ Predefined transformers.
Changes¶
2.0.0a2 (unreleased)¶
- Drop support for Python 3.6.
- Revert API to OOP style.
- Move type annotations from stub into source.
2.0.0a1 (2019-07-23)¶
- Remove reducing functions; selected texts will always be concatenated (using an optional separator).
- Convert string normalization and cleaning into transformers.
- Add support for chaining transformers.
- Change chaining symbol from “->” to “|”.
2.0.0a0 (2019-06-28)¶
- Drop support for Python 2 and 3.4.
- Add support for absolute XPath queries in ElementTree.
- Add support for XPath queries that start with a parent axis in ElementTree.
- Add shorthand notation for path extractors in specification.
- Cache compiled XPath expressions.
- Remove HTML charset detection.
- Command line operations now read only from stdin.
- Simplify CLI commands.
1.0.1 (2019-02-07)¶
- Accept both .yaml and .yml as valid YAML file extensions.
- Documentation fixes.
1.0 (2018-05-25)¶
- Bumped version to 1.0.
1.0b7 (2018-03-21)¶
- Dropped support for Python 3.3.
- Fixes for handling Unicode data in HTML for Python 2.
- Added registry for preprocessors.
1.0b6 (2018-01-17)¶
- Support for writing specifications in YAML.
1.0b5 (2018-01-16)¶
- Added a class-based API for writing specifications.
- Added predefined transformation functions.
- Removed callables from specification maps. Use the new API instead.
- Added support for registering new reducers and transformers.
- Added support for defining sections in document.
- Refactored XPath evaluation method in order to parse path expressions once.
- Preprocessing will be done only once when the tree is built.
- Concatenation is now the default reducing operation.
1.0b4 (2018-01-02)¶
- Added “–version” option to command line arguments.
- Added option to force the use of lxml’s HTML builder.
- Fixed the error where non-truthy values would be excluded from the result.
- Added support for transforming node text during preprocess.
- Added separate preprocessing function to API.
- Renamed the “join” reducer as “concat”.
- Renamed the “foreach” keyword for keys as “section”.
- Removed some low level debug messages to substantially increase speed.
1.0b3 (2017-07-25)¶
- Removed the caching feature.
1.0b2 (2017-06-16)¶
- Added helper function for getting cache hash keys of URLs.
1.0b1 (2017-04-26)¶
- Added optional value transformations.
- Added support for custom reducer callables.
- Added command-line option for scraping documents from local files.
1.0a2 (2017-04-04)¶
- Added support for Python 2.7.
- Fixed lxml support.
1.0a1 (2016-08-24)¶
- First release on PyPI.