API

Module for scraping XML and HTML documents using XPath queries.

It consists of this single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications.

For more information, please refer to the documentation: https://tekir.org/piculet/

class piculet.Extractor

Bases: abc.ABC

An abstract base extractor.

This class handles the common extraction operations such as transforming obtained raw values and handling multi-valued data.

static from_desc(desc)

Create an extractor from a description.

class piculet.HTMLNormalizer(*, skip_tags: Iterable[str] = (), skip_attrs: Iterable[str] = ())

Bases: html.parser.HTMLParser

HTML to XHTML converter.

In addition to converting the document to valid XHTML, this will also remove unwanted tags and attributes, along with all comments and DOCTYPE declarations.

Parameters:
  • skip_tags – Tags to remove.
  • skip_attrs – Attributes to remove.
VOID_ELEMENTS = frozenset({'link', 'embed', 'input', 'col', 'keygen', 'track', 'nextid', 'hr', 'bgsound', 'image', 'img', 'command', 'param', 'source', 'isindex', 'br', 'basefont', 'base', 'wbr', 'area', 'meta', 'menuitem', 'frame'})

Tags to treat as self-closing.

error(message: str) → None

Ignore errors.

class piculet.Items(rules: Sequence[Callable[[lxml.etree.Element], Mapping[KT, VT_co]]], *, section: Optional[str] = None, transform: Optional[Callable[[Mapping[KT, VT_co]], Any]] = None, foreach: Optional[str] = None)

Bases: piculet.Extractor

An extractor that can get multiple pieces of data from an element.

Parameters:
  • rules – Functions for generating the items from the element.
  • section – XPath expression for selecting the root element for queries.
  • transform – Function for transforming the raw data items.
  • foreach – XPath expression for selecting multiple subelements.
class piculet.Path(path: str, *, sep: Optional[str] = None, transform: Optional[Callable[[str], Any]] = None, foreach: Optional[str] = None)

Bases: piculet.Extractor

An extractor that can get a single piece of data from an element.

Parameters:
  • path – XPath expression for getting the raw data values.
  • sep – Separator for joining the raw data values.
  • transform – Function for transforming the raw data.
  • foreach – XPath expression for selecting multiple subelements.
class piculet.Rule(key: Union[str, Callable[[lxml.etree.Element], str]], value: piculet.Extractor, *, foreach: Optional[str] = None)

Bases: object

A data generator.

Parameters:
  • key – Name to distinguish the data.
  • value – Extractor that will generate the data.
  • foreach – XPath expression for generating multiple data items.
static from_desc(desc)

Create a rule from a description.

piculet.build_tree(document: str, *, html: bool = False) → lxml.etree.Element

Build a tree from an XML document.

Parameters:
  • document – XML document to build the tree from.
  • html – Whether the document is in HTML format.
Returns:

Root element of the XML tree.

piculet.chain(*functions)

Chain functions to apply the output of one as the input of the next.

piculet.html_to_xhtml(document: str, *, skip_tags: Iterable[str] = (), skip_attrs: Iterable[str] = ()) → str

Convert an HTML document to XHTML.

Parameters:
  • document – HTML document to convert.
  • skip_tags – Tags to exclude from the output.
  • skip_attrs – Attributes to exclude from the output.
piculet.preprocessor(desc)

Create a preprocessor from a description.

piculet.preprocessors = namespace(remove=<function _remove>, set_attr=<function _set_attr>, set_text=<function _set_text>)

Predefined preprocessors.

piculet.scrape(document: str, spec: Mapping[KT, VT_co], *, html: bool = False) → Mapping[KT, VT_co]

Extract data from a document.

Parameters:
  • document – Document to scrape.
  • spec – Preprocessing and extraction specification.
  • html – Whether to use the HTML builder.
piculet.transformers = namespace(bool=<method-wrapper '__call__' of type object>, capitalize=<method 'capitalize' of 'str' objects>, clean=<function <lambda>>, float=<method-wrapper '__call__' of type object>, int=<method-wrapper '__call__' of type object>, len=<built-in function len>, lower=<method 'lower' of 'str' objects>, lstrip=<method 'lstrip' of 'str' objects>, normalize=<function <lambda>>, rstrip=<method 'rstrip' of 'str' objects>, strip=<method 'strip' of 'str' objects>, upper=<method 'upper' of 'str' objects>)

Predefined transformers.