API

Piculet is a module for extracting data from HTML/XML and JSON documents. The queries are written in XPath for HTML/XML, and in JMESPath for JSON.

The documentation is available on: https://piculet.readthedocs.io/

class piculet.Collector(*, root: ~piculet.Query | None = None, foreach: ~piculet.Query | None = None, transforms: list[str] = <factory>, _transforms: list[~collections.abc.Callable[[~typing.Any], ~typing.Any]] = <factory>, rules: list[~piculet.Rule] = <factory>)

Bases: Extractor

An extractor that collects multiple pieces of data.

extract(node: _Element | dict[str, Any]) dict[str, Any] | None

Extract data from a node.

rules: list[Rule]

Rules to apply to a node to collect the data.

class piculet.Extractor(*, root: ~piculet.Query | None = None, foreach: ~piculet.Query | None = None, transforms: list[str] = <factory>, _transforms: list[~collections.abc.Callable[[~typing.Any], ~typing.Any]] = <factory>)

Bases: object

Base class for extractors.

foreach: Query | None = None

Query to select the nodes for producing multiple results.

root: Query | None = None

Query to select the root node to extract the data from.

transforms: list[str]

Names of transform functions to apply to the obtained data.

class piculet.Picker(*, root: ~piculet.Query | None = None, foreach: ~piculet.Query | None = None, transforms: list[str] = <factory>, _transforms: list[~collections.abc.Callable[[~typing.Any], ~typing.Any]] = <factory>, path: ~piculet.Query)

Bases: Extractor

An extractor that produces a single value.

extract(node: _Element | dict[str, Any]) Any

Extract data from a node.

path: Query

Query to apply to a node to extract the value.

class piculet.Query(path: str)

Bases: object

A query based on XPath or JMESPath.

Expressions starting with / or ./ are assumed to be XPath. and others are assumed to be JMESPath.

apply(node: _Element | dict[str, Any]) Any

Apply this query to a node.

If this is an XPath query, it should return a list of texts, which will be concatenated.

get(node: _Element | dict[str, Any]) _Element | dict[str, Any]

Get the first node matched by applying this query to a node.

path: str

Path expression to apply to nodes.

select(node: _Element | dict[str, Any]) list[_Element | dict[str, Any]]

Get all nodes matched by applying this query to a node.

class piculet.Rule(*, key: str | Picker, extractor: Picker | Collector, foreach: Query | None = None)

Bases: object

A rule that generates key-value pairs from a node.

apply(root: _Element | dict[str, Any]) dict[str, Any] | None

Apply this rule to a node.

extractor: Picker | Collector

Extractor to produce the value.

foreach: Query | None = None

Query to select the nodes for producing multiple key-value pairs.

key: str | Picker

Name of key or extractor to produce the key.

class piculet.Spec(*, root: ~piculet.Query | None = None, foreach: ~piculet.Query | None = None, transforms: list[str] = <factory>, _transforms: list[~collections.abc.Callable[[~typing.Any], ~typing.Any]] = <factory>, rules: list[~piculet.Rule] = <factory>, pre: list[str] = <factory>, post: list[str] = <factory>, _pre: list[~collections.abc.Callable[[~lxml.etree._Element | dict[str, ~typing.Any]], ~lxml.etree._Element | dict[str, ~typing.Any]]] = <factory>, _post: list[~collections.abc.Callable[[dict[str, ~typing.Any]], dict[str, ~typing.Any]]] = <factory>)

Bases: Collector

A scraping specification.

extract(root: _Element | dict[str, Any])

Extract data from a node.

post: list[str]

Names of postprocessor functions.

postprocess(data: dict[str, Any]) dict[str, Any]

Apply the postprocessors to the collected data.

pre: list[str]

Names of preprocessor functions.

preprocess(root: _Element | dict[str, Any]) _Element | dict[str, Any]

Apply the preprocessors to the root node.

scrape(document: str | _Element | dict[str, Any], *, doctype: Literal['html', 'xml', 'json']) dict[str, Any]

Scrape a document.

piculet.build_tree(document: str, doctype: Literal['html', 'xml', 'json']) _Element | dict[str, Any]

Convert a document to a tree.

piculet.deserialize(value: Any, type_: Type[T], **kwargs) T

Generate an object from a dictionary.

piculet.load_spec(content: ~typing.Mapping[str, ~typing.Any], *, type_: type = <class 'piculet.Spec'>, transformers: ~typing.Mapping[str, ~collections.abc.Callable[[~typing.Any], ~typing.Any]] | None = None, preprocessors: ~typing.Mapping[str, ~collections.abc.Callable[[~lxml.etree._Element | dict[str, ~typing.Any]], ~lxml.etree._Element | dict[str, ~typing.Any]]] | None = None, postprocessors: ~typing.Mapping[str, ~collections.abc.Callable[[dict[str, ~typing.Any]], dict[str, ~typing.Any]]] | None = None) Spec

Deserialize a mapping into a scraping specification.

piculet.serialize(value: Any, **kwargs) Any

Generate a dictionary from an object.