API¶
Piculet is a module for extracting data from HTML/XML and JSON documents. The queries are written in XPath for HTML/XML, and in JMESPath for JSON.
The documentation is available on: https://piculet.readthedocs.io/
- class piculet.Collector(*, root: ~piculet.Query | None = None, foreach: ~piculet.Query | None = None, transforms: list[str] = <factory>, _transforms: list[~collections.abc.Callable[[~typing.Any], ~typing.Any]] = <factory>, rules: list[~piculet.Rule] = <factory>)¶
Bases:
ExtractorAn extractor that collects multiple pieces of data.
- extract(node: _Element | dict[str, Any]) dict[str, Any] | None¶
Extract data from a node.
- class piculet.Extractor(*, root: ~piculet.Query | None = None, foreach: ~piculet.Query | None = None, transforms: list[str] = <factory>, _transforms: list[~collections.abc.Callable[[~typing.Any], ~typing.Any]] = <factory>)¶
Bases:
objectBase class for extractors.
- transforms: list[str]¶
Names of transform functions to apply to the obtained data.
- class piculet.Picker(*, root: ~piculet.Query | None = None, foreach: ~piculet.Query | None = None, transforms: list[str] = <factory>, _transforms: list[~collections.abc.Callable[[~typing.Any], ~typing.Any]] = <factory>, path: ~piculet.Query)¶
Bases:
ExtractorAn extractor that produces a single value.
- extract(node: _Element | dict[str, Any]) Any¶
Extract data from a node.
- class piculet.Query(path: str)¶
Bases:
objectA query based on XPath or JMESPath.
Expressions starting with
/or./are assumed to be XPath. and others are assumed to be JMESPath.- apply(node: _Element | dict[str, Any]) Any¶
Apply this query to a node.
If this is an XPath query, it should return a list of texts, which will be concatenated.
- get(node: _Element | dict[str, Any]) _Element | dict[str, Any]¶
Get the first node matched by applying this query to a node.
- path: str¶
Path expression to apply to nodes.
- select(node: _Element | dict[str, Any]) list[_Element | dict[str, Any]]¶
Get all nodes matched by applying this query to a node.
- class piculet.Rule(*, key: str | Picker, extractor: Picker | Collector, foreach: Query | None = None)¶
Bases:
objectA rule that generates key-value pairs from a node.
- apply(root: _Element | dict[str, Any]) dict[str, Any] | None¶
Apply this rule to a node.
- class piculet.Spec(*, root: ~piculet.Query | None = None, foreach: ~piculet.Query | None = None, transforms: list[str] = <factory>, _transforms: list[~collections.abc.Callable[[~typing.Any], ~typing.Any]] = <factory>, rules: list[~piculet.Rule] = <factory>, pre: list[str] = <factory>, post: list[str] = <factory>, _pre: list[~collections.abc.Callable[[~lxml.etree._Element | dict[str, ~typing.Any]], ~lxml.etree._Element | dict[str, ~typing.Any]]] = <factory>, _post: list[~collections.abc.Callable[[dict[str, ~typing.Any]], dict[str, ~typing.Any]]] = <factory>)¶
Bases:
CollectorA scraping specification.
- extract(root: _Element | dict[str, Any])¶
Extract data from a node.
- post: list[str]¶
Names of postprocessor functions.
- postprocess(data: dict[str, Any]) dict[str, Any]¶
Apply the postprocessors to the collected data.
- pre: list[str]¶
Names of preprocessor functions.
- preprocess(root: _Element | dict[str, Any]) _Element | dict[str, Any]¶
Apply the preprocessors to the root node.
- scrape(document: str | _Element | dict[str, Any], *, doctype: Literal['html', 'xml', 'json']) dict[str, Any]¶
Scrape a document.
- piculet.build_tree(document: str, doctype: Literal['html', 'xml', 'json']) _Element | dict[str, Any]¶
Convert a document to a tree.
- piculet.deserialize(value: Any, type_: Type[T], **kwargs) T¶
Generate an object from a dictionary.
- piculet.load_spec(content: ~typing.Mapping[str, ~typing.Any], *, type_: type = <class 'piculet.Spec'>, transformers: ~typing.Mapping[str, ~collections.abc.Callable[[~typing.Any], ~typing.Any]] | None = None, preprocessors: ~typing.Mapping[str, ~collections.abc.Callable[[~lxml.etree._Element | dict[str, ~typing.Any]], ~lxml.etree._Element | dict[str, ~typing.Any]]] | None = None, postprocessors: ~typing.Mapping[str, ~collections.abc.Callable[[dict[str, ~typing.Any]], dict[str, ~typing.Any]]] | None = None) Spec¶
Deserialize a mapping into a scraping specification.
- piculet.serialize(value: Any, **kwargs) Any¶
Generate a dictionary from an object.