OverviewΒΆ

The processing of a document consists of several stages:

  1. Building a tree out of the document.

  2. (Optional) Preprocessing the tree.

  3. Extracting data out of the tree.

  4. (Optional) Postprocessing the collected data.

The extraction process generates a dictionary according to a specification that describes what the names of the keys will be and how their values will be extracted. The specification is itself a dictionary with the following keys:

pre

The list of names of preprocessors to apply to the document tree before extraction.

rules (required)

The list of rules to apply to the document tree to extract the data.

post

The list of names of postprocessors to apply to the obtained data after extraction.