Data extraction
===============

.. note::

   The example uses an HTML document in combination with XPath queries.
   JSON documents in combination with JMESPath queries conceptually work
   the same way, differing only in XPath/JMESPath related details.

This section explains how to write the rules for extracting the data.
We'll scrape the following HTML content for the movie "The Shining"
in our examples:

.. literalinclude:: ../tests/shining.html
   :language: html

Assuming the HTML document above is saved as :file:`shining.html`,
let's get its contents:

.. code-block:: python

   from pathlib import Path

   document = Path("shining.html").read_text()

Rules
-----

Each rule in the list specifies what the name of a piece of data will be,
and how its value will be extracted.
In the simple case, an extractor will use a path query.

For example, to get the title of the movie from the example document,
we can write the following rule:

.. code-block:: python

   rule = {"key": "title", "extractor": {"path": "//title/text()"}}

Next, we use this rule in a specification
that we load using the :func:`load_spec <piculet.load_spec>` function:

.. code-block:: python

   from piculet import load_spec

   spec = load_spec({"rules": [rule]})

Now that we have the document and the specification,
we can use the :func:`scrape <piculet.Spec.scrape>` method
to extract data from the document:

.. code-block:: python

   data = spec.scrape(document, doctype="html")

   # data:
   {"title": "The Shining"}

The XPath query has to be arranged so that it will return a list of texts.
These will be joined to produce the value.
For example:

.. code-block:: python

   rule = {"key": "full_title", "extractor": {"path": "//h1//text()"}}

   spec = load_spec({"rules": [rule]})
   data = spec.scrape(document, doctype="html")

   # data:
   {"full_title": "The Shining (1980)"}

Multiple items can be collected in a single invocation:

.. code-block:: python

   rules = [
       {"key": "title", "extractor": {"path": "//title/text()"}},
       {"key": "country", "extractor": {"path": "//div[@class='info'][1]/p/text()"}}
   ]

   spec = load_spec({"rules": rules})
   data = spec.scrape(document, doctype="html")

   # data:
   {"title": "The Shining", "country": "United States"}

If a rule doesn't produce a value, the item will be excluded from the output.
Note that in the following example, there's no ``foo`` key in the result:

.. code-block:: python

   rules = [
       {"key": "title", "extractor": {"path": "//title/text()"}},
       {"key": "foo", "extractor": {"path": "//foo/text()"}}
   ]

   spec = load_spec({"rules": rules})
   data = spec.scrape(document, doctype="html")

   # result:
   {"title": "The Shining"}

Transforming results
--------------------

Extractors can apply transformations to the values they have obtained.
Each transformation has a name and an associated function.
We tell the extractor to apply the function by giving its name
in the extractor transforms.
To match the transformer names to their functions,
a lookup map has to be provided when the specification is loaded.

For example, the following rule for the movie year would produce a string::

  {"key": "year", "extractor": {"path": "//span[@class='year']/text()"}}

To convert that value to an integer,
let's define and use an ``int`` transformer:

.. code-block:: python

   rule = {
       "key": "year",
       "extractor": {
           "path": "//span[@class='year']/text()",
           "transforms": ["int"]
       }
   }

   transformers = {"int": int}
   spec = load_spec({"rules": [rule]}, transformers=transformers)
   data = spec.scrape(document, doctype="html")

   # data:
   {"year": 1980}

Multiple transformations are applied in the order they are listed:

.. code-block:: python

   rule = {
       "key": "title",
       "extractor": {
           "path": "//title/text()",
           "transforms": ["remove_spaces", "titlecase"]
       }
   }

   transformers = {
       "titlecase": str.title,
       "remove_spaces": lambda s: s.replace(" ", "")
   }
   spec = load_spec({"rules": [rule]}, transformers=transformers)
   data = spec.scrape(document, doctype="html")

   # data:
   {"title": "Theshining"}

Multivalued results
--------------------

Data with multiple values can be created by using a ``foreach`` key
in the extractor.
This should be a path expression to select elements from the tree.
After the elements are selected, the query in the ``path`` key
will be applied *to each element*,
and the obtained values will be collected in the resulting list.
For example, to get the genres of the movie, we can write:

.. code-block:: python

   rule = {
       "key": "genres",
       "extractor": {
           "foreach": "//ul[@class='genres']/li",
           "path": "./text()"
       }
   }

   spec = load_spec({"rules": [rule]})
   data = spec.scrape(document, doctype="html")

   # data:
   {"genres": ["Horror", "Drama"]}

If the ``foreach`` key doesn't match any element, the item will be excluded
from the result:

.. code-block:: python

   rules = [
       {
           "key": "title",
           "extractor": {"path": "//title/text()"}
       },
       {
           "key": "foos",
           "extractor": {
               "foreach": "//ul[@class='foos']/li",
               "path": "./text()"
           }
       }
   ]

   spec = load_spec({"rules": rules})
   data = spec.scrape(document, doctype="html")

   # data:
   {"title": "The Shining"}

If a transformation is specified, it will be applied to *each element*
in the resulting list:

.. code-block:: python

   rule = {
       "key": "genres",
       "extractor": {
           "foreach": "//ul[@class='genres']/li",
           "path": "./text()",
           "transforms": ["lower"]
       }
   }

   transformers = {"lower": str.lower}
   spec = load_spec({"rules": [rule]}, transformers=transformers)
   data = spec.scrape(document, doctype="html")

   # data:
   {"genres": ["horror", "drama"]}

Subrules
--------

Nested structures can be created by writing subrules as extractors.
If the extractor contains a ``rules`` key instead of a path,
then this will be interpreted as a subrule,
and the generated mapping will be the value for the key.

.. code-block:: python

   rule = {
       "key": "director",
       "extractor": {
           "rules": [
               {
                   "key": "name",
                   "extractor": {"path": "//div[@class='director']//a/text()"}
               },
               {
                   "key": "link",
                   "extractor": {"path": "//div[@class='director']//a/@href"}
               }
           ]
       }
   }

   spec = load_spec({"rules": [rule]})
   data = spec.scrape(document, doctype="html")

   # data:
   {"director": {"name": "Stanley Kubrick", "link": "/people/1"}}

Extractors can select a different node as the root before applying the query.
This can improve readability and performance.
The ``root`` key has to be a path that selects the root for the operation.
If it returns multiple nodes, the first one will be selected.
The above rule is equivalent to:

.. code-block:: python

   rule = {
       "key": "director",
       "extractor": {
           "root": "//div[@class='director']//a",
           "rules": [
               {
                   "key": "name",
                   "extractor": {"path": "./text()"}
               },
               {
                   "key": "link",
                   "extractor": {"path": "./@href"}
               }
           ]
       }
   }

Subrules can be combined with multivalues:

.. code-block:: python

   rule = {
       "key": "cast",
       "extractor": {
           "foreach": "//table[@class='cast']/tr",
           "rules": [
               {"key": "name", "extractor": {"path": "./td[1]/a/text()"}},
               {"key": "character", "extractor": {"path": "./td[2]/text()"}}
           ]
       }
   }

   spec = load_spec({"rules": [rule]})
   data = spec.scrape(document, doctype="html")

   # data:
   {
     "cast": [
       {"name": "Jack Nicholson", "character": "Jack Torrance"},
       {"name": "Shelley Duvall", "character": "Wendy Torrance"}
     ]
   }

Moving the root takes place before selecting the elements using ``foreach``.
The rule given above is equivalent to:

.. code-block:: python

   rule = {
       "key": "cast",
       "extractor": {
           "root": "//table[@class='cast']",
           "foreach": "./tr",
           "rules": [
               {"key": "name", "extractor": {"path": "./td[1]/a/text()"}},
               {"key": "character", "extractor": {"path": "./td[2]/text()"}}
           ]
       }
   }

Subitems can also be transformed.
The transformation functions are always applied as the last step
in an extraction,
therefore the first transformer will take the generated mapping as parameter.

.. code-block:: python

   rule = {
       "key": "cast",
       "extractor": {
           "foreach": "//table[@class='cast']/tr",
           "rules": [
               {"key": "name", "extractor": {"path": "./td[1]/a/text()"}},
               {"key": "character", "extractor": {"path": "./td[2]/text()"}}
           ],
           "transforms": ["stars"]
       }
   }

   transformers = {"stars": lambda x: "%(name)s as %(character)s" % x}
   spec = load_spec({"rules": [rule]}, transformers=transformers)
   data = spec.scrape(document, doctype="html")

   # data:
   {
     "cast": [
        "Jack Nicholson as Jack Torrance",
        "Shelley Duvall as Wendy Torrance"
     ]
   }

Generating keys from content
----------------------------

You can generate items where the key value also comes from the content.
For example, consider how you would get the country and the language
of the movie.
Instead of writing multiple items for each ``h3`` element
under a ``div`` element with an ``info`` class,
we can write only one item that will select these divs
and use the h3 text as the key.

This method requires to locate the elements that contain both the key
and the value (in this example, the ``div``).
These elements will be selected using a ``foreach`` specification.
Key and value extractors will be applied to each selected element.

.. code-block:: python

   rule = {
       "foreach": "//div[@class='info']",
       "key": {"path": "./h3/text()" },
       "extractor": {"path": "./p/text()"}
   }

   spec = load_spec({"rules": [rule]})
   data = spec.scrape(document, doctype="html")

   # data:
   {"Country": "United States", "Language": "English"}

Like values, keys can also be transformed:

.. code-block:: python

   rule = {
       "foreach": "//div[@class='info']",
       "key": {"path": "./h3/text()", "transforms": ["lower"]},
       "extractor": {"path": "./p/text()"}
   }

   transformers = {"lower": str.lower}
   spec = load_spec({"rules": [rule]}, transformers=transformers)
   data = spec.scrape(document, doctype="html")

   # data:
   {"country": "United States", "language": "English"}