Overview

Scraping a document consists of three stages:

  1. Building a DOM tree out of the document. This is a straightforward operation for an XML document. For an HTML document, Piculet will first try to convert it into XHTML, and then build the tree from that.
  2. Preprocessing the tree. This is an optional stage. In some cases it might be helpful to do some changes on the tree to simplify the extraction process.
  3. Extracting data out of the tree.

The preprocessing and extraction stages are expressed as part of a scraping specification. The specification is a mapping which can be stored in a file format that can represent a mapping, such as JSON or YAML. Details about the specification are given in later chapters.

Command-line interface

The command-line interface reads the document from the standard input. After downloading the example files shining.html and movie.json, run the command:

$ cat shining.html | piculet -s movie.json

This should print the following output:

{
  "cast": [
    {
      "character": "Jack Torrance",
      "link": "/people/2",
      "name": "Jack Nicholson"
    },
    {
      "character": "Wendy Torrance",
      "link": "/people/3",
      "name": "Shelley Duvall"
    }
  ],
  "director": {
    "link": "/people/1",
    "name": "Stanley Kubrick"
  },
  "genres": [
    "Horror",
    "Drama"
  ],
  "language": "English",
  "review": "Fantastic movie. Definitely recommended.",
  "runtime": "144 minutes",
  "title": "The Shining",
  "year": 1980
}

For HTML documents, the --html option has to be used. For example, to extract some data from the Wikipedia page for David Bowie, download the wikipedia.json file and run the command:

$ curl -s "https://en.wikipedia.org/wiki/David_Bowie" | piculet -s wikipedia.json --html

This should print the following output:

{
  "birthplace": "Brixton, London, England",
  "born": "1947-01-08",
  "name": "David Bowie",
  "occupation": [
    "Singer",
    "songwriter",
    "actor"
  ]
}

In the same command, change the name part of the URL to Merlene_Ottey and you will get similar data for Merlene Ottey. Note that since the markup used in Wikipedia pages for persons varies, the kinds of data you get with this specification will also vary.

Piculet can also be used as a simplistic HTML to XHTML converter by invoking it with the --h2x option:

$ cat foo.html | piculet --h2x

YAML support

To use YAML for specification, Piculet has to be installed with YAML support:

pip install piculet[yaml]

Note that this will install an external module for parsing YAML files.

The YAML version of the configuration example above can be found in movie.yaml.