5.7. Load HTML

This section describes procedures that can be used to import data from HTML pages.

Scraping Data from Html Pages.

apoc.load.html('url',{name: jquery, name2: jquery}, config) YIELD value

Load Html page and return the result as a Map

This procedures provides a very convenient API for acting using DOM, CSS and jquery-like methods. It relies on jsoup library.

CALL apoc.load.html(url, {name: <css/dom query>, name2: <css/dom query>}, {config}) YIELD value

The result is a stream of DOM elements represented by a map

The result is a map i.e.

{name: <list of elements>, name2: <list of elements>}

5.7.1. Config

Config param is optional, the default value is an empty map.

charset

Default: UTF-8

baserUri

Default: "", it is use to resolve relative paths

5.7.2. Example with real data

The examples below use the Wikipedia home page.

CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"})

You will get this result:

apoc.load.htmlall
CALL apoc.load.html("https://en.wikipedia.org/",{links:"link"})

You will get this result:

apoc.load.htmllinks
CALL apoc.load.html("https://en.wikipedia.org/",{metadata:"meta", h2:"h2"}, {charset: "UTF-8})

You will get this result:

apoc.load.htmlconfig