Downloading an Archive with exoskeleton

| last update: 2020-07-08

While writing my dissertation I stumpled upon an online government archive with millions of documents in public domain. A quick search yielded dozens relevant texts, but yet there is no relevant research making systematic use of them. This might be due to one problem: No one applied OCR to those PDF files. My search only showed me matches if and only if the keywords were in the title.

Fortunately, the documents are at least organized in collections and have a creation date. This made it possible to develop heuristics that reduced the number of potentially relevant files to tens of thousands. Still, I have neither time nor interest to read them all.

Making those documents searchable would be a major improvement. I would have to download all files that could be relevant and apply OCR to them, while keeping them organized. This meant writing a crawler. That crawler would have to be polite as maxing out the connection might bring down the archive’s servers. Some files are large. Therefore, I decided on downloading a maximum of two per minute. Working through 100,000 files at this rate means the crawler needs to run for 833 hours or approximately 34 days straight. It is likely that there will occur some problems in that month. That means the crawler must be able to recover from errors and tenaciously do his work.

The exoskeleton library was built for this - literally: I found myself writing similar code for different steps of this task, so I decided to create the library which makes my code leaner and more coherent.

This post describes how to use exoskeleton, a Python library with a MariaDB database backend, to create a crawler with the aforementioned properties and persistence. Follow-up posts will describe how I made the collection searchable and applied machine learning to detect relevant documents.

Install and other Preparations

Laptop versus Server

The first choice is where to run the bot. The operating system can be Linux, Mac or Windows as exoskeleton is written in Python. I could have used my laptop as exoskeleton is tolerant to interruptions and can pick up the job again without loosing the progress. I chose a server, because it will be faster without interruptions.

Install versus Docker

The project needs a MariaDB database. There is a SQL-script to create it. Exoskeleton itself can be installed via pip. Just follow the instructions on the project page.

Optional: Create an Info-Page

Despite my intentions and precautions the bot might by mistake cause problems for the site I crawl. Maybe they are interested why their site is crawled.

Therefore I create a small info page on my personal website. It explains in two sentences what the bot is supposed to do and contains my contact details. The URL of this page will be included in the user agent string of the bot.

Gathering Search Results

The site has a built-in search function. Good search queries narrow down the number of documents to scrape.

The search result pages link to detail pages of the documents that contain metadata like title, creation date, collection, and more. These detail pages also contain a link to the document itself.

So the first step is to scrape the search results to get a list of detail pages. A possible problem could be that one and the same document shows up in different search queries with a different URL and might be downloaded multiple times. However I checked and the URL is always like https://www.example.com/details-uniqueDocID.html. If the URL would contain dynamic get-parameters like https://www.example.com/details-uniqueDocID.html?search-term=foo it would be necessary to get rid of those. A way to do this could be using urllib.parse from the Python standard library. However, as the URL is always unchanged, exoskeleton automatically takes care and prevents duplicate downloads.

First set up exoskeleton and run it to check the settings.


#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# get-detail-pages.py : listing 1 of 4

"""
Crawls the SERP-pages to get a list of links of pages with
the document detail view.
"""

import logging

from bs4 import BeautifulSoup
import exoskeleton

logging.basicConfig(level=logging.DEBUG)

exo = exoskeleton.Exoskeleton(
    database_settings={'database': 'example',
                       'username': 'admin',
                       'passphrase': ''},
    target_directory='/home/user/example-files'
)

# continued below

Now we have to extend this script to loop over the search results. Not all results are shown on the same page, instead they are paginated. So find all links to documents on the page, find the link to next page and repeat with that. We are done when the next page is undefined.

Job Management in Exoskeleton

Some searches yield a couple of hundred search result pages. As the crawler is deliberately slow, something might go wrong while scraping those. Starting all over would not result in duplicates, but would cost time. To avoid this, exoskeleton has a job management functionality. A job consists of four things:

  • a name to reference it
  • a start URL
  • the current URL / the URL last crawled
  • a state (done or not done)

So each time the bot moves to the next search result page, it updates the current URL of the job. In case it fails on page 799 of 800, it can look up on restart that it already reached page 799 and continue from there.

The analysis of the page to extract the detail page URLs will be done using beautiful soup. Its select function makes it possible to select elements with CSS syntax.

It might later be relevant to see in which search query the document showed up. For this the label function is used to attach labels like “search keyword X” when the detail page is added to the queue. If the document shows up in another search it will as described not be downloaded, but any new label will be added. So a document showing up in five different search result sets, will end up with five different labels.

The result of those deliberations is this function which extends the script above:


# continued from above
# get-detail-pages.py : listing 2 of 4

def crawl(job: str):
    "Loop until next_page is undefined."

    # all links are relative, so we have to extend them
    url_base = 'https://www.example.com'

    while True:

        try:
            # pick up there we left
            page_to_crawl = exo.jobs.get_current_url(job)
        except RuntimeError:
            logging.info('The job is already marked as finished.')
            break

        # get the page content and parse it with lxml
        soup = BeautifulSoup(exo.return_page_code(page_to_crawl), 'lxml')

        # extract all relevant links to detail pages
        urls = soup.select("td.views-field-label a")

        if urls:
            # loop over all URLs and add the base
            # # then add them to the queue
            for i in urls:
                full_url = (f"{url_base}{i['href']}")

                # Print found URL as visual progress indicator:
                print(full_url)
                # Add the name of the job as a label
                exo.add_save_page_code(full_url,
                                       labels_version={job, 'detail page'},
                                       prettify_html=True)

        try:
            # check whether a next page is defined for the SERPs
            next_page = soup.select("li.pager-next a")

            if next_page != []:
                # next_page is definied
                next_page = next_page[0]
                # add base
                next_page = f"{url_base}{next_page['href']}"

                exo.job_update_current_url(job, next_page)
                exo.random_wait()
            else:
                # next_page is *not* defined
                exo.job_mark_as_finished(job)
                break  # finish while loop
        except Exception:
            raise
            # ends up here if next page is undefined
    print('done')

# continued below

To start the crawl define one or more jobs with the first search result page as initial URL. Pass them to the crawl function.


# continued from above
# get-detail-pages.py : listing 3 of 4

exo.job_define_new('Search Keyword X',
                   'https://www.example.com/search-view?keyword=X')
crawl('Search Keyword X')

exo.job_define_new('Search Keyword Y',
                   'https://www.example.com/search-view?keyword=Y')
crawl('Search Keyword Y')

# continued below

If a job is already done, that is recognized first thing in the loop.

Now that all URLs of the detail pages are in the queue, they have to be downloaded into the database to scrape the metadata and the link to the document. This is just a single command:


# continued from above
# get-detail-pages.py : listing 4 of 4

exo.process_queue()

Scraping Metadata

The archive kindly provides a lot of metadata for each document. Those information have to extracted as they surely are relevant. However, as every project is different, exoskeleton has no build in functionality to hold those information. So the database has to be extended with a separate table:


-- change to the name of your database:
USE exoskeleton;

CREATE TABLE docDetails (
    pageUrl VARCHAR(1023)
    ,docTitle VARCHAR(1023)
    ,docType VARCHAR(1023)
    ,collection VARCHAR(1023)
    ,attachmentUrl VARCHAR(1023)
    -- left out a dozen more fields
    ) ENGINE=InnoDB;

Now a second script fills those fields by analyzing all downloaded detail pages. The docDetails table is in the exoskeleton database and we want to add file downloads to the queue. Therefore, this script loads exoskeleton as before. However some libraries are to be added:

  • userprovided : a sister-project of exoskeleton which provides functionality to convert dates found on the detail pages.
  • re for regular expressions.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# analyze-detail-page.py : listing 1 of 4

"""
The first script already downloaded all detail pages.
This script analyses them and stores all information
in a custom table.
Then the actual file is added to the download queue.
"""

import logging
import re

from bs4 import BeautifulSoup

import exoskeleton
import userprovided

logging.basicConfig(level=logging.DEBUG)

exo = exoskeleton.Exoskeleton(
    database_settings={'database': 'example',
                       'username': 'admin',
                       'passphrase': ''},
    target_directory='/home/user/example-files'
)

# continued below

The analysis can be done again with beautiful soup. There is just one caveat: Some fields are not available for all documents. If beautiful soup tries to access them, it fails with an exception. That has to be covered. This results in a function:


# continued from above
# analyze-detail-page.py : listing 2 of 4

def analyze_page(page_content: str,
                 url: str,
                 file_labels: set):

    soup = BeautifulSoup(page_content)

    # Store the result set in a dictionary
    rs = dict()

    rs['pageUrl'] = url

    docTitle = soup.find("h1", class_="documentFirstHeading")
    rs['docTitle'] = docTitle.string if docTitle is not None else ''

    try:
        docType = soup.find("div", class_='field-name-field-taxonomy-doc-type').a
        rs['docType'] = docType.string if docType.string != '' else None
    except:
        rs['docType'] = None
        pass

    try:
        collection = soup.find("div", class_='field-name-field-collection').a
        rs['collection'] = collection.string if collection.string != '' else None
    except:
        rs['collection'] = None
        pass

    try:
        docPublicationDate = soup.find("div", class_='field-name-field-pub-date').find_all('div')[2]
        rs['docPublicationDate'] = userprovided.long_to_short_date(docPublicationDate.get_text()) if docPublicationDate.get_text() != '' else None
    except:
        rs['docPublicationDate'] = None
        pass

    try:
        contentType = soup.find("div", class_='field-name-field-content-type').find_all('div')[2]
        rs['contentType'] = contentType.get_text() if contentType.get_text() != '' else None
    except:
        rs['contentType'] = None
        pass

    try:
        attachment = str(soup.find("div", class_='field-name-field-file').find_all('div')[2])
        # get_text() would throw away the HTML code
        PATTERN = re.compile('href=\"(?P.*)\" type=\"(?P.*); length=(?P\d*)')
        rs['attachmentUrl'] = re.search(PATTERN, attachment).group('href')
    except:
        logging.warning('exception while trying to find the file', exc_info=True)
        return # do NOT write results on this

    # Construct a query
    sql = ("INSERT INTO docDetails (pageUrl, docTitle, docType, " +
           "collection, attachmentUrl) " +
           "VALUES (%s, %s, %s, %s, %s);")

    # Insert results into the custom table
    exo.cur.execute(sql, (rs['pageUrl'],
                    rs['docTitle'],
                    rs['docType'],
                    rs['collection'],
                    rs['attachmentUrl'])
                    )
    # Queue the file for download:
    exo.add_file_download(rs['attachmentUrl'],
                          labels_master=file_labels)
    return

# continued below

The actual function is much longer as there are more fields. Most of it is copy-and-paste from the field before. Now all that remains is looping over the already downloaded detail pages. We identify them by the label ‘detail page’ we gave them in the first step. We mark those we analyzed by giving them a new label. Again, some custom SQL is inevitable.


# continued from above
# analyze-detail-page.py : listing 3 of 4

# Get the uuid of all detail pages via the label.
# processed_only means only those which are already
# downloaded because the UUId can reference a queue
# item. If you run this script in parallel with the
# one that downloads the detail pages, you have to run
# it again when all pages are downloaded:
detail_pages = exo.version_uuids_by_label('detail page',
                                          processed_only=True)
# Loop over the pages found
for page in detail_pages:
    # get the content / source code of the page
    exo.cur.execute('SELECT pageContent ' +
                    'FROM fileContent ' +
                    'WHERE versionID = %s;',
                    page)
    page_content = exo.cur.fetchone()
    page_content = page_content[0]

    # Get the page URL which is currently linked
    # to the downloaded page:
    exo.cur.execute('SELECT url ' +
                    'FROM fileMaster ' +
                    'WHERE id = ( ' +
                    '    SELECT fileMasterID ' +
                    '    FROM fileVersions ' +
                    '    WHERE id = %s)',
                    page)
    page_url = exo.cur.fetchone()
    page_url = page_url[0]

    # which labels are attached?
    page_labels = exo.labels.version_labels_by_uuid(page)
    # Remove the label 'detail page' from the set as we want
    # to attach these labels to the new queue item, i.e.
    # the file we download:
    page_labels.remove('detail page')

    # Analyze the page and add the file to the queue:
    if page_content is not None:
        analyze_page(page_content, page_url, page_labels)
        # To mark the detail page as processed, exchange the label:
        exo.labels.remove_labels_from_uuid(page, 'detail page')
        exo.labels.assign_labels_to_uuid(page, 'processed detail page')

    # mark progress visually
    print('.')


# continued below

Downloading all Files

All documents have been added to the queue. Downloading them is again only one command extending the script above.


# continued from above
# analyze-detail-page.py : listing 4 of 4

exo.process_queue()
Now exoskeleton downloads all files. As this will take some time, you could activate progress reports via email.