Downloading an Archive with exoskeleton
Rüdiger Voigt | last update: 2020-07-08
While writing my dissertation I stumpled upon an online government archive with millions of documents in public domain. A quick search yielded dozens relevant texts, but yet there is no relevant research making systematic use of them. This might be due to one problem: No one applied OCR to those PDF files. My search only showed me matches if and only if the keywords were in the title.
Fortunately, the documents are at least organized in collections and have a creation date. This made it possible to develop heuristics that reduced the number of potentially relevant files to tens of thousands. Still, I have neither time nor interest to read them all.
Making those documents searchable would be a major improvement. I would have to download all files that could be relevant and apply OCR to them, while keeping them organized. This meant writing a crawler. That crawler would have to be polite as maxing out the connection might bring down the archive’s servers. Some files are large. Therefore, I decided on downloading a maximum of two per minute. Working through 100,000 files at this rate means the crawler needs to run for 833 hours or approximately 34 days straight. It is likely that there will occur some problems in that month. That means the crawler must be able to recover from errors and tenaciously do his work.
The exoskeleton library was built for this - literally: I found myself writing similar code for different steps of this task, so I decided to create the library which makes my code leaner and more coherent.
This post describes how to use exoskeleton, a Python library with a MariaDB database backend, to create a crawler with the aforementioned properties and persistence. Follow-up posts will describe how I made the collection searchable and applied machine learning to detect relevant documents.
Install and other Preparations
Laptop versus Server
The first choice is where to run the bot. The operating system can be Linux, Mac or Windows as exoskeleton is written in Python. I could have used my laptop as exoskeleton is tolerant to interruptions and can pick up the job again without loosing the progress. I chose a server, because it will be faster without interruptions.
Install versus Docker
The project needs a MariaDB database. There is a SQL-script to create it. Exoskeleton itself can be installed via pip. Just follow the instructions on the project page.
Optional: Create an Info-Page
Despite my intentions and precautions the bot might by mistake cause problems for the site I crawl. Maybe they are interested why their site is crawled.
Therefore I create a small info page on my personal website. It explains in two sentences what the bot is supposed to do and contains my contact details. The URL of this page will be included in the user agent string of the bot.
Gathering Search Results
The site has a built-in search function. Good search queries narrow down the number of documents to scrape.
The search result pages link to detail pages of the documents that contain metadata like title, creation date, collection, and more. These detail pages also contain a link to the document itself.
So the first step is to scrape the search results to get a list of detail pages. A possible problem could be that one and the same document shows up in different search queries with a different URL and might be downloaded multiple times. However I checked and the URL is always like https://www.example.com/details-uniqueDocID.html
. If the URL would contain dynamic get-parameters like https://www.example.com/details-uniqueDocID.html?search-term=foo
it would be necessary to get rid of those. A way to do this could be using urllib.parse from the Python standard library. However, as the URL is always unchanged, exoskeleton automatically takes care and prevents duplicate downloads.
First set up exoskeleton and run it to check the settings.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# get-detail-pages.py : listing 1 of 4
"""
Crawls the SERP-pages to get a list of links of pages with
the document detail view.
"""
import logging
from bs4 import BeautifulSoup
import exoskeleton
logging.basicConfig(level=logging.DEBUG)
exo = exoskeleton.Exoskeleton(
database_settings={'database': 'example',
'username': 'admin',
'passphrase': ''},
target_directory='/home/user/example-files'
)
# continued below
Now we have to extend this script to loop over the search results. Not all results are shown on the same page, instead they are paginated. So find all links to documents on the page, find the link to next page and repeat with that. We are done when the next page is undefined.
Job Management in Exoskeleton
Some searches yield a couple of hundred search result pages. As the crawler is deliberately slow, something might go wrong while scraping those. Starting all over would not result in duplicates, but would cost time. To avoid this, exoskeleton has a job management functionality. A job consists of four things:
- a name to reference it
- a start URL
- the current URL / the URL last crawled
- a state (done or not done)
So each time the bot moves to the next search result page, it updates the current URL of the job. In case it fails on page 799 of 800, it can look up on restart that it already reached page 799 and continue from there.
The analysis of the page to extract the detail page URLs will be done using beautiful soup. Its select
function makes it possible to select elements with CSS syntax.
It might later be relevant to see in which search query the document showed up. For this the label function is used to attach labels like “search keyword X” when the detail page is added to the queue. If the document shows up in another search it will as described not be downloaded, but any new label will be added. So a document showing up in five different search result sets, will end up with five different labels.
The result of those deliberations is this function which extends the script above:
# continued from above
# get-detail-pages.py : listing 2 of 4
def crawl(job: str):
"Loop until next_page is undefined."
# all links are relative, so we have to extend them
url_base = 'https://www.example.com'
while True:
try:
# pick up there we left
page_to_crawl = exo.jobs.get_current_url(job)
except RuntimeError:
logging.info('The job is already marked as finished.')
break
# get the page content and parse it with lxml
soup = BeautifulSoup(exo.return_page_code(page_to_crawl), 'lxml')
# extract all relevant links to detail pages
urls = soup.select("td.views-field-label a")
if urls:
# loop over all URLs and add the base
# # then add them to the queue
for i in urls:
full_url = (f"{url_base}{i['href']}")
# Print found URL as visual progress indicator:
print(full_url)
# Add the name of the job as a label
exo.add_save_page_code(full_url,
labels_version={job, 'detail page'},
prettify_html=True)
try:
# check whether a next page is defined for the SERPs
next_page = soup.select("li.pager-next a")
if next_page != []:
# next_page is definied
next_page = next_page[0]
# add base
next_page = f"{url_base}{next_page['href']}"
exo.job_update_current_url(job, next_page)
exo.random_wait()
else:
# next_page is *not* defined
exo.job_mark_as_finished(job)
break # finish while loop
except Exception:
raise
# ends up here if next page is undefined
print('done')
# continued below
To start the crawl define one or more jobs with the first search result page as initial URL. Pass them to the crawl
function.
# continued from above
# get-detail-pages.py : listing 3 of 4
exo.job_define_new('Search Keyword X',
'https://www.example.com/search-view?keyword=X')
crawl('Search Keyword X')
exo.job_define_new('Search Keyword Y',
'https://www.example.com/search-view?keyword=Y')
crawl('Search Keyword Y')
# continued below
If a job is already done, that is recognized first thing in the loop.
Now that all URLs of the detail pages are in the queue, they have to be downloaded into the database to scrape the metadata and the link to the document. This is just a single command:
# continued from above
# get-detail-pages.py : listing 4 of 4
exo.process_queue()
Scraping Metadata
The archive kindly provides a lot of metadata for each document. Those information have to extracted as they surely are relevant. However, as every project is different, exoskeleton has no build in functionality to hold those information. So the database has to be extended with a separate table:
-- change to the name of your database:
USE exoskeleton;
CREATE TABLE docDetails (
pageUrl VARCHAR(1023)
,docTitle VARCHAR(1023)
,docType VARCHAR(1023)
,collection VARCHAR(1023)
,attachmentUrl VARCHAR(1023)
-- left out a dozen more fields
) ENGINE=InnoDB;
Now a second script fills those fields by analyzing all downloaded detail pages. The docDetails
table is in the exoskeleton database and we want to add file downloads to the queue. Therefore, this script loads exoskeleton as before. However some libraries are to be added:
userprovided
: a sister-project of exoskeleton which provides functionality to convert dates found on the detail pages.re
for regular expressions.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# analyze-detail-page.py : listing 1 of 4
"""
The first script already downloaded all detail pages.
This script analyses them and stores all information
in a custom table.
Then the actual file is added to the download queue.
"""
import logging
import re
from bs4 import BeautifulSoup
import exoskeleton
import userprovided
logging.basicConfig(level=logging.DEBUG)
exo = exoskeleton.Exoskeleton(
database_settings={'database': 'example',
'username': 'admin',
'passphrase': ''},
target_directory='/home/user/example-files'
)
# continued below
The analysis can be done again with beautiful soup. There is just one caveat: Some fields are not available for all documents. If beautiful soup tries to access them, it fails with an exception. That has to be covered. This results in a function:
# continued from above
# analyze-detail-page.py : listing 2 of 4
def analyze_page(page_content: str,
url: str,
file_labels: set):
soup = BeautifulSoup(page_content)
# Store the result set in a dictionary
rs = dict()
rs['pageUrl'] = url
docTitle = soup.find("h1", class_="documentFirstHeading")
rs['docTitle'] = docTitle.string if docTitle is not None else ''
try:
docType = soup.find("div", class_='field-name-field-taxonomy-doc-type').a
rs['docType'] = docType.string if docType.string != '' else None
except:
rs['docType'] = None
pass
try:
collection = soup.find("div", class_='field-name-field-collection').a
rs['collection'] = collection.string if collection.string != '' else None
except:
rs['collection'] = None
pass
try:
docPublicationDate = soup.find("div", class_='field-name-field-pub-date').find_all('div')[2]
rs['docPublicationDate'] = userprovided.long_to_short_date(docPublicationDate.get_text()) if docPublicationDate.get_text() != '' else None
except:
rs['docPublicationDate'] = None
pass
try:
contentType = soup.find("div", class_='field-name-field-content-type').find_all('div')[2]
rs['contentType'] = contentType.get_text() if contentType.get_text() != '' else None
except:
rs['contentType'] = None
pass
try:
attachment = str(soup.find("div", class_='field-name-field-file').find_all('div')[2])
# get_text() would throw away the HTML code
PATTERN = re.compile('href=\"(?P.*)\" type=\"(?P.*); length=(?P\d*)')
rs['attachmentUrl'] = re.search(PATTERN, attachment).group('href')
except:
logging.warning('exception while trying to find the file', exc_info=True)
return # do NOT write results on this
# Construct a query
sql = ("INSERT INTO docDetails (pageUrl, docTitle, docType, " +
"collection, attachmentUrl) " +
"VALUES (%s, %s, %s, %s, %s);")
# Insert results into the custom table
exo.cur.execute(sql, (rs['pageUrl'],
rs['docTitle'],
rs['docType'],
rs['collection'],
rs['attachmentUrl'])
)
# Queue the file for download:
exo.add_file_download(rs['attachmentUrl'],
labels_master=file_labels)
return
# continued below
The actual function is much longer as there are more fields. Most of it is copy-and-paste from the field before. Now all that remains is looping over the already downloaded detail pages. We identify them by the label ‘detail page’ we gave them in the first step. We mark those we analyzed by giving them a new label. Again, some custom SQL is inevitable.
# continued from above
# analyze-detail-page.py : listing 3 of 4
# Get the uuid of all detail pages via the label.
# processed_only means only those which are already
# downloaded because the UUId can reference a queue
# item. If you run this script in parallel with the
# one that downloads the detail pages, you have to run
# it again when all pages are downloaded:
detail_pages = exo.version_uuids_by_label('detail page',
processed_only=True)
# Loop over the pages found
for page in detail_pages:
# get the content / source code of the page
exo.cur.execute('SELECT pageContent ' +
'FROM fileContent ' +
'WHERE versionID = %s;',
page)
page_content = exo.cur.fetchone()
page_content = page_content[0]
# Get the page URL which is currently linked
# to the downloaded page:
exo.cur.execute('SELECT url ' +
'FROM fileMaster ' +
'WHERE id = ( ' +
' SELECT fileMasterID ' +
' FROM fileVersions ' +
' WHERE id = %s)',
page)
page_url = exo.cur.fetchone()
page_url = page_url[0]
# which labels are attached?
page_labels = exo.labels.version_labels_by_uuid(page)
# Remove the label 'detail page' from the set as we want
# to attach these labels to the new queue item, i.e.
# the file we download:
page_labels.remove('detail page')
# Analyze the page and add the file to the queue:
if page_content is not None:
analyze_page(page_content, page_url, page_labels)
# To mark the detail page as processed, exchange the label:
exo.labels.remove_labels_from_uuid(page, 'detail page')
exo.labels.assign_labels_to_uuid(page, 'processed detail page')
# mark progress visually
print('.')
# continued below
Downloading all Files
All documents have been added to the queue. Downloading them is again only one command extending the script above.
# continued from above
# analyze-detail-page.py : listing 4 of 4
exo.process_queue()
Now exoskeleton downloads all files. As this will take some time, you could activate progress reports via email.