Scraping the Europotato database
# Hint: Check "Python" on the top right corner to see this pad with pretty colors and highlights
# However, one has to be registered to access the "Export Data" link. There seems to be no interface for registering, as it appears to be a manual approval process by the site admins.
# So we recall the data hacker's mantra -- it's easier to ask for forgiveness than permission -- and set out to write a script to reclaim this information in a structured format that we can work on.
Downloading the HTML pages
# The site provides a handy index with the list of varieties at http://www.europotato.org/varietyindex.php.
# Usually, we'd consider writing a script to fetch all the links to download. However, since the index has just a handful of pages, we can just access them with the browser and use a mass downloader # to find the links and download them. DownThemAll is a really handy extension for Firefox to accomplish this. After fetching every page, we end up with a directory with some 5600+ HTML files. From # here, we'll write a script that will read each of them and dump the information into a structured format (CSV and/or JSON).
# Each page contains a set of key-value pairs, which we'll go through in the script below to generate the dataset.
Parsing the HTML content
# Here's the documented script to parse the downloaded HTML files. You can also find this at the Cqrrelations FTP server at /share/datasets/europotato.
# Script to parse the HTML from europotato.org's potato varieties list
# Copyleft 2015 Ricardo Lafuente
# Released under the GPL version 3 or later.
# See the full license athttp://www.gnu.org/licenses/gpl.html
# Keys to ignore; these are fields in the page that we just won't include
IGNORE_KEYS = ["plant_material_maintained_as", "sample_status", "test_conditions", "plant_health_directive_ec77/93,_requirements"]
# Local dir where the HTML files can be found
# List to store the parsed entries
# List of keys; we specify here the ones that we'll add ourselves, and later add the keys that
# os.listdir only returns filenames, but we want paths
# if it's not in order. Let's use the OrderedDict for this, which otherwise works identically to your usual
# Python dict
# using the power of regular expressions (see https://xkcd.com/208/)