Scraping the Europotato database

# Hint: Check "Python" on the top right corner to see this pad with pretty colors and highlights

# The European Cultivated Potato Database ("Europotato") contains information on thousands of varieties and strains of potatoes.

# However, one has to be registered to access the "Export Data" link. There seems to be no interface for registering, as it appears to be a manual approval process by the site admins.

# So we recall the data hacker's mantra -- it's easier to ask for forgiveness than permission -- and set out to write a script to reclaim this information in a structured format that we can work on.

Downloading the HTML pages

# The site provides a handy index with the list of varieties at http://www.europotato.org/varietyindex.php.

# Usually, we'd consider writing a script to fetch all the links to download. However, since the index has just a handful of pages, we can just access them with the browser and use a mass downloader # to find the links and download them. DownThemAll is a really handy extension for Firefox to accomplish this. After fetching every page, we end up with a directory with some 5600+ HTML files. From # here, we'll write a script that will read each of them and dump the information into a structured format (CSV and/or JSON).

Page Structure
# Each page contains a set of key-value pairs, which we'll go through in the script below to generate the dataset.

Parsing the HTML content

# Here's the documented script to parse the downloaded HTML files. You can also find this at the Cqrrelations FTP server at /share/datasets/europotato.

#!/usr/bin/env python
# Script to parse the HTML from europotato.org's potato varieties list # Copyleft 2015 Ricardo Lafuente # Released under the GPL version 3 or later. # See the full license athttp://www.gnu.org/licenses/gpl.html
# Keys to ignore; these are fields in the page that we just won't include

IGNORE_KEYS = ["plant_material_maintained_as", "sample_status", "test_conditions", "plant_health_directive_ec77/93,_requirements"]

# Local dir where the HTML files can be foundsource_dir = "new-html/"

# List to store the parsed entriesresults = []

# List of keys; we specify here the ones that we'll add ourselves, and later add the keys that# we find in the pages. This is necessary for CSV generation
all_keys = ['title', 'name', 'number_of_sources', 'url']

# go through each of the HTML files
import os
for filename in os.listdir(source_dir):
# os.listdir only returns filenames, but we want paths    filename = os.path.join(source_dir, filename)
    # read the file with BeautifulSoup
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(open(filename, 'r').read(), "lxml")

    # normal Python dicts are unordered; since we have a large amount of fields, it will be a mess
# if it's not in order. Let's use the OrderedDict for this, which otherwise works identically to your usual # Python dict    from collections import OrderedDict
    result = OrderedDict()

    # Get the title from the filename
    title = os.path.basename(filename).replace('.html', '')
    result['title'] = title
    # Make a "slug" field for urls and other identifiers where spaces and uppercase shouldn't be used
    result['name'] = title.replace(" ", "-").lower().replace("(", "").replace(")", "")

    # All fields are inside tables, so we'll look into those
    tables = soup.find_all('table')
    for table in tables:
        # Go through each row
        for r in table.find_all("tr"):
            # The fields we're looking for are all in rows with two columns; disregard everything else
            if not len(r.find_all("td")) == 2:
                # "continue" will stop the current loop and go to the next iteration; handy for
                # setting conditions in order to process further or not
                continue

            # If we got here, it means that we have a row with 2 cols; turn it into a key-value pair
            k, v = r.findAll("td")
            k = k.text.strip(":")
            # Also slugify key names
            k = k.lower().replace(" ", "_").replace("'", "")
            v = v.text.strip()
            # Make sure it's not an empty cell or a key that we should ignore
            if not k or k in ("-", "_") or k in IGNORE_KEYS:
                continue
            # For values, remove the source markers like "[57, 58]" in the end of the line,
# using the power of regular expressions (see https://xkcd.com/208/)            import re
            v = re.sub(r'\[[0-9, ]+\]$', '', v).strip()
            # If there's source markers remaining in the middle of the string, it means there's more
            # than one value from different sources; we separate these using a slash
            v = re.sub(r'\[[0-9, ]+\]', '/', v).strip()

            # Add the current key to the list of keys
            if k not in all_keys:
                all_keys.append(k)
            # And add it to our result dict
            result[k] = v

    # Find images on the page
    image_hits = soup.find_all('img', src=re.compile("create_thumbnail"))
    if image_hits:
        result['images'] = []
        for img in image_hits:
            # Parse the "src" attribute so that we end up with a nice URL
            imgpath = img['src'].split('=')[-1].strip('.')
            url = 'http://www.europotato.org' + imgpath
            result['images'].append(url)

    # Add a field with the number of sources found in this page
    if result.get('data_source'):
        ds = result['data_source']
        result['number_of_sources'] = ds.count('/') + 1

    # Add the URL field
    result['url'] = "http://www.europotato.org/display_description.php?variety_name=%s" % title

    # All done -- add this entry to the results list
    results.append(result)
    print result['name']

    # Uncomment when debugging; it just parses the first 30 results
    # otherwise we'll have to wait quite a bit every time we run this
    # if len(results) > 30:
    #     break

# Create the JSON file
import json
contents = json.dumps(results, indent=2)
of = open("potato-varieties.json", "w")
of.write(contents)
of.close()

# Create CSV file
# Note that we use unicodecsv instead of the csv module; this is a quick and somewhat dirty way
# to deal with Unicode errors when creating CSV. However, these errors should be tackled with proper
# encoding and decoding of strings. But I always get these mixed up and ended up taking the lazy route.
import unicodecsv as csv
csvwriter = csv.DictWriter(open('potato-varieties.csv', 'wb'), fieldnames=all_keys, quoting=csv.QUOTE_MINIMAL)
csvwriter.writeheader()
for row in results:
    # Don't include the image list on the CSV
    if row.get('images'):
        del row['images']
    csvwriter.writerow(row)