Scraping the Europotato database
# Hint: Check "Python" on the top right corner to see this pad with pretty colors and highlights
# The European Cultivated Potato Database ("Europotato") contains information on thousands of varieties and strains of potatoes.
# However, one has to be registered to access the "Export Data" link. There seems to be no interface for registering, as it appears to be a manual approval process by the site admins.
# So we recall the data hacker's mantra -- it's easier to ask for forgiveness than permission -- and set out to write a script to reclaim this information in a structured format that we can work on.
Downloading the HTML pages
# The site provides a handy index with the list of varieties at http://www.europotato.org/varietyindex.php.
# Usually, we'd consider writing a script to fetch all the links to download. However, since the index has just a handful of pages, we can just access them with the browser and use a mass downloader # to find the links and download them. DownThemAll is a really handy extension for Firefox to accomplish this. After fetching every page, we end up with a directory with some 5600+ HTML files. From # here, we'll write a script that will read each of them and dump the information into a structured format (CSV and/or JSON).
Page Structure
# Each page contains a set of key-value pairs, which we'll go through in the script below to generate the dataset.
Parsing the HTML content
# Here's the documented script to parse the downloaded HTML files. You can also find this at the Cqrrelations FTP server at /share/datasets/europotato.
#!/usr/bin/env python
# Script to parse the HTML from europotato.org's potato varieties list
# Copyleft 2015 Ricardo Lafuente
# Released under the GPL version 3 or later.
# See the full license athttp://www.gnu.org/licenses/gpl.html
# Keys to ignore; these are fields in the page that we just won't include
IGNORE_KEYS = ["plant_material_maintained_as", "sample_status", "test_conditions", "plant_health_directive_ec77/93,_requirements"]
# Local dir where the HTML files can be found
source_dir = "new-html/"
# List to store the parsed entries
results = []
# List of keys; we specify here the ones that we'll add ourselves, and later add the keys that
# we find in the pages. This is necessary for CSV generation
all_keys = ['title', 'name', 'number_of_sources', 'url']
# go through each of the HTML files
import os
for filename in os.listdir(source_dir):
# os.listdir only returns filenames, but we want paths
filename = os.path.join(source_dir, filename)
# read the file with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(filename, 'r').read(), "lxml")
# normal Python dicts are unordered; since we have a large amount of fields, it will be a mess
# if it's not in order. Let's use the OrderedDict for this, which otherwise works identically to your usual
# Python dict
from collections import OrderedDict
result = OrderedDict()
# Get the title from the filename
title = os.path.basename(filename).replace('.html', '')
result['title'] = title
# Make a "slug" field for urls and other identifiers where spaces and uppercase shouldn't be used
result['name'] = title.replace(" ", "-").lower().replace("(", "").replace(")", "")
# All fields are inside tables, so we'll look into those
tables = soup.find_all('table')
for table in tables:
# Go through each row
for r in table.find_all("tr"):
# The fields we're looking for are all in rows with two columns; disregard everything else
if not len(r.find_all("td")) == 2:
# "continue" will stop the current loop and go to the next iteration; handy for
# setting conditions in order to process further or not
continue
# If we got here, it means that we have a row with 2 cols; turn it into a key-value pair
k, v = r.findAll("td")
k = k.text.strip(":")
# Also slugify key names
k = k.lower().replace(" ", "_").replace("'", "")
v = v.text.strip()
# Make sure it's not an empty cell or a key that we should ignore
if not k or k in ("-", "_") or k in IGNORE_KEYS:
continue
# For values, remove the source markers like "[57, 58]" in the end of the line,
# using the power of regular expressions (see https://xkcd.com/208/)
import re
v = re.sub(r'\[[0-9, ]+\]$', '', v).strip()
# If there's source markers remaining in the middle of the string, it means there's more
# than one value from different sources; we separate these using a slash
v = re.sub(r'\[[0-9, ]+\]', '/', v).strip()
# Add the current key to the list of keys
if k not in all_keys:
all_keys.append(k)
# And add it to our result dict
result[k] = v
# Find images on the page
image_hits = soup.find_all('img', src=re.compile("create_thumbnail"))
if image_hits:
result['images'] = []
for img in image_hits:
# Parse the "src" attribute so that we end up with a nice URL
imgpath = img['src'].split('=')[-1].strip('.')
url = 'http://www.europotato.org' + imgpath
result['images'].append(url)
# Add a field with the number of sources found in this page
if result.get('data_source'):
ds = result['data_source']
result['number_of_sources'] = ds.count('/') + 1
# Add the URL field
result['url'] = "http://www.europotato.org/display_description.php?variety_name=%s" % title
# All done -- add this entry to the results list
results.append(result)
print result['name']
# Uncomment when debugging; it just parses the first 30 results
# otherwise we'll have to wait quite a bit every time we run this
# if len(results) > 30:
# break
# Create the JSON file
import json
contents = json.dumps(results, indent=2)
of = open("potato-varieties.json", "w")
of.write(contents)
of.close()
# Create CSV file
# Note that we use unicodecsv instead of the csv module; this is a quick and somewhat dirty way
# to deal with Unicode errors when creating CSV. However, these errors should be tackled with proper
# encoding and decoding of strings. But I always get these mixed up and ended up taking the lazy route.
import unicodecsv as csv
csvwriter = csv.DictWriter(open('potato-varieties.csv', 'wb'), fieldnames=all_keys, quoting=csv.QUOTE_MINIMAL)
csvwriter.writeheader()
for row in results:
# Don't include the image list on the CSV
if row.get('images'):
del row['images']
csvwriter.writerow(row)