t#TODO(ricardo) - Scraping europotato into a CSV


# Being developed at [[EuropotatoScraping]]

## Turning descriptors into numbers


# The European Cultured Potato Database is almost entirely non-numerical. Here is an example of a well-populated potato entry, for an obscure German potato type.
potato = { "title": "Orienta", "name": "orienta", "higher_taxon": "Solanaceae", "genus": "Solanum L.", "pedigree": "E 77 330 x E 76 56", "breeder": "Nordkartoffel-Zuchtgesellschaft", "breeders_rights": "", "synonyms": "", "national_list": "1991", "country_of_origin": "GERMANY", "data_source": "CZEHBROD / DEU416", "flower_colour": "White", "flower_frequency": "Occasional to frequent", "foliage_cover": "Moderate to good", "maturity": "Intermediate", "primary_tuber_flesh_colour": "Yellow", "tuber_eye_depth": "Shallow", "tuber_shape": "Long to oval", "tuber_skin_colour": "White to yellow", "tuber_skin_texture": "Smooth to intermediate", "dormancy_period": "Medium to long", "early_harvest_yield_potential": "High", "growth_cracking": "Low", "hollow_heart_tendency": "Low", "internal_rust_spot": "Infrequent to medium", "resistance_to_external_damage": "Resistant", "secondary_growth": "Medium", "tuber_shape_uniformity": "Uniform", "tuber_size": "Large", "tubers_per_plant": "Medium", "after_cooking_blackening": "Trace to little", "cooking_type_/_411_cooked_texture": "Fairly firm (multi-purpose type)", "enzymic_browning": "Little", "starch_content": "Low to medium", "taste": "Moderate to good", "field_immunity_to_wart_races": "Race 1", "resistance_to_late_blight_on_foliage": "Medium", "resistance_to_late_blight_on_tubers": "Medium to high", "resistance_to_stem_canker_(rhizoctonia_solani)": "High", "wart_(synchytrium_endobioticum)": "Field immune", "resistance_to_blackleg_(erwinia_spp.)": "High", "resistance_to_common_scab_(streptomyces_scabies)": "Medium", "resistance_to_potato_leaf_roll_virus": "Low", "resistance_to_potato_virus_a": "Very high", "resistance_to_potato_virus_y_(strain_not_specified)": "High to very high", "resistance_to_globodera_rostochiensis_race_1": "High", "number_of_sources": 2, "url": "http://www.europotato.org/display_description.php?variety_name=Orienta" }, # Here is an exmple of a slightly less populated entry: potato2 = { "title": "Hudson", "name": "hudson", "higher_taxon": "Solanaceae", "genus": "Solanum L.", "pedigree": "NIF-1 x 56 N 18-4", "breeder": "", "breeders_rights": "", "synonyms": "", "national_list": "1972", "country_of_origin": "UNITED STATES", "data_source": "DEU001 / NEIKER / VRI RUSSIA", "maturity": "Intermediate / Early to intermediate", "primary_tuber_flesh_colour": "White / Yellow", "tuber_eye_depth": "Medium", "tuber_shape": "Oval to round / Very long", "tuber_skin_colour": "White to yellow", "yield_potential": "Very high", "crisp_suitability": "Moderate to good", "dry_matter_content": "Low", "starch_content": "Very low to low / Low", "resistance_to_dry_rot_(fusarium_spp.)": "High", "resistance_to_late_blight_on_foliage": "Very low to low", "resistance_to_late_blight_on_tubers": "Very low to low", "susceptibility_to_wart_races": "Race 3 / Race 2 / Race 3 / Race 2", "wart_(synchytrium_endobioticum)": "Susceptible", "resistance_to_globodera_rostochiensis_race_1": "Very high", "number_of_sources": 3, "url": "http://www.europotato.org/display_description.php?variety_name=Hudson" }

# Note that some fields are orderable--"Low" must be less than "High," "Infrequent" before "frequent." Where there is a forward slash, there is contrary data from multiple sources, and sometimes the word "to" is used to specify a range. The goal, then, is to convert orderable descriptions into numbers that maintain the intuitive ordering.
# I wrote a function to extract and normalize the values from a field as a list of different values: def get_terms(colval): ret = [] for values in colval.split("/"): values = values.replace("-", " to ") for term in values.split(" to"): ret.append(term.strip().lower()) return ret
# In order to convert these text descriptions into numbers, I created an index of all of the different values the occured in each column: # I used python's CSV parser to process the data that Ricardo assembled. import csv r = csv.reader(open(CSV_PATH)) # Find the column names cols = r.next() # initialize a mapping of column to set of values colterms = dict([(X,set()) for X in cols]) # for each line of the CSV, add the terms to a set for line in r: for colname, colval in zip(cols, line): for term in get_terms(colval): colterms[colname].add(term)
# As a quick pruning, I limited my interest to columns with fewer than 15 distinct terms MAX_N_TERMS = 15 comparable_columns = [colname for (colname,cterms) in colterms.items() if len(cterms) < MAX_N_TERMS]
# With this data (as well as a reverse mapping from term to columns in which that term appears), I manually created a `term2number.json' file, excerpted below, that mapped terms into correctly-ordered numeric values: term2number = { "": None, "very good": 5, "very late": 5, "spreading": 2, "long": 5, "oval": 3, "very high": 6, "yes": 2, "very spreading": 1, "little": 3, "slow": 1, "very long": 6, "susceptible": 2, "fast": 7, "very deep": 6, "late": 4, "few": 2, "intermediate": 3, "low": 2, "semi erect": 3, "very short": 1, "very early": 1, [...] }
# With this mapping, I was able to produce a numerical version of the CSV r = csv.reader(open(CSV_PATH)) cols = r.next() # Blank, numerical array arr = [] # loop through CSV for line in r: row = [] # For each row, loop through column names and values for colname, colval in zip(cols, line): if colname in comparable_columns: # Transform each term into a number and accumulate values in `vals' list vals = [] for term in get_terms(colval): vals.append(term2number[term]) if None in vals: row.append(None) else: # Take the average of all reported values row.append(sum(vals) / float(len(vals))) arr.append(row)
# The new, numerical data looks like this in CSV form. Note the fractional results where multiple values were averaged, and the large amount of blank space:
Sanna,,,5.0,4.0,3.0,3.0,2.5,5.0,2.0,6.0,4.0,2.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Karolin,1.0,,3.5,3.3333333333333335,3.0,2.0,2.3333333333333335,2.5,4.0,4.0,,4.0,,3.5,4.5,3.0,1.5,2.0,3.5,4.0,2.0,4.5,4.0,2.5,4.0,1.5,4.0,3.5,4.5,3.0,3.0,4.0,4.0,5.0,2.0,5.0,5.0,2.6666666666666665,5.666666666666667,5.0,5.0,5.666666666666667,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Arran Signet,1.0,,,,2.0,,4.0,,,,,,,,,,,,,,,,,,,,,,,,1.5,,1.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Amuretc,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, CIP 5 374080 5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, B 71,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, FL 1954,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 3436 GL,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Orienta,,,4.0,3.5,3.0,2.0,4.0,2.5,2.0,,,3.0,,,4.5,2.0,2.0,3.0,4.0,,4.0,5.0,4.0,2.5,,3.0,,3.5,,4.0,,4.5,,5.0,2.0,,5.0,4.0,2.0,6.0,,5.5,5.0,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, Stamm 927 76,,,,,,,,,,5.0,2.0,2.0,,,,,,,,,,,,,,,,,,4.0,,4.0,,,,,,,,,,,,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

# Finding correlations from numbers


# Now that we have a numerical transformation of the europotato database, we can run our correlations.
# We'll use numpy to perform the computations import numpy as np # import scipy.stats # Initialize an empty NxN matrix to store the correlation results for each pair of columns, where N is the number of columns correlations = np.zeros((len(comparable_columns), len(comparable_columns))) # ...and another empty matrix to keep track of how many samples were compared for each correlation result nsamples = np.zeros((len(comparable_columns), len(comparable_columns)), dtype=int) # Loop through every pair of comparable columns for idx1,colname1 in enumerate(comparable_columns): for idx2,colname2 in enumerate(comparable_columns): # Since there's so much blank data, we compute an intersection of rows that have values in column one *and* two intersection = np.array([(row[idx1], row[idx2]) for row in arr if (row[idx1] is not None) and (row[idx2] is not None)]) # Store the length of this intersection in the `nsamples' matrix nsamples[idx1,idx2] = len(intersection) # Compute the correlation. We can use numpy's correlation coefficients, or scipy's kendall tau coefficients if len(intersection) > 0: correlations[idx1,idx2] = np.corrcoef(intersection[:,0], intersection[:,1])[0,1] # correlations[idx1,idx2] = scipy.stats.stats.kendalltau(intersection[:,0], intersection[:,1])[0]
# Now we have a matrix, `correlations' relating each column to each other, with values from -1 to 1, where -1 is an inverse correlation, 0 indicates no correlation, and 1 a positive correlation.
# Randomly sampling the results, the first strong value I found was a negative correlation between "adaptability" and "taste": -0.917662935482 # It turned out that there were only three potatoes with values for both "adaptability" and "taste," which hardly qualifies as statistically significant, but I very much liked the suggestion that the two should be mutually incompatible. Truth in numbers!

# Visualizing the matrix of correlations

# It's hard to make any sense of this 85x85 matrix without a way to look at the 7225 values. # I would have plotted it in ipython, but my iPython pylab graphics backend was misconfigured and I was too lazy to debug it so I dumped the data to JSON and wrote a simple HTML page to look at the results.
json.dump({ "columns": comparable_columns, "correlations": correlations.tolist(), "nsamples": nsamples.tolist(), }, open(CORR_OUT, 'w'), indent=4)
# In HTML, my approach was to make absolutely-positioned
elements for every correlation value correlations.correlations.forEach(function(row, r_idx) { row.forEach(function(val, c_idx) { var $cell = document.createElement("div"); // Position the cell based on column and row index $cell.style.left = c_idx * 10; $cell.style.top = r_idx * 10; // Set the CSS background color to red for positive correlations, and blue for negative correlations if(val > 0) { $cell.style.backgroundColor = "rgb(" + Math.floor(val*255) + ",0,0)"; } else if(val < 0) { $cell.style.backgroundColor = "rgb(0,0," + Math.floor(-val*255) + ")"; } document.body.appendChild($cell); }); });
# With labels, the result looks as such. The diagonal line in the middle represents every column's positive correlation with itself: http://i.imgur.com/HxmvLya.png?1

#Categorising the columns (JM)


# Naturally a large number of columns presented in the correlation matrix presemnt some ambiguities # - you can get an idea of +, weak, no and - correlations but the order of the columns result in less immediate understanding - WE SHOULD SORT THE COLUMNS!!!
# A fan of ontologies and taxonomies, I use the following approach for categorising things (feel free to use my framwork, its good for everything, though designed for public affairs and monitoring): Projects Project (to focus later on when stabbing at user profiles) Issue Event Group Individual Organisation

# The categories in the database focus on #Issues # This is because the database is focusing on qualities which contribute to the economic success of the potato
# I split these issues into two main themes: Resistance
# This is because the categories highlighted a split between (I) inputs and outputs which determined the health of the crop and then (II) what were the outcomes of the sales ready product
The subcategories became:    
Resistance # For this approach there was a need for bundling across an x/y access. Therefore things which may be interdependent (perhaps a 'host' category exists because of poor 'environmental' conditions) are ignored. Also, this breakdown was done in expediency, with only cursory knowledge of topics (after all this is only to shift columns for legibility).
# This is naturally responsive to the existing categorisation, if personal insight was greater perhaps it could have been less passive. If reordering the data provided more insight there approach could naturally be refined.
# The file with the columns and my folksonomy can be found herehttp://cqrrelations.lan/share/datasets/europotato/potato_categories_by_JM.csv

# They can be added to the data via a join.
# Either a 'two stage' column approach could be done with the data columns or a rename of the columns to include say 'resistance environment' or 'content yield' before can sort the columns at the correct stage.

#TODO(???) - "... profit"