Pattern for Python

Other softwares: R (not good at tokenization), SAS (not free), Python

* easy to learn
* elegant syntax
* fun in string manipulation
* lot of libraries: sklearn (classifiers for machine learning), pandas (databases)

* webmining: 
    Google, Twitter, FB (doesn't break in into someone's profile)
    HTML parser (relevant parts of webpage), crawler (follow links until no more data is found)
* Natural language prcoessing
POS tagging
sentiment analysis
en, nl, fr, it , es
-> string classification
* Machine learning
SVM Support Vector Machines, neural networks (trying to mimick human brain)
* web apps & visualisation
ex network of politicans on Twitter

-> need licenses key to avoid that you're thrown off
-> you can get free keys as well, not so powerful

>>> from pattern.web import Twitter
>>> from pattern.en import sentiment
>>> for tweet in Twitter(language="en").search("#NVA"):
can predict category for given 'thing' (text, image...)
ex. sentiment in text / face in a photograph

text classifiers: count words
some words occur in positive/negative tweets
machine learning classifiers do this work / no preconceived notions of what is pos/neg, retrieved from the data
-> not interested in word order anymore, but in how many times a word occurs

training document (data)-> classifier -> bag-of-words (frequency)
> words are 'features'

you can also use 'bag of lemmas': did/done --> to do / better level of abstraction
you can look at word bigrams (2 consecutive words) or character trigrams (n-grams) (efficient in authorship attribution), tokenization (Goed! = goed + !), average word lenghts
-> whatever improves your system

support vector machine
hyperplace: best possible transition between two groups
looks at nearest neighbours to judge error margin
-> decisions can be based on 10000 dimension problem for computers

measure system on data we've never seen before
set aside small portion of annotated training data = gold standard ("this is correct")
build classifier on remaining 90%
can compare prediction of classifier on the class assigned by the human annotator

10-fold cross validation:
    1000 tweets -> split in sets of 100
    train on 800 tweets + 10% test set + 10% validation -> shift roles of these sets of 100 tweets

compare to baseline scores
ex weather of tomorrow: 50% accuracy 'it will be the same as today'
-> look at precision & recall

Some classifiers can shape a decision tree from your data
you get insight in structure of your data 
-> don't work as well as Support Vector Machines (black boxes, can use combinations of vectors in multiple ways)

Process of training
ex stockmarket - music reviews
for annotation you need to know the context --> information about annotation process? 
-> depends on complexity of classification problem
-> same document woudl be annotated by 2 different people - if there is disagreement, I will not use it as training data
-> link to such a protocol?

For annotator's protocol: "Best tasks are binary: yes/no"


1/ Download books from Gutenberg, in csv file
selection of sentences with author names -> insert non-identified sentence
-> good results
-> sentence is very little information to base authorship attribution on
best to balance 'amount of words' per author, 1000-10000 words necessary

2/ left-right sentiment in tweets
a lot of disagreement between annotators (also for irony, 'I will bomb the airport', CIA has a lot of problems with it)

3/ Bible/Koran/Thora - trainingdata
feeded in speeches of Obama (Thora)/Osama (Koran)/Malcolm X(Bible)

4/ sentiment analysis on sms sent on 9/11

5/ wikipedia
had csv issues, api from pattern doesn't give history
does it find the user who wrote WP article?
'war on terror' has more than 10000 edits, going back to 2002 
-> wanted to run sentiment analysis, see if subjective edits have been removed

6/ datasets of contest
discussed the methods
understand annotation context

M: Abstracts analysis has been taken over by Pharmaceutical Company
Software to find lies & pedophiles, sounds usable for marketing proposals
-> research projects have commercial use
is this discussed in academia?

-> Fundamental research (most interesting), no applications until in 200 years
-> Applied research (no interest from industry, interest for society)
-> Industry commands

Ex. Amica: 2nd stage, application for society
can use same technology for marketeer project, but not use same data/code
-> they rebuild the technology again: use very different data / AMICA project they signed a non-disclosure agreement for data/code & they are not devleoping a 'product' that can be plugged in
-> no transfer of public money to industrial market

What is good for society?
as if tools are aiming to specific & unique function (is not the case)
ethical advisory board follows up their technical development; focus group discussions wth youngsters/peers...
ex gmail: "there is something reading along"

Application of model on relationships - if you do not conform to this model, than you're reported to moderator
model = subjective, based on precise choices that are not always uniclear
-> each difference, non-conform to model will disappear -> leads to normalisation/formatting
-> no neutral technolgoy, will flatten everything out