Other softwares: R (not good at tokenization), SAS (not free), Python
Python * easy to learn * elegant syntax * fun in string manipulation * lot of libraries: sklearn (classifiers for machine learning), pandas (databases)
Pattern * webmining: Google, Twitter, FB (doesn't break in into someone's profile) HTML parser (relevant parts of webpage), crawler (follow links until no more data is found) * Natural language prcoessing POS tagging sentiment analysis en, nl, fr, it , es -> string classification * Machine learning SVM Support Vector Machines, neural networks (trying to mimick human brain) * web apps & visualisation ex network of politicans on Twitter
Submodules Pattern.web -> need licenses key to avoid that you're thrown off -> you can get free keys as well, not so powerful
example: >>> from pattern.web import Twitter >>> from pattern.en import sentiment >>> for tweet in Twitter(language="en").search("#NVA"):
print(tweet, sentiment) (?)
Classifier can predict category for given 'thing' (text, image...) ex. sentiment in text / face in a photograph
text classifiers: count words some words occur in positive/negative tweets machine learning classifiers do this work / no preconceived notions of what is pos/neg, retrieved from the data -> not interested in word order anymore, but in how many times a word occurs
training document (data)-> classifier -> bag-of-words (frequency) > words are 'features'
you can also use 'bag of lemmas': did/done --> to do / better level of abstraction you can look at word bigrams (2 consecutive words) or character trigrams (n-grams) (efficient in authorship attribution), tokenization (Goed! = goed + !), average word lenghts -> whatever improves your system
support vector machine hyperplace: best possible transition between two groups looks at nearest neighbours to judge error margin -> decisions can be based on 10000 dimension problem for computers
Evaluation measure system on data we've never seen before set aside small portion of annotated training data = gold standard ("this is correct") build classifier on remaining 90% can compare prediction of classifier on the class assigned by the human annotator
10-fold cross validation: 1000 tweets -> split in sets of 100 train on 800 tweets + 10% test set + 10% validation -> shift roles of these sets of 100 tweets
compare to baseline scores ex weather of tomorrow: 50% accuracy 'it will be the same as today' -> look at precision & recall
Some classifiers can shape a decision tree from your data you get insight in structure of your data -> don't work as well as Support Vector Machines (black boxes, can use combinations of vectors in multiple ways)
Process of training ex stockmarket - music reviews for annotation you need to know the context --> information about annotation process? -> depends on complexity of classification problem -> same document woudl be annotated by 2 different people - if there is disagreement, I will not use it as training data -> link to such a protocol?
For annotator's protocol: "Best tasks are binary: yes/no"
1/ Download books from Gutenberg, in csv file selection of sentences with author names -> insert non-identified sentence -> good results -> sentence is very little information to base authorship attribution on best to balance 'amount of words' per author, 1000-10000 words necessary
2/ left-right sentiment in tweets a lot of disagreement between annotators (also for irony, 'I will bomb the airport', CIA has a lot of problems with it)
3/ Bible/Koran/Thora - trainingdata feeded in speeches of Obama (Thora)/Osama (Koran)/Malcolm X(Bible)
4/ sentiment analysis on sms sent on 9/11
5/ wikipedia had csv issues, api from pattern doesn't give history does it find the user who wrote WP article? 'war on terror' has more than 10000 edits, going back to 2002 -> wanted to run sentiment analysis, see if subjective edits have been removed
6/ datasets of contest discussed the methods understand annotation context
Discussion M: Abstracts analysis has been taken over by Pharmaceutical Company Software to find lies & pedophiles, sounds usable for marketing proposals -> research projects have commercial use is this discussed in academia?
-> Fundamental research (most interesting), no applications until in 200 years -> Applied research (no interest from industry, interest for society) -> Industry commands
Ex. Amica: 2nd stage, application for society can use same technology for marketeer project, but not use same data/code -> they rebuild the technology again: use very different data / AMICA project they signed a non-disclosure agreement for data/code & they are not devleoping a 'product' that can be plugged in -> no transfer of public money to industrial market
What is good for society? as if tools are aiming to specific & unique function (is not the case) ethical advisory board follows up their technical development; focus group discussions wth youngsters/peers... ex gmail: "there is something reading along"
Application of model on relationships - if you do not conform to this model, than you're reported to moderator model = subjective, based on precise choices that are not always uniclear -> each difference, non-conform to model will disappear -> leads to normalisation/formatting -> no neutral technolgoy, will flatten everything out