Morning discussion: Week 2
Wednesday 21 Jan

ze bible works
how changin the training set changes the result
simple csv with all the text..  (bible, ALL THE BIBLE)

--> looking if the speech of Obama/Osama/Malcomx correlates with the style of the bible/torah/quran
[[info on classifiers and parsers]]

you can change SVM() to KNN(), now everybody is in the Torah.
'bag of words' is not necessarily about words but features, these can be anything (f.ex. when you look at images)


Other example: defining authors of Gutenberg
same code, but broke up the texts into sentences, did not get to results
Roel working on utopian & distopian literature (with aim to check business model of Clips): same, but result gives 'none'


Wikipedia history
classify authorship -> txt/date/handle on username
predict who wrote an article after analyzing the edits of the different authors.
policing and profiling mentality in algorithms: is the wikipedia text NPOV ( http://en.wikipedia.org/wiki/Wikipedia:Neutral_point_of_view ) or not?
automatic bot to normalize wiki articles.
Maybe more interesting to train based on ADDED vs. REMOVED (a police bot / spam bot)
Qualify the censors
Also: modality / sentiment -- could you question the NPOV (neutrality) of Wikipedia text?
the npov page of wikipedia has 0.3 neutrality sentiment! :)


Gettin sentimental
-> real time writing tool that grades your sentimentality! 'it seems you're getting a bit sentimental'

'sentiment' is based on simple annoted dictionary
ex with 'Mussolini'=
'Benito Mussolini'  is 0.26 subjective.
-> sentiment / positiveness analyses on a certain wikipedia article 


Kafka's "the trial"

ex. search onf 'free'
'K was living in a free country'
free is 0.8 positive.. and 0.666 objective, thats how the sentences are categorized...
-> rating is between -1 and 1 for subjective - objective / negative - positive
-> has been analysed in the context of sentences in which the word 'free' occurs
-> depends who the annotators are... show their perspective; is as good as determining the protocol for sentiment

what kind of knowledge is produced with what kind of tool


the annotation problem of the PAN database
based on necessary averaging.
(it's the question of what kind of sentiment you are looking at while annotating.) 
how can you show this problem in an exercise?
how can we open up this annotation (evaluation) proces? 

tools for anti-harrassment, trying to find out who the author is
what counts as harassment? to whom?
a practical example that would show the absurdity of defining what counts as hurtful statistically.
-> depends on who is speaking in what context - not generizable
-> the tool lack an awareness of context

'there is no free lunch' (for each aim you have to construct your classifier' vs 'neutrality, objectivity'
normality gets reinforced
make clear point of view of the annotator
-> by doing this, accept you're living in a world that is statistically controlled

In Readme on data/training data, there is no date mentioned
-> points of view depend on newscontext
-> what if we use the same data in 10 years? has authority
(How could annotation data sets carry something of their contstruction

examples of excercises:
annotation process that creates links back to context
so you can unpack how we got there, data reveals these traces

ex start with positive sentences, finish with negatives -> rereading a text, look at the meaning
showing the normative structure


-> make a system that allows people to look at sitcom & comment in % of being upset
Speaking in bags of words

-> find most positive sentence on the internet
(or generate this most positive sentence)
The most positive sentence in J.G. Ballard's CRASH / "Why I want to fuck Ronald Reagan" http://sensitiveskinmagazine.com/ronald-reagan/
the  first one: He saw Reagan in a complex rear-end collision , dying a  stylized death that expressed Vaughan 's obsession with Reagan 's  genital organs , like his obsession with the exquisite transits of the  screen actress 's pubis across the vinyl seat covers of hired limousines.

-> revisiting harassment mining
annotators vs harassers and harassed

-> understanding algorithms & showing their limitations
-> writing tools
ex autocomplete - generalising/randomising
Generating simple english from a "normal" input
-> real time writing tool that grades your sentimentality! 'it seems you're getting a bit sentimental/pedophile/lying/Kafkaian... '


-> Feedback loops, reverse engineering
To attain the total positive text / total subject text (play on the reductivity -- what remains when constrained)
Exampe of the ESP game (image annotation that converges on common descriptions), is generative: reduction machine becomes production machine
another ex. with images/art history : http://algorithmicarthistory.tumblr.com/
Critique of 'hidden layers' in neural networks
http://cs.stanford.edu/people/karpathy/deepimagesent/ // image recognition through neural networks 
http://arxiv.org/pdf/1412.1897v1.pdf // how the unknown deeper layers of neural networks can be somehow tricked

mh  feedback result: This constabulary survey documents the findings of the  vicious probe into an allegation made by Mohamed aluminium Fayed of  Dixieland to death penalty the Princess of Cambria and his boy Dodi Al  Fayed .

-> Vocabulary styles: automatic profanities..  automatic adademics as well?
lexicons in pattern,
academic


profanity
1000 basic English 
they needed a list of profanities for the paedophilia project..

-> Paul Otlet's text. - The Time Machine
at the time seen as progressive and world changing.
seen now is racist, colonialist, paternalist.
the classifiers changed clearly.
-> how to follow the claissifers across time
compare with other texts of that time, to see where it is progressive, to put it in context
time-machine: is this 
'how to make the experiment fail interestingly'
'not only use bag of words, but also associations between words'

---->http://www.clips.ua.ac.be/cgi-bin/stylenedemo.html  

-> Writing tool, sentiment as a reader
Different sources, responses to it
1984 and it's influence ... fear/dystopia as an idea
and how people nowadays use it to refer to it, as a reading feedback loop
-> there is a norm for sentiment
-> compare to the feeling that you feel when you read/write