/adjecGuy De Pauw CLiPS Antwerp
Tuesday 20 Jan 2015

Datasets available:
(in Dutch)

Datasets need to be prepared. Look  at data with task in mind. 
1. Sentiment mining
2. Age prediction
3. gender prediction
4. personality prediction
5.level of education prediction
(where's number 6, mystery... error detection ;)
7.deception detection
8. authorship dattribution
9. ...

Data annotatation:
Annotation used to build classifiers for (from? for?) raw data.

Python packages allow one to get into text mining very quickly. It's all well-documented online.

Evaluation: How do you know it is working? Testing is important.
(esp. when you make changes, how do you know if it gets "better")

"no result can also be a result"

at 17 we do a Pecha Kucha short presentations: 20 slides / 20 seconds each
End with: Beyond the word ... thought provoking other areas that need another week

Psycholinguistics grounded in love for language, not for computers

Knowledge from Data (Chomsky on linguistics? might not be necessary)

Rules from data.

African language learning/applying techniques using technologies 

Why do computers have problems using language?
Elementary, my dear Watson!

Fresh Prince: Google Translated - https://www.youtube.com/watch?v=LMkJuDVJdTw 
Computers have no concept of the world. 'world knowledge'  computers are the new hiphop

IBM's Watson

"we'll be laughing at language prediction/technology(?) for a few more decades"

example gramatically incorrect, spelling correct:
It plane lee marks four my revue 
Miss steaks aye can knot sea.

Natural Language processing is taking off

people generate big amounts of text, look at language data (unstructured) and look at it in structured way

* Most information is in unstructured data (ext)
* Most data is in digital form
* Big Data (too big to handle with conventional means)
* Accuracy levels Need To be Raised.

Fundamental Problems:
how to go from form to meaning?    
how do we represent meaning?

Three levels of knowledge:

Objective knowledge linking to ontologies
Example of automatic linking to an ontology (eg Wikipedia) of the "Who, what where, when, ..." -- so-called objective knowledge within a text ("non-disputable" text)
text analytics:
'named entity recognition': knows that Liechtenstein is name of a place
time-attributes : "former", "later", "had", "in 1249", (donat)"ed"..

(curious: the subject "the German Army" is categorized / linked as abstract language, or to a particular reference / historical notion of "the German army" situated in history... or merely as "noun phrase")

Subjective knowledge
since 10 years, presence of social media
reveals opinion of author of txt

examples: unique, most interesting...

Authorship, author attributes (style, gender, time) = profiling
2 camps
* content words: what are people talking about
* function words: the, and, on, the,... -> can be indicative of someone's profile of age/gender, falls beyond conscious control of the authors

Slide : Fig3. Words, phrases, and topics most highly distinguishing females and males
source?? from social media (maybe tweets)
factors : correlation strenght, relative frequency,  (can't read the right one)
slide source: http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0073791&representation=PDF

TFIDF (prominent for one category, not for another). Looking for difference

How to build a classifier for gender. 

"Gender" is a matter of small words
relational vs informational language:
women use pronouns
men use determiners and quantors
(Proposal of an alternative approach looking at helping words as an indicator of style)

'this is not exact science, classifier will always make mistakes' :-)
(what counts as a mistake?) 

Language is always ambigous


Pretty little's girl school (over six possible meaning, who's pretty? --> computers are out of context)

She told him that she loved him
She only told him that she loved him
She told only him that she loved him

"co-reference resolution": what does 'they' refer to?
The mayors prohibited the students to demonstrate, because they preached the revolution/they feared violence

From form to meaning
Text input -> Meaning output
% refers to accurancy

* Tokenization
Idenitfying meaningful units in a text
works well in Pattern

* Lemmatization (98%)
reducing word forms to their dictionary item (is, been, was, be => to be), also reduction of plurals to singular

* Part of speech taggers -> 2 out of 100 words will be incorrect/works well
(Determiner, Noun, Adjective)... Nouns are particulary interesting for 'objective' knowledge, Adjectives for 'subjective' knowledge, (e.g. sentiment).

* Shallow Parsing, Modality/negation (95%)
Identifying subjects; who does what to who
* Word Sense Disambiguation (70%)

Bank, Can, ...
* Semantic Role Labeling (65%)

* Named-Entity Recognition
Persons, locations

* Co-reference Resolution (50% = dramatic)

GOAL: mine knowledge from text

How can we represent meaning?

Textmining = shallow understanding
look at specific types of information

www.biog raph.be
"BioGraph provides a web service for discovery of biomedical relations and exploring functional hypotheses."

Problem: Too many subfields in bio-medics; nobody can be aware of all. 
Now instead pharmaceutical companies can keep up with new scientific discoveries.

Definition of text-mining (Marti ...) that describes it as creating *new knowledge* (not just extracting existing knowledge).
(Strange blurring of lines between making an abstract and creation of new knowledge; related is a self-fulfillment of a shallow concept of knowledge and the sense of "ever more knowledge" being created / information overload )

A hot topic: Deception

"opinion spam" f;ex. Trip Advisor

Personality affects success in deception ;-)


First one = SPAM or second one = SPAM?
The language of lies
(in what context?)

Liars use 
- fewer exclusive words. (but, except, without, exclude)
- fewer self- and other-references
- fewer tenta/ve words
- fewer time-related words
- more space-related words
- more negative words
- more negations
- more motion verbs

paper about lies:

How much influence would context have on 'lies'? (what counts as a lie when and where?)

The question of whether 400 true positive reviews from TripAdvisor may be correctly taken as true (how to verify?) And what counts as positive? And what is false? Or: what type of falseness are we revealing?

"Explorative deception experiment"
      false like      false dislike
(question of why write reviews about imaginary subjects?) -> check paper

Spam classifiers do the same as 'text categorisation'
collection fo spam mail + collection of real mail
machine learning to add information to specific classes 
-> gives you a classfiier

'Amica' Automatic Monitoring for Cyberspace Application
-> detecting grooming of pedophiles
Collected stories Philip K. Dick (Minority Report: PreCrime) 
--> this link does not work!?

Paedophile detection
Grooming detector
profile information vs writing style (function words, pronouns...) -> suspicious profiles sent to moderator
(The example of a 50 year old man with the profile of a 14 year old girl)
What if pedophile learns to write as 14years old girl?
'machine learning techniques: we don't really know what is going on'
(Whether you're talking about Justin Bieber or Mozart, it doesn't matter; it's how you speak about it)
(Question of oversight -- if the techniques are so opaque, how are they evaluated ... )

Being aware of being a target. Obfuscation tools for pedophiles (ref. Ad Nauseum)?

Who cares about moderation?

Ref: Facebook moderation on pornography (doesn't take motivation into account at all)
Problematics as semantic possibilities drift into morality

"the creation of new knowledge"
creating a flag vs warning that someone is a pedophile...
collaboration with sociologists
are pedophiles stakeholders in this development process?

is this technology desirable?
is this technology possible?
--> keep this friction for a work group next days?

Trend of surveillance technologies justified by hot-button topics - terrorism, paedophilia etc but ultimately (covertly?) directed towards marketing/commercial uses.

"Deep Blue Optimism"
IBM chess computer beat Kasparov

capacity vs capability
capacity does not equal capability

Splitting etherpad --> follow :
[[CLipS part 2]]
(can't we split to clips part2 ehterpad for clarity?)