/adjecGuy De Pauw CLiPS Antwerp Tuesday 20 Jan 2015
Datasets available: (in Dutch)
Datasets need to be prepared. Look at data with task in mind. Tasks: 1. Sentiment mining 2. Age prediction 3. gender prediction 4. personality prediction 5.level of education prediction (where's number 6, mystery... error detection ;) 7.deception detection 8. authorship dattribution 9. ...
Data annotatation: Annotation used to build classifiers for (from? for?) raw data.
Python packages allow one to get into text mining very quickly. It's all well-documented online.
Evaluation: How do you know it is working? Testing is important. (esp. when you make changes, how do you know if it gets "better")
"no result can also be a result"
at 17 we do a Pecha Kucha short presentations:20 slides / 20 seconds each End with: Beyond the word ... thought provoking other areas that need another week
Psycholinguistics grounded in love for language, not for computers
Knowledge from Data (Chomsky on linguistics? might not be necessary)
Spellcheckers (what did you mean?), do not look beyond word in a dictionary,
speechsynthesis (text to speech)
IBM's Watson SIRI
"we'll be laughing at language prediction/technology(?) for a few more decades"
example gramatically incorrect, spelling correct: It plane lee marks four my revue Miss steaks aye can knot sea.
Natural Language processing is taking off
"gisting" people generate big amounts of text, look at language data (unstructured) and look at it in structured way
Issues Possibilities * Most information is in unstructured data (ext) * Most data is in digital form * Big Data (too big to handle with conventional means) Problems * Accuracy levels Need To be Raised.
Fundamental Problems: how to go from form to meaning? how do we represent meaning?
Three levels of knowledge:
Objective knowledge linking to ontologies Example of automatic linking to an ontology (eg Wikipedia) of the "Who, what where, when, ..." -- so-called objective knowledge within a text ("non-disputable" text) text analytics: 'named entity recognition': knows that Liechtenstein is name of a place time-attributes : "former", "later", "had", "in 1249", (donat)"ed"..
(curious: the subject "the German Army" is categorized / linked as abstract language, or to a particular reference / historical notion of "the German army" situated in history... or merely as "noun phrase")
Subjective knowledge since 10 years, presence of social media reveals opinion of author of txt
examples: unique, most interesting...
Meta-knowledge Authorship, author attributes (style, gender, time) = profiling 2 camps * content words: what are people talking about * function words: the, and, on, the,... -> can be indicative of someone's profile of age/gender, falls beyond conscious control of the authors
"Gender" is a matter of small words relational vs informational language: women use pronouns men use determiners and quantors (Proposal of an alternative approach looking at helping words as an indicator of style)
'this is not exact science, classifier will always make mistakes' :-) (what counts as a mistake?)
Pretty little's girl school (over six possible meaning, who's pretty? --> computers are out of context)
She told him that she loved him She only told him that she loved him She told only him that she loved him etc.
Inference "co-reference resolution": what does 'they' refer to? The mayors prohibited the students to demonstrate, because they preached the revolution/they feared violence
From form to meaning Pipeline: Text input -> Meaning output % refers to accurancy
* Tokenization Idenitfying meaningful units in a text works well in Pattern
* Lemmatization (98%) reducing word forms to their dictionary item (is, been, was, be => to be), also reduction of plurals to singular
* Part of speech taggers -> 2 out of 100 words will be incorrect/works well (Determiner, Noun, Adjective)... Nouns are particulary interesting for 'objective' knowledge, Adjectives for 'subjective' knowledge, (e.g. sentiment).
* Shallow Parsing, Modality/negation (95%) Identifying subjects; who does what to who * Word Sense Disambiguation (70%)
Bank, Can, ... * Semantic Role Labeling (65%)
* Named-Entity Recognition Persons, locations
* Co-reference Resolution (50% = dramatic)
GOAL: mine knowledge from text
How can we represent meaning?
Textmining = shallow understanding look at specific types of information
www.biograph.be "BioGraph provides a web service for discovery of biomedical relations and exploring functional hypotheses."
Problem: Too many subfields in bio-medics; nobody can be aware of all. Now instead pharmaceutical companies can keep up with new scientific discoveries.
Definition of text-mining (Marti ...) that describes it as creating *new knowledge* (not just extracting existing knowledge). (Strange blurring of lines between making an abstract and creation of new knowledge; related is a self-fulfillment of a shallow concept of knowledge and the sense of "ever more knowledge" being created / information overload )
A hot topic: Deception
"opinion spam" f;ex. Trip Advisor
Personality affects success in deception ;-)
VOTE FOR SPAM!
First one = SPAM or second one = SPAM? The language of lies (in what context?)
Liars use - fewer exclusive words. (but, except, without, exclude) - fewer self- and other-references - fewer tenta/ve words - fewer time-related words - more space-related words - more negative words - more negations - more motion verbs
How much influence would context have on 'lies'? (what counts as a lie when and where?)
The question of whether 400 true positive reviews from TripAdvisor may be correctly taken as true (how to verify?) And what counts as positive? And what is false? Or: what type of falseness are we revealing? https://www.mturk.com/mturk/welcome
"Explorative deception experiment"
true like true dislike
false like false dislike (question of why write reviews about imaginary subjects?) -> check paper
Spam classifiers do the same as 'text categorisation' collection fo spam mail + collection of real mail machine learning to add information to specific classes -> gives you a classfiier
'Amica' Automatic Monitoring for Cyberspace Application http://amicaproject.be/ -> detecting grooming of pedophiles Collected stories Philip K. Dick (Minority Report: PreCrime) --> this link does not work!?
Paedophile detection Grooming detector profile information vs writing style (function words, pronouns...) -> suspicious profiles sent to moderator (The example of a 50 year old man with the profile of a 14 year old girl) What if pedophile learns to write as 14years old girl? 'machine learning techniques: we don't really know what is going on' (Whether you're talking about Justin Bieber or Mozart, it doesn't matter; it's how you speak about it) (Question of oversight -- if the techniques are so opaque, how are they evaluated ... )
Being aware of being a target. Obfuscation tools for pedophiles (ref. Ad Nauseum)?
Who cares about moderation?
Ref: Facebook moderation on pornography (doesn't take motivation into account at all) Problematics as semantic possibilities drift into morality
"the creation of new knowledge" creating a flag vs warning that someone is a pedophile... collaboration with sociologists are pedophiles stakeholders in this development process?
is this technology desirable? is this technology possible? --> keep this friction for a work group next days?
Trend of surveillance technologies justified by hot-button topics - terrorism, paedophilia etc but ultimately (covertly?) directed towards marketing/commercial uses.