how to take context into account?
how to think about circularity of these algorithms?

irony detection
Author profiling ->

We start with a 'generic' task: "This task is about predicting an  author's  demographics from her writing. Concretely: age, gender and  personality  traits."

Looking for prepared data, where to start: Gijs
--> no gender input but : 

1) Extraversion (x) (sociable vs shy)
2) Neuroticism (n) (neurotic vs calm)
3) Agreeableness (a) (friendly vs uncooperative)
4) Conscientiousness (c) (organized vs careless)
5) Openness (o) (insightful vs unimaginative).
--> e.g neuto seems correlate to conscinouness (shy = conscious?)

using neo pi-r, a tool to measure t
In psychology, the Big Five personality traits are five broad domains or dimensions of personality that are used to describe human personality. The theory based on the Big Five factors is called the five-factor model (FFM).[1] The five factors are openness, conscientiousness, extraversion, agreeableness, and neuroticism. Acronyms commonly used to refer to the five traits collectively are OCEAN, NEOAC, or CANOE. Cath FS

our dataset : PAN-AP-13 corpus - Author Profiling Shared Task

Trying to understand where/what/how of this dataset

Larger context: "uncovering plagiarism, authorship, and social software misuse"

Author Profiling:

"Authorship analysis deals with the classification of texts into  classes based on the stylistic choices of their authors. Beyond the  author identification and author verification tasks where the style of  individual authors is examined, author profiling distinguishes between classes of authors studying their sociolect aspect, that is, how  language is shared by people. This helps in identifying profiling  aspects such as gender, age, native language, or personality type.  Author profiling is a problem of growing importance in applications in  forensics, security, and marketing. E.g., from a forensic linguistics  perspective one would like being able to know the linguistic profile of  the author of a harassing text message (language used by a certain type  of people) and identify certain characteristics (language as evidence). Similarly, from a marketing viewpoint, companies may be interested in  knowing, on the basis of the analysis of blogs and online product  reviews, the demographics of people that like or dislike their products.  The focus is on author profiling in social media since we are mainly interested in everyday language and how it reflects basic social and  personality processes"

From the Readme:

"Moreover, documents from authors who pretend to be minors have been included (e.g., documents composed of chat lines of sexual predators). For any doubt or problem, please get in touch with us."

"Social media" = Chat messages? Different conversations. It seems mixed ... Gijs finds Spam, other types of messages.