READ ME


Feature: pat(t)ernalism


pattern.en.paternalism is a suggested addition to the Pattern natural language processing Python toolkit. 
This feature detects if a text could be considered paternalist.

Motivation


"Machine-learning algorithms that partially automate data processing still need to be trained for every new form, or every new kind of topic the algorithm might deal with. [...] Such work of alignment is not a bug—it is the condition of possibility for keeping humans and automation working in the same world."
http://www.publicbooks.org/nonfiction/justice-for-data-janitors

We started to work on this feature during the worksession Cqrrelations. (http://www.cqrrelations.constantvzw.org) A group of artists, researchers, programmers and designers came together to work through impure, missing, invisible, broken or suspicious data. As we slowly got to grips with the actual practice of data-mining, and more specifically the text-mining software package Pattern (http://www.clips.ua.ac.be/pattern) the process and consequence of annotating data sources appeared undervalued and underdocumented. Annotation in the context of text-mining is the 'scoring' of large amounts of data that can than be used for 'training' algorithms. This scored data becomes a 'truth' or 'Golden Standard' against which the algorithm is trained and tested. 
We were interested in the paradoxical situation of annotation, where human input is both considered a source of truth, and made invisible. As human 'scoring' of data is the reference of machine learning. From our shared experience in cultural and software research, it seemed that annotation always implies a subjective process. Reading sources is a durational process and involves a clear positioning of the opinionated annotator, which we wanted to both experience and challenge. Our decision to work with a contested 'polarity' such as paternalism, was of course deliberate.

We wanted to:


Use cases


Nudging

Nudging  theory (or Nudge) is a concept in behavioral science, political theory and economics which argues that positive reinforcement and indirect suggestions to try to achieve non-forced compliance can influence the motives, incentives and decision making  of groups and individuals, at least as effectively – if not more  effectively - than direct instruction, legislation, or enforcement. A documented threat to behavioural politics such as nudging is when nudging attempts are perceived as paternalistic. As such it is relevant to detect patterns of language that could be considered to be paternalistic.
http://ethicsofnudge.com/tag/paternalism/

Process

Some meta-notes on the annotation process:


The Removal of Pascal


We started establishing a classifier for detecting pat(t)ernalism. While training our K-Nearest Neighbour algorithm, results seemed skewed towards a few French terms in the sources, most notably 'autre'. Closer scrutiny revealed the term was part of a quote of Blaise Pascal used in 'How to observe morals and manners' by Harriet Martineau, one of the sources used. Since our algorithm was not performing according to our expectations, we decided to remove the paragraph that created the unwanted result. This is the sentence that was removed:

"Une différente coutume donnera d'autres principes naturels. Cela se voit par expérience; et s'il y en a d'ineffaçables à la coutume, il y en a aussi de la coutume ineffaçables à la nature."

"A different usage, would result in distinct natural causality. One can experience it; furthermore if some are impossible to abolish from usage, usage impossible to erase from nature also exists."

http://www.gutenberg.org/files/18269/18269-h/18269-h.htm#p_92
A different custom will cause different natural principles. This is seen in experience; and if there are some natural principles ineradicable by custom, there are also some customs opposed to nature, ineradicable by nature, or by a second custom.

Unfortunately The Removal of Pascal did not improve the performance of our algorithm.

Definitions of paternalism

From: https://en.wikipedia.org/wiki/Paternalism 
Paternalism (or parentalism) is behavior, by a person,  organization or state, which limits some person or group's liberty or  autonomy for that person's or group's own good. Paternalism can also imply that the behavior is against or regardless  of the will of a person, or also that the behavior expresses an attitude  of superiority.
The word paternalism is from the Latin pater for father, though paternalism should be distinguished from patriarchy. Some, such as John Stuart Mill,  think paternalism to be appropriate towards children: "It is, perhaps,  hardly necessary to say that this doctrine is meant to apply only to  human beings in the maturity of their faculties. We are not speaking of  children, or of young persons below the age which the law may fix as  that of manhood or womanhood." Paternalism towards adults is sometimes thought to treat them as if they were children.
Examples of paternalism include laws requiring the use of motorcycle helmets, a parent forbidding their children to engage in dangerous  activities, and a psychiatrist confiscating sharp objects from someone  who is suicidally depressed.

From: https://fr.wikipedia.org/wiki/Paternalisme 
Le paternalisme est une doctrine politique qui définit comme  moralement souhaitable qu'un agent privé ou public puisse décider à la  place d'un autre pour son bien propre. Cette doctrine s'oppose au libéralisme.
Par exemple, quand l’État interdit aux agents de fumer ou de boire, il mène une politique paternaliste. D'un point de vue libéral, on ne  peut pas chercher à faire le bien d'un individu contre son gré.
Le paternalisme est une attitude qui consiste à se conduire comme un père envers d'autres personnes sur lesquelles on exerce ou tente d'exercer une autorité. Cette attitude peut être volontaire, comme involontaire et inconsciente.
Ce terme est notamment utilisé dans des domaines comme l'économie, la morale ou la politique. On parle alors de paternalisme économique,  moral, politique, social etc.
L'attitude paternaliste revient à considérer des adultes comme des enfants. Un paternaliste infantilise  ceux sur qui il exerce, ou cherche à exercer, une autorité. À l'inverse  que c'est parce que ceux-ci sont déjà infantiles que cela suscite en  retour une tendance paternaliste.

From: https://nl.wikipedia.org/wiki/Paternalisme
Paternalisme verwijst naar een houding of beleid vergelijkbaar met het hiërarchische familiepatroon waarbij de vader (pater in het Latijn) aan het hoofd van de familie staat en de vader  beslissingen neemt voor de andere familieleden (vrouw en kinderen), ook  als die beslissing niet in overeenstemming is met wat zij wensen.
Paternalisme is het optreden van de overheid tegenover het volk, of  van een overheersend volk in vreemd gebied (kolonie of vroegere kolonie)  of van een gezaghebber als een vader of voogd die het goede met het volk, zijn kinderen of pupillen voorheeft, maar hen geen invloed van belang geeft op hun eigen aangelegenheden.

From: http://dexonline.ro/definitie/paternalism (there is no wikipedia entry for Paternalism in the Romanian Wikipedia)
Paternalism s. n. 1. (Ec. pol.)  Concepție care desemnează interesul pe care îl manifestă patronii  pentru bunăstarea muncitorilor sau pentru atmosfera familială din  întreprindere, raporturile dintre patroni și muncitori caracterizate  prin afecțiune reciprocă, autoritate și respect. 2. Protecție, protejare, tutelare excesivă a propriului copil. – Din fr. paternalisme.

Selected sources

The Annotators decided to select the following 20 sources. From these sources, 600 paragraphs were selected. For Gutenberg sources, paragraphs were automatically scraped from Gutenberg. For  Wikipedia sources, Annotators copy-pasted the paragraphs into a spreadsheet by hand. Paragraph titles and graphic elements were ignored.

Gutenberg project

  1. J. B. Bury, The Idea Of Progress, 1920, http://www.gutenberg.org/cache/epub/4557/pg4557.txt
  2. Maud Churton Braby, Modern Marriage and How To Bear It, 1908, https://www.gutenberg.org/files/31529/31529-0.txt
  3. Harriet Martineau, How to Observe Morals and Manners, 1838, http://www.gutenberg.org/cache/epub/33944/pg33944.txt
  4. Irwin Edman, Human Traits and their Social Significance, 1920, http://www.gutenberg.org/cache/epub/22306/pg22306.txt
  5. James Hayden Tufts, The Ethics of Cooperation, 1918, http://www.gutenberg.org/cache/epub/29508/pg29508.txt
  6. James Harvey Robinson, The Mind in the Making: The Relation of Intelligence to Social Reform, 1921, http://www.gutenberg.org/cache/epub/8077/pg8077.txt
  7. Helen Kendrick Johnson, Woman And The Republic, 1897, https://www.gutenberg.org/cache/epub/7300/pg7300.txt
  8. Charles Darwin, On the Origin of species, 1859, http://www.gutenberg.org/cache/epub/1228/pg1228.txt
  9. Emma Goldman, Anarchism and other essays, 1910, http://www.gutenberg.org/cache/epub/2162/pg2162.txt
  10. John F. Hume, The Abolitionists (Together With Personal Memories Of The Struggle For Human Rights), 1830-1864, http://www.gutenberg.org/cache/epub/13176/pg13176.txt

Wikipedia

  1. Mining : https://en.wikipedia.org/wiki/Mining
  2. Textile Industry : https://en.wikipedia.org/wiki/Textile_industry
  3. History of computing hardware : https://en.wikipedia.org/wiki/History_of_computing_hardware
  4. Marissa Mayer : https://en.wikipedia.org/wiki/Marissa_Mayer
  5. Larry Page : https://en.wikipedia.org/wiki/Larry_Page
  6. Liberty : https://en.wikipedia.org/wiki/Liberty
  7. Choice : https://en.wikipedia.org/wiki/Choice
  8. Sabotage : http://en.wikipedia.org/wiki/Sabotage
  9. Social Darwinism : http://en.wikipedia.org/wiki/Social_Darwinism
  10. Anarchism : https://en.wikipedia.org/wiki/Anarchism

Annotation file

name: main-the-annotator-paragraphs-[ID-number].ods
example: main-the-annotator-paragraphs-005.ods

Columns:                                                                                                                                    

A : unique ID                                    
B : url of the source                                                                                    
C : title of the source                                                                                        
D : year of publication                                                                                           
E : paragraph (content)                                                                                              
F : the ID number of the annotator                                                                                   
G : classifier (-1/0/1/x)                                                                                                            
H : comment                                                                                                                         

Instructions for annotators


About the annotators

Annotator 001 (f, 1982) is a French author living in Belgium. She is currently involved in a research to the life of Anna Kavan, and is interested in digital writing.
Annotator 002 (f, 1969) is a Dutch designer/artist living in Belgium. She is a feminist and interested in tools, practice and Free Software.
Annotator 003 (m, 1990) is a Dutch artist living in The Netherlands. He is interested in infrastructures and networks.
Annotator 004 (f, 1988) is a French artist living in The Netherlands. She is interested in the physical location of the web, and enjoys the act of making web pages.
Annotator 005 (f, 1989) is a Dutch designer living in The Netherlands. She is interested in language philosophy and computational linguistics.
Annotator 006 (f, 1991) is a Romanian curator living in The Netherlands. She is interested in the conditional aspect of (web) interfaces.
Annotator 007 (m, 1982) is a Hungarian researcher living in Spain. He is interested in collaborative production practices and cybernetics as an ideological formation.
Annatotor 008 (m, 1984) wishes that he was007. Why? Because 7, 8, 9... No 008 has an English mothertongue and has been annotating considerably in professional and non professional contexts (including with building ontological frameworks since 2008 (fuck me, that was a long time ago)). Besides, the reading material themes in 008s remit is close to 'natural reading environment' He has been permitted to retire at 250, unless somebody is able to catch up with him!Annotator 009 (f, 1975) is a French researcher/teacher living in France. She is interested in bots.

Meta-mining


Legend

x = noise
d = disagreement
n = not annotated
p = annotated by 1 person

Results

Group A (001, 004, 007)
    174 paragraphs classified
    18 disagreements 
    7 paragraphs were noise
    
Group B (002, 005, 008)
    55 paragraphs classified
    21 disagreements
    2 paragraphs were noise

group C (003, 006, 009)
    61 paragraphs classified
    10 disagreements
    2 paragraphs were noise

Totals:
244 paragraphs were classified and used for training
Annotators disagreed on whether a paragraph was paternalist on 49 occasions.

Annotator disagreement rate: 20.08967213114754%


result lists

list of raw annotation data
../share/the-annotator/all-annotations-abc.html

classified as paternalistic
list of paragraphs that are classified as paternalistic
combined with the notes that were taken during the annotation process
../share/the-annotator/paternalism-classifications.html


disagreement of classification
list of paragraphs that is disagreed on by the annotators, and so are not taken into account in the training
combined with the notes that were taken during the annotation process
../share/the-annotator/disagreement-list-selection.html




































































































































































PAN'13 Training Corpus for Author Profiling Task
================================================

Corpus description
------------------

The corpus consists of XML documents containing conversations (HTML format) about many different topics grouped by author and labeled with his/her language, gender and age group.

There are two languages (English and Spanish), two genders (Male and Female), and three groups of age (10s: 13-17,  20s: 23-27 and 30s: 33-47).

Each author is presented as a separate XML file, the name of which provides information about language, gender and age group in order to facilitate file tasks, and grouped by language in two separate folders, EN and ES.

Each XML document name is formatted as:

UUID_lang_agegroup_gender.xml

For example:

303232a213161ece822fe69176d48e58_en_20s_female.xml

And each XML file is formatted as follows:

<author lang="lang_code" gender="gender_code" age_group="age_group">
        <conversations count="number_of_conversations_in_file">
                <conversation id="UUID">
                        [Original HTML Content of the conversation]
                </conversation>

                <conversation id="UUID">
                        [Original HTML Content of the conversation]
                </conversation>

                ....

        </conversations>
</author>


English corpus incorporates 236,000 authors (files), with 413,564 conversations and 180,809,187 words. Spanish corpus incorporates 75,900 authors (files), with 126,453 conversations and 21,824,198 words.

The distribution of the training data is:

LANG        AGE_GROUP        GENDER                N. OF AUTHORS (FILES)
-------------------------------------------------------------
EN        10s                MALE                        8,600
                        FEMALE                        8,600
        20s                MALE                        42,900
                        FEMALE                        42,900
        30s                MALE                        66,800
                        FEMALE                        66,800
-------------------------------------------------------------
ES        10s                MALE                        1,250
                        FEMALE                        1,250
        20s                MALE                        21,300
                        FEMALE                        21,300
        30s                MALE                        15,400
                        FEMALE                        15,400
-------------------------------------------------------------

Moreover, documents from authors who pretend to be minors have been included (e.g., documents composed of chat lines of sexual predators).

For any doubt or problem, please get in touch with us.









Terms Of Use

Confidentiality 
You cannot send the database to any other party, nor disclose to anyone else the information contained within it as well as its structure. 

Anonymisation 
We have anonymised the dataset. You cannot attempt to reverse this by linking individual records with specific users. This means that you also should not link individual data with any other information about an individual that you may have. This implies that you cannot attempt to contact any individuals either. 

Non-Commerical License 
We grant you a non-commercial license to use the data. You can only use it for academic research that does not earn revenue, and your research also cannot be in collaboration with any commercial entities. 

Recoverability 
The license to use and store our data is recoverable, which means that myPersonality may ask you to cease use of it and to delete it from any storage you have at our sole discretion.