1 | Dates | Topic | Classes / Lecture content | Readings (A: critical, B: helpful, C: details) | Tutorials content (see website for up-to-date and links) | Homework content (see website for up-to-date and links) | Extra resources and pointers (optional) |
---|---|---|---|---|---|---|---|
2 | 1/10, 1/12, 1/17 | Block 1 - Introduction to language analyses | Intro (what's in this course and what is not) overview of basic skills to get going (SSH, google colab, databases) how does language work (basic intro to language stats) Motivation: case studies of language analyses in the social sciences -- the highlight reel of the literature Publishing in social sciences vs. computational linguistics Introduction to DLATK database tables and flow. What is a word? Tokenization - Simple 1gram extraction with DLATK. | A** - Eichstaedt Psych Methods - Overview Intro / course blue print B - The secret life of pronouns (helpful light reading on the side) B - Iliev - Overview - text analysis in psychology (general overview) | Tutorial 1: basic Linux commands (home folder, edit files, permissions, running R and Python from the command line). Tutorial 2: SSH and tunneling Tutorial 3: Introduction to Sequel Pro Tutorial 4: LinkedIn learning with Sequel Pro: SQL essentials Tutorial 5: Working with SQL & tweets }} please do by Tuesday 9/22 | HW1: Intermediate SQL exercises (due Thursday, 9/25, before class) -- will come out Thursday, 9/17 | Intro to git Intro to R Regex tutorial Intro to Personality (reading: the power of personality) |
3 | 1/19, 1/24, 1/26, 1/31 | Block 2 - Closed-vocabulary language analyses (dictionaries) | Lecture on DICTIONARIES Overview of the psychological text analysis literature (LIWC, DICTION, General Inquirer) Acknowledging the advances of the 1960s! Laswell, General Inquirer Precision/recall // sensitivity/specificity, and how it applies to dictionaries controlling for muliple comparisons (Bonferroni vs. Benjamini-Hochberg) Sources of error, Validation of dictionaries, use of weighted dictionaries based on supervised machine learning sentiment analysis, NRC dictionaries Introduction to Language Confusion matrices // Lecture on EMOTIONS Sentiment ANEW and LabMT through hand annotation Mood states #NRC emotion dictionaries Valence and arousal (already in GI) | (still A - Eichstaedt from Block 1) A** - Kern - DLATK psych methods (more detail on the use of DLATK and parameter choices) -- focus on first sections for now A** - Pennebaker - Psychological aspects of natural language. use our words, our selves (most LIWC, nicely structured by outcome category) A - Tausczik - The Psychological Meaning of Words - GREAT APPENDIX (mostly LIWC) B - Mehl -Text analysis Handbook (overlaps with the previous introductions, has some worked example and helpful perspective) C - DLATK technical paper (the CS paper that introduces DLATK and explains its class structure) // TBD (further B and C papers): 1 labMT paper and the San Diego LabMT paper (where it misfires) Limitations of LIWC (the Jessie Sun paper) Jaidka 2020 PNAS Mehl JPSP paper on first person pronouns => negative affectivity | ## week 2 Tutorial 6: Introduction to Jupyter Tutorial 7: Getting your data into a non-local database (with R, with Python, via command line, with your SQL GUI) Tutorials 7 and 8 DLATK tutorial (1gram extraction, extracting dictionaries (weighted / unweighted), correlating dictionaries, both pearson and spearman) Tutorial Pulling DLATK tables into R, reproducing DLATK analysis. making scatter plots / histograms with language frequencies How to interpret dictionary correlations (?) Write up dictionary based results for publication Tutorial : how to get a dictionary into DLATK Importing ONE standard dictionary into dlatk (ANEW) | HW2: Dictionaries in DLATK (due 10/1). extract unweighted dictionaries (LIWC 2015, General Inquirer) on study data set, run correlations. extract weighted dictionaries (ANEW, labMT, Warriners) on study data set, run correlations. #### your own dictionary project #### Make your own dictionary, upload to server. Extract it from blog data. Run correlations with demographics / occupation Also run LIWC dictionary that is most related. Get the words that determine counts in your dictionary, correlate with outcome Pull 100 examples and determine precision of your dictionary ------------------------------------------------- HW3: DLATK and R plot cumulative dictionary composition in R (johannes gives code) make scatter plots Maybe: hypothesize some association with a Big Five outcome, and then produce a language confusion matrix (true/false positives) Write up your findings for publication according to template (methods, results, supplement {validation table, language confusion matrix, density plot} interpretation using Tausczik & Pennebaker | |
4 | 2/2, 2/7 | Block 3 - Open-vocab language analyses | 1-to-3 grams, extraction (occurrence thresholds, pointwise mutual information) Statistical power considerations (what can you do with what kind of data set?) More involved language analyses: dose response, the interpersonal circumplex Part of speech tags ?: TFIDF as a way to summarize language from different categories / authors (e.g., biden vs. Trump speeches) | (still A - Eichstaedt from Block 1) (Still A - Kern from Block 2) A - Schwartz - The Open-Vocab Appraoch - (the paper that really introduced the open vocab approach to psychology) A - PNAS - police body camera footage!! (TODO: find good reading re: lexical hypothesis) | Extract 1to3grams from study data set - one time with low occurrence threshold, one time with high - one time with high PMI, one time with low (using the -f flag during feature extraction to speed up operations) counting number of features in MySQL subsample the exercise data to 1,000 users -- play with occurrence thresholds until you have something that works // MAKING 1to3gram word-clouds with good OCC and PMI choices // explore dose-response results in R: - plot language features as a function of age in R (subsample equal men and women) - plot correlation of language features "!", "!!", "!!!" etc. with Big Five, and with Age // extract part of speech tags with DLATK run code to batch different POS tags, print word clouds | HW4: Open vocab analyses. dose response: Think of another language feature to explore // part of speech extraction find all the adjectives that are associated with Big Five trait, age, and gender. Consult original publication. Write up your findings for publication | |
5 | 2/9, 2/14 | Block 4 - Clustering (topics, word2vec) | supervised vs. unsupervised methods Latent Semantic Analysis Latent Dirichlet Allocation topic modeling what are plate diagrams? examples of supervised topic modeling word2vec: the idea, cosine distances Difference between semantic relatedness and substitutability ?: Brown topics or some other word2vec-related clusters ?: perhaps -- extract both sentiment and topics, to get topic sentiment WORD EMBEDDING TUTORIAL: https://www.youtube.com/watch?v=Eku_pbZ3-Mw&feature=emb_logo // | LSA paper: TODO one introductory, the Pennebaker psych methods LSA paper TODO: accessible intro to LDA TODO: Tay paper that uses word2vec to describe situations skim: reading tea leaves LDA paper | extract 2,000 WWBP FB topics run correlations on training data set // make your own topics! model LDA topics on study data set in MALLET (?) import them into DLATK extract them for the study data set (or a subset thereof) // apply LSA (TODO: with DLATK? or in R?) // extract word2vec for different users ?:: get an average position of users in the word2vec space | HW5: LDA topics // make your own LDA topics on another dataset that we give you // compare Big Five language correlates obtained for LSA, 2,000 LDA topics, Brown clusters. which one of these do you find the most helpful? // add a description of the methods/results/supplement that covers LDA topics to your previous publication write up. Get it all ready for publication! | Intro to contextual embeddings |
6 | 2/16, 2/21, 2/23, 2/28 | Block 5 - Intro to Machine learning | Introduction: cross-validation, penalization, dimensionality reduction, what are hyperparameters? prediction of Michal classification (LR / SVM) vs. regression (ridge regression) ?:: learning a person language model, applying it to counties GUEST lecture: Andy Schwartz on deep learning PENALIZED REGRESSION TUTORIAL: https://www.youtube.com/watch?v=nQ4G45AbHyU&feature=emb_logo | Park JPSP paper on Big5 prediction Yarkoni-prediction as a method | // prediction with DLATK with different feature sets (1grams, 1to3grams, LIWC< topics, word2vec) for different outcomes (classification for gender and high/low extraversion [divide by thirds, remove middle third], regression for contionous outcomes (age/personality scores)] TODO: how can we allow students to set hyperparemeter easily -- from command line, not by editing hyperparameter search strings in regressionPredictor.py - run examples where hyperparemeters are overfitting and underfitting, to see fall off in prediction accuracies // exercise: given them a 400 user subset -- what predictions can they obtain, with what features? // either here or next week:: Reproduce the Twitter predicts heart disease paper (2015, psych science) -- Shrinidhi, see 2018 replication, Appendix 1 (jeichstaedt.com/pubs -- it has the step by step in there) ?? work with basic transformers in DLATK ?? | HW6: Machine learning application make a results table with the different prediction accuracies write prediction resutls up for publication in a social science journal | Optional: Add transformers in DLATK |
7 | TBD | ETHICS | what are the main faux pas that haven't in this space? why are they bad? positive ethics vs. negative ethics deontological vs. consequentialist ethics the 5 criteria in bioethics Cambridge Analytica and microtargetting Facebook influence experiment using NLP to find bias in police interactions | story about CA? algorithmic bias science paper a paper from Maarten Sap? | Kramer et al. PNAS | ||
8 | 3/2, 3/7, 3/9, 3/14, 3/16 | Block 6 - Team projects (yay!) | Probably: ethics Probably: biases probably: cursing and speech acts How do you get your own language data? -- geotagging flow chart - Twitter data sets off the shelf -- County 10% from WWBP github - writing prompts - scraping website (erowid corpus?) - political speeches - specific subreddits for your construct of itnerest | work with different kinds of language data - pull it if necessary, upload to server, import to database. wrangle into shape. - we'll do a few examples (assume data is very big: upload csv's to server, import into SQL non-locally) // reproduce heart disease analyses on the Twitter data set | Final Team Project Pick 1-2 datasets, or bring your own, execute the entire pipeline on it. < it needs some type of label for prediction > - language dataset descriptives - model the right number of LDA topics - extract LIWC2015 - extract sentiment - run machine learning against outcomes / labels - visualize all results write methods/results/DISCUSSION/supplement up for publication! Give it a minimal introduction | ||
9 | |||||||
10 | Unsure where: | advanced topics in the literature--temporal prediction of mood states | emotion diversity vs. absolute level papers | ||||
11 | not sure where: different correl fields | ||||||
12 | Anscombe transforms? Other transforms? | ||||||
13 | |||||||
14 | discussion of biases: presentation, desirability | ||||||
15 | talk about feature importance measures | ||||||
16 | why don't we stem? where might it help> | ||||||
17 | correlation vs causation: http://www.tylervigen.com/spurious-correlations | ||||||
18 | Cursing on social media | ||||||
19 | where the because internet reading | ||||||
20 | unicode, emojis | ||||||
21 | reading: reading tea leaves | ||||||
22 | MANIC patients upper case | ||||||
23 | |||||||
24 | Open vocab: | ||||||
25 | ADD – spreadsheet discussion of ER depression – correlation vs. LR | ||||||
26 | |||||||
27 | |||||||
28 | |||||||
29 | |||||||
30 | Spatial aggression units -- | ||||||
31 | Add google search query data? Discussion of DMA? |