PSYCH290 - ONGOING schedule and content - syllabus sheet


1	Dates	Topic	Classes / Lecture content	Readings (A: critical, B: helpful, C: details)	Tutorials content (see website for up-to-date and links)	Homework content (see website for up-to-date and links)	Extra resources and pointers (optional)
2	1/10, 1/12, 1/17	Block 1 - Introduction to language analyses	Intro (what's in this course and what is not) overview of basic skills to get going (SSH, google colab, databases) how does language work (basic intro to language stats) Motivation: case studies of language analyses in the social sciences -- the highlight reel of the literature Publishing in social sciences vs. computational linguistics Introduction to DLATK database tables and flow. What is a word? Tokenization - Simple 1gram extraction with DLATK.	A** - Eichstaedt Psych Methods - Overview Intro / course blue print B - The secret life of pronouns (helpful light reading on the side) B - Iliev - Overview - text analysis in psychology (general overview)	Tutorial 1: basic Linux commands (home folder, edit files, permissions, running R and Python from the command line). Tutorial 2: SSH and tunneling Tutorial 3: Introduction to Sequel Pro Tutorial 4: LinkedIn learning with Sequel Pro: SQL essentials Tutorial 5: Working with SQL & tweets }} please do by Tuesday 9/22	HW1: Intermediate SQL exercises (due Thursday, 9/25, before class) -- will come out Thursday, 9/17	Intro to git Intro to R Regex tutorial Intro to Personality (reading: the power of personality)
3	1/19, 1/24, 1/26, 1/31	Block 2 - Closed-vocabulary language analyses (dictionaries)	Lecture on DICTIONARIES Overview of the psychological text analysis literature (LIWC, DICTION, General Inquirer) Acknowledging the advances of the 1960s! Laswell, General Inquirer Precision/recall // sensitivity/specificity, and how it applies to dictionaries controlling for muliple comparisons (Bonferroni vs. Benjamini-Hochberg) Sources of error, Validation of dictionaries, use of weighted dictionaries based on supervised machine learning sentiment analysis, NRC dictionaries Introduction to Language Confusion matrices // Lecture on EMOTIONS Sentiment ANEW and LabMT through hand annotation Mood states #NRC emotion dictionaries Valence and arousal (already in GI)	(still A - Eichstaedt from Block 1) A - Kern - DLATK psych methods (more detail on the use of DLATK and parameter choices) -- focus on first sections for now A - Pennebaker - Psychological aspects of natural language. use our words, our selves (most LIWC, nicely structured by outcome category) A - Tausczik - The Psychological Meaning of Words - GREAT APPENDIX (mostly LIWC) B - Mehl -Text analysis Handbook (overlaps with the previous introductions, has some worked example and helpful perspective) C - DLATK technical paper (the CS paper that introduces DLATK and explains its class structure) // TBD (further B and C papers): 1 labMT paper and the San Diego LabMT paper (where it misfires) Limitations of LIWC (the Jessie Sun paper) Jaidka 2020 PNAS Mehl JPSP paper on first person pronouns => negative affectivity	## week 2 Tutorial 6: Introduction to Jupyter Tutorial 7: Getting your data into a non-local database (with R, with Python, via command line, with your SQL GUI) Tutorials 7 and 8 DLATK tutorial (1gram extraction, extracting dictionaries (weighted / unweighted), correlating dictionaries, both pearson and spearman) Tutorial Pulling DLATK tables into R, reproducing DLATK analysis. making scatter plots / histograms with language frequencies How to interpret dictionary correlations (?) Write up dictionary based results for publication Tutorial : how to get a dictionary into DLATK Importing ONE standard dictionary into dlatk (ANEW)	HW2: Dictionaries in DLATK (due 10/1). extract unweighted dictionaries (LIWC 2015, General Inquirer) on study data set, run correlations. extract weighted dictionaries (ANEW, labMT, Warriners) on study data set, run correlations. #### your own dictionary project #### Make your own dictionary, upload to server. Extract it from blog data. Run correlations with demographics / occupation Also run LIWC dictionary that is most related. Get the words that determine counts in your dictionary, correlate with outcome Pull 100 examples and determine precision of your dictionary ------------------------------------------------- HW3: DLATK and R plot cumulative dictionary composition in R (johannes gives code) make scatter plots Maybe: hypothesize some association with a Big Five outcome, and then produce a language confusion matrix (true/false positives) Write up your findings for publication according to template (methods, results, supplement {validation table, language confusion matrix, density plot} interpretation using Tausczik & Pennebaker
4	2/2, 2/7	Block 3 - Open-vocab language analyses	1-to-3 grams, extraction (occurrence thresholds, pointwise mutual information) Statistical power considerations (what can you do with what kind of data set?) More involved language analyses: dose response, the interpersonal circumplex Part of speech tags ?: TFIDF as a way to summarize language from different categories / authors (e.g., biden vs. Trump speeches)	(still A - Eichstaedt from Block 1) (Still A - Kern from Block 2) A - Schwartz - The Open-Vocab Appraoch - (the paper that really introduced the open vocab approach to psychology) A - PNAS - police body camera footage!! (TODO: find good reading re: lexical hypothesis)	Extract 1to3grams from study data set - one time with low occurrence threshold, one time with high - one time with high PMI, one time with low (using the -f flag during feature extraction to speed up operations) counting number of features in MySQL subsample the exercise data to 1,000 users -- play with occurrence thresholds until you have something that works // MAKING 1to3gram word-clouds with good OCC and PMI choices // explore dose-response results in R: - plot language features as a function of age in R (subsample equal men and women) - plot correlation of language features "!", "!!", "!!!" etc. with Big Five, and with Age // extract part of speech tags with DLATK run code to batch different POS tags, print word clouds	HW4: Open vocab analyses. dose response: Think of another language feature to explore // part of speech extraction find all the adjectives that are associated with Big Five trait, age, and gender. Consult original publication. Write up your findings for publication
5	2/9, 2/14	Block 4 - Clustering (topics, word2vec)	supervised vs. unsupervised methods Latent Semantic Analysis Latent Dirichlet Allocation topic modeling what are plate diagrams? examples of supervised topic modeling word2vec: the idea, cosine distances Difference between semantic relatedness and substitutability ?: Brown topics or some other word2vec-related clusters ?: perhaps -- extract both sentiment and topics, to get topic sentiment WORD EMBEDDING TUTORIAL: https://www.youtube.com/watch?v=Eku_pbZ3-Mw&feature=emb_logo //	LSA paper: TODO one introductory, the Pennebaker psych methods LSA paper TODO: accessible intro to LDA TODO: Tay paper that uses word2vec to describe situations skim: reading tea leaves LDA paper	extract 2,000 WWBP FB topics run correlations on training data set // make your own topics! model LDA topics on study data set in MALLET (?) import them into DLATK extract them for the study data set (or a subset thereof) // apply LSA (TODO: with DLATK? or in R?) // extract word2vec for different users ?:: get an average position of users in the word2vec space	HW5: LDA topics // make your own LDA topics on another dataset that we give you // compare Big Five language correlates obtained for LSA, 2,000 LDA topics, Brown clusters. which one of these do you find the most helpful? // add a description of the methods/results/supplement that covers LDA topics to your previous publication write up. Get it all ready for publication!	Intro to contextual embeddings
6	2/16, 2/21, 2/23, 2/28	Block 5 - Intro to Machine learning	Introduction: cross-validation, penalization, dimensionality reduction, what are hyperparameters? prediction of Michal classification (LR / SVM) vs. regression (ridge regression) ?:: learning a person language model, applying it to counties GUEST lecture: Andy Schwartz on deep learning PENALIZED REGRESSION TUTORIAL: https://www.youtube.com/watch?v=nQ4G45AbHyU&feature=emb_logo	Park JPSP paper on Big5 prediction Yarkoni-prediction as a method	// prediction with DLATK with different feature sets (1grams, 1to3grams, LIWC< topics, word2vec) for different outcomes (classification for gender and high/low extraversion [divide by thirds, remove middle third], regression for contionous outcomes (age/personality scores)] TODO: how can we allow students to set hyperparemeter easily -- from command line, not by editing hyperparameter search strings in regressionPredictor.py - run examples where hyperparemeters are overfitting and underfitting, to see fall off in prediction accuracies // exercise: given them a 400 user subset -- what predictions can they obtain, with what features? // either here or next week:: Reproduce the Twitter predicts heart disease paper (2015, psych science) -- Shrinidhi, see 2018 replication, Appendix 1 (jeichstaedt.com/pubs -- it has the step by step in there) ?? work with basic transformers in DLATK ??	HW6: Machine learning application make a results table with the different prediction accuracies write prediction resutls up for publication in a social science journal	Optional: Add transformers in DLATK
7	TBD	ETHICS	what are the main faux pas that haven't in this space? why are they bad? positive ethics vs. negative ethics deontological vs. consequentialist ethics the 5 criteria in bioethics Cambridge Analytica and microtargetting Facebook influence experiment using NLP to find bias in police interactions	story about CA? algorithmic bias science paper a paper from Maarten Sap?		Kramer et al. PNAS
8	3/2, 3/7, 3/9, 3/14, 3/16	Block 6 - Team projects (yay!)	Probably: ethics Probably: biases probably: cursing and speech acts How do you get your own language data? -- geotagging flow chart - Twitter data sets off the shelf -- County 10% from WWBP github - writing prompts - scraping website (erowid corpus?) - political speeches - specific subreddits for your construct of itnerest		work with different kinds of language data - pull it if necessary, upload to server, import to database. wrangle into shape. - we'll do a few examples (assume data is very big: upload csv's to server, import into SQL non-locally) // reproduce heart disease analyses on the Twitter data set	Final Team Project Pick 1-2 datasets, or bring your own, execute the entire pipeline on it. < it needs some type of label for prediction > - language dataset descriptives - model the right number of LDA topics - extract LIWC2015 - extract sentiment - run machine learning against outcomes / labels - visualize all results write methods/results/DISCUSSION/supplement up for publication! Give it a minimal introduction
9
10		Unsure where:	advanced topics in the literature--temporal prediction of mood states			emotion diversity vs. absolute level papers
11			not sure where: different correl fields
12			Anscombe transforms? Other transforms?
13
14				discussion of biases: presentation, desirability
15			talk about feature importance measures
16			why don't we stem? where might it help>
17			correlation vs causation: http://www.tylervigen.com/spurious-correlations
18			Cursing on social media
19			where the because internet reading
20			unicode, emojis
21			reading: reading tea leaves
22			MANIC patients upper case
23
24			Open vocab:
25			ADD – spreadsheet discussion of ER depression – correlation vs. LR
26
27
28
29
30			Spatial aggression units --
31			Add google search query data? Discussion of DMA?