- title: algorithms
- date: 7/14/14-9/3/14
- time: M & W 10am - 1pm
- affiliation: Columbia University, Lede Program
- instructors: Jonathan Soma, Chris Wiggins
- location: 607c Pulitzer Hall *
Multiliteracies in algorithms: functional literacy, critical literacy, and rhetorical literacy. Within critical literacy, a strong emphasis will be knowing what is possible. For algorithms, this usually means computational complexity -- the study of how the time needed to perform an algorithm grows as the problem size (e.g., the number of data) grows. For algorithms dealing with data, we will study how this leads to a balance between fast and accurate. Within functional literacy, we will be building on Python's tools for learning from data, including scikit-learn. Rhetorical literacy will be the anchor for the class, as our primary interest is in producing technology-enabled journalism.
"every piece of digital technology embeds within it a model of the world, and acts as an argument for that model." --mark hansen
- What is an algorithm?
- Algorithms in computer science (searching, sorting, clustering)
- Algorithms in real life
- Algorithmic thinking
- Step after step
- Reductions/Black boxes
- Multiliteracies
- Functional literacy
- Rhetorical literacy
- Critical literacy
- Summary of projects
- Documentation
- Agile vs Waterfall
- Analysis of algorithm
- Computationally (Functionally)
- Correctness, Termination, Time, Space
- Generality
- Critically (Nick Diakopoulos)
- Prioritization
- Classification
- Association
- Filtering
- Computationally (Functionally)
- Examples of algorithms in journalism
- QuakeBot
- Narrative Science/Automated Insights
- Projects from last class
- ISO 3103 (2)
- Royal Society of Chemistry
- Orwell
- Automated Insights (2)
- Algorithmic Accountability Reporting by Nick Diakopoulos
- Introduction to first in-class project: building a democrat detector
Course tools: scikit-learn, pandas, ntlk, capitolwords.org's api (you will need to register for a key)
-Week Inspiration: Diakopolous Report
Focus: modeling: predictive and interpretable
- tools:
- data journalism and reproducibility
- upshot on github
- e.g., rangel charity
- e.g., world cup
- reminder: same bostock as in d3
- also producing tools, e.g., statement for getting congressional press statements
- upshot on github
- why open source?
- many eyes
- BUT this doesn't mean no bugs. cf., heartbleed
- overfitting (cf., Einstein's "Everything should be made as simple as possible, but not simpler."
- discussion of nifty projects
- sentiment analysis: it's a thing
- example of a sentiment analysis as a service company
- hedometer: example of a sentiment analysis research project
- more on naive bayes
- importance of probability
-Week Inspiration: Nifty project on authorship detection
- algorithms that learn from data to model the world ( i.e., machine learning)
- the role of optimization in those algos
- representation (e.g., documents)
- examples: reading aloud the authorship nifty assignment
- another example: bag of words
- introduce naive Bayes
- introduce probability and Bayes rule
- go through naive Bayes
- show how it's a graphical model (pictures, organizing stories in your head, a chance to talk about complexity)
- say but don't show how you could do this with priors and for multiclass
- talk about other classification algorithms
- how do decide what algorithm or priors are "best"?
- digression on meaning of modeling and desiderata of models
-Week Inspiration: what is Bayes theorem
- k-nearest neighbors (predicting from examples)
- back to 'Naive Bayes' and Bayes rule
- 'being Bayesian'
- critical literacy
- why this classifier? what else is possible?
- computational complexity: what is realistic?
- what assumptions are made?
- what is "good" modeling -- see Leo (an allusion to CP Snow's the two cultures
- rhetorical literacy: try something else!
- random forests
- decision trees,
- e.g., in ProPublica's message machine
- iris image as simple decision tree
- SVMs
- explore scikit-learn's classification algorithms
- introduction to unsupervised learning
- normalization via standard score
- preprocessing at command line
- more on data journalism
- supervised learning
- kmeans movie
- useful resources to learn more
Possibly useful: Bayes Rule
- supervised learning/classification with probability modeling
Focus: Exploratory data analysis, iterative algorithms (and therefore fast-vs-accurate)
opening questions:
- how can journalists be disciplined while facing deadlines?
- hard with deadlines; cf., "The Goat Must Be Fed: Why digital tools are missing in most newsrooms", by the Duke Reporters' Lab, May 2014
- hard even for professional developers; cf., commit logs from last night
- growing awareness is already leading to novel field, and novel curricula. cf., the software carpentry movement.
- note that you ignore good software carpentry at your peril. cf., " How to lose $172,222 a second for 45 minutes"
- should the relationship between journalist and story end when story is published? (cf., "The leaked New York Times innovation report is one of the key documents of this media age", Joshua Benton, Neiman Journalism Lab )
- see also this summary/table of contents
- example of journalist engaging audience
- example of journalist turning relations with readers into new stories
new matters:
-
(supervised) regression and (over-)fitting
-
document clustering in kmeans
-
'GMM' (Gaussian/Normal/Bell curve mixture modeling)
- explanation
- image of pseudocode from ESL
- demo
- actual code for GMM
-
dimensionality reduction via PCA
-
try something else in scikit-learn from among their clustering algorithms! Try changing number of clusters! Go play!
thoughts on UNIX and algorithms in your life:
- too many aliases, mathbabe post
- example: code to introduce people to each other
- example of pipes for word counting
- killall is useful
- some example aliases
-Week Inspiration: Krugman busts out probability
- Python test
- KMeans coding - in-class version, my version
- Homework
- Google one-grams
- solutions and test scripts
- now go nuts! be free!
- Related: Zipf's law: why?
(note: lots of room for critical literacy here)
- Input, Output, Precision, Determinism, Finiteness, Correctness, Generality
- Prioritization, Classification, Association, Filtering
What is a story? What's in a story?
- Generative vs. descriptive
- Plotto, the Master Book of All Plots (2) - preconditions, postconditions
- Aarne-Thompson Classification System
- Vladimir Propp: Plot Elements, Dramatis Personae
- Claude Lévi-Strauss: Structuralism, The Structural Study of Myth
- Conflict
Cinderella tales, examples: 1, 2, 3
NYT: Mike Brown's autopsy, PWC fined, Germany + the American Old West, Palin and Oil, Iraq retakes dam
- Narrative Science on Forbes, examples: 1, 2, 3, 4
- The Future of Journalism? (CJR)
- Can an Algorithm Write a Better News Story Than a Human Reporter?
- Notes on Narrative Science and Automated Insights
What's your angle? Trands, correlations, inflection points
Propublica's Opportunity Gap
Writeup: How To Edit 52,000 Stories at Once
- Stuyvesant High: ProPublica, Big Apple Ed, Open House Packet, IB Times, NY Post
- Brooklyn Tech: ProPublica, Big Apple Ed, Technology Analysis
- William Cullen Bryant: ProPublica, Big Apple Ed, Wikipedia
- Harvey Milk: ProPublica, Big Apple Ed
- Networks
- introduction and examples in data journalism
- from NYT: oscar net
- from bostock:
- lots of work from gilad lotan, e.g., recent media analysis around gaza
- critical literacy: how do you reduce human interactions into a graph?
- centralities (find 'important' nodes)
- functional literacy
- critical literacy: does choice of centrality matter?
- graph drawing/graph visualization
- critical literacy: does graph drawing mean anything? what are the axes?
- graph partitioning/community detection
- example in python by wiggins and hofman, using latent variable model, a special case of factor analysis
- introduction and examples in data journalism
things we'll use today:
deep thoughts/tangents:
- data journalism is not qualitatively different from other journalism. they're both awesome because they involve thinking clearly. they're both limited by subjective choices, including design choices and process choices.
- great quote related to the above, from a post by a stats grad student about a MOOC on data driven journalism.
I loved some of the language that came up, such as "backgrounding the data" -- analogous to checking out your sources to see how much you can trust them -- or "interrogating the data," including coming prepared to the "data interview" to ask thorough, thoughtful questions. I'd love to see a Statistics 101 course taught from this perspective. Statisticians do these things all the time, but our terminology and approach seem alien and confusing the first few times you see them. "Thinking like a journalist" and "thinking like a statistician" are not all that different, and the former might be a much more approachable path to the latter.
- lior's awesome angry blog: even grownups (people with PhD's) working on very, very important problems can do very, very bad statistics.
- stacks: networks of theorems
- networks in corporate boards:
Possibly useful
- networkx is like flying by importing antigravity.
- 2003 review article
- cathy's 2012 blog post based on a lecture from john kelly of morningside analytics
- Social Network Analysis as a method in the Data Journalistic toolkit by Adriana Homolova - Academia.edu
- Social Network Analysis for Journalists Using the Twitter API
- gephi (I don't actually use it but it's very widely used and very pretty)
- cytoscape Similar, but invented by biologists.
- Graphs
- No class! (Labor day)
- Demos
- the site
- their tutorial on infernece
- their tutorial on scikit-learn
- 1-page algorithm cheat sheet
- (longer) user guide
- many examples
- Reading Machines, Stephen Ramsay, 2011
- NBA Census: https://raw.githubusercontent.com/ledeprogram/courses/master/algorithms/NBA-Census-10.14.2013.csv
- Iris data: https://raw.githubusercontent.com/ledeprogram/courses/master/algorithms/data/iris.csv
- Authorship data: https://raw.githubusercontent.com/ledeprogram/courses/master/algorithms/data/books/book-data.csv
- Mystery books: 1 2 3 4 5