algorithms

title: algorithms
date: 7/14/14-9/3/14
time: M & W 10am - 1pm
affiliation: Columbia University, Lede Program
instructors: Jonathan Soma, Chris Wiggins
location: 607c Pulitzer Hall *

Multiliteracies in algorithms: functional literacy, critical literacy, and rhetorical literacy. Within critical literacy, a strong emphasis will be knowing what is possible. For algorithms, this usually means computational complexity -- the study of how the time needed to perform an algorithm grows as the problem size (e.g., the number of data) grows. For algorithms dealing with data, we will study how this leads to a balance between fast and accurate. Within functional literacy, we will be building on Python's tools for learning from data, including scikit-learn. Rhetorical literacy will be the anchor for the class, as our primary interest is in producing technology-enabled journalism.

"every piece of digital technology embeds within it a model of the world, and acts as an argument for that model." --mark hansen

Schedule and notes:

Week 1: Intro to Algorithms

What is an algorithm?
- Algorithms in computer science (searching, sorting, clustering)
- Algorithms in real life
Algorithmic thinking
- Step after step
- Reductions/Black boxes
Multiliteracies
- Functional literacy
- Rhetorical literacy
- Critical literacy
Summary of projects
- Documentation
- Agile vs Waterfall
Analysis of algorithm
- Computationally (Functionally)
  - Correctness, Termination, Time, Space
  - Generality
- Critically (Nick Diakopoulos)
  - Prioritization
  - Classification
  - Association
  - Filtering
Examples of algorithms in journalism
- QuakeBot
- Narrative Science/Automated Insights
- Projects from last class

Day One Links

ISO 3103 (2)
Royal Society of Chemistry
Orwell
Automated Insights (2)
Algorithmic Accountability Reporting by Nick Diakopoulos

Wednesday 7/16

Introduction to first in-class project: building a democrat detector

Course tools: scikit-learn, pandas, ntlk, capitolwords.org's api (you will need to register for a key)

-Week Inspiration: Diakopolous Report

Week 2: Supervised learning

Focus: modeling: predictive and interpretable

tools:
- scikit-learn
- nltk
- pandas
data journalism and reproducibility
- upshot on github
  - e.g., rangel charity
  - e.g., world cup
  - reminder: same bostock as in d3
  - also producing tools, e.g., statement for getting congressional press statements
why open source?
- many eyes
- BUT this doesn't mean no bugs. cf., heartbleed
overfitting (cf., Einstein's "Everything should be made as simple as possible, but not simpler."
discussion of nifty projects
- sentiment analysis: it's a thing
- example of a sentiment analysis as a service company
- hedometer: example of a sentiment analysis research project
more on naive bayes
- example of naive bayes in bash script on enron email dataset
- example of naive bayes in awk
- this is not how spam works, though some people think it is
importance of probability

-Week Inspiration: Nifty project on authorship detection

Monday 7/21

overview/concepts:

algorithms that learn from data to model the world ( i.e., machine learning)
the role of optimization in those algos
representation (e.g., documents)
examples: reading aloud the authorship nifty assignment
another example: bag of words

math:

introduce naive Bayes
introduce probability and Bayes rule
go through naive Bayes
show how it's a graphical model (pictures, organizing stories in your head, a chance to talk about complexity)

extensions:

say but don't show how you could do this with priors and for multiclass
talk about other classification algorithms
how do decide what algorithm or priors are "best"?
digression on meaning of modeling and desiderata of models

Fun data to play with

-Week Inspiration: what is Bayes theorem

Wednesday 7/23

k-nearest neighbors (predicting from examples)

Week 3: Probability and statistics

Monday: 7/28

back to 'Naive Bayes' and Bayes rule
'being Bayesian'
critical literacy
- why this classifier? what else is possible?
- computational complexity: what is realistic?
- what assumptions are made?
- what is "good" modeling -- see Leo (an allusion to CP Snow's the two cultures
rhetorical literacy: try something else!
- random forests
- decision trees,
  - e.g., in ProPublica's message machine
  - iris image as simple decision tree
- SVMs
- explore scikit-learn's classification algorithms
introduction to unsupervised learning
- normalization via standard score
- preprocessing at command line
more on data journalism
- NYT LIRR example, 2012
- Lucia de berk
- forensic bioinformatics
supervised learning
- kmeans movie
useful resources to learn more
- free book "triplets" aka ESL
- map of algorithms, including k-means, GMM; kNN, NB, decision trees

Possibly useful: Bayes Rule

Wednesday 7/30

supervised learning/classification with probability modeling

Week 4: Unsupervised learning

Focus: Exploratory data analysis, iterative algorithms (and therefore fast-vs-accurate)

Monday 8/4

opening questions:

how can journalists be disciplined while facing deadlines?
- hard with deadlines; cf., "The Goat Must Be Fed: Why digital tools are missing in most newsrooms", by the Duke Reporters' Lab, May 2014
- hard even for professional developers; cf., commit logs from last night
- growing awareness is already leading to novel field, and novel curricula. cf., the software carpentry movement.
- note that you ignore good software carpentry at your peril. cf., " How to lose $172,222 a second for 45 minutes"
should the relationship between journalist and story end when story is published? (cf., "The leaked New York Times innovation report is one of the key documents of this media age", Joshua Benton, Neiman Journalism Lab )
- see also this summary/table of contents
- example of journalist engaging audience
- example of journalist turning relations with readers into new stories

new matters:

bayes, naively
(supervised) regression and (over-)fitting
- explanation
- code
document clustering in kmeans
- code
- note: uses TFIDF
- related example in kmeans: digits
'GMM' (Gaussian/Normal/Bell curve mixture modeling)
- explanation
- image of pseudocode from ESL
- demo
  - explanation
  - code
- actual code for GMM
dimensionality reduction via PCA
try something else in scikit-learn from among their clustering algorithms! Try changing number of clusters! Go play!

thoughts on UNIX and algorithms in your life:

too many aliases, mathbabe post
example: code to introduce people to each other
example of pipes for word counting
killall is useful
some example aliases
- gugc for better git discipline
- repo for better repository discipline
- mypy for dealing with multiple python installs

-Week Inspiration: Krugman busts out probability

Wednesday 8/6

Python test
KMeans coding - in-class version, my version
Homework

Week 5: Nifty projects:

Monday 8/11

Google one-grams
- solutions and test scripts
- now go nuts! be free!
Related: Zipf's law: why?
- in word counts (from Peter Norvig)
- in neuroscience
- in general

Wednesday 8/13

Twitter sentiment mapping

(note: lots of room for critical literacy here)

Week 6: Algorithmic story generation

Monday 8/18

Input, Output, Precision, Determinism, Finiteness, Correctness, Generality
Prioritization, Classification, Association, Filtering

Quakebot: on Source, on Slate

Storytelling

What is a story? What's in a story?

Generative vs. descriptive
Plotto, the Master Book of All Plots (2) - preconditions, postconditions
Aarne-Thompson Classification System
Vladimir Propp: Plot Elements, Dramatis Personae
Claude Lévi-Strauss: Structuralism, The Structural Study of Myth
Conflict

Cinderella tales, examples: 1, 2, 3

NYT: Mike Brown's autopsy, PWC fined, Germany + the American Old West, Palin and Oil, Iraq retakes dam

Narrative Science (and Automated Insights)

Narrative Science on Forbes, examples: 1, 2, 3, 4
The Future of Journalism? (CJR)
Can an Algorithm Write a Better News Story Than a Human Reporter?
Notes on Narrative Science and Automated Insights

What's your angle? Trands, correlations, inflection points

Propublica's Opportunity Gap

Writeup: How To Edit 52,000 Stories at Once

Stuyvesant High: ProPublica, Big Apple Ed, Open House Packet, IB Times, NY Post
Brooklyn Tech: ProPublica, Big Apple Ed, Technology Analysis
William Cullen Bryant: ProPublica, Big Apple Ed, Wikipedia
Harvey Milk: ProPublica, Big Apple Ed

Wednesday 8/20

For reference

Our notes

Week 7: Networks and graphs

Monday 8/25

Networks
- introduction and examples in data journalism
  - from NYT: oscar net
  - from bostock:
  - lots of work from gilad lotan, e.g., recent media analysis around gaza
  - critical literacy: how do you reduce human interactions into a graph?
- centralities (find 'important' nodes)
  - functional literacy
  - critical literacy: does choice of centrality matter?
- graph drawing/graph visualization
  - critical literacy: does graph drawing mean anything? what are the axes?
- graph partitioning/community detection
  - example in python by wiggins and hofman, using latent variable model, a special case of factor analysis

things we'll use today:

example notebook
cool python thingy featuring the much studied 1977 zachary karate graph

deep thoughts/tangents:

data journalism is not qualitatively different from other journalism. they're both awesome because they involve thinking clearly. they're both limited by subjective choices, including design choices and process choices.
great quote related to the above, from a post by a stats grad student about a MOOC on data driven journalism.

I loved some of the language that came up, such as "backgrounding the data" -- analogous to checking out your sources to see how much you can trust them -- or "interrogating the data," including coming prepared to the "data interview" to ask thorough, thoughtful questions. I'd love to see a Statistics 101 course taught from this perspective. Statisticians do these things all the time, but our terminology and approach seem alien and confusing the first few times you see them. "Thinking like a journalist" and "thinking like a statistician" are not all that different, and the former might be a much more approachable path to the latter.

lior's awesome angry blog: even grownups (people with PhD's) working on very, very important problems can do very, very bad statistics.
- example of ire
stacks: networks of theorems
networks in corporate boards:
- theyrule
- muckety
  - help in muckety
  - about
  - e.g., the google

Possibly useful

networkx is like flying by importing antigravity.
2003 review article
cathy's 2012 blog post based on a lecture from john kelly of morningside analytics
Social Network Analysis as a method in the Data Journalistic toolkit by Adriana Homolova - Academia.edu
Social Network Analysis for Journalists Using the Twitter API
gephi (I don't actually use it but it's very widely used and very pretty)
cytoscape Similar, but invented by biologists.

Wednesday 8/27

Graphs

Week 8: Final project demo

Monday 9/1

No class! (Labor day)

Wednesday 9/3

Demos

additional resources

scikit-learn

a book

some fun data, none of which has API

Readings

Reading Machines, Stephen Ramsay, 2011

Python

warning on upgrades

Data sets

NBA Census: https://raw.githubusercontent.com/ledeprogram/courses/master/algorithms/NBA-Census-10.14.2013.csv
Iris data: https://raw.githubusercontent.com/ledeprogram/courses/master/algorithms/data/iris.csv
Authorship data: https://raw.githubusercontent.com/ledeprogram/courses/master/algorithms/data/books/book-data.csv
Mystery books: 1 2 3 4 5

Name		Name	Last commit message	Last commit date
parent directory ..
data		data
extra-networkx-d3-hotness		extra-networkx-d3-hotness
images		images
nifty-onegram		nifty-onegram
src/examples		src/examples
01 NBA Intro.ipynb		01 NBA Intro.ipynb
02 Nearest Neighbor Clustering (Class).ipynb		02 Nearest Neighbor Clustering (Class).ipynb
02 k-Nearest Neighbors (Chatty Notes).ipynb		02 k-Nearest Neighbors (Chatty Notes).ipynb
04 Congressional Record NLTK.ipynb		04 Congressional Record NLTK.ipynb
05 Kmeans! from class.ipynb		05 Kmeans! from class.ipynb
05 NBA K-Means (my notes).ipynb		05 NBA K-Means (my notes).ipynb
06 KMeans-Congressional Record NLTK.ipynb		06 KMeans-Congressional Record NLTK.ipynb
07 Cluster All The Text Things.ipynb		07 Cluster All The Text Things.ipynb
07 HP Scraper.ipynb		07 HP Scraper.ipynb
09-Forecast-twitter.ipynb		09-Forecast-twitter.ipynb
10-networkx-fun.ipynb		10-networkx-fun.ipynb
11 Pandas Review.ipynb		11 Pandas Review.ipynb
2014-04-01-COI-proposal.md		2014-04-01-COI-proposal.md
Homework - 05 KMeans (Completed).ipynb		Homework - 05 KMeans (Completed).ipynb
Homework - 05 KMeans.ipynb		Homework - 05 KMeans.ipynb
Linear regression.ipynb		Linear regression.ipynb
NBA-Census-10.14.2013.csv		NBA-Census-10.14.2013.csv
README.md		README.md
cron.md		cron.md
twitter.md		twitter.md

FilesExpand file tree

algorithms

Directory actions

More options

Directory actions

More options

Latest commit

History

algorithms

Folders and files

parent directory

README.md

Schedule and notes:

Week 1: Intro to Algorithms

Day One Links

Wednesday 7/16

Week 2: Supervised learning

Monday 7/21

overview/concepts:

math:

extensions:

Fun data to play with

Wednesday 7/23

Week 3: Probability and statistics

Monday: 7/28

Wednesday 7/30

Week 4: Unsupervised learning

Monday 8/4

Wednesday 8/6

Week 5: Nifty projects:

Monday 8/11

Wednesday 8/13

Week 6: Algorithmic story generation

Monday 8/18

Storytelling

Narrative Science (and Automated Insights)

Propublica's Opportunity Gap

Wednesday 8/20

For reference

Our notes

Week 7: Networks and graphs

Monday 8/25

Wednesday 8/27

Week 8: Final project demo

Monday 9/1

Wednesday 9/3

additional resources

scikit-learn

a book

some fun data, none of which has API

Readings

Python

Data sets