X Tutup

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "ISRC Python Workshop: Text Analytics in Python\n", "\n", "___Text Analytics in Python___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "@author: Zhiya Zuo\n", "\n", "@email: zhiya-zuo@uiowa.edu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Text data has been one of the most important sources across areas. Doctoral notes, students comments/feedback, product reviews, as well as social media text (tweets/facebook posts), are valuable resources. In this workshop, we will use a toy dataset to go though a common text analytics procedure in Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first load a toy dataset (I did a little bit of preprocessing) used in the paper: \n", "\n", "> Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).\n", "\n", "This is actually the testing set (for algorithm evaluation) but we will just use this as an illustration for text analytics procedures." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:02:34.663292Z", "start_time": "2018-04-03T19:02:34.249602Z" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:02:35.089928Z", "start_time": "2018-04-03T19:02:35.034226Z" } }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	class_index	title	description	class_name
0	3	Fears for T N pension after talks	Unions representing workers at Turner Newall...	Business
1	4	The Race is On: Second Private Team Sets Launc...	SPACE.com - TORONTO, Canada -- A second\\team o...	Sci/Tech

\n", "

" ], "text/plain": [ " class_index title \\\n", "0 3 Fears for T N pension after talks \n", "1 4 The Race is On: Second Private Team Sets Launc... \n", "\n", " description class_name \n", "0 Unions representing workers at Turner Newall... Business \n", "1 SPACE.com - TORONTO, Canada -- A second\\team o... Sci/Tech " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('sample-data/ag_news.csv')\n", "df.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are four columns: \n", "- class index and class names that annotate the content type\n", "- title and description of each news piece" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's combine the title and description into a column called `content` and drop unneccessary columns." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:03:21.954463Z", "start_time": "2018-04-03T19:03:21.932703Z" } }, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	class_index	class_name	content
0	3	Business	Fears for T N pension after talks. Unions repr...
1	4	Sci/Tech	The Race is On: Second Private Team Sets Launc...

\n", "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	class_index	class_name	content
0	3	Business	Fears for T N pension after talks. Unions repr...
1	4	Sci/Tech	The Race is On: Second Private Team Sets Launc...

\n", "

" ], "text/plain": [ " class_index class_name content\n", "0 3 Business Fears for T N pension after talks. Unions repr...\n", "1 4 Sci/Tech The Race is On: Second Private Team Sets Launc..." ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:11:04.224280Z", "start_time": "2018-04-03T19:11:00.712523Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fears; for; t; n; pension; after; talks; .; unions; representing; workers; at; turner; newall; say; they; are; 'disappointed; '; after; talks; with; stricken; parent; firm; federal; mogul; .\n" ] } ], "source": [ "# also convert them to lower case\n", "bow = [nltk.word_tokenize(content.lower()) for content in df['content'].values]\n", "# show the first 1\n", "print('; '.join(bow[0]))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:12:46.084360Z", "start_time": "2018-04-03T19:12:46.080095Z" } }, "outputs": [ { "data": { "text/plain": [ "7600" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(bow)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:12:53.252026Z", "start_time": "2018-04-03T19:12:53.247033Z" } }, "outputs": [ { "data": { "text/plain": [ "['fears',\n", " 'for',\n", " 't',\n", " 'n',\n", " 'pension',\n", " 'after',\n", " 'talks',\n", " '.',\n", " 'unions',\n", " 'representing',\n", " 'workers',\n", " 'at',\n", " 'turner',\n", " 'newall',\n", " 'say',\n", " 'they',\n", " 'are',\n", " \"'disappointed\",\n", " \"'\",\n", " 'after',\n", " 'talks',\n", " 'with',\n", " 'stricken',\n", " 'parent',\n", " 'firm',\n", " 'federal',\n", " 'mogul',\n", " '.']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bow[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's remove the stop words and punctuations. Note that there are some very short words of length 1. I will typically remove them as well if they do not have special meanings. Also, I will remove pure numbers by `str.isdigit`. Note that this may not work well but for simplicity, we will just go with it. See more dicussion on this issue [here](https://stackoverflow.com/questions/354038/how-do-i-check-if-a-string-is-a-number-float?page=1&tab=votes#tab-top)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:12:31.540518Z", "start_time": "2018-04-03T19:12:31.537528Z" } }, "outputs": [], "source": [ "min_length = 3 # define the customized minimum length" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:14:32.306745Z", "start_time": "2018-04-03T19:14:31.652246Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fears; n; pension; talks; unions; representing; workers; turner; newall; say; 'disappointed; talks; stricken; parent; firm; federal; mogul\n" ] } ], "source": [ "bow = [[w for w in d if w not in punctuation and w not in eng_stopwords and not w.isdigit()] for d in bow]\n", "print('; '.join(bow[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can, sometimes we need further data cleaning to remove punctuations in words. In this example, we want to remove the quote in the word \"disappointed\". In this case, we can utilize [`string.translate`](https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate):" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:15:39.553891Z", "start_time": "2018-04-03T19:15:39.339770Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fears; pension; talks; unions; representing; workers; turner; newall; say; disappointed; talks; stricken; parent; firm; federal; mogul\n" ] } ], "source": [ "# do not translate anything, except for removing all punctuations\n", "trans = str.maketrans('', '', punctuation)\n", "bow = [[w.translate(trans).strip() for w in d] for d in bow]\n", "bow = [[w for w in d if len(w) >= min_length] for d in bow]\n", "print('; '.join(bow[0]))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:15:56.753465Z", "start_time": "2018-04-03T19:15:56.747851Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "race; second; private; team; sets; launch; date; human; spaceflight; spacecom; spacecom; toronto; canada; secondteam; rocketeers; competing; million; ansari; prize; contest; forprivately; funded; suborbital; space; flight; officially; announced; firstlaunch; date; manned; rocket\n", "---------------\n", "company; wins; grant; study; peptides; company; founded; chemistry; researcher; university; louisville; grant; develop; method; producing; better; peptides; short; chains; amino; acids; building; blocks; proteins\n", "---------------\n", "prediction; unit; helps; forecast; wildfires; barely; dawn; mike; fitzpatrick; starts; shift; blur; colorful; maps; figures; endless; charts; already; knows; day; bring; lightning; strike; places; expects; winds; pick; moist; places; dry; flames; roar\n", "---------------\n", "calif; aims; limit; farmrelated; smog; southern; california; smogfighting; agency; went; emissions; bovine; variety; friday; adopting; nation; first; rules; reduce; air; pollution; dairy; cow; manure\n", "---------------\n" ] } ], "source": [ "for i in range(1, 5):\n", " print('; '.join(bow[i]))\n", " print('---------------')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Stemming/Lemmetization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, it is noteworthy that one word can take different forms. For example, run can be ___run___, ___runs___, ___ran___, and ___running___. While they are different, they mean the same thing. There are two common methods to reduce a word back to the root form." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first one is called ___stemming___, where each word will bee reduced to its \"stem\". For example, the stem of the word ___fly___ will be ___fli___" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:18:20.629846Z", "start_time": "2018-04-03T19:18:20.621641Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fly : fli\n", "papers : paper\n", "communication : commun\n", "community : commun\n" ] } ], "source": [ "from nltk import PorterStemmer\n", "stemmer = PorterStemmer()\n", "for w in ['fly', 'papers', 'communication', 'community']:\n", " print(w, ': ', stemmer.stem(w))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While this seems to be okay, sometimes it is hard to ___reverse___ the stemming result (e.g., both \"communication\" and \"community\" are transformed to \"commun\", although they mean very different things). A second choice can be ___lemmatization___, which makes a word to its [___lemma__ ](https://en.wikipedia.org/wiki/Lemma_(morphology)), or say canonical or dictionary form." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:20:00.787775Z", "start_time": "2018-04-03T19:19:58.489186Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "flies : fly\n", "papers : paper\n", "communication : communication\n", "communities : community\n" ] } ], "source": [ "from nltk import WordNetLemmatizer\n", "lemmatizer = WordNetLemmatizer()\n", "for w in ['flies', 'papers', 'communication', 'communities']:\n", " print(w, ': ', lemmatizer.lemmatize(w))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For simplicity, let's just use stemming." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:21:05.035022Z", "start_time": "2018-04-03T19:20:59.956056Z" } }, "outputs": [], "source": [ "bow = [[stemmer.stem(w) for w in d] for d in bow]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:21:05.043110Z", "start_time": "2018-04-03T19:21:05.036895Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "race; second; privat; team; set; launch; date; human; spaceflight; spacecom; spacecom; toronto; canada; secondteam; rocket; compet; million; ansari; prize; contest; forpriv; fund; suborbit; space; flight; offici; announc; firstlaunch; date; man; rocket\n", "------------------------------------------------------------------------------------------------------------------------------------------------------\n", "compani; win; grant; studi; peptid; compani; found; chemistri; research; univers; louisvil; grant; develop; method; produc; better; peptid; short; chain; amino; acid; build; block; protein\n", "------------------------------------------------------------------------------------------------------------------------------------------------------\n", "predict; unit; help; forecast; wildfir; bare; dawn; mike; fitzpatrick; start; shift; blur; color; map; figur; endless; chart; alreadi; know; day; bring; lightn; strike; place; expect; wind; pick; moist; place; dri; flame; roar\n", "------------------------------------------------------------------------------------------------------------------------------------------------------\n", "calif; aim; limit; farmrel; smog; southern; california; smogfight; agenc; went; emiss; bovin; varieti; friday; adopt; nation; first; rule; reduc; air; pollut; dairi; cow; manur\n", "------------------------------------------------------------------------------------------------------------------------------------------------------\n" ] } ], "source": [ "for i in range(1, 5):\n", " print('; '.join(bow[i]))\n", " print('---------------'*10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Vector Space Model (VSM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a bag of words, we can create vectors based on these items. Instead of using tokens/text, it is easier and better to just use integer indices. For example, `race` is the first word and therefore the number `1` maps to `race`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this task, I like to use [`gensim`](https://radimrehurek.com/gensim/index.html), which has a library of very well written and convenient APIs, especially for [topic modeling](https://en.wikipedia.org/wiki/Topic_model) and [word2vec](https://rare-technologies.com/word2vec-tutorial/) algorithms:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:25:41.068099Z", "start_time": "2018-04-03T19:25:40.120687Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dictionary(17958 unique tokens: ['disappoint', 'fear', 'feder', 'firm', 'mogul']...)\n" ] } ], "source": [ "import gensim\n", "dictionary = gensim.corpora.Dictionary(bow)\n", "print(dictionary)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:26:14.668636Z", "start_time": "2018-04-03T19:26:14.638436Z" }, "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "{'disappoint': 0,\n", " 'fear': 1,\n", " 'feder': 2,\n", " 'firm': 3,\n", " 'mogul': 4,\n", " 'newal': 5,\n", " 'parent': 6,\n", " 'pension': 7,\n", " 'repres': 8,\n", " 'say': 9,\n", " 'stricken': 10,\n", " 'talk': 11,\n", " 'turner': 12,\n", " 'union': 13,\n", " 'worker': 14,\n", " 'announc': 15,\n", " 'ansari': 16,\n", " 'canada': 17,\n", " 'compet': 18,\n", " 'contest': 19,\n", " 'date': 20,\n", " 'firstlaunch': 21,\n", " 'flight': 22,\n", " 'forpriv': 23,\n", " 'fund': 24,\n", " 'human': 25,\n", " 'launch': 26,\n", " 'man': 27,\n", " 'million': 28,\n", " 'offici': 29,\n", " 'privat': 30,\n", " 'prize': 31,\n", " 'race': 32,\n", " 'rocket': 33,\n", " 'second': 34,\n", " 'secondteam': 35,\n", " 'set': 36,\n", " 'space': 37,\n", " 'spacecom': 38,\n", " 'spaceflight': 39,\n", " 'suborbit': 40,\n", " 'team': 41,\n", " 'toronto': 42,\n", " 'acid': 43,\n", " 'amino': 44,\n", " 'better': 45,\n", " 'block': 46,\n", " 'build': 47,\n", " 'chain': 48,\n", " 'chemistri': 49,\n", " 'compani': 50,\n", " 'develop': 51,\n", " 'found': 52,\n", " 'grant': 53,\n", " 'louisvil': 54,\n", " 'method': 55,\n", " 'peptid': 56,\n", " 'produc': 57,\n", " 'protein': 58,\n", " 'research': 59,\n", " 'short': 60,\n", " 'studi': 61,\n", " 'univers': 62,\n", " 'win': 63,\n", " 'alreadi': 64,\n", " 'bare': 65,\n", " 'blur': 66,\n", " 'bring': 67,\n", " 'chart': 68,\n", " 'color': 69,\n", " 'dawn': 70,\n", " 'day': 71,\n", " 'dri': 72,\n", " 'endless': 73,\n", " 'expect': 74,\n", " 'figur': 75,\n", " 'fitzpatrick': 76,\n", " 'flame': 77,\n", " 'forecast': 78,\n", " 'help': 79,\n", " 'know': 80,\n", " 'lightn': 81,\n", " 'map': 82,\n", " 'mike': 83,\n", " 'moist': 84,\n", " 'pick': 85,\n", " 'place': 86,\n", " 'predict': 87,\n", " 'roar': 88,\n", " 'shift': 89,\n", " 'start': 90,\n", " 'strike': 91,\n", " 'unit': 92,\n", " 'wildfir': 93,\n", " 'wind': 94,\n", " 'adopt': 95,\n", " 'agenc': 96,\n", " 'aim': 97,\n", " 'air': 98,\n", " 'bovin': 99,\n", " 'calif': 100,\n", " 'california': 101,\n", " 'cow': 102,\n", " 'dairi': 103,\n", " 'emiss': 104,\n", " 'farmrel': 105,\n", " 'first': 106,\n", " 'friday': 107,\n", " 'limit': 108,\n", " 'manur': 109,\n", " 'nation': 110,\n", " 'pollut': 111,\n", " 'reduc': 112,\n", " 'rule': 113,\n", " 'smog': 114,\n", " 'smogfight': 115,\n", " 'southern': 116,\n", " 'varieti': 117,\n", " 'went': 118,\n", " 'also': 119,\n", " 'appar': 120,\n", " 'area': 121,\n", " 'artist': 122,\n", " 'audac': 123,\n", " 'british': 124,\n", " 'campaign': 125,\n", " 'children': 126,\n", " 'copyright': 127,\n", " 'depart': 128,\n", " 'dfe': 129,\n", " 'download': 130,\n", " 'educ': 131,\n", " 'emi': 132,\n", " 'end': 133,\n", " 'feel': 134,\n", " 'find': 135,\n", " 'gener': 136,\n", " 'got': 137,\n", " 'gover': 138,\n", " 'happen': 139,\n", " 'hope': 140,\n", " 'ignor': 141,\n", " 'illeg': 142,\n", " 'indoctrin': 143,\n", " 'industri': 144,\n", " 'inspir': 145,\n", " 'intent': 146,\n", " 'letter': 147,\n", " 'littl': 148,\n", " 'make': 149,\n", " 'manifesto': 150,\n", " 'music': 151,\n", " 'musician': 152,\n", " 'musicth': 153,\n", " 'negoti': 154,\n", " 'next': 155,\n", " 'open': 156,\n", " 'ostens': 157,\n", " 'pedant': 158,\n", " 'perhap': 159,\n", " 'popular': 160,\n", " 'recent': 161,\n", " 'school': 162,\n", " 'similar': 163,\n", " 'skill': 164,\n", " 'someth': 165,\n", " 'suppos': 166,\n", " 'thing': 167,\n", " 'unfortun': 168,\n", " 'use': 169,\n", " 'variou': 170,\n", " 'well': 171,\n", " 'write': 172,\n", " 'wrote': 173,\n", " '18yearold': 174,\n", " 'accord': 175,\n", " 'admit': 176,\n", " 'antiviru': 177,\n", " 'arrest': 178,\n", " 'author': 179,\n", " 'captur': 180,\n", " 'cluley': 181,\n", " 'confirm': 182,\n", " 'consult': 183,\n", " 'custodi': 184,\n", " 'five': 185,\n", " 'germani': 186,\n", " 'graham': 187,\n", " 'infect': 188,\n", " 'isrespons': 189,\n", " 'jaschan': 190,\n", " 'least': 191,\n", " 'led': 192,\n", " 'loos': 193,\n", " 'may': 194,\n", " 'microsoft': 195,\n", " 'month': 196,\n", " 'netski': 197,\n", " 'network': 198,\n", " 'one': 199,\n", " 'percent': 200,\n", " 'polic': 201,\n", " 'portscan': 202,\n", " 'preced': 203,\n", " 'program': 204,\n", " 'publish': 205,\n", " 'reward': 206,\n", " 'roundup': 207,\n", " 'said': 208,\n", " 'sasser': 209,\n", " 'selfconfess': 210,\n", " 'senior': 211,\n", " 'sixmonthviru': 212,\n", " 'somethingexpert': 213,\n", " 'sopho': 214,\n", " 'staggeri': 215,\n", " 'sven': 216,\n", " 'taken': 217,\n", " 'technolog': 218,\n", " 'terror': 219,\n", " 'therewer': 220,\n", " 'theteenag': 221,\n", " 'variant': 222,\n", " 'viru': 223,\n", " 'virus': 224,\n", " 'war': 225,\n", " 'wednesday': 226,\n", " 'whosaid': 227,\n", " 'worm': 228,\n", " 'wormsass': 229,\n", " 'base': 230,\n", " 'bloom': 231,\n", " 'clientdiscov': 232,\n", " 'could': 233,\n", " 'cours': 234,\n", " 'direct': 235,\n", " 'distribut': 236,\n", " 'distributioni': 237,\n", " 'encrypt': 238,\n", " 'entir': 239,\n", " 'file': 240,\n", " 'filter': 241,\n", " 'fingerprint': 242,\n", " 'foaf': 243,\n", " 'foaffil': 244,\n", " 'foafkey': 245,\n", " 'foafloaf': 246,\n", " 'friend': 247,\n", " 'gpgopenpgp': 248,\n", " 'higher': 249,\n", " 'ident': 250,\n", " 'identit': 251,\n", " 'includ': 252,\n", " 'interest': 253,\n", " 'key': 254,\n", " 'keydistributionwhat': 255,\n", " 'keyfingerpr': 256,\n", " 'keyfingerprint': 257,\n", " 'level': 258,\n", " 'lot': 259,\n", " 'mean': 260,\n", " 'needto': 261,\n", " 'new': 262,\n", " 'pgp': 263,\n", " 'popul': 264,\n", " 'properti': 265,\n", " 'simpl': 266,\n", " 'social': 267,\n", " 'socialnetwork': 268,\n", " 'sourc': 269,\n", " 'thi': 270,\n", " 'think': 271,\n", " 'though': 272,\n", " 'weboftrust': 273,\n", " 'whitelist': 274,\n", " 'within': 275,\n", " 'would': 276,\n", " 'your': 277,\n", " 'chief': 278,\n", " 'email': 279,\n", " 'fraud': 280,\n", " 'phish': 281,\n", " 'scam': 282,\n", " 'squad': 283,\n", " 'target': 284,\n", " 'warn': 285,\n", " 'wiltshir': 286,\n", " '36000': 287,\n", " '65m': 288,\n", " 'card': 289,\n", " 'dedic': 290,\n", " 'estim': 291,\n", " 'net': 292,\n", " 'recov': 293,\n", " 'save': 294,\n", " 'stolen': 295,\n", " 'two': 296,\n", " 'year': 297,\n", " 'angel': 298,\n", " 'brcmo': 299,\n", " 'broadcom': 300,\n", " 'corp': 301,\n", " 'current': 302,\n", " 'format': 303,\n", " 'group': 304,\n", " 'highspe': 305,\n", " 'inc': 306,\n", " 'instrument': 307,\n", " 'lo': 308,\n", " 'propos': 309,\n", " 'reuter': 310,\n", " 'speed': 311,\n", " 'standard': 312,\n", " 'stmicroelectron': 313,\n", " 'stmpa': 314,\n", " 'texa': 315,\n", " 'thursday': 316,\n", " 'time': 317,\n", " 'txnn': 318,\n", " 'wireless': 319,\n", " 'aaplo': 320,\n", " 'appl': 321,\n", " 'began': 322,\n", " 'bundl': 323,\n", " 'comput': 324,\n", " 'creat': 325,\n", " 'cut': 326,\n", " 'design': 327,\n", " 'discount': 328,\n", " 'featur': 329,\n", " 'final': 330,\n", " 'flagship': 331,\n", " 'graphic': 332,\n", " 'let': 333,\n", " 'motion': 334,\n", " 'pro': 335,\n", " 'realtim': 336,\n", " 'ship': 337,\n", " 'softwar': 338,\n", " 'tuesday': 339,\n", " 'unveil': 340,\n", " 'user': 341,\n", " 'video': 342,\n", " 'videoedit': 343,\n", " 'amsterdam': 344,\n", " 'battleground': 345,\n", " 'beat': 346,\n", " 'digit': 347,\n", " 'dutch': 348,\n", " 'europ': 349,\n", " 'free': 350,\n", " 'latest': 351,\n", " 'local': 352,\n", " 'market': 353,\n", " 'record': 354,\n", " 'retail': 355,\n", " 'servic': 356,\n", " 'shop': 357,\n", " 'song': 358,\n", " '100km': 359,\n", " 'ant': 360,\n", " 'australia': 361,\n", " 'coloni': 362,\n", " 'discov': 363,\n", " 'giant': 364,\n", " 'hit': 365,\n", " 'insect': 366,\n", " 'melbourn': 367,\n", " 'speci': 368,\n", " 'super': 369,\n", " 'threaten': 370,\n", " 'claim': 371,\n", " 'collaps': 372,\n", " 'dolphin': 373,\n", " 'keep': 374,\n", " 'pod': 375,\n", " 'reli': 376,\n", " 'scientist': 377,\n", " 'socialit': 378,\n", " 'achiev': 379,\n", " 'adolesc': 380,\n", " 'due': 381,\n", " 'enorm': 382,\n", " 'growth': 383,\n", " 'massiv': 384,\n", " 'monster': 385,\n", " 'rex': 386,\n", " 'size': 387,\n", " 'spurt': 388,\n", " 'teenag': 389,\n", " 'tyrannosauru': 390,\n", " 'beneath': 391,\n", " 'billion': 392,\n", " 'ganymed': 393,\n", " 'ici': 394,\n", " 'interior': 395,\n", " 'irregular': 396,\n", " 'jet': 397,\n", " 'jupit': 398,\n", " 'lab': 399,\n", " 'largest': 400,\n", " 'lump': 401,\n", " 'lumpi': 402,\n", " 'mass': 403,\n", " 'moon': 404,\n", " 'propuls': 405,\n", " 'rock': 406,\n", " 'shell': 407,\n", " 'support': 408,\n", " 'surfac': 409,\n", " 'capabl': 410,\n", " 'demonstr': 411,\n", " 'draw': 412,\n", " 'esa': 413,\n", " 'european': 414,\n", " 'express': 415,\n", " 'futur': 416,\n", " 'imag': 417,\n", " 'interplanetari': 418,\n", " 'joint': 419,\n", " 'mar': 420,\n", " 'mission': 421,\n", " 'nasa': 422,\n", " 'part': 423,\n", " 'pave': 424,\n", " 'pictur': 425,\n", " 'relay': 426,\n", " 'rover': 427,\n", " 'way': 428,\n", " 'although': 429,\n", " 'begin': 430,\n", " 'biolog': 431,\n", " 'chemic': 432,\n", " 'clue': 433,\n", " 'content': 434,\n", " 'cradl': 435,\n", " 'debat': 436,\n", " 'describ': 437,\n", " 'evidenti': 438,\n", " 'fossil': 439,\n", " 'issu': 440,\n", " 'layer': 441,\n", " 'life': 442,\n", " 'mcloughlin': 443,\n", " 'nicola': 444,\n", " 'oxford': 445,\n", " 'sediment': 446,\n", " 'spawn': 447,\n", " 'spirit': 448,\n", " 'stem': 449,\n", " 'western': 450,\n", " 'whether': 451,\n", " 'ago': 452,\n", " 'analyst': 453,\n", " 'bruis': 454,\n", " 'compar': 455,\n", " 'earn': 456,\n", " 'long': 457,\n", " 'miss': 458,\n", " 'per': 459,\n", " 'rise': 460,\n", " 'server': 461,\n", " 'share': 462,\n", " 'shot': 463,\n", " 'storag': 464,\n", " 'updat': 465,\n", " 'biggest': 466,\n", " 'even': 467,\n", " 'headcount': 468,\n", " 'hire': 469,\n", " 'ibm': 470,\n", " 'plan': 471,\n", " 'sinc': 472,\n", " 'code': 473,\n", " 'craft': 474,\n", " 'earli': 475,\n", " 'get': 476,\n", " 'glass': 477,\n", " 'look': 478,\n", " 'oper': 479,\n", " 'provid': 480,\n", " 'skin': 481,\n", " 'still': 482,\n", " 'sun': 483,\n", " 'system': 484,\n", " 'view': 485,\n", " 'appli': 486,\n", " 'chip': 487,\n", " 'electr': 488,\n", " 'fault': 489,\n", " 'fuse': 490,\n", " 'heal': 491,\n", " 'identifi': 492,\n", " 'repair': 493,\n", " 'someday': 494,\n", " 'bid': 495,\n", " 'bill': 496,\n", " 'case': 497,\n", " 'deni': 498,\n", " 'elig': 499,\n", " 'everyday': 500,\n", " 'googl': 501,\n", " 'ipo': 502,\n", " 'minimum': 503,\n", " 'particip': 504,\n", " 'peopl': 505,\n", " 'point': 506,\n", " 'process': 507,\n", " 'public': 508,\n", " 'stranglehold': 509,\n", " 'street': 510,\n", " 'underwrit': 511,\n", " 'usual': 512,\n", " 'wall': 513,\n", " 'annoy': 514,\n", " 'attitud': 515,\n", " 'broker': 516,\n", " 'charl': 517,\n", " 'decad': 518,\n", " 'francisco': 519,\n", " 'iconoclast': 520,\n", " 'liedtk': 521,\n", " 'low': 522,\n", " 'michael': 523,\n", " 'price': 524,\n", " 'rival': 525,\n", " 'san': 526,\n", " 'sch': 527,\n", " 'schwab': 528,\n", " 'shoe': 529,\n", " 'stock': 530,\n", " 'stone': 531,\n", " 'tabl': 532,\n", " 'tri': 533,\n", " 'turn': 534,\n", " 'wingtip': 535,\n", " 'compon': 536,\n", " 'cyber': 537,\n", " 'fail': 538,\n", " 'grid': 539,\n", " 'movement': 540,\n", " 'news': 541,\n", " 'power': 542,\n", " 'reach': 543,\n", " 'secur': 544,\n", " 'sluggish': 545,\n", " 'vulner': 546,\n", " '826': 547,\n", " 'giddi': 548,\n", " 'gold': 549,\n", " 'individu': 550,\n", " 'medal': 551,\n", " 'medley': 552,\n", " 'minut': 553,\n", " 'phelp': 554,\n", " 'touch': 555,\n", " 'world': 556,\n", " 'arguabl': 557,\n", " 'best': 558,\n", " 'bicep': 559,\n", " 'bodi': 560,\n", " 'choos': 561,\n", " 'cornerback': 562,\n", " 'deion': 563,\n", " 'easi': 564,\n", " 'fat': 565,\n", " 'field': 566,\n", " 'finess': 567,\n", " 'footbal': 568,\n", " 'foxborough': 569,\n", " 'game': 570,\n", " 'hardli': 571,\n", " 'huge': 572,\n", " 'impli': 573,\n", " 'lack': 574,\n", " 'law': 575,\n", " 'much': 576,\n", " 'ounc': 577,\n", " 'physic': 578,\n", " 'play': 579,\n", " 'ridicul': 580,\n", " 'sander': 581,\n", " 'see': 582,\n", " 'shut': 583,\n", " 'side': 584,\n", " 'soften': 585,\n", " 'tougher': 586,\n", " 'upper': 587,\n", " 'adjust': 588,\n", " 'appear': 589,\n", " 'care': 590,\n", " 'catcher': 591,\n", " 'climb': 592,\n", " 'continu': 593,\n", " 'dwindl': 594,\n", " 'enter': 595,\n", " 'highli': 596,\n", " 'jason': 597,\n", " 'kelli': 598,\n", " 'like': 599,\n", " 'major': 600,\n", " 'monitor': 601,\n", " 'pawtucket': 602,\n", " 'plate': 603,\n", " 'readi': 604,\n", " 'red': 605,\n", " 'remain': 606,\n", " 'seen': 607,\n", " 'shoppach': 608,\n", " 'sox': 609,\n", " 'toward': 610,\n", " 'tripl': 611,\n", " 'uncertain': 612,\n", " 'varitek': 613,\n", " 'week': 614,\n", " 'attend': 615,\n", " 'babi': 616,\n", " 'boy': 617,\n", " 'dangelo': 618,\n", " 'david': 619,\n", " 'famili': 620,\n", " 'fenway': 621,\n", " 'good': 622,\n", " 'home': 623,\n", " 'imagin': 624,\n", " 'last': 625,\n", " 'mighti': 626,\n", " 'morn': 627,\n", " 'night': 628,\n", " 'old': 629,\n", " 'ortiz': 630,\n", " 'park': 631,\n", " 'rest': 632,\n", " 'sleep': 633,\n", " 'son': 634,\n", " 'spent': 635,\n", " 'sure': 636,\n", " 'yesterday': 637,\n", " 'belichick': 638,\n", " 'bryant': 639,\n", " 'caught': 640,\n", " 'cha': 641,\n", " 'decis': 642,\n", " 'easier': 643,\n", " 'eye': 644,\n", " 'gessner': 645,\n", " 'jen': 646,\n", " 'noth': 647,\n", " 'patten': 648,\n", " 'quot': 649,\n", " 'receiv': 650,\n", " 'ricki': 651,\n", " 'central': 652,\n", " 'charg': 653,\n", " 'cleveland': 654,\n", " 'hafner': 655,\n", " 'indian': 656,\n", " 'lead': 657,\n", " 'martinez': 658,\n", " 'minnesota': 659,\n", " 'mount': 660,\n", " 'pull': 661,\n", " 'run': 662,\n", " 'saturday': 663,\n", " 'travi': 664,\n", " 'twin': 665,\n", " 'victor': 666,\n", " 'canadian': 667,\n", " 'citi': 668,\n", " 'confront': 669,\n", " 'constabl': 670,\n", " 'defend': 671,\n", " 'demand': 672,\n", " 'die': 673,\n", " 'involv': 674,\n", " 'offic': 675,\n", " 'press': 676,\n", " 'resign': 677,\n", " 'sister': 678,\n", " 'slam': 679,\n", " 'vancouv': 680,\n", " 'violent': 681,\n", " '50m': 682,\n", " 'affair': 683,\n", " 'aid': 684,\n", " 'associ': 685,\n", " 'cash': 686,\n", " 'decid': 687,\n", " 'extramarit': 688,\n", " 'gay': 689,\n", " 'gov': 690,\n", " 'governor': 691,\n", " 'harass': 692,\n", " 'jame': 693,\n", " 'mcgreevey': 694,\n", " 'push': 695,\n", " 'settlement': 696,\n", " 'sexual': 697,\n", " 'sought': 698,\n", " 'told': 699,\n", " 'armor': 700,\n", " 'back': 701,\n", " 'ceasefir': 702,\n", " 'echo': 703,\n", " 'explos': 704,\n", " 'fight': 705,\n", " 'gunfir': 706,\n", " 'holi': 707,\n", " 'intend': 708,\n", " 'iraq': 709,\n", " 'najaf': 710,\n", " 'rattl': 711,\n", " 'roll': 712,\n", " 'sunday': 713,\n", " 'tank': 714,\n", " 'temporari': 715,\n", " 'throughout': 716,\n", " 'troop': 717,\n", " 'vehicl': 718,\n", " 'breath': 719,\n", " 'celebr': 720,\n", " 'cure': 721,\n", " 'frail': 722,\n", " 'franc': 723,\n", " 'french': 724,\n", " 'gasp': 725,\n", " 'heavili': 726,\n", " 'homili': 727,\n", " 'hundr': 728,\n", " 'john': 729,\n", " 'lourd': 730,\n", " 'mani': 731,\n", " 'mari': 732,\n", " 'miracul': 733,\n", " 'openair': 734,\n", " 'paul': 735,\n", " 'pilgrim': 736,\n", " 'polish': 737,\n", " 'pope': 738,\n", " 'sever': 739,\n", " 'shrine': 740,\n", " 'struggl': 741,\n", " 'thousand': 742,\n", " 'virgin': 743,\n", " 'wheelchair': 744,\n", " 'chavez': 745,\n", " 'defeat': 746,\n", " 'govern': 747,\n", " 'oil': 748,\n", " 'possibl': 749,\n", " 'prepar': 750,\n", " 'recal': 751,\n", " 'turmoil': 752,\n", " 'venezuela': 753,\n", " 'vote': 754,\n", " 'act': 755,\n", " 'activ': 756,\n", " 'call': 757,\n", " 'duti': 758,\n", " 'employ': 759,\n", " 'guard': 760,\n", " 'job': 761,\n", " 'preserv': 762,\n", " 'protect': 763,\n", " 'provis': 764,\n", " 'reemploy': 765,\n", " 'reserv': 766,\n", " 'right': 767,\n", " 'strengthen': 768,\n", " 'uniform': 769,\n", " 'userra': 770,\n", " 'anywher': 771,\n", " 'attack': 772,\n", " 'dare': 773,\n", " 'iran': 774,\n", " 'iranian': 775,\n", " 'israel': 776,\n", " 'militari': 777,\n", " 'missil': 778,\n", " 'report': 779,\n", " 'state': 780,\n", " 'tehran': 781,\n", " 'afghan': 782,\n", " 'afghanistan': 783,\n", " 'airplan': 784,\n", " 'armi': 785,\n", " 'calm': 786,\n", " 'capit': 787,\n", " 'deadli': 788,\n", " 'dispatch': 789,\n", " 'far': 790,\n", " 'fli': 791,\n", " 'interven': 792,\n", " 'kabul': 793,\n", " 'nato': 794,\n", " 'outbreak': 795,\n", " 'retak': 796,\n", " 'violenc': 797,\n", " 'warlord': 798,\n", " 'west': 799,\n", " 'arizona': 800,\n", " 'dback': 801,\n", " 'diamondback': 802,\n", " 'fourhitt': 803,\n", " 'inning': 804,\n", " 'johnson': 805,\n", " 'lose': 806,\n", " 'met': 807,\n", " 'ninegam': 808,\n", " 'ninth': 809,\n", " 'randi': 810,\n", " 'slide': 811,\n", " 'steve': 812,\n", " 'streak': 813,\n", " 'took': 814,\n", " 'trachsel': 815,\n", " 'york': 816,\n", " 'adult': 817,\n", " 'amongstyleconsci': 818,\n", " 'apparel': 819,\n", " 'backtoschool': 820,\n", " 'buyer': 821,\n", " 'couldb': 822,\n", " 'fall': 823,\n", " 'fashion': 824,\n", " 'grade': 825,\n", " 'sell': 826,\n", " 'student': 827,\n", " 'teen': 828,\n", " 'theirbacktoschool': 829,\n", " 'tighterhold': 830,\n", " 'tough': 831,\n", " 'vie': 832,\n", " 'wallet': 833,\n", " 'young': 834,\n", " 'afterthought': 835,\n", " 'amid': 836,\n", " 'bush': 837,\n", " 'charley': 838,\n", " 'coastal': 839,\n", " 'earthquak': 840,\n", " 'fire': 841,\n", " 'flood': 842,\n", " 'hurrican': 843,\n", " 'made': 844,\n", " 'polit': 845,\n", " 'postdisast': 846,\n", " 'presid': 847,\n", " 'scene': 848,\n", " 'sort': 849,\n", " 'storm': 850,\n", " 'struck': 851,\n", " 'three': 852,\n", " 'tour': 853,\n", " 'visit': 854,\n", " 'wreckag': 855,\n", " 'cent': 856,\n", " 'china': 857,\n", " 'chines': 858,\n", " 'custom': 859,\n", " 'fell': 860,\n", " 'ftcom': 861,\n", " 'impos': 862,\n", " 'internet': 863,\n", " 'messag': 864,\n", " 'mobil': 865,\n", " 'multimedia': 866,\n", " 'oneyear': 867,\n", " 'phone': 868,\n", " 'portal': 869,\n", " 'sent': 870,\n", " 'sohucom': 871,\n", " 'spam': 872,\n", " 'suspens': 873,\n", " 'uslist': 874,\n", " 'anaheim': 875,\n", " 'boston': 876,\n", " 'darin': 877,\n", " 'detroit': 878,\n", " 'doubl': 879,\n", " 'eighth': 880,\n", " 'erstad': 881,\n", " 'goahead': 882,\n", " 'lift': 883,\n", " 'percentag': 884,\n", " 'tiger': 885,\n", " 'victori': 886,\n", " 'wildcard': 887,\n", " 'atlanta': 888,\n", " 'brave': 889,\n", " 'cardin': 890,\n", " 'drew': 891,\n", " 'injuri': 892,\n", " 'lineup': 893,\n", " 'loui': 894,\n", " 'outfield': 895,\n", " 'quadricep': 896,\n", " 'sore': 897,\n", " 'caraca': 898,\n", " 'elector': 899,\n", " 'extend': 900,\n", " 'histor': 901,\n", " 'hugo': 902,\n", " 'leftw': 903,\n", " 'number': 904,\n", " 'poll': 905,\n", " 'prolong': 906,\n", " 'referendum': 907,\n", " 'venezuelan': 908,\n", " 'competit': 909,\n", " 'consum': 910,\n", " 'countri': 911,\n", " 'dell': 912,\n", " 'dello': 913,\n", " 'exit': 914,\n", " 'hong': 915,\n", " 'kong': 916,\n", " 'left': 917,\n", " 'lowend': 918,\n", " 'maker': 919,\n", " 'monday': 920,\n", " 'overal': 921,\n", " 'segment': 922,\n", " 'stiff': 923,\n", " 'accus': 924,\n", " 'beij': 925,\n", " 'chineseamerican': 926,\n", " 'espionag': 927,\n", " 'media': 928,\n", " 'soon': 929,\n", " 'spi': 930,\n", " 'taiwan': 931,\n", " 'trial': 932,\n", " 'anoth': 933,\n", " 'championship': 934,\n", " 'nonfactor': 935,\n", " 'player': 936,\n", " 'rank': 937,\n", " 'triumph': 938,\n", " 'wood': 939,\n", " 'afp': 940,\n", " 'alaska': 941,\n", " 'deploy': 942,\n", " 'enhanc': 943,\n", " 'f15e': 944,\n", " 'fighter': 945,\n", " 'firepow': 946,\n", " 'forc': 947,\n", " 'korea': 948,\n", " 'korean': 949,\n", " 'peninsula': 950,\n", " 'south': 951,\n", " 'squadron': 952,\n", " 'batter': 953,\n", " 'host': 954,\n", " 'leagu': 955,\n", " 'curfew': 956,\n", " 'dissid': 957,\n", " 'eas': 958,\n", " 'emerg': 959,\n", " 'follow': 960,\n", " 'indefinit': 961,\n", " 'maldiv': 962,\n", " 'parliament': 963,\n", " 'put': 964,\n", " 'resid': 965,\n", " 'restiv': 966,\n", " 'riot': 967,\n", " 'round': 968,\n", " 'session': 969,\n", " 'busi': 970,\n", " 'ceski': 971,\n", " 'czech': 972,\n", " 'disentagl': 973,\n", " 'fixedlin': 974,\n", " 'thedealcom': 975,\n", " 'vodafon': 976,\n", " 'want': 977,\n", " '4wk': 978,\n", " 'briefli': 979,\n", " 'data': 980,\n", " 'dip': 981,\n", " 'dollar': 982,\n", " 'economi': 983,\n", " 'euro': 984,\n", " 'fan': 985,\n", " 'fourweek': 986,\n", " 'health': 987,\n", " 'london': 988,\n", " 'profittak': 989,\n", " 'slightli': 990,\n", " 'steep': 991,\n", " 'weak': 992,\n", " 'worri': 993,\n", " 'kaleko': 994,\n", " 'kept': 995,\n", " 'older': 996,\n", " 'problem': 997,\n", " 'promot': 998,\n", " 'realiz': 999,\n", " ...}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dictionary.token2id" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mapping of tokens" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:26:29.081726Z", "start_time": "2018-04-03T19:26:29.076804Z" } }, "outputs": [ { "data": { "text/plain": [ "(0, 101)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dictionary.token2id['disappoint'], dictionary.token2id['california']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upon creation of a dictionary that maps words to integers (and vice versa), we can transform our bag of words. Each document will be a list of tuples that contain token indices and frequencies." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:27:02.065526Z", "start_time": "2018-04-03T19:27:01.845541Z" }, "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "[(0, 1),\n", " (1, 1),\n", " (2, 1),\n", " (3, 1),\n", " (4, 1),\n", " (5, 1),\n", " (6, 1),\n", " (7, 1),\n", " (8, 1),\n", " (9, 1),\n", " (10, 1),\n", " (11, 2),\n", " (12, 1),\n", " (13, 1),\n", " (14, 1)]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus = [dictionary.doc2bow(d) for d in bow]\n", "corpus[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### TFIDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes term frequencies can be misleading. Just like the same reason we remove stop words, words that occur a lot across the whole corpus may not be informative. On the other hand, a word that shows up a lot in only a small portion instead of all can provide valuable informaiton on the contents of these texts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to resolve this potential problem is called term frequency-inverse document frequency (TFIDF), which is a product of TF and IDF. A common weighting scheme is $TFIDF(t,d, D) = freq_{t,d}\\times log_2~\\dfrac{N_D}{N_t}$, where $N_D$ is the number of documents; $N_t$ is the number of documents that contain the term $t$. See [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency) for more details." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can try this on our data with [`gensim.models.tfidfmodel`](https://radimrehurek.com/gensim/models/tfidfmodel.html):" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:30:54.489780Z", "start_time": "2018-04-03T19:30:54.399261Z" } }, "outputs": [], "source": [ "from gensim.models import TfidfModel\n", "model = TfidfModel(corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apply the model on one document" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:30:55.322744Z", "start_time": "2018-04-03T19:30:55.317566Z" } }, "outputs": [ { "data": { "text/plain": [ "[(0, 0.23464966537701473),\n", " (1, 0.21340476186532015),\n", " (2, 0.16098999782417597),\n", " (3, 0.17617388045540608),\n", " (4, 0.3380922768382177),\n", " (5, 0.35558360355081436),\n", " (6, 0.25022339107653035),\n", " (7, 0.2521409920896115),\n", " (8, 0.22971412643239364),\n", " (9, 0.1150431833134092),\n", " (10, 0.38548522406671576),\n", " (11, 0.31781564552907887),\n", " (12, 0.3081906563223163),\n", " (13, 0.17230810874356683),\n", " (14, 0.18769468048518562)]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model[corpus[0]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Topic Modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Latent Dirichet Allocation (LDA)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very commonly used dimensionality reduction technique family is called ___topic modeling___. It assumes that each document is a mixture of topics, where each topic is a mixture of terms. One of the most successful algorithms is [___latent Dirichlet allocation___](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA), whose corresponding paper is:\n", "> Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LDA is a generative model that does the reverse engineerging of document generation. It can be represented as a probablistic graphical model:\n", "![lda](https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generative process can be described as follows:\n", "- For each topic $k$, sample a multinomial distribution $\\phi_k$ over words from the Dirichlet prior with parameter $\\beta$\n", "- For each document $m$, sample a multinomial distribution $\\theta_m$ over topics from the Dirichlet prior with parameter $\\alpha$\n", " - For each word $n$ in $m$:\n", " - Sample a topic $z_{m,n}$ from the correponding topic distribution parameterized by $\\theta_m$\n", " - Sample a word $w_{m,n}$ from the correponding topic $z_{m,n}$'s word distribution parameterized by $\\phi_{z_{m,n}}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Parameters in LDA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generally, we need to control two hyperparamters of a LDA model:\n", "- Topic-word Dirichlet prior $\\beta$\n", "- Document-topic Dirichlet prior $\\alpha$\n", "\n", "The selection of these parameters are application dependent. Heuristically, people will choose $\\alpha=\\dfrac{50}{K}$ and $\\beta=0.01$, as described in\n", "\n", "> Griffiths, T. L., and Steyvers, M. 2004. “Finding Scientific Topics,” Proceedings of the National Academy of Sciences (101:Supplement 1), National Academy of Sciences, pp. 5228–5235.\n", "\n", "It is also possible to infer these two hyperparameters given the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The selection of $K$ totally depends on the context. It is also possible to select a topic number based on quantitative measures of topic modeling quality, but this is beyond the scope of this tutorial.\n", "\n", "For our toy sample set, we will just select $K=4$ because there are 4 labels: " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:42:53.362356Z", "start_time": "2018-04-03T19:42:53.356128Z" } }, "outputs": [ { "data": { "text/plain": [ "array(['Business', 'Sci/Tech', 'Sports', 'World'], dtype=object)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.class_name.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Run LDA!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks to the convenient APIs by `gensim`, we can easily run [LDA in Python](https://radimrehurek.com/gensim/models/ldamodel.html):" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:46:13.129939Z", "start_time": "2018-04-03T19:44:39.239135Z" } }, "outputs": [], "source": [ "from gensim.models import LdaModel\n", "lda = LdaModel(corpus, num_topics=4, id2word=dictionary, passes=10, \n", " minimum_probability=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Analysis on LDA results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the output of LDA. First, we can check if the topics make sense" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:46:45.406370Z", "start_time": "2018-04-03T19:46:45.398364Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.012*\"new\" + 0.009*\"reuter\" + 0.009*\"said\" + 0.008*\"price\" + 0.007*\"oil\" + 0.007*\"inc\" + 0.007*\"compani\" + 0.006*\"stock\" + 0.006*\"share\" + 0.006*\"report\"\n", "------------------------------------------------------------------------------------------------------------------------\n", "0.007*\"quot\" + 0.007*\"compani\" + 0.007*\"new\" + 0.006*\"said\" + 0.005*\"servic\" + 0.005*\"microsoft\" + 0.005*\"plan\" + 0.004*\"million\" + 0.004*\"say\" + 0.004*\"deal\"\n", "------------------------------------------------------------------------------------------------------------------------\n", "0.010*\"game\" + 0.008*\"win\" + 0.007*\"team\" + 0.007*\"first\" + 0.007*\"new\" + 0.006*\"season\" + 0.005*\"world\" + 0.005*\"one\" + 0.004*\"year\" + 0.004*\"final\"\n", "------------------------------------------------------------------------------------------------------------------------\n", "0.009*\"said\" + 0.007*\"reuter\" + 0.007*\"kill\" + 0.006*\"iraq\" + 0.006*\"presid\" + 0.004*\"offici\" + 0.004*\"two\" + 0.004*\"minist\" + 0.004*\"say\" + 0.004*\"afp\"\n", "------------------------------------------------------------------------------------------------------------------------\n" ] } ], "source": [ "for _, topic_str in lda.show_topics():\n", " print(topic_str)\n", " print('------------'*10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While we probably cannot say the topics are perfect, they are okay. We can interpret the topics as: business, sports, world, and sci/tech." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each document, we can check their topic distributions:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:49:19.069553Z", "start_time": "2018-04-03T19:49:19.061953Z" } }, "outputs": [ { "data": { "text/plain": [ "[(0, 0.89377004), (1, 0.012440922), (2, 0.012430134), (3, 0.08135887)]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "i = 1000\n", "lda.get_document_topics(corpus[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that topic 0, which is interpreted \"business\" topics dominate this document. We can check to see if this makes sense:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2018-04-03T19:50:15.220687Z", "start_time": "2018-04-03T19:50:15.202603Z" } }, "outputs": [ { "data": { "text/plain": [ "class_index 3\n", "class_name Business\n", "content Albertsons #39; 2Q Profit Falls 36 Percent. Pe...\n", "Name: 1000, dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[i]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, LDA can be used in many situtaions, such as information retrieval, document clustering and labeling, and even for images! Here we just mention the simplest use case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we went through a simple procedure, from preprocessing of raw texts, to modeling topics in the resulting bag of words corpus. A lot of terms are used, such as ___stop words___, ___bag of words___, ___stemming___, etc. However, these are only a small part of text analytics. There are a lot more to explore. Below I list some materials on text analytics in Python that I like and hope they will be useful:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Gensim tutorial](https://radimrehurek.com/gensim/tutorial.html)\n", "- [NLTK book](http://www.nltk.org/book/)\n", "- [Coursera text mining course](https://www.coursera.org/learn/python-text-mining)\n", "- [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html) (A package not tutorial)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }

X Tutup