X Tutup
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Processing\n", "\n", "## Capturing Text Data\n", "\n", "### Plain Text" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.\n", "\n" ] } ], "source": [ "import os\n", "\n", "# Read in a plain text file\n", "with open(os.path.join(\"data\", \"hieroglyph.txt\"), \"r\") as f:\n", " text = f.read()\n", " print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tabular Data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
publishertitle
0Livemintfed's charles plosser sees high bar for change...
1IFA Magazineus open: stocks fall after fed official hints ...
2IFA Magazinefed risks falling 'behind the curve', charles ...
3Moneynewsfed's plosser: nasty weather has curbed job gr...
4NASDAQplosser: fed may have to accelerate tapering pace
\n", "
" ], "text/plain": [ " publisher title\n", "0 Livemint fed's charles plosser sees high bar for change...\n", "1 IFA Magazine us open: stocks fall after fed official hints ...\n", "2 IFA Magazine fed risks falling 'behind the curve', charles ...\n", "3 Moneynews fed's plosser: nasty weather has curbed job gr...\n", "4 NASDAQ plosser: fed may have to accelerate tapering pace" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Extract text column from a dataframe\n", "df = pd.read_csv(os.path.join(\"data\", \"news.csv\"))\n", "df.head()[['publisher', 'title']]\n", "\n", "# Convert text column to lowercase\n", "df['title'] = df['title'].str.lower()\n", "df.head()[['publisher', 'title']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Online Resource" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"success\": {\n", " \"total\": 1\n", " },\n", " \"contents\": {\n", " \"quotes\": [\n", " {\n", " \"quote\": \"When you win, say nothing. When you lose, say less.\",\n", " \"author\": \"Paul Brown\",\n", " \"length\": \"51\",\n", " \"tags\": [\n", " \"inspire\",\n", " \"losing\",\n", " \"running\",\n", " \"winning\"\n", " ],\n", " \"category\": \"inspire\",\n", " \"title\": \"Inspiring Quote of the day\",\n", " \"date\": \"2018-05-09\",\n", " \"id\": null\n", " }\n", " ],\n", " \"copyright\": \"2017-19 theysaidso.com\"\n", " }\n", "}\n", "When you win, say nothing. When you lose, say less. \n", "-- Paul Brown\n" ] } ], "source": [ "import requests\n", "import json\n", "\n", "# Fetch data from a REST API\n", "r = requests.get(\n", " \"https://quotes.rest/qod.json\")\n", "res = r.json()\n", "print(json.dumps(res, indent=4))\n", "\n", "# Extract relevant object and field\n", "q = res[\"contents\"][\"quotes\"][0]\n", "print(q[\"quote\"], \"\\n--\", q[\"author\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " Hacker News
\n", " \n", "\n", "\n", "
\n", " \n", "
Hacker News\n", " new | comments | show | ask | jobs | submit \n", " login\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1.
Google Duplex: An AI System for Accomplishing Real World Tasks Over the Phone (googleblog.com)
\n", " 1023 points by ivank 9 hours ago | hide | 432 comments
2.
A Short Introduction to the Art of Programming – Edsgar W. Dijkstra [pdf] (utexas.edu)
\n", " 27 points by Rescis 1 hour ago | hide | 4 comments
3.
You can now run Linux apps on Chrome OS (techcrunch.com)
\n", " 357 points by willsinclair 6 hours ago | hide | 153 comments
4.
The Cutting Room Floor: Unearthing Unused Content from Video Games (tcrf.net)
\n", " 64 points by indescions_2018 5 hours ago | hide | 7 comments
5.
Eshell as a main shell (bitbucket.io)
\n", " 21 points by taeric 3 hours ago | hide | 1 comment
6.
Yubico and Microsoft Introduce Passwordless Login (yubico.com)
\n", " 183 points by guitarbill 9 hours ago | hide | 138 comments
7.
LispKit: framework for Lisp-based extension/scripting languages for macOS apps (github.com)
\n", " 41 points by ingve 4 hours ago | hide | 3 comments
8.
Dynimize: Speed Up MySQL with CPU Performance Virtualization (dynimize.com)
\n", " 49 points by nwrk 5 hours ago | hide | 13 comments
9.
Android P (blog.google)
\n", " 179 points by alanfranzoni 7 hours ago | hide | 145 comments
10.
ls | grep “echo ${data}” – Why/how does this work? (zerobin.net)
\n", " 40 points by indigodaddy 5 hours ago | hide | 28 comments
11.
A Recycled IP Address Caused Me to Pirate Books by Accident (nickjanetakis.com)
\n", " 213 points by nickjj 13 hours ago | hide | 78 comments
12.
Anarchists in the Spanish Civil War (2002) (isreview.org)
\n", " 40 points by dgarceran 5 hours ago | hide | 11 comments
13.
A Plane That Accidentally Circumnavigated the World (2014) (medium.com)
\n", " 180 points by SeoxyS 13 hours ago | hide | 33 comments
14.
Mozilla Global Sprint 2018 (github.com)
\n", " 178 points by robterthaddeus 7 hours ago | hide | 24 comments
15. Blitz Esports (YC S15) is hiring a front end engineer – build apps for gamers (medium.com)
\n", " 1 hour ago | hide
16.
How to Find Investors and Get Email Intros (atrium.co)
\n", " 55 points by meredithah 8 hours ago | hide | 15 comments
17.
Building a Progressive Web App in React, using Firestore for offline support (truthlabs.com)
\n", " 33 points by sconstantinides 4 hours ago | hide | 3 comments
18.
The Difficulty of Faking Data (1999) [pdf] (kkuniyuk.com)
\n", " 23 points by tontonius 4 hours ago | hide | 2 comments
19.
iOS 11.4 to Disable USB Port After 7 Days: What It Means for Mobile Forensics (elcomsoft.com)
\n", " 466 points by Artemis2 11 hours ago | hide | 321 comments
20.
Write Emails Faster with Smart Compose in Gmail (blog.google)
\n", " 153 points by devhxinc 9 hours ago | hide | 158 comments
21.
Stewart Brand Changed the World, Twice (nytimes.com)
\n", " 75 points by tysone 13 hours ago | hide | 16 comments
22.
Great technology should improve life, not distract from it (wellbeing.google)
\n", " 149 points by panarky 7 hours ago | hide | 82 comments
23.
Google’s ML Kit makes it easy to add AI smarts to iOS and Android apps (techcrunch.com)
\n", " 171 points by coloneltcb 8 hours ago | hide | 21 comments
24.
Fake news was illegal in 17th century colonial Massachusetts (mass.gov)
\n", " 144 points by jimschley 9 hours ago | hide | 152 comments
25.
Conversations with a six-year-old on functional programming (byorgey.wordpress.com)
\n", " 1865 points by weatherlight 1 day ago | hide | 266 comments
26.
Amazon’s Fake Review Economy (buzzfeed.com)
\n", " 224 points by jonbaer 10 hours ago | hide | 99 comments
27.
How Mapbox Is Winning Over Developers to Challenge Google's Mapping Dominance (forbes.com)
\n", " 289 points by coloneltcb 12 hours ago | hide | 95 comments
28.
Superconducting Optoelectronic Neurons I: General Principles (arxiv.org)
\n", " 9 points by indescions_2018 5 hours ago | hide | 1 comment
29.
Introducing extended line endings support in Notepad (microsoft.com)
\n", " 242 points by dEnigma 9 hours ago | hide | 143 comments
30.
Defector: WikiLeaks ‘Will Lie to Your Face’ (thedailybeast.com)
\n", " 9 points by eplanit 1 hour ago | hide | 2 comments
More
\n", "

Guidelines\n", " | FAQ\n", " | Support\n", " | API\n", " | Security\n", " | Lists\n", " | Bookmarklet\n", " | Legal\n", " | Apply to YC\n", " | Contact

Search:\n", "
\n", "
\n", " \n", "\n" ] } ], "source": [ "import requests\n", "\n", "# Fetch a web page\n", "r = requests.get(\"https://news.ycombinator.com\")\n", "print(r.text)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " Hacker News\n", " \n", " Hacker News\n", " new | comments | show | ask | jobs | submit \n", " login\n", " \n", " \n", "\n", " \n", " 1. Google Duplex: An AI System for Accomplishing Real World Tasks Over the Phone (googleblog.com)\n", " 1023 points by ivank 9 hours ago | hide | 432 comments \n", " \n", " \n", " 2. A Short Introduction to the Art of Programming – Edsgar W. Dijkstra [pdf] (utexas.edu)\n", " 27 points by Rescis 1 hour ago | hide | 4 comments \n", " \n", " \n", " 3. You can now run Linux apps on Chrome OS (techcrunch.com)\n", " 357 points by willsinclair 6 hours ago | hide | 153 comments \n", " \n", " \n", " 4. The Cutting Room Floor: Unearthing Unused Content from Video Games (tcrf.net)\n", " 64 points by indescions_2018 5 hours ago | hide | 7 comments \n", " \n", " \n", " 5. Eshell as a main shell (bitbucket.io)\n", " 21 points by taeric 3 hours ago | hide | 1 comment \n", " \n", " \n", " 6. Yubico and Microsoft Introduce Passwordless Login (yubico.com)\n", " 183 points by guitarbill 9 hours ago | hide | 138 comments \n", " \n", " \n", " 7. LispKit: framework for Lisp-based extension/scripting languages for macOS apps (github.com)\n", " 41 points by ingve 4 hours ago | hide | 3 comments \n", " \n", " \n", " 8. Dynimize: Speed Up MySQL with CPU Performance Virtualization (dynimize.com)\n", " 49 points by nwrk 5 hours ago | hide | 13 comments \n", " \n", " \n", " 9. Android P (blog.google)\n", " 179 points by alanfranzoni 7 hours ago | hide | 145 comments \n", " \n", " \n", " 10. ls | grep “echo ${data}” – Why/how does this work? (zerobin.net)\n", " 40 points by indigodaddy 5 hours ago | hide | 28 comments \n", " \n", " \n", " 11. A Recycled IP Address Caused Me to Pirate Books by Accident (nickjanetakis.com)\n", " 213 points by nickjj 13 hours ago | hide | 78 comments \n", " \n", " \n", " 12. Anarchists in the Spanish Civil War (2002) (isreview.org)\n", " 40 points by dgarceran 5 hours ago | hide | 11 comments \n", " \n", " \n", " 13. A Plane That Accidentally Circumnavigated the World (2014) (medium.com)\n", " 180 points by SeoxyS 13 hours ago | hide | 33 comments \n", " \n", " \n", " 14. Mozilla Global Sprint 2018 (github.com)\n", " 178 points by robterthaddeus 7 hours ago | hide | 24 comments \n", " \n", " \n", " 15. Blitz Esports (YC S15) is hiring a front end engineer – build apps for gamers (medium.com)\n", " 1 hour ago | hide \n", " \n", " \n", " 16. How to Find Investors and Get Email Intros (atrium.co)\n", " 55 points by meredithah 8 hours ago | hide | 15 comments \n", " \n", " \n", " 17. Building a Progressive Web App in React, using Firestore for offline support (truthlabs.com)\n", " 33 points by sconstantinides 4 hours ago | hide | 3 comments \n", " \n", " \n", " 18. The Difficulty of Faking Data (1999) [pdf] (kkuniyuk.com)\n", " 23 points by tontonius 4 hours ago | hide | 2 comments \n", " \n", " \n", " 19. iOS 11.4 to Disable USB Port After 7 Days: What It Means for Mobile Forensics (elcomsoft.com)\n", " 466 points by Artemis2 11 hours ago | hide | 321 comments \n", " \n", " \n", " 20. Write Emails Faster with Smart Compose in Gmail (blog.google)\n", " 153 points by devhxinc 9 hours ago | hide | 158 comments \n", " \n", " \n", " 21. Stewart Brand Changed the World, Twice (nytimes.com)\n", " 75 points by tysone 13 hours ago | hide | 16 comments \n", " \n", " \n", " 22. Great technology should improve life, not distract from it (wellbeing.google)\n", " 149 points by panarky 7 hours ago | hide | 82 comments \n", " \n", " \n", " 23. Google’s ML Kit makes it easy to add AI smarts to iOS and Android apps (techcrunch.com)\n", " 171 points by coloneltcb 8 hours ago | hide | 21 comments \n", " \n", " \n", " 24. Fake news was illegal in 17th century colonial Massachusetts (mass.gov)\n", " 144 points by jimschley 9 hours ago | hide | 152 comments \n", " \n", " \n", " 25. Conversations with a six-year-old on functional programming (byorgey.wordpress.com)\n", " 1865 points by weatherlight 1 day ago | hide | 266 comments \n", " \n", " \n", " 26. Amazon’s Fake Review Economy (buzzfeed.com)\n", " 224 points by jonbaer 10 hours ago | hide | 99 comments \n", " \n", " \n", " 27. How Mapbox Is Winning Over Developers to Challenge Google's Mapping Dominance (forbes.com)\n", " 289 points by coloneltcb 12 hours ago | hide | 95 comments \n", " \n", " \n", " 28. Superconducting Optoelectronic Neurons I: General Principles (arxiv.org)\n", " 9 points by indescions_2018 5 hours ago | hide | 1 comment \n", " \n", " \n", " 29. Introducing extended line endings support in Notepad (microsoft.com)\n", " 242 points by dEnigma 9 hours ago | hide | 143 comments \n", " \n", " \n", " 30. Defector: WikiLeaks ‘Will Lie to Your Face’ (thedailybeast.com)\n", " 9 points by eplanit 1 hour ago | hide | 2 comments \n", " \n", " More\n", " \n", "\n", "Guidelines\n", " | FAQ\n", " | Support\n", " | API\n", " | Security\n", " | Lists\n", " | Bookmarklet\n", " | Legal\n", " | Apply to YC\n", " | ContactSearch:\n", " \n", " \n", " \n", " \n", "\n" ] } ], "source": [ "import re\n", "\n", "# Remove HTML tags using RegEx\n", "pattern = re.compile(r'<.*?>') # tags look like <...>\n", "print(pattern.sub('', r.text)) # replace them with blank" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " \n", " \n", " Hacker News\n", " \n", " Hacker News\n", " new | comments | show | ask | jobs | submit \n", " login\n", " \n", " \n", "\n", " \n", " 1. Google Duplex: An AI System for Accomplishing Real World Tasks Over the Phone (googleblog.com)\n", " 1023 points by ivank 9 hours ago | hide | 432 comments \n", " \n", " \n", " 2. A Short Introduction to the Art of Programming – Edsgar W. Dijkstra [pdf] (utexas.edu)\n", " 27 points by Rescis 1 hour ago | hide | 4 comments \n", " \n", " \n", " 3. You can now run Linux apps on Chrome OS (techcrunch.com)\n", " 357 points by willsinclair 6 hours ago | hide | 153 comments \n", " \n", " \n", " 4. The Cutting Room Floor: Unearthing Unused Content from Video Games (tcrf.net)\n", " 64 points by indescions_2018 5 hours ago | hide | 7 comments \n", " \n", " \n", " 5. Eshell as a main shell (bitbucket.io)\n", " 21 points by taeric 3 hours ago | hide | 1 comment \n", " \n", " \n", " 6. Yubico and Microsoft Introduce Passwordless Login (yubico.com)\n", " 183 points by guitarbill 9 hours ago | hide | 138 comments \n", " \n", " \n", " 7. LispKit: framework for Lisp-based extension/scripting languages for macOS apps (github.com)\n", " 41 points by ingve 4 hours ago | hide | 3 comments \n", " \n", " \n", " 8. Dynimize: Speed Up MySQL with CPU Performance Virtualization (dynimize.com)\n", " 49 points by nwrk 5 hours ago | hide | 13 comments \n", " \n", " \n", " 9. Android P (blog.google)\n", " 179 points by alanfranzoni 7 hours ago | hide | 145 comments \n", " \n", " \n", " 10. ls | grep “echo ${data}” – Why/how does this work? (zerobin.net)\n", " 40 points by indigodaddy 5 hours ago | hide | 28 comments \n", " \n", " \n", " 11. A Recycled IP Address Caused Me to Pirate Books by Accident (nickjanetakis.com)\n", " 213 points by nickjj 13 hours ago | hide | 78 comments \n", " \n", " \n", " 12. Anarchists in the Spanish Civil War (2002) (isreview.org)\n", " 40 points by dgarceran 5 hours ago | hide | 11 comments \n", " \n", " \n", " 13. A Plane That Accidentally Circumnavigated the World (2014) (medium.com)\n", " 180 points by SeoxyS 13 hours ago | hide | 33 comments \n", " \n", " \n", " 14. Mozilla Global Sprint 2018 (github.com)\n", " 178 points by robterthaddeus 7 hours ago | hide | 24 comments \n", " \n", " \n", " 15. Blitz Esports (YC S15) is hiring a front end engineer – build apps for gamers (medium.com)\n", " 1 hour ago | hide \n", " \n", " \n", " 16. How to Find Investors and Get Email Intros (atrium.co)\n", " 55 points by meredithah 8 hours ago | hide | 15 comments \n", " \n", " \n", " 17. Building a Progressive Web App in React, using Firestore for offline support (truthlabs.com)\n", " 33 points by sconstantinides 4 hours ago | hide | 3 comments \n", " \n", " \n", " 18. The Difficulty of Faking Data (1999) [pdf] (kkuniyuk.com)\n", " 23 points by tontonius 4 hours ago | hide | 2 comments \n", " \n", " \n", " 19. iOS 11.4 to Disable USB Port After 7 Days: What It Means for Mobile Forensics (elcomsoft.com)\n", " 466 points by Artemis2 11 hours ago | hide | 321 comments \n", " \n", " \n", " 20. Write Emails Faster with Smart Compose in Gmail (blog.google)\n", " 153 points by devhxinc 9 hours ago | hide | 158 comments \n", " \n", " \n", " 21. Stewart Brand Changed the World, Twice (nytimes.com)\n", " 75 points by tysone 13 hours ago | hide | 16 comments \n", " \n", " \n", " 22. Great technology should improve life, not distract from it (wellbeing.google)\n", " 149 points by panarky 7 hours ago | hide | 82 comments \n", " \n", " \n", " 23. Google’s ML Kit makes it easy to add AI smarts to iOS and Android apps (techcrunch.com)\n", " 171 points by coloneltcb 8 hours ago | hide | 21 comments \n", " \n", " \n", " 24. Fake news was illegal in 17th century colonial Massachusetts (mass.gov)\n", " 144 points by jimschley 9 hours ago | hide | 152 comments \n", " \n", " \n", " 25. Conversations with a six-year-old on functional programming (byorgey.wordpress.com)\n", " 1865 points by weatherlight 1 day ago | hide | 266 comments \n", " \n", " \n", " 26. Amazon’s Fake Review Economy (buzzfeed.com)\n", " 224 points by jonbaer 10 hours ago | hide | 99 comments \n", " \n", " \n", " 27. How Mapbox Is Winning Over Developers to Challenge Google's Mapping Dominance (forbes.com)\n", " 289 points by coloneltcb 12 hours ago | hide | 95 comments \n", " \n", " \n", " 28. Superconducting Optoelectronic Neurons I: General Principles (arxiv.org)\n", " 9 points by indescions_2018 5 hours ago | hide | 1 comment \n", " \n", " \n", " 29. Introducing extended line endings support in Notepad (microsoft.com)\n", " 242 points by dEnigma 9 hours ago | hide | 143 comments \n", " \n", " \n", " 30. Defector: WikiLeaks ‘Will Lie to Your Face’ (thedailybeast.com)\n", " 9 points by eplanit 1 hour ago | hide | 2 comments \n", " \n", " More\n", " \n", "\n", "Guidelines\n", " | FAQ\n", " | Support\n", " | API\n", " | Security\n", " | Lists\n", " | Bookmarklet\n", " | Legal\n", " | Apply to YC\n", " | ContactSearch:\n", " \n", " \n", " \n", " \n", "\n" ] } ], "source": [ "from bs4 import BeautifulSoup\n", "\n", "# Remove HTML tags using Beautiful Soup library\n", "soup = BeautifulSoup(r.text, \"html5lib\")\n", "print(soup.get_text())" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "\n", " 1.
Google Duplex: An AI System for Accomplishing Real World Tasks Over the Phone (googleblog.com)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Find all articles\n", "summaries = soup.find_all(\"tr\", class_=\"athing\")\n", "summaries[0]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Google Duplex: An AI System for Accomplishing Real World Tasks Over the Phone'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Extract title\n", "summaries[0].find(\"a\", class_=\"storylink\").get_text().strip()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "30 Article summaries found. Sample:\n", "Google Duplex: An AI System for Accomplishing Real World Tasks Over the Phone\n" ] } ], "source": [ "# Find all articles, extract titles\n", "articles = []\n", "summaries = soup.find_all(\"tr\", class_=\"athing\")\n", "for summary in summaries:\n", " title = summary.find(\"a\", class_=\"storylink\").get_text().strip()\n", " articles.append((title))\n", "\n", "print(len(articles), \"Article summaries found. Sample:\")\n", "print(articles[0])" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Normalization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Case Normalization" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?\n" ] } ], "source": [ "# Sample text\n", "text = \"The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?\"\n", "print(text)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?\n" ] } ], "source": [ "# Convert to lowercase\n", "text = text.lower() \n", "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Punctuation Removal" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "the first time you see the second renaissance it may look boring look at it at least twice and definitely watch part 2 it will change your view of the matrix are the human people the ones who started the war is ai a bad thing \n" ] } ], "source": [ "import re\n", "\n", "# Remove punctuation characters\n", "text = re.sub(r\"[^a-zA-Z0-9]\", \" \", text) \n", "print(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenization" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']\n" ] } ], "source": [ "# Split text into tokens (words)\n", "words = text.split()\n", "print(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### NLTK: Natural Language ToolKit" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import os\n", "import nltk\n", "nltk.data.path.append(os.path.join(os.getcwd(), \"nltk_data\"))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.\n" ] } ], "source": [ "# Another sample text\n", "text = \"Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.\"\n", "print(text)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']\n" ] } ], "source": [ "from nltk.tokenize import word_tokenize\n", "\n", "# Split text into words using NLTK\n", "words = word_tokenize(text)\n", "print(words)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']\n" ] } ], "source": [ "from nltk.tokenize import sent_tokenize\n", "\n", "# Split text into sentences\n", "sentences = sent_tokenize(text)\n", "print(sentences)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n" ] } ], "source": [ "# List stop words\n", "from nltk.corpus import stopwords\n", "print(stopwords.words(\"english\"))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']\n" ] } ], "source": [ "# Reset text\n", "text = \"The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?\"\n", "\n", "# Normalize it\n", "text = re.sub(r\"[^a-zA-Z0-9]\", \" \", text.lower())\n", "\n", "# Tokenize it\n", "words = text.split()\n", "print(words)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']\n" ] } ], "source": [ "# Remove stop words\n", "words = [w for w in words if w not in stopwords.words(\"english\")]\n", "print(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentence Parsing" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(S\n", " (NP I)\n", " (VP\n", " (VP (V shot) (NP (Det an) (N elephant)))\n", " (PP (P in) (NP (Det my) (N pajamas)))))\n", "(S\n", " (NP I)\n", " (VP\n", " (V shot)\n", " (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))\n" ] } ], "source": [ "import nltk\n", "\n", "# Define a custom grammar\n", "my_grammar = nltk.CFG.fromstring(\"\"\"\n", "S -> NP VP\n", "PP -> P NP\n", "NP -> Det N | Det N PP | 'I'\n", "VP -> V NP | VP PP\n", "Det -> 'an' | 'my'\n", "N -> 'elephant' | 'pajamas'\n", "V -> 'shot'\n", "P -> 'in'\n", "\"\"\")\n", "parser = nltk.ChartParser(my_grammar)\n", "\n", "# Parse a sentence\n", "sentence = word_tokenize(\"I shot an elephant in my pajamas\")\n", "for tree in parser.parse(sentence):\n", " print(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stemming & Lemmatization\n", "\n", "### Stemming" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['first', 'time', 'see', 'second', 'renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definit', 'watch', 'part', '2', 'chang', 'view', 'matrix', 'human', 'peopl', 'one', 'start', 'war', 'ai', 'bad', 'thing']\n" ] } ], "source": [ "from nltk.stem.porter import PorterStemmer\n", "\n", "# Reduce words to their stems\n", "stemmed = [PorterStemmer().stem(w) for w in words]\n", "print(stemmed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lemmatization" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing']\n" ] } ], "source": [ "from nltk.stem.wordnet import WordNetLemmatizer\n", "\n", "# Reduce words to their root form\n", "lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]\n", "print(lemmed)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'start', 'war', 'ai', 'bad', 'thing']\n" ] } ], "source": [ "# Lemmatize verbs by specifying pos\n", "lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]\n", "print(lemmed)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }
X Tutup