pythonlearn/html-009/book010.html at master · StudyCourse/pythonlearn

History

420 lines (413 loc) · 30.1 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

<!DOCTYPE html>

<html>

<head>

<title>Dictionaries</title>

</head>

<body>

<hr />

<h1 class="chapter" id="sec117">Chapter&#XA0;9&#XA0;&#XA0;Dictionaries</h1>

<a id="hevea_default585"></a>A dictionary is like a list, but more general. In a list,

the positions (a.k.a. indices) have to be integers; in a dictionary

the indices can be (almost) any type.You can think of a dictionary as a mapping between a set of indices

(which are called keys) and a set of values. Each key maps to a

value. The association of a key and a value is called a key-value pair or sometimes an item.As an example, we&#X2019;ll build a dictionary that maps from English

to Spanish words, so the keys and the values are all strings.The function dict creates a new dictionary with no items.

Because dict is the name of a built-in function, you

should avoid using it as a variable name.<a id="hevea_default586"></a>

<a id="hevea_default587"></a><pre class="verbatim">>>> eng2sp = dict()

>>> print eng2sp

{}

</pre>The squiggly-brackets, <code>{}</code>, represent an empty dictionary.

To add items to the dictionary, you can use square brackets:<a id="hevea_default588"></a>

<a id="hevea_default589"></a><pre class="verbatim">>>> eng2sp['one'] = 'uno'

</pre>This line creates an item that maps from the key

&#X2019;one&#X2019; to the value <code>'uno'</code>. If we print the

dictionary again, we see a key-value pair with a colon

between the key and value:<pre class="verbatim">>>> print eng2sp

{'one': 'uno'}

</pre>This output format is also an input format. For example,

you can create a new dictionary with three items:<pre class="verbatim">>>> eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}

</pre>But if you print eng2sp, you might be surprised:<pre class="verbatim">>>> print eng2sp

{'one': 'uno', 'three': 'tres', 'two': 'dos'}

</pre>The order of the key-value pairs is not the same. In fact, if

you type the same example on your computer, you might get a

different result. In general, the order of items in

a dictionary is unpredictable.But that&#X2019;s not a problem because

the elements of a dictionary are never indexed with integer indices.

Instead, you use the keys to look up the corresponding values:<pre class="verbatim">>>> print eng2sp['two']

'dos'

</pre>The key &#X2019;two&#X2019; always maps to the value <code>'dos'</code> so the order

of the items doesn&#X2019;t matter.If the key isn&#X2019;t in the dictionary, you get an exception:<a id="hevea_default590"></a>

<a id="hevea_default591"></a><pre class="verbatim">>>> print eng2sp['four']

KeyError: 'four'

</pre>The len function works on dictionaries; it returns the

number of key-value pairs:<a id="hevea_default592"></a>

<a id="hevea_default593"></a><pre class="verbatim">>>> len(eng2sp)

</pre>The in operator works on dictionaries; it tells you whether

something appears as a key in the dictionary (appearing

as a value is not good enough).<a id="hevea_default594"></a>

<a id="hevea_default596"></a><pre class="verbatim">>>> 'one' in eng2sp

True

>>> 'uno' in eng2sp

False

</pre>To see whether something appears as a value in a dictionary, you

can use the method values, which returns the values as

a list, and then use the in operator:<a id="hevea_default597"></a>

<a id="hevea_default598"></a><pre class="verbatim">>>> vals = eng2sp.values()

>>> 'uno' in vals

True

</pre>The in operator uses different algorithms for lists and

dictionaries. For lists, it uses a linear search algorithm.

As the list gets longer, the search time gets

longer in direct proportion to the length of the list.

For dictionaries, Python uses an

algorithm called a hash table that has a remarkable property; the

in operator takes about the same amount of time no matter how

many items there are in a dictionary. I won&#X2019;t explain

why hash functions are so magical,

but you can read more about it at

wikipedia.org/wiki/Hash_table.<a id="hevea_default599"></a><div class="theorem">Exercise&#XA0;1&#XA0;&#XA0;

<a id="hevea_default601"></a>Write a program that reads the words in words.txt and

stores them as keys in a dictionary. It doesn&#X2019;t matter what the

values are. Then you can use the in operator

as a fast way to check whether a string is in

the dictionary.</div>

<h2 class="section" id="sec118">9.1&#XA0;&#XA0;Dictionary as a set of counters</h2>

<a id="histogram"></a><a id="hevea_default602"></a>Suppose you are given a string and you want to count how many

times each letter appears. There are several ways you could do it:<ol class="enumerate" type="1"><li class="li-enumerate">You could create 26 variables, one for each letter of the

alphabet. Then you could traverse the string and, for each

character, increment the corresponding counter, probably using

a chained conditional.</li><li class="li-enumerate">You could create a list with 26 elements. Then you could

convert each character to a number (using the built-in function

ord), use the number as an index into the list, and increment

the appropriate counter.</li><li class="li-enumerate">You could create a dictionary with characters as keys

and counters as the corresponding values. The first time you

see a character, you would add an item to the dictionary. After

that you would increment the value of an existing item.</li></ol>Each of these options performs the same computation, but each

of them implements that computation in a different way.<a id="hevea_default603"></a>An implementation is a way of performing a computation;

some implementations are better than others. For example,

an advantage of the dictionary implementation is that we don&#X2019;t

have to know ahead of time which letters appear in the string

and we only have to make room for the letters that do appear.Here is what the code might look like:<pre class="verbatim">word = 'brontosaurus'

d = dict()

for c in word:

if c not in d:

d[c] = 1

else:

d[c] = d[c] + 1

print d

</pre>We are effectively computing a histogram, which is a statistical

term for a set of counters (or frequencies). <a id="hevea_default604"></a>

<a id="hevea_default606"></a>The for loop traverses

the string. Each time through the loop, if the character c is

not in the dictionary, we create a new item with key c and the

initial value 1 (since we have seen this letter once). If c is

already in the dictionary we increment d[c].<a id="hevea_default607"></a>Here&#X2019;s the output of the program:<pre class="verbatim">{'a': 1, 'b': 1, 'o': 2, 'n': 1, 's': 2, 'r': 2, 'u': 2, 't': 1}

</pre>The histogram indicates that the letters &#X2019;a&#X2019; and <code>'b'</code>

appear once; <code>'o'</code> appears twice, and so on.<a id="hevea_default608"></a>

<a id="hevea_default609"></a>Dictionaries have a method called get that takes a key

and a default value. If the key appears in the dictionary,

get returns the corresponding value; otherwise it returns

the default value. For example:<pre class="verbatim">>>> counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

>>> print counts.get('jan', 0)

100

>>> print counts.get('tim', 0)

</pre>We can use get to write our histogram loop more concisely.

Because the get method automatically handles the case where a key

is not in a dictionary, we can reduce four lines down to one

and eliminate the if statement.<pre class="verbatim">word = 'brontosaurus'

d = dict()

for c in word:

d[c] = d.get(c,0) + 1

print d

</pre>The use of the get method to simplify this counting loop

ends up being a very commonly used &#X201C;idiom&#X201D; in Python and

we will use it many times the rest of the book. So you should

take a moment and compare the loop using the if statement

and in operator with the loop using the get method.

They do exactly the same thing, but one is more succinct.

<h2 class="section" id="sec119">9.2&#XA0;&#XA0;Dictionaries and files</h2>

One of the common uses of a dictionary is to count the occurrence

of words in a file with some written text.

Let&#X2019;s start with a very simple file of

words taken from the text of Romeo and Juliet

thanks to

http://shakespeare.mit.edu/Tragedy/romeoandjuliet/romeo_juliet.2.2.html.For the first set of examples, we will use a shortened and simplified version

of the text with no punctuation. Later we will work with the text of the

scene with punctuation included.<pre class="verbatim">But soft what light through yonder window breaks

It is the east and Juliet is the sun

Arise fair sun and kill the envious moon

Who is already sick and pale with grief

</pre>We will write a Python program to read through the lines of the file,

break each line into a list of words, and then loop through each

of the words in the line, and count each word using a dictionary.<a id="hevea_default611"></a>

You will see that we have two for loops. The outer loop is reading the

lines of the file and the inner loop is iterating through each

of the words on that particular line. This is an example

of a pattern called nested loops because one of the loops

is the outer loop and the other loop is the inner

loop. Because the inner loop executes all of its iterations each time

the outer loop makes a single iteration, we think of the inner

loop as iterating &#X201C;more quickly&#X201D; and the outer loop as iterating

more slowly.<a id="hevea_default613"></a>

The combination of the two nested loops ensures that we will count

every word on every line of the input file.<pre class="verbatim">fname = raw_input('Enter the file name: ')

try:

fhand = open(fname)

except:

print 'File cannot be opened:', fname

exit()

counts = dict()

for line in fhand:

words = line.split()

for word in words:

if word not in counts:

counts[word] = 1

else:

counts[word] += 1

print counts

</pre>When we run the program, we see a raw dump of all of the counts in unsorted

hash order.

(the romeo.txt file is available at

www.py4inf.com/code/romeo.txt)<pre class="verbatim">python count1.py

Enter the file name: romeo.txt

{'and': 3, 'envious': 1, 'already': 1, 'fair': 1,

'is': 3, 'through': 1, 'pale': 1, 'yonder': 1,

'what': 1, 'sun': 2, 'Who': 1, 'But': 1, 'moon': 1,

'window': 1, 'sick': 1, 'east': 1, 'breaks': 1,

'grief': 1, 'with': 1, 'light': 1, 'It': 1, 'Arise': 1,

'kill': 1, 'the': 3, 'soft': 1, 'Juliet': 1}

</pre>It is a bit inconvenient to look through the dictionary to find the

most common words and their counts, so we need to add some more

Python code to get us the output that will be more helpful.

<h2 class="section" id="sec120">9.3&#XA0;&#XA0;Looping and dictionaries</h2>

<a id="hevea_default616"></a>If you use a dictionary as the sequence

in a for statement, it traverses

the keys of the dictionary. This loop

prints each key and the corresponding value:<pre class="verbatim">counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

for key in counts:

print key, counts[key]

</pre>Here&#X2019;s what the output looks like:<pre class="verbatim">jan 100

chuck 1

annie 42

</pre>Again, the keys are in no particular order.<a id="hevea_default617"></a>

We can use this pattern to implement the various loop idioms

that we have described earlier. For example if we wanted

to find all the entries in a dictionary with a value

above ten, we could write the following code:<pre class="verbatim">counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

for key in counts:

if counts[key] > 10 :

print key, counts[key]

</pre>The for loop iterates through the

keys of the dictionary, so we must

use the index operator to retrieve the

corresponding value

for each key.

Here&#X2019;s what the output looks like:<pre class="verbatim">jan 100

annie 42

</pre>We see only the entries with a value above 10.<a id="hevea_default618"></a>

If you want to print the keys in alphabetical order, you first

make a list of the keys in the dictionary using the

keys method available in dictionary objects,

and then sort that list

and loop through the sorted list, looking up each

key printing out key/value pairs in sorted order as follows

as follows:<pre class="verbatim">counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

lst = counts.keys()

print lst

lst.sort()

for key in lst:

print key, counts[key]

</pre>Here&#X2019;s what the output looks like:<pre class="verbatim">['jan', 'chuck', 'annie']

annie 42

chuck 1

jan 100

</pre>First you see the list of keys in unsorted order that

we get from the keys method. Then we see the key/value

pairs in order from the for loop.

<h2 class="section" id="sec121">9.4&#XA0;&#XA0;Advanced text parsing</h2>

In the above example using the file romeo.txt,

we made the file as simple as possible by removing

any and all punctuation by hand. The real text

has lots of punctuation as shown below:<pre class="verbatim">But, soft! what light through yonder window breaks?

It is the east, and Juliet is the sun.

Arise, fair sun, and kill the envious moon,

Who is already sick and pale with grief,

</pre>Since the Python split function looks for spaces and

treats words as tokens separated by spaces, we would treat the

words &#X201C;soft!&#X201D; and &#X201C;soft&#X201D; as different words and create

a separate dictionary entry for each word.Also since the file has capitalization, we would treat

&#X201C;who&#X201D; and &#X201C;Who&#X201D; as different words with different

counts.We can solve both these problems by using the string

methods lower, punctuation, and translate. The

translate is the most subtle of the methods.

Here is the documentation for translate:<code>string.translate(s, table[, deletechars])</code>Delete all characters from s that are in deletechars (if present),

and then translate the characters using table, which must

be a 256-character string giving the translation for each

character value, indexed by its ordinal. If table is None,

then only the character deletion step is performed.We will not specify the table but we will use

the deletechars parameter to delete all of the punctuation.

We will even let Python tell us the list of characters

that it considers &#X201C;punctuation&#X201D;:<pre class="verbatim">>>> import string

>>> string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

</pre>We make the following modifications to our program:<pre class="verbatim">import string # New Code

fname = raw_input('Enter the file name: ')

try:

fhand = open(fname)

except:

print 'File cannot be opened:', fname

exit()

counts = dict()

for line in fhand:

line = line.translate(None, string.punctuation) # New Code

line = line.lower() # New Code

words = line.split()

for word in words:

if word not in counts:

counts[word] = 1

else:

counts[word] += 1

print counts

</pre>We use translate to remove all punctuation and lower to

force the line to lowercase. Otherwise the program is unchanged.

Note for Python 2.5 and earlier, translate does not

accept None as the first parameter so use this code for the translate

call:<pre class="verbatim">print a.translate(string.maketrans(' ',' '), string.punctuation

</pre>Part of learning the &#X201C;Art of Python&#X201D; or &#X201C;Thinking Pythonically&#X201D;

is realizing that Python

often has built-in capabilities for many common data-analysis

problems. Over time, you will see enough example code and read

enough of the documentation to know where to look to see if someone

has already written something that makes your job much easier.The following is an abbreviated version of the output:<pre class="verbatim">Enter the file name: romeo-full.txt

{'swearst': 1, 'all': 6, 'afeard': 1, 'leave': 2, 'these': 2,

'kinsmen': 2, 'what': 11, 'thinkst': 1, 'love': 24, 'cloak': 1,

a': 24, 'orchard': 2, 'light': 5, 'lovers': 2, 'romeo': 40,

'maiden': 1, 'whiteupturned': 1, 'juliet': 32, 'gentleman': 1,

'it': 22, 'leans': 1, 'canst': 1, 'having': 1, ...}

</pre>Looking through this output is still unwieldy and we can use

Python to gives us exactly what we are looking for, but to do

so, we need to learn about Python tuples. We

will pick up this example once we learn about tuples.

<h2 class="section" id="sec122">9.5&#XA0;&#XA0;Debugging</h2>

<a id="hevea_default621"></a>As you work with bigger datasets it can become unwieldy to

debug by printing and checking data by hand. Here are some

suggestions for debugging large datasets:<dl class="description"><dt class="dt-description">Scale down the input:</dt><dd class="dd-description"> If possible, reduce the size of the

dataset. For example if the program reads a text file, start with

just the first 10 lines, or with the smallest example you can find.

You can either edit the files themselves, or (better) modify the

program so it reads only the first n lines.If there is an error, you can reduce n to the smallest

value that manifests the error, and then increase it gradually

as you find and correct errors.</dd><dt class="dt-description">Check summaries and types:</dt><dd class="dd-description"> Instead of printing and checking the

entire dataset, consider printing summaries of the data: for example,

the number of items in a dictionary or the total of a list of numbers.A common cause of runtime errors is a value that is not the right

type. For debugging this kind of error, it is often enough to print

the type of a value.</dd><dt class="dt-description">Write self-checks:</dt><dd class="dd-description"> Sometimes you can write code to check

for errors automatically. For example, if you are computing the

average of a list of numbers, you could check that the result is

not greater than the largest element in the list or less than

the smallest. This is called a &#X201C;sanity check&#X201D; because it detects

results that are &#X201C;completely illogical.&#X201D;<a id="hevea_default622"></a>

<a id="hevea_default623"></a>Another kind of check compares the results of two different

computations to see if they are consistent. This is called a

&#X201C;consistency check.&#X201D;</dd><dt class="dt-description">Pretty print the output:</dt><dd class="dd-description"> Formatting debugging output

can make it easier to spot an error. </dd></dl>Again, time you spend building scaffolding can reduce

the time you spend debugging.<a id="hevea_default624"></a>

<h2 class="section" id="sec123">9.6&#XA0;&#XA0;Glossary</h2>

<dl class="description"><dt class="dt-description">dictionary:</dt><dd class="dd-description"> A mapping from a set of keys to their

corresponding values.

<a id="hevea_default625"></a></dd><dt class="dt-description">hashtable:</dt><dd class="dd-description"> The algorithm used to implement Python

dictionaries.

<a id="hevea_default626"></a></dd><dt class="dt-description">hash function:</dt><dd class="dd-description"> A function used by a hashtable to compute the

location for a key.

<a id="hevea_default627"></a></dd><dt class="dt-description">histogram:</dt><dd class="dd-description"> A set of counters.

<a id="hevea_default628"></a></dd><dt class="dt-description">implementation:</dt><dd class="dd-description"> A way of performing a computation.

<a id="hevea_default629"></a></dd><dt class="dt-description">item:</dt><dd class="dd-description"> Another name for a key-value pair.

<a id="hevea_default630"></a></dd><dt class="dt-description">key:</dt><dd class="dd-description"> An object that appears in a dictionary as the

first part of a key-value pair.

<a id="hevea_default631"></a></dd><dt class="dt-description">key-value pair:</dt><dd class="dd-description"> The representation of the mapping from

a key to a value.

<a id="hevea_default632"></a></dd><dt class="dt-description">lookup:</dt><dd class="dd-description"> A dictionary operation that takes a key and finds

the corresponding value.

<a id="hevea_default633"></a></dd><dt class="dt-description">nested loops:</dt><dd class="dd-description"> When there is one or more loops &#X201C;inside&#X201D; of

another loop. The inner loop runs to completion each time the outer

loop runs once.

<a id="hevea_default635"></a></dd><dt class="dt-description">value:</dt><dd class="dd-description"> An object that appears in a dictionary as the

second part of a key-value pair. This is more specific than

our previous use of the word &#X201C;value.&#X201D;

<h2 class="section" id="sec124">9.7&#XA0;&#XA0;Exercises</h2>

<div class="theorem">Exercise&#XA0;2&#XA0;&#XA0;

Write a program that categorizes each mail message by which

day of the week the commit was done. To do this look for

lines which start with &#X201C;From&#X201D;, then look for the

third word and then keep a running count of each of the

days of the week. At the end of the program print out the

contents of your dictionary (order does not matter).<pre class="verbatim">Sample Line:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Sample Execution:

python dow.py

Enter a file name: mbox-short.txt

{'Fri': 20, 'Thu': 6, 'Sat': 1}

</pre></div><div class="theorem">Exercise&#XA0;3&#XA0;&#XA0;

Write a program to read through a mail log, and

build a histogram using a dictionary to count how many

messages have come from each email address

and print the dictionary.<pre class="verbatim">Enter file name: mbox-short.txt

{'gopal.ramasammycook@gmail.com': 1, 'louis@media.berkeley.edu': 3,

'cwen@iupui.edu': 5, 'antranig@caret.cam.ac.uk': 1,

'rjlowe@iupui.edu': 2, 'gsilver@umich.edu': 3,

'david.horwitz@uct.ac.za': 4, 'wagnermr@iupui.edu': 1,

'zqian@umich.edu': 4, 'stephen.marquard@uct.ac.za': 2,

'ray@media.berkeley.edu': 1}

</pre></div><div class="theorem">Exercise&#XA0;4&#XA0;&#XA0;

Add code to the above program to figure out who has the

most messages in the file.After all the data has been read and the dictionary has been

created, look through the dictionary using a

maximum loop

(see Section&#XA0;<a href="book006.html#maximumloop">??</a>)

to find who has the most

messages and print how many messages the person has.<pre class="verbatim">Enter a file name: mbox-short.txt

cwen@iupui.edu 5

Enter a file name: mbox.txt

zqian@umich.edu 195

</pre></div><div class="theorem">Exercise&#XA0;5&#XA0;&#XA0;

This program records the domain name (instead of the address)

where the message was sent from instead of who the mail

came from (i.e. the whole e-mail address). At the end

of the program print out the contents of your dictionary. <pre class="verbatim">python schoolcount.py

Enter a file name: mbox-short.txt

{'media.berkeley.edu': 4, 'uct.ac.za': 6, 'umich.edu': 7,

'gmail.com': 1, 'caret.cam.ac.uk': 1, 'iupui.edu': 8}

</pre></div>

<hr />

</body>

</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

book010.html

Latest commit

History

book010.html

File metadata and controls