pythonlearn/html-008/cfbook010.html at master · StudyCourse/pythonlearn

History

622 lines (615 loc) · 26.6 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

"http://www.w3.org/TR/REC-html40/loose.dtd">

<html>

<head>

<title>

Dictionaries

</title>

</head>

<body>

<hr />

<h1><a name="htoc116">Chapter 9</a>  Dictionaries</h1>

A dictionary is like a list, but more general. In a list,

the positions (a.k.a. indices) have to be integers; in a dictionary

the indices can be (almost) any type.

You can think of a dictionary as a mapping between a set of indices

(which are called keys) and a set of values. Each key maps to a

value. The association of a key and a value is called a key-value pair or sometimes an item.

As an example, we'll build a dictionary that maps from English

to Spanish words, so the keys and the values are all strings.

The function <tt>dict</tt> creates a new dictionary with no items.

Because <tt>dict</tt> is the name of a built-in function, you

should avoid using it as a variable name.

<pre>

>>> eng2sp = dict()

>>> print eng2sp

{}

</pre>

The squiggly-brackets, <code>{}</code>, represent an empty dictionary.

To add items to the dictionary, you can use square brackets:

<pre>

>>> eng2sp['one'] = 'uno'

</pre>This line creates an item that maps from the key

<tt>'one'</tt> to the value <code>'uno'</code>. If we print the

dictionary again, we see a key-value pair with a colon

between the key and value:

<pre>

>>> print eng2sp

{'one': 'uno'}

</pre>This output format is also an input format. For example,

you can create a new dictionary with three items:

<pre>

>>> eng2sp = {'one': 'uno', 'two': 'dos', 'three': 'tres'}

</pre>But if you print <tt>eng2sp</tt>, you might be surprised:

<pre>

>>> print eng2sp

{'one': 'uno', 'three': 'tres', 'two': 'dos'}

</pre>The order of the key-value pairs is not the same. In fact, if

you type the same example on your computer, you might get a

different result. In general, the order of items in

a dictionary is unpredictable.

But that's not a problem because

the elements of a dictionary are never indexed with integer indices.

Instead, you use the keys to look up the corresponding values:

<pre>

>>> print eng2sp['two']

'dos'

</pre>The key <tt>'two'</tt> always maps to the value <code>'dos'</code> so the order

of the items doesn't matter.

If the key isn't in the dictionary, you get an exception:

<pre>

>>> print eng2sp['four']

KeyError: 'four'

</pre>The <tt>len</tt> function works on dictionaries; it returns the

number of key-value pairs:

<pre>

>>> len(eng2sp)

</pre>The <tt>in</tt> operator works on dictionaries; it tells you whether

something appears as a key in the dictionary (appearing

as a value is not good enough).

<pre>

>>> 'one' in eng2sp

True

>>> 'uno' in eng2sp

False

</pre>To see whether something appears as a value in a dictionary, you

can use the method <tt>values</tt>, which returns the values as

a list, and then use the <tt>in</tt> operator:

<pre>

>>> vals = eng2sp.values()

>>> 'uno' in vals

True

</pre>The <tt>in</tt> operator uses different algorithms for lists and

dictionaries. For lists, it uses a linear search algorithm.

As the list gets longer, the search time gets

longer in direct proportion to the length of the list.

For dictionaries, Python uses an

algorithm called a hash table that has a remarkable property; the

<tt>in</tt> operator takes about the same amount of time no matter how

many items there are in a dictionary. I won't explain

why hash functions are so magical,

but you can read more about it at

<tt>wikipedia.org/wiki/Hash_table</tt>.

<div align="left">Exercise 1

Write a program that reads the words in <tt>words.txt</tt> and

stores them as keys in a dictionary. It doesn't matter what the

values are. Then you can use the <tt>in</tt> operator

as a fast way to check whether a string is in

the dictionary.</div>

<h2><a name="htoc117">9.1</a>  Dictionary as a set of counters</h2>

Suppose you are given a string and you want to count how many

times each letter appears. There are several ways you could do it:

<ol type="1"><li>You could create 26 variables, one for each letter of the

alphabet. Then you could traverse the string and, for each

character, increment the corresponding counter, probably using

a chained conditional.

</li><li>You could create a list with 26 elements. Then you could

convert each character to a number (using the built-in function

<tt>ord</tt>), use the number as an index into the list, and increment

the appropriate counter.

</li><li>You could create a dictionary with characters as keys

and counters as the corresponding values. The first time you

see a character, you would add an item to the dictionary. After

that you would increment the value of an existing item.</li></ol>

Each of these options performs the same computation, but each

of them implements that computation in a different way.

An implementation is a way of performing a computation;

some implementations are better than others. For example,

an advantage of the dictionary implementation is that we don't

have to know ahead of time which letters appear in the string

and we only have to make room for the letters that do appear.

Here is what the code might look like:

<pre>

word = 'brontosaurus'

d = dict()

for c in word:

if c not in d:

d[c] = 1

else:

d[c] = d[c] + 1

print d

</pre>We are effectively computing a histogram, which is a statistical

term for a set of counters (or frequencies).

The <tt>for</tt> loop traverses

the string. Each time through the loop, if the character <tt>c</tt> is

not in the dictionary, we create a new item with key <tt>c</tt> and the

initial value 1 (since we have seen this letter once). If <tt>c</tt> is

already in the dictionary we increment <tt>d[c]</tt>.

Here's the output of the program:

<pre>

{'a': 1, 'b': 1, 'o': 2, 'n': 1, 's': 2, 'r': 2, 'u': 2, 't': 1}

</pre>The histogram indicates that the letters <tt>'a'</tt> and <code>'b'</code>

appear once; <code>'o'</code> appears twice, and so on.

Dictionaries have a method called <tt>get</tt> that takes a key

and a default value. If the key appears in the dictionary,

<tt>get</tt> returns the corresponding value; otherwise it returns

the default value. For example:

<pre>

>>> counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

>>> print counts.get('jan', 0)

100

>>> print counts.get('tim', 0)

</pre>We can use <tt>get</tt> to write our histogram loop more concisely.

Because the <tt>get</tt> method automatically handles the case where a key

is not in a dictionary, we can reduce four lines down to one

and eliminate the <tt>if</tt> statement.

<pre>

word = 'brontosaurus'

d = dict()

for c in word:

d[c] = d.get(c,0) + 1

print d

</pre>The use of the <tt>get</tt> method to simplify this counting loop

ends up being a very commonly used "idiom" in Python and

we will use it many times the rest of the book. So you should

take a moment and compare the loop using the <tt>if</tt> statement

and <tt>in</tt> operator with the loop using the <tt>get</tt> method.

They do exactly the same thing, but one is more succinct.

<h2><a name="htoc118">9.2</a>  Dictionaries and files</h2>

One of the common uses of a dictionary is to count the occurrence

of words in a file with some written text.

Let's start with a very simple file of

words taken from the text of Romeo and Juliet

thanks to

<tt>http://shakespeare.mit.edu/Tragedy/romeoandjuliet/romeo_juliet.2.2.html</tt>.

For the first set of examples, we will use a shortened and simplified version

of the text with no punctuation. Later we will work with the text of the

scene with punctuation included.

<pre>

But soft what light through yonder window breaks

It is the east and Juliet is the sun

Arise fair sun and kill the envious moon

Who is already sick and pale with grief

</pre>We will write a Python program to read through the lines of the file,

break each line into a list of words, and then loop through each

of the words in the line, and count each word using a dictionary.

You will see that we have two <tt>for</tt> loops. The outer loop is reading the

lines of the file and the inner loop is iterating through each

of the words on that particular line. This is an example

of a pattern called nested loops because one of the loops

is the outer loop and the other loop is the inner

loop.

Because the inner loop executes all of its iterations each time

the outer loop makes a single iteration, we think of the inner

loop as iterating "more quickly" and the outer loop as iterating

more slowly.

The combination of the two nested loops ensures that we will count

every word on every line of the input file.

<pre>

fname = raw_input('Enter the file name: ')

try:

fhand = open(fname)

except:

print 'File cannot be opened:', fname

exit()

counts = dict()

for line in fhand:

words = line.split()

for word in words:

if word not in counts:

counts[word] = 1

else:

counts[word] += 1

print counts

</pre>When we run the program, we see a raw dump of all of the counts in unsorted

hash order.

(the <tt>romeo.txt</tt> file is available at

<tt>www.py4inf.com/code/romeo.txt</tt>)

<pre>

python count1.py

Enter the file name: romeo.txt

{'and': 3, 'envious': 1, 'already': 1, 'fair': 1,

'is': 3, 'through': 1, 'pale': 1, 'yonder': 1,

'what': 1, 'sun': 2, 'Who': 1, 'But': 1, 'moon': 1,

'window': 1, 'sick': 1, 'east': 1, 'breaks': 1,

'grief': 1, 'with': 1, 'light': 1, 'It': 1, 'Arise': 1,

'kill': 1, 'the': 3, 'soft': 1, 'Juliet': 1}

</pre>It is a bit inconvenient to look through the dictionary to find the

most common words and their counts, so we need to add some more

Python code to get us the output that will be more helpful.

<h2><a name="htoc119">9.3</a>  Looping and dictionaries</h2>

If you use a dictionary as the sequence

in a <tt>for</tt> statement, it traverses

the keys of the dictionary. This loop

prints each key and the corresponding value:

<pre>

counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

for key in counts:

print key, counts[key]

</pre>Here's what the output looks like:

<pre>

jan 100

chuck 1

annie 42

</pre>Again, the keys are in no particular order.

We can use this pattern to implement the various loop idioms

that we have described earlier. For example if we wanted

to find all the entries in a dictionary with a value

above ten, we could write the following code:

<pre>

counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

for key in counts:

if counts[key] > 10 :

print key, counts[key]

</pre>The <tt>for</tt> loop iterates through the

keys of the dictionary, so we must

use the index operator to retrieve the

corresponding value

for each key.

Here's what the output looks like:

<pre>

jan 100

annie 42

</pre>We see only the entries with a value above 10.

If you want to print the keys in alphabetical order, you first

make a list of the keys in the dictionary using the

<tt>keys</tt> method available in dictionary objects,

and then sort that list

and loop through the sorted list, looking up each

key printing out key/value pairs in sorted order as follows

as follows:

<pre>

counts = { 'chuck' : 1 , 'annie' : 42, 'jan': 100}

lst = counts.keys()

print lst

lst.sort()

for key in lst:

print key, counts[key]

</pre>Here's what the output looks like:

<pre>

['jan', 'chuck', 'annie']

annie 42

chuck 1

jan 100

</pre>First you see the list of keys in unsorted order that

we get from the <tt>keys</tt> method. Then we see the key/value

pairs in order from the <tt>for</tt> loop.

<h2><a name="htoc120">9.4</a>  Advanced text parsing</h2>

In the above example using the file <tt>romeo.txt</tt>,

we made the file as simple as possible by removing

any and all punctuation by hand. The real text

has lots of punctuation as shown below:

<pre>

But, soft! what light through yonder window breaks?

It is the east, and Juliet is the sun.

Arise, fair sun, and kill the envious moon,

Who is already sick and pale with grief,

</pre>Since the Python <tt>split</tt> function looks for spaces and

treats words as tokens separated by spaces, we would treat the

words "soft!" and "soft" as different words and create

a separate dictionary entry for each word.

Also since the file has capitalization, we would treat

"who" and "Who" as different words with different

counts.

We can solve both these problems by using the string

methods <tt>lower</tt>, <tt>punctuation</tt>, and <tt>translate</tt>. The

<tt>translate</tt> is the most subtle of the methods.

Here is the documentation for <tt>translate</tt>:

<code>string.translate(s, table[, deletechars])</code>

Delete all characters from s that are in deletechars (if present),

and then translate the characters using table, which must

be a 256-character string giving the translation for each

character value, indexed by its ordinal. If table is None,

then only the character deletion step is performed.

We will not specify the <tt>table</tt> but we will use

the <tt>deletechars</tt> parameter to delete all of the punctuation.

We will even let Python tell us the list of characters

that it considers "punctuation":

<pre>

>>> import string

>>> string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

</pre>We make the following modifications to our program:

<pre>

import string # New Code

fname = raw_input('Enter the file name: ')

try:

fhand = open(fname)

except:

print 'File cannot be opened:', fname

exit()

counts = dict()

for line in fhand:

line = line.translate(None, string.punctuation) # New Code

line = line.lower() # New Code

words = line.split()

for word in words:

if word not in counts:

counts[word] = 1

else:

counts[word] += 1

print counts

</pre>We use <tt>translate</tt> to remove all punctuation and <tt>lower</tt> to

force the line to lowercase. Otherwise the program is unchanged.

Note for Python 2.5 and earlier, <tt>translate</tt> does not

accept <tt>None</tt> as the first parameter so use this code for the translate

call:

<pre>

print a.translate(string.maketrans(' ',' '), string.punctuation

</pre>Part of learning the "Art of Python" or "Thinking Pythonically"

is realizing that Python

often has built-in capabilities for many common data-analysis

problems. Over time, you will see enough example code and read

enough of the documentation to know where to look to see if someone

has already written something that makes your job much easier.

The following is an abbreviated version of the output:

<pre>

Enter the file name: romeo-full.txt

{'swearst': 1, 'all': 6, 'afeard': 1, 'leave': 2, 'these': 2,

'kinsmen': 2, 'what': 11, 'thinkst': 1, 'love': 24, 'cloak': 1,

a': 24, 'orchard': 2, 'light': 5, 'lovers': 2, 'romeo': 40,

'maiden': 1, 'whiteupturned': 1, 'juliet': 32, 'gentleman': 1,

'it': 22, 'leans': 1, 'canst': 1, 'having': 1, ...}

</pre>Looking through this output is still unwieldy and we can use

Python to gives us exactly what we are looking for, but to do

so, we need to learn about Python tuples. We

will pick up this example once we learn about tuples.

<h2><a name="htoc121">9.5</a>  Debugging</h2>

As you work with bigger datasets it can become unwieldy to

debug by printing and checking data by hand. Here are some

suggestions for debugging large datasets:

<dl compact="compact"><dt>Scale down the input:</dt><dd> If possible, reduce the size of the

dataset. For example if the program reads a text file, start with

just the first 10 lines, or with the smallest example you can find.

You can either edit the files themselves, or (better) modify the

program so it reads only the first <tt>n</tt> lines.

If there is an error, you can reduce <tt>n</tt> to the smallest

value that manifests the error, and then increase it gradually

as you find and correct errors.

</dd><dt>Check summaries and types:</dt><dd> Instead of printing and checking the

entire dataset, consider printing summaries of the data: for example,

the number of items in a dictionary or the total of a list of numbers.

A common cause of runtime errors is a value that is not the right

type. For debugging this kind of error, it is often enough to print

the type of a value.

</dd><dt>Write self-checks:</dt><dd> Sometimes you can write code to check

for errors automatically. For example, if you are computing the

average of a list of numbers, you could check that the result is

not greater than the largest element in the list or less than

the smallest. This is called a "sanity check" because it detects

results that are "completely illogical."

Another kind of check compares the results of two different

computations to see if they are consistent. This is called a

"consistency check."

</dd><dt>Pretty print the output:</dt><dd> Formatting debugging output

can make it easier to spot an error. </dd></dl>

Again, time you spend building scaffolding can reduce

the time you spend debugging.

<h2><a name="htoc122">9.6</a>  Glossary</h2>

<dl compact="compact"><dt>dictionary:</dt><dd> A mapping from a set of keys to their

corresponding values.

</dd><dt>hashtable:</dt><dd> The algorithm used to implement Python

dictionaries.

</dd><dt>hash function:</dt><dd> A function used by a hashtable to compute the

location for a key.

</dd><dt>histogram:</dt><dd> A set of counters.

</dd><dt>implementation:</dt><dd> A way of performing a computation.

</dd><dt>item:</dt><dd> Another name for a key-value pair.

</dd><dt>key:</dt><dd> An object that appears in a dictionary as the

first part of a key-value pair.

</dd><dt>key-value pair:</dt><dd> The representation of the mapping from

a key to a value.

</dd><dt>lookup:</dt><dd> A dictionary operation that takes a key and finds

the corresponding value.

</dd><dt>nested loops:</dt><dd> When there is one or more loops "inside" of

another loop. The inner loop runs to completion each time the outer

loop runs once.

</dd><dt>value:</dt><dd> An object that appears in a dictionary as the

second part of a key-value pair. This is more specific than

our previous use of the word "value."

<h2><a name="htoc123">9.7</a>  Exercises</h2>

<div align="left">Exercise 2

Write a program that categorizes each mail message by which

day of the week the commit was done. To do this look for

lines which start with "From", then look for the

third word and then keep a running count of each of the

days of the week. At the end of the program print out the

contents of your dictionary (order does not matter).

<pre>

Sample Line:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Sample Execution:

python dow.py

Enter a file name: mbox-short.txt

{'Fri': 20, 'Thu': 6, 'Sat': 1}

</pre></div>

<div align="left">Exercise 3

Write a program to read through a mail log, and

build a histogram using a dictionary to count how many

messages have come from each email address

and print the dictionary.

<pre>

Enter file name: mbox-short.txt

{'gopal.ramasammycook@gmail.com': 1, 'louis@media.berkeley.edu': 3,

'cwen@iupui.edu': 5, 'antranig@caret.cam.ac.uk': 1,

'rjlowe@iupui.edu': 2, 'gsilver@umich.edu': 3,

'david.horwitz@uct.ac.za': 4, 'wagnermr@iupui.edu': 1,

'zqian@umich.edu': 4, 'stephen.marquard@uct.ac.za': 2,

'ray@media.berkeley.edu': 1}

</pre></div>

<div align="left">Exercise 4

Add code to the above program to figure out who has the

most messages in the file.

After all the data has been read and the dictionary has been

created, look through the dictionary using a

maximum loop

(see Section <a href="cfbook006.html#maximumloop">??</a>)

to find who has the most

messages and print how many messages the person has.

<pre>

Enter a file name: mbox-short.txt

cwen@iupui.edu 5

Enter a file name: mbox.txt

zqian@umich.edu 195

</pre></div>

<div align="left">Exercise 5

This program records the domain name (instead of the address)

where the message was sent from instead of who the mail

came from (i.e. the whole e-mail address). At the end

of the program print out the contents of your dictionary.

<pre>

python schoolcount.py

Enter a file name: mbox-short.txt

{'media.berkeley.edu': 4, 'uct.ac.za': 6, 'umich.edu': 7,

'gmail.com': 1, 'caret.cam.ac.uk': 1, 'iupui.edu': 8}

</pre></div>

<hr />

</body>

</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

cfbook010.html

Latest commit

History

cfbook010.html

File metadata and controls