forked from csev/py4e
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathbook017.html
More file actions
470 lines (468 loc) · 30.8 KB
/
book017.html
File metadata and controls
470 lines (468 loc) · 30.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="hevea 2.09" />
<link rel="stylesheet" type="text/css" href="book.css" />
<title>Automating common tasks on your computer</title>
</head>
<body>
<a href="book016.html"><img src="previous_motif.gif" alt="Previous" /></a>
<a href="index.html"><img src="contents_motif.gif" alt="Up" /></a>
<a href="book018.html"><img src="next_motif.gif" alt="Next" /></a>
<hr />
<h1 class="chapter" id="sec190"><span class="c006">Chapter 16  Automating common tasks on your computer</span></h1>
<p><span class="c006">We have been reading data from files, networks, services,
and databases. Python can also go through all of the
directories and folders on your computers and read those files
as well.</span></p><p><span class="c006">In this chapter, we will write programs that scan
through your computer and
perform some operation on each file.
Files are organized into directories (also called “folders”).
Simple Python scripts
can make short work of simple tasks that must be done to
hundreds or thousands of files
spread across a directory tree or your entire computer.</span></p><p><span class="c006">To walk through all the directories and files in a tree we use
<span class="c001">os.walk</span> and a <span class="c001">for</span> loop. This is similar to how
<span class="c001">open</span> allows us to write a loop to read the contents of a file,
<span class="c001">socket</span> allows us to write a loop to read the contents of a network connection, and
<span class="c001">urllib</span> allows us to open a web document and loop through its contents.</span></p><span class="c005">
</span><h2 class="section" id="sec191"><span class="c006">16.1  File names and paths</span></h2>
<p><span class="c005">
</span><a id="paths"></a></p><p><a id="hevea_default812"></a><span class="c005">
</span><a id="hevea_default813"></a><span class="c005">
</span><a id="hevea_default814"></a><span class="c005">
</span><a id="hevea_default815"></a></p><p><span class="c006">Every running program has a “current directory,” which is the
default directory for most operations. For example, when you open a file
for reading, Python looks for it in the current directory.</span></p><p><a id="hevea_default816"></a><span class="c005">
</span><a id="hevea_default817"></a></p><p><span class="c006">The <span class="c001">os</span> module provides functions for working with files and
directories (<span class="c001">os</span> stands for “operating system”). <span class="c001">os.getcwd</span>
returns the name of the current directory:</span></p><p><a id="hevea_default818"></a><span class="c005">
</span><a id="hevea_default819"></a></p><pre class="verbatim"><span class="c004">>>> import os
>>> cwd = os.getcwd()
>>> print cwd
/Users/csev
</span></pre><p><span class="c006"><span class="c001">cwd</span> stands for <span class="c009">current working directory</span>. The result in
this example is <span class="c001">/Users/csev</span>, which is the home directory of a
user named <span class="c001">csev</span>.</span></p><p><a id="hevea_default820"></a><span class="c005">
</span><a id="hevea_default821"></a></p><p><span class="c006">A string like <span class="c001">cwd</span> that identifies a file is called a path.
A <span class="c009">relative path</span> starts from the current directory;
an <span class="c009">absolute path</span> starts from the topmost directory in the
file system.</span></p><p><a id="hevea_default822"></a><span class="c005">
</span><a id="hevea_default823"></a><span class="c005">
</span><a id="hevea_default824"></a><span class="c005">
</span><a id="hevea_default825"></a></p><p><span class="c006">The paths we have seen so far are simple file names, so they are
relative to the current directory. To find the absolute path to
a file, you can use <span class="c001">os.path.abspath</span>:</span></p><pre class="verbatim"><span class="c004">>>> os.path.abspath('memo.txt')
'/Users/csev/memo.txt'
</span></pre><p><span class="c006"><span class="c001">os.path.exists</span> checks
whether a file or directory exists:</span></p><p><a id="hevea_default826"></a><span class="c005">
</span><a id="hevea_default827"></a></p><pre class="verbatim"><span class="c004">>>> os.path.exists('memo.txt')
True
</span></pre><p><span class="c006">If it exists, <span class="c001">os.path.isdir</span> checks whether it’s a directory:</span></p><pre class="verbatim"><span class="c004">>>> os.path.isdir('memo.txt')
False
>>> os.path.isdir('music')
True
</span></pre><p><span class="c006">Similarly, <span class="c001">os.path.isfile</span> checks whether it’s a file.</span></p><p><span class="c006"><span class="c001">os.listdir</span> returns a list of the files (and other directories)
in the given directory:</span></p><pre class="verbatim"><span class="c004">>>> os.listdir(cwd)
['music', 'photos', 'memo.txt']
</span></pre><span class="c005">
</span><h2 class="section" id="sec192"><span class="c006">16.2  Example: Cleaning up a photo directory</span></h2>
<p><span class="c006">Some time ago, I built a bit of Flickr-like software that
received photos from my cell phone and stored those photos
on my server. I wrote this before Flickr existed and kept
using it after Flickr existed because I wanted to keep original
copies of my images forever.</span></p><p><span class="c006">I would also send a simple one-line text description in the MMS message
or the subject line of the email message. I stored these messages
in a text file in the same directory as the image file. I came up
with a directory structure based on the month, year, day, and time the
photo was taken. The following would be an example of the naming for
one photo and its existing description:</span></p><pre class="verbatim"><span class="c004">./2006/03/24-03-06_2018002.jpg
./2006/03/24-03-06_2018002.txt
</span></pre><p><span class="c006">After seven years, I had a lot of photos and captions. Over the years
as I switched cell phones, sometimes my code to extract the caption from the message
would break and add a bunch of useless data on my server instead of a caption. </span></p><p><span class="c006">I wanted to go through these files and figure out which of the
text files were really captions and which were junk and then delete the bad
files. The first thing to do was to get a simple inventory of
how many text files I had in one the subfolders
using the following program:</span></p><pre class="verbatim"><span class="c004">import os
count = 0
for (dirname, dirs, files) in os.walk('.'):
for filename in files:
if filename.endswith('.txt') :
count = count + 1
print 'Files:', count
python txtcount.py
Files: 1917
</span></pre><p><span class="c006">The key bit of code that makes this possible is the <span class="c001">os.walk</span>
library in Python. When we call <span class="c001">os.walk</span> and give it a starting
directory, it will “walk” through all of the directories
and subdirectories recursively. The string “.” indicates
to start in the current directory and walk downward.
As it encounters each directory, we get three values in a tuple in the
body of the <span class="c001">for</span> loop. The first value is the current
directory name, the second value is the list of subdirectories
in the current directory, and the third value is a list of files
in the current directory.</span></p><p><span class="c006">We do not have to explicitly look into each of the subdirectories
because we can count on <span class="c001">os.walk</span> to visit every
folder eventually. But we do want to look at each file, so
we write a simple <span class="c001">for</span> loop to examine each of the files
in the current directory. We check each file to see if
it ends with “.txt” and then count the number of
files through the whole directory tree that end with the
suffix “.txt”.</span></p><p><span class="c006">Once we have a sense of how many files end with “.txt”, the next
thing to do is try to automatically
determine in Python which files are bad and which files
are good. So we write a simple program to print out the
files and the size of each file:</span></p><pre class="verbatim"><span class="c004">import os
from os.path import join
for (dirname, dirs, files) in os.walk('.'):
for filename in files:
if filename.endswith('.txt') :
thefile = os.path.join(dirname,filename)
print os.path.getsize(thefile), thefile
</span></pre><p><span class="c006">Now instead of just counting the files, we create
a file name by concatenating the directory name with
the name of the file within the directory using
<span class="c001">os.path.join</span>. It is important to use
<span class="c001">os.path.join</span> instead of string concatenation
because on Windows we use a backslash
(<code>\</code>) to construct file paths and on Linux
or Apple we use a forward slash (<code>/</code>)
to construct file paths. The <span class="c001">os.path.join</span>
knows these differences and knows what system
we are running on and it does the proper concatenation
depending on the system. So the same Python code
runs on either Windows or Unix-style systems.</span></p><p><span class="c006">Once we have the full file name with directory
path, we use the <span class="c001">os.path.getsize</span> utility
to get the size and print it out, producing the
following output:</span></p><pre class="verbatim"><span class="c004">python txtsize.py
...
18 ./2006/03/24-03-06_2303002.txt
22 ./2006/03/25-03-06_1340001.txt
22 ./2006/03/25-03-06_2034001.txt
...
2565 ./2005/09/28-09-05_1043004.txt
2565 ./2005/09/28-09-05_1141002.txt
...
2578 ./2006/03/27-03-06_1618001.txt
2578 ./2006/03/28-03-06_2109001.txt
2578 ./2006/03/29-03-06_1355001.txt
...
</span></pre><p><span class="c006">Scanning the output, we notice that some files are pretty short and
a lot of the files are pretty large and the same size (2578 and 2565).
When we take a look at a few of these larger files by hand,
it looks like the large
files are nothing but a generic bit of identical HTML that came
in from mail sent to my system from my T-Mobile phone:</span></p><pre class="verbatim"><span class="c004"><html>
<head>
<title>T-Mobile</title>
...
</span></pre><p><span class="c006">Skimming through the file, it looks like there is no good information
in these files so we can probably delete them.</span></p><p><span class="c006">But before we delete the files, we will write a program to look for files
that are more than one line long and show the contents of the file.
We will not bother showing ourselves those files that are exactly
2578 or 2565 characters long since we know that these files have no useful
information.</span></p><p><span class="c006">So we write the following program:</span></p><pre class="verbatim"><span class="c004">import os
from os.path import join
for (dirname, dirs, files) in os.walk('.'):
for filename in files:
if filename.endswith('.txt') :
thefile = os.path.join(dirname,filename)
size = os.path.getsize(thefile)
if size == 2578 or size == 2565:
continue
fhand = open(thefile,'r')
lines = list()
for line in fhand:
lines.append(line)
fhand.close()
if len(lines) > 1:
print len(lines), thefile
print lines[:4]
</span></pre><p><span class="c006">We use a <span class="c001">continue</span> to skip files with the two
“bad sizes”, then open the rest of the files
and read the lines of the file into a Python list
and if the file has more than one line we print
out how many lines are in the file and print out
the first three lines.</span></p><p><span class="c006">It looks like filtering out those two bad file sizes, and assuming
that all one-line files are correct, we are down to some pretty clean
data:</span></p><pre class="verbatim"><span class="c004">python txtcheck.py
3 ./2004/03/22-03-04_2015.txt
['Little horse rider\r\n', '\r\n', '\r']
2 ./2004/11/30-11-04_1834001.txt
['Testing 123.\n', '\n']
3 ./2007/09/15-09-07_074202_03.txt
['\r\n', '\r\n', 'Sent from my iPhone\r\n']
3 ./2007/09/19-09-07_124857_01.txt
['\r\n', '\r\n', 'Sent from my iPhone\r\n']
3 ./2007/09/20-09-07_115617_01.txt
...
</span></pre><p><span class="c006">But there is one more annoying pattern of files:
there are some three-line files that consist of
two blank lines followed by a line that says
“Sent from my iPhone” that have slipped
into my data. So we make the following change
to the program to deal with these files as well.</span></p><pre class="verbatim"><span class="c004"> lines = list()
for line in fhand:
lines.append(line)
if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
continue
if len(lines) > 1:
print len(lines), thefile
print lines[:4]
</span></pre><p><span class="c006">We simply check if we have a three-line file, and if the third
line starts with the specified text, we skip it.</span></p><p><span class="c006">Now when we run the program, we only see four remaining
multi-line files and all of those files look pretty reasonable:</span></p><pre class="verbatim"><span class="c004">python txtcheck2.py
3 ./2004/03/22-03-04_2015.txt
['Little horse rider\r\n', '\r\n', '\r']
2 ./2004/11/30-11-04_1834001.txt
['Testing 123.\n', '\n']
2 ./2006/03/17-03-06_1806001.txt
['On the road again...\r\n', '\r\n']
2 ./2006/03/24-03-06_1740001.txt
['On the road again...\r\n', '\r\n']
</span></pre><p><span class="c006">If you look at the overall pattern of this program,
we have successively refined how we accept or reject
files and once we found a pattern that was “bad” we used
<span class="c001">continue</span> to skip the bad files so we could refine
our code to find more file patterns that were bad.</span></p><p><span class="c006">Now we are getting ready to delete the files, so
we are going to flip the logic and instead of printing out
the remaining good files, we will print out the “bad”
files that we are about to delete.</span></p><pre class="verbatim"><span class="c004">import os
from os.path import join
for (dirname, dirs, files) in os.walk('.'):
for filename in files:
if filename.endswith('.txt') :
thefile = os.path.join(dirname,filename)
size = os.path.getsize(thefile)
if size == 2578 or size == 2565:
print 'T-Mobile:',thefile
continue
fhand = open(thefile,'r')
lines = list()
for line in fhand:
lines.append(line)
fhand.close()
if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
print 'iPhone:', thefile
continue
</span></pre><p><span class="c006">We can now see a list of candidate files that we are about
to delete and why these files are up for deleting.
The program produces the following output:</span></p><pre class="verbatim"><span class="c004">python txtcheck3.py
...
T-Mobile: ./2006/05/31-05-06_1540001.txt
T-Mobile: ./2006/05/31-05-06_1648001.txt
iPhone: ./2007/09/15-09-07_074202_03.txt
iPhone: ./2007/09/15-09-07_144641_01.txt
iPhone: ./2007/09/19-09-07_124857_01.txt
...
</span></pre><p><span class="c006">We can spot-check these files to make sure that we did not inadvertently
end up introducing a bug in our program or perhaps our logic
caught some files we did not want to catch.</span></p><p><span class="c006">Once we are satisfied that this is the list of files we want to delete,
we make the following change to the program:</span></p><pre class="verbatim"><span class="c004"> if size == 2578 or size == 2565:
print 'T-Mobile:',thefile
os.remove(thefile)
continue
...
if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):
print 'iPhone:', thefile
os.remove(thefile)
continue
</span></pre><p><span class="c006">In this version of the program, we will both print the file out
and remove the bad files
using <span class="c001">os.remove</span>.</span></p><pre class="verbatim"><span class="c004">python txtdelete.py
T-Mobile: ./2005/01/02-01-05_1356001.txt
T-Mobile: ./2005/01/02-01-05_1858001.txt
...
</span></pre><p><span class="c006">Just for fun, run the program a second time and it will produce no output
since the bad files are already gone.</span></p><p><span class="c006">If we rerun <span class="c001">txtcount.py</span> we can see that we have removed
899 bad files:
</span></p><pre class="verbatim"><span class="c004">python txtcount.py
Files: 1018
</span></pre><p><span class="c006">In this section, we have followed a sequence where we use Python
to first look through directories and files seeking
patterns. We slowly use Python to help determine what we
want to do to clean up our directories. Once we
figure out which files are good and which files are
not useful, we use Python to delete the files and
perform the cleanup.</span></p><p><span class="c006">The problem you may need to solve can either be quite simple
and might only depend on looking at the names of files,
or perhaps you need to read every single file and
look for patterns within the files. Sometimes
you will need to read all the files and make a change
to some of the files. All of these are pretty
straightforward once you understand how <span class="c001">os.walk</span>
and the other <span class="c001">os</span> utilities can be used.</span></p><span class="c005">
</span><h2 class="section" id="sec193"><span class="c006">16.3  Command-line arguments</span></h2>
<p><a id="hevea_default828"></a></p><p><span class="c006">In earlier chapters, we had a number of programs that prompted
for a file name using <code>raw_input</code> and then read data
from the file and processed the data as follows:</span></p><pre class="verbatim"><span class="c004">name = raw_input('Enter file:')
handle = open(name, 'r')
text = handle.read()
...
</span></pre><p><span class="c006">We can simplify this program a bit by taking the file name
from the command line when we start Python. Up to now,
we simply run our Python programs and respond to the
prompts as follows:</span></p><pre class="verbatim"><span class="c004">python words.py
Enter file: mbox-short.txt
...
</span></pre><p><span class="c006">We can place additional strings after the Python file and access
those <span class="c009">command-line arguments</span> in our Python program. Here is a simple program
that demonstrates reading arguments from the command line:</span></p><pre class="verbatim"><span class="c004">import sys
print 'Count:', len(sys.argv)
print 'Type:', type(sys.argv)
for arg in sys.argv:
print 'Argument:', arg
</span></pre><p><span class="c006">The contents of <span class="c001">sys.argv</span> are a list of strings where the first string
is the name of the Python program and the remaining strings are the arguments
on the command line after the Python file.</span></p><p><span class="c006">The following shows our program reading several command-line arguments from the command
line:</span></p><pre class="verbatim"><span class="c004">python argtest.py hello there
Count: 3
Type: <type 'list'>
Argument: argtest.py
Argument: hello
Argument: there
</span></pre><p><span class="c006">There are three arguments are passed into our program as a three-element list.
The first element of the list is the file name (argtest.py) and the others are
the two command-line arguments after the file name.</span></p><p><span class="c006">We can rewrite our program to read the file, taking the file name
from the command-line argument as follows:</span></p><pre class="verbatim"><span class="c004">import sys
name = sys.argv[1]
handle = open(name, 'r')
text = handle.read()
print name, 'is', len(text), 'bytes'
</span></pre><p><span class="c006">We take the second command-line argument as the name of the file (skipping past
the program name in the <span class="c001">[0]</span> entry). We open the file and read
the contents as follows:</span></p><pre class="verbatim"><span class="c004">python argfile.py mbox-short.txt
mbox-short.txt is 94626 bytes
</span></pre><p><span class="c006">Using command-line arguments as input can make it easier to reuse your Python programs,
especially when you only need to input one or two strings.</span></p><span class="c005">
</span><h2 class="section" id="sec194"><span class="c006">16.4  Pipes</span></h2>
<p><a id="hevea_default829"></a><span class="c005">
</span><a id="hevea_default830"></a></p><p><span class="c006">Most operating systems provide a command-line interface,
also known as a <span class="c009">shell</span>. Shells usually provide commands
to navigate the file system and launch applications. For
example, in Unix, you can change directories with <span class="c001">cd</span>,
display the contents of a directory with <span class="c001">ls</span>, and launch
a web browser by typing (for example) <span class="c001">firefox</span>.</span></p><p><a id="hevea_default831"></a><span class="c005">
</span><a id="hevea_default832"></a></p><p><span class="c006">Any program that you can launch from the shell can also be
launched from Python using a <span class="c009">pipe</span>. A pipe is an object
that represents a running process.</span></p><p><span class="c006">For example, the Unix command</span><sup><a id="text17" href="#note17"><span class="c006">1</span></a></sup><span class="c006">
<span class="c001">ls -l</span> normally displays the
contents of the current directory (in long format). You can
launch <span class="c001">ls</span> with <span class="c001">os.popen</span>:</span></p><p><a id="hevea_default833"></a><span class="c005">
</span><a id="hevea_default834"></a></p><pre class="verbatim"><span class="c004">>>> cmd = 'ls -l'
>>> fp = os.popen(cmd)
</span></pre><p><span class="c006">The argument is a string that contains a shell command. The
return value is a file pointer that behaves just like an open
file. You can read the output from the <span class="c001">ls</span> process one
line at a time with <span class="c001">readline</span> or get the whole thing at
once with <span class="c001">read</span>:</span></p><p><a id="hevea_default835"></a><span class="c005">
</span><a id="hevea_default836"></a><span class="c005">
</span><a id="hevea_default837"></a><span class="c005">
</span><a id="hevea_default838"></a></p><pre class="verbatim"><span class="c004">>>> res = fp.read()
</span></pre><p><span class="c006">When you are done, you close the pipe like a file:</span></p><p><a id="hevea_default839"></a><span class="c005">
</span><a id="hevea_default840"></a></p><pre class="verbatim"><span class="c004">>>> stat = fp.close()
>>> print stat
None
</span></pre><p><span class="c006">The return value is the final status of the <span class="c001">ls</span> process;
<span class="c001">None</span> means that it ended normally (with no errors).</span></p><span class="c005">
</span><h2 class="section" id="sec195"><span class="c006">16.5  Glossary</span></h2>
<dl class="description"><dt class="dt-description"><span class="c010">absolute path:</span></dt><dd class="dd-description"><span class="c006"> A string that describes where a file or
directory is stored that starts at the “top of the tree of directories”
so that it can be used to access the file or directory, regardless
of the current working directory.
</span><a id="hevea_default841"></a></dd><dt class="dt-description"><span class="c010">checksum:</span></dt><dd class="dd-description"><span class="c006"> See also <span class="c009">hashing</span>. The term “checksum”
comes from the need to verify if data was garbled as it was
sent across a network or written to a backup medium and then
read back in. When the data is written or sent, the sending system
computes a checksum and also sends the checksum. When the
data is read or received, the receiving system re-computes
the checksum from the received data and compares it to the
received checksum. If the checksums do not match, we must
assume that the data was garbled as it was transferred.
</span><a id="hevea_default842"></a></dd><dt class="dt-description"><span class="c010">command-line argument:</span></dt><dd class="dd-description"><span class="c006"> Parameters on the command line after the Python file name.</span></dd><dt class="dt-description"><span class="c010">current working directory:</span></dt><dd class="dd-description"><span class="c006"> The current directory that you
are “in”. You can change your working directory using the
<span class="c001">cd</span> command on most systems in their command-line interfaces.
When you open a file in Python using just the file name with no path
information, the file must be in the current working directory
where you are running the program.
</span><a id="hevea_default843"></a><span class="c005">
</span><a id="hevea_default844"></a><span class="c005">
</span><a id="hevea_default845"></a></dd><dt class="dt-description"><span class="c010">hashing:</span></dt><dd class="dd-description"><span class="c006"> Reading through a potentially large amount of data
and producing a unique checksum for the data. The best hash functions
produce very few “collisions” where you can give two different
streams of data to the hash function and get back the same hash.
MD5, SHA1, and SHA256 are examples of commonly used hash functions.
</span><a id="hevea_default846"></a></dd><dt class="dt-description"><span class="c010">pipe:</span></dt><dd class="dd-description"><span class="c006"> A pipe is a connection to a running program. Using
a pipe, you can write a program to send data to another program
or receive data from that program. A pipe is similar to a
<span class="c009">socket</span> except that a pipe can only be used to
connect programs running on the same computer (i.e., not
across a network).
</span><a id="hevea_default847"></a></dd><dt class="dt-description"><span class="c010">relative path:</span></dt><dd class="dd-description"><span class="c006"> A string that describes where a file or
directory is stored relative to the current working
directory.
</span><a id="hevea_default848"></a></dd><dt class="dt-description"><span class="c010">shell:</span></dt><dd class="dd-description"><span class="c006"> A command-line interface to an operating system.
Also called a “terminal program” in some systems. In this interface
you type a command and parameters on a line and press “enter”
to execute the command.
</span><a id="hevea_default849"></a></dd><dt class="dt-description"><span class="c010">walk:</span></dt><dd class="dd-description"><span class="c006"> A term we use to describe the notion of visiting
the entire tree of directories, sub-directories, sub-sub-directories,
until we have visited the all of the directories. We call this
“walking the directory tree”.
</span><a id="hevea_default850"></a></dd></dl><span class="c005">
</span><h2 class="section" id="sec196"><span class="c006">16.6  Exercises</span></h2>
<div class="theorem"><span class="c006"><span class="c009">Exercise 1</span>  
</span><a id="checksum"></a><p><a id="hevea_default851"></a></p><p><span class="c006"><em>In a large collection of MP3 files there may be more than one
copy of the same song, stored in different directories or with
different file names. The goal of this exercise is to search for
these duplicates.</em></span></p><ol class="enumerate" type="1"><li class="li-enumerate"><span class="c006"><em>Write a program that walks a directory and all of its
subdirectories for all files with a given suffix (like <span class="c001">.mp3</span>)
and lists pairs of files with that are the same size.
Hint: Use a dictionary where the key of the dictionary is the size
of the file from <span class="c001">os.path.getsize</span> and the value in the
dictionary is the path name concatenated with the file name.
As you encounter each file, check to see if you already have a
file that has the same size as the current file. If so, you have a
duplicate size file, so print out the file size and the two file names
(one from the hash and the other file you are looking at).</em></span><p><a id="hevea_default852"></a><span class="c005">
</span><a id="hevea_default853"></a><span class="c005">
</span><a id="hevea_default854"></a><span class="c005">
</span><a id="hevea_default855"></a></p></li><li class="li-enumerate"><span class="c006"><em>Adapt the previous program to look for files that
have duplicate content using a hashing or <span class="c009">checksum</span>
algorithm. For example,
MD5 (Message-Digest algorithm 5) takes an arbitrarily-long
“message” and returns a 128-bit “checksum”. The probability
is very small that two files with different contents will
return the same checksum.</em></span><p><span class="c006"><em>You can read about MD5 at <span class="c001">wikipedia.org/wiki/Md5</span>. The
following code snippet opens a file, reads it, and computes
its checksum.</em></span></p><pre class="verbatim"><span class="c004"><em>import hashlib
...
fhand = open(thefile,'r')
data = fhand.read()
fhand.close()
checksum = hashlib.md5(data).hexdigest()
</em></span></pre><p><span class="c006"><em>You should create a dictionary where the checksum is the key
and the file name is the value. When you compute a checksum
and it is already in the dictionary as a key, you have two files with
duplicate content, so print out the file in the dictionary
and the file you just read. Here is some sample output
from a run in a folder of image files:</em></span></p><pre class="verbatim"><span class="c004"><em>./2004/11/15-11-04_0923001.jpg ./2004/11/15-11-04_1016001.jpg
./2005/06/28-06-05_1500001.jpg ./2005/06/28-06-05_1502001.jpg
./2006/08/11-08-06_205948_01.jpg ./2006/08/12-08-06_155318_02.jpg
</em></span></pre><p><span class="c006"><em>Apparently I sometimes sent the same photo more than once
or made a copy of a photo from time to time without deleting
the original.</em></span></p></li></ol></div><span class="c005">
</span><hr class="footnoterule" /><dl class="thefootnotes"><dt class="dt-thefootnotes"><span class="c005">
</span><a id="note17" href="#text17"><span class="c006">1</span></a></dt><dd class="dd-thefootnotes"><span class="c006"><div class="footnotetext">When using pipes to talk
to operating system commands like <span class="c001">ls</span>, it is important
for you to know which operating system you are using and only
open pipes to commands that are supported on your operating system.</div>
</span></dd></dl>
<hr />
<a href="book016.html"><img src="previous_motif.gif" alt="Previous" /></a>
<a href="index.html"><img src="contents_motif.gif" alt="Up" /></a>
<a href="book018.html"><img src="next_motif.gif" alt="Next" /></a>
</body>
</html>