pythonlearn/html-008/cfbook017.html at master · StudyCourse/pythonlearn

History

637 lines (635 loc) · 27.6 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

"http://www.w3.org/TR/REC-html40/loose.dtd">

<html>

<head>

<title>

Automating common tasks on your computer

</title>

</head>

<body>

<hr />

<h1><a name="htoc189">Chapter 16</a>  Automating common tasks on your computer</h1>

We have been reading data from files, networks, services,

and databases. Python can also go through all of the

directories and folders on your computers and read those files

as well.

In this chapter, we will write programs that scan

scan through your computer and

perform some operation on each file.

Files are organized into directories (also called "folders").

Simple Python scripts

can make short work of simple tasks that must be done to

hundreds or thousands of files

spread across a directory tree or your entire computer.

To walk through all the directories and files in a tree we use

<tt>os.walk</tt> and a <tt>for</tt> loop. This is similar to how

<tt>open</tt> allows us to write a loop to read the contents of a file,

<tt>socket</tt> allows us to write a loop to read the contents of a network connection, and

<tt>urllib</tt> allows us to open a web document and loop through its contents.

<h2><a name="htoc190">16.1</a>  File names and paths</h2>

Every running program has a "current directory," which is the

default directory for most operations.

For example, when you open a file for reading, Python looks for it in the

current directory.

The <tt>os</tt> module provides functions for working with files and

directories (<tt>os</tt> stands for "operating system"). <tt>os.getcwd</tt>

returns the name of the current directory:

<pre>

>>> import os

>>> cwd = os.getcwd()

>>> print cwd

/Users/csev

</pre><tt>cwd</tt> stands for current working directory. The result in

this example is <tt>/Users/csev</tt>, which is the home directory of a

user named <tt>csev</tt>.

A string like <tt>cwd</tt> that identifies a file is called a path.

A relative path starts from the current directory;

an absolute path starts from the topmost directory in the

file system.

The paths we have seen so far are simple file names, so they are

relative to the current directory. To find the absolute path to

a file, you can use <tt>os.path.abspath</tt>:

<pre>

>>> os.path.abspath('memo.txt')

'/Users/csev/memo.txt'

</pre><tt>os.path.exists</tt> checks

whether a file or directory exists:

<pre>

>>> os.path.exists('memo.txt')

True

</pre>If it exists, <tt>os.path.isdir</tt> checks whether it's a directory:

<pre>

>>> os.path.isdir('memo.txt')

False

>>> os.path.isdir('music')

True

</pre>Similarly, <tt>os.path.isfile</tt> checks whether it's a file.

<tt>os.listdir</tt> returns a list of the files (and other directories)

in the given directory:

<pre>

>>> os.listdir(cwd)

['music', 'photos', 'memo.txt']

</pre>

<h2><a name="htoc191">16.2</a>  Example: Cleaning up a photo directory</h2>

Some time ago, I built a bit of Flickr-like software that

received photos from my cell phone and stored those photos

on my server. I wrote this before Flickr existed and kept

using it after Flickr existed because

I wanted to keep original copies of my images forever.

I would also send a simple one-line text description in the MMS message

or the subject line of the e-mail message. I stored these messages

in a text file in the same directory as the image file. I came up

with a directory structure based on the month, year, day and time the

photo was taken. The following would be an example of the naming for

one photo and its existing description:

<pre>

./2006/03/24-03-06_2018002.jpg

./2006/03/24-03-06_2018002.txt

</pre>After seven years, I had a lot of photos and captions. Over the years

as I switched cell phones, sometimes my code to extract the caption from the message

would break and add a bunch of useless data on my server instead of a caption.

I wanted to go through these files and figure out which of the

text files were really captions and which were junk and then delete the bad

files. The first thing to do was to get a simple inventory of

how many text files I had in of the sub-folders

using the following program:

<pre>

import os

count = 0

for (dirname, dirs, files) in os.walk('.'):

for filename in files:

if filename.endswith('.txt') :

count = count + 1

print 'Files:', count

python txtcount.py

Files: 1917

</pre>The key bit of code that makes this possible is the <tt>os.walk</tt>

library in Python. When we call <tt>os.walk</tt> and give it a starting

directory, it will "walk" through all of the directories

and sub-directories recursively. The string "." indicates

to start in the current directory and walk downward.

As it encounters each directory,

we get three values in a tuple in the body of the <tt>for</tt> loop.

The first value is the current

directory name, the second value is the list of sub-directories

in the current directory, and the third value is a list of files

in the current directory.

We do not have to explicitly look into each of the sub-directories

because we can count on <tt>os.walk</tt> to visit every

folder eventually. But we do want to look at each file, so

we write a simple <tt>for</tt> loop to examine each of the files

in the current directory. We check each file to see if

it ends with ".txt" and then count the number of

files through the whole directory tree that end with the

suffix ".txt".

Once we have a sense of how many files end with ".txt", the next

thing to do is try to automatically

determine in Python which files are bad and which files

are good. So we write a simple program to print out the

files and the size of each file:

<pre>

import os

from os.path import join

for (dirname, dirs, files) in os.walk('.'):

for filename in files:

if filename.endswith('.txt') :

thefile = os.path.join(dirname,filename)

print os.path.getsize(thefile), thefile

</pre>Now instead of just counting the files, we create

a file name by concatenating the directory name with

the name of the file within the directory using

<tt>os.path.join</tt>. It is important to use

<tt>os.path.join</tt> instead of string concatenation

because on Windows we use a backslash

(<code>\</code>) to construct file paths and on Linux

or Apple we use a forward slash (<code>/</code>)

to construct file paths. The <tt>os.path.join</tt>

knows these differences and knows what system

we are running on and it does the proper concatenation

depending on the system. So the same Python code

runs on either Windows or Unix-style systems.

Once we have the full file name with directory

path, we use the <tt>os.path.getsize</tt> utility

to get the size and print it out, producing the

following output:

<pre>

python txtsize.py

...

18 ./2006/03/24-03-06_2303002.txt

22 ./2006/03/25-03-06_1340001.txt

22 ./2006/03/25-03-06_2034001.txt

...

2565 ./2005/09/28-09-05_1043004.txt

2565 ./2005/09/28-09-05_1141002.txt

...

2578 ./2006/03/27-03-06_1618001.txt

2578 ./2006/03/28-03-06_2109001.txt

2578 ./2006/03/29-03-06_1355001.txt

...

</pre>Scanning the output, we notice that some files are pretty short and

a lot of the files are pretty large and the same size (2578 and 2565).

When we take a look at a few of these larger files by hand,

it looks like the large

files are nothing but a generic bit of identical HTML that came

in from mail sent to my system from my T-Mobile phone:

<pre>

<html>

<head>

<title>T-Mobile</title>

...

</pre>Skimming through the file, it looks like there is no good information

in these files so we can probably delete them.

But before we delete the files, we will write a program to look for files

that are more than one line long and show the contents of the file.

We will not bother showing ourselves those files that are exactly

2578 or 2565 characters long since we know that these files have no useful

information.

So we write the following program:

<pre>

import os

from os.path import join

for (dirname, dirs, files) in os.walk('.'):

for filename in files:

if filename.endswith('.txt') :

thefile = os.path.join(dirname,filename)

size = os.path.getsize(thefile)

if size == 2578 or size == 2565:

continue

fhand = open(thefile,'r')

lines = list()

for line in fhand:

lines.append(line)

fhand.close()

if len(lines) > 1:

print len(lines), thefile

print lines[:4]

</pre>We use a <tt>continue</tt> to skip files with the two

"bad sizes", then open the rest of the files

and read the lines of the file into a Python list

and if the file has more than one line we print

out how many lines are in the file and print out

the first three lines.

It looks like filtering out those two bad file sizes, and assuming

that all one-line files are correct, we are down to some pretty clean

data:

<pre>

python txtcheck.py

3 ./2004/03/22-03-04_2015.txt

['Little horse rider\r\n', '\r\n', '\r']

2 ./2004/11/30-11-04_1834001.txt

['Testing 123.\n', '\n']

3 ./2007/09/15-09-07_074202_03.txt

['\r\n', '\r\n', 'Sent from my iPhone\r\n']

3 ./2007/09/19-09-07_124857_01.txt

['\r\n', '\r\n', 'Sent from my iPhone\r\n']

3 ./2007/09/20-09-07_115617_01.txt

...

</pre>But there is one more annoying pattern of files:

there are some three-line files that consist of

two blank lines followed by a line that says

"Sent from my iPhone" that have slipped

into my data. So we make the following change

to the program to deal with these files as well.

<pre>

lines = list()

for line in fhand:

lines.append(line)

if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):

continue

if len(lines) > 1:

print len(lines), thefile

print lines[:4]

</pre>We simply check if we have a three-line file, and if the third

line starts with the specified text, we skip it.

Now when we run the program, we only see four remaining

multi-line files and all of those files look pretty reasonable:

<pre>

python txtcheck2.py

3 ./2004/03/22-03-04_2015.txt

['Little horse rider\r\n', '\r\n', '\r']

2 ./2004/11/30-11-04_1834001.txt

['Testing 123.\n', '\n']

2 ./2006/03/17-03-06_1806001.txt

['On the road again...\r\n', '\r\n']

2 ./2006/03/24-03-06_1740001.txt

['On the road again...\r\n', '\r\n']

</pre>If you look at the overall pattern of this program,

we have successively refined how we accept or reject

files and once we found a pattern that was "bad" we used

<tt>continue</tt> to skip the bad files so we could refine

our code to find more file patterns that were bad.

Now we are getting ready to delete the files, so

we are going to flip the logic and instead of printing out

the remaining good files, we will print out the "bad"

files that we are about to delete.

<pre>

import os

from os.path import join

for (dirname, dirs, files) in os.walk('.'):

for filename in files:

if filename.endswith('.txt') :

thefile = os.path.join(dirname,filename)

size = os.path.getsize(thefile)

if size == 2578 or size == 2565:

print 'T-Mobile:',thefile

continue

fhand = open(thefile,'r')

lines = list()

for line in fhand:

lines.append(line)

fhand.close()

if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):

print 'iPhone:', thefile

continue

</pre>We can now see a list of candidate files that we are about

to delete and why these files are up for deleting.

The program produces the following output:

<pre>

python txtcheck3.py

...

T-Mobile: ./2006/05/31-05-06_1540001.txt

T-Mobile: ./2006/05/31-05-06_1648001.txt

iPhone: ./2007/09/15-09-07_074202_03.txt

iPhone: ./2007/09/15-09-07_144641_01.txt

iPhone: ./2007/09/19-09-07_124857_01.txt

...

</pre>We can spot-check these files to make sure that we did not inadvertently

end up introducing a bug in our program or perhaps our logic

caught some files we did not want to catch.

Once we are satisfied that this is the list of files we want to delete,

we make the following change to the program:

<pre>

if size == 2578 or size == 2565:

print 'T-Mobile:',thefile

os.remove(thefile)

continue

...

if len(lines) == 3 and lines[2].startswith('Sent from my iPhone'):

print 'iPhone:', thefile

os.remove(thefile)

continue

</pre>In this version of the program, we will both print the file out

and remove the bad files

using <tt>os.remove</tt>.

<pre>

python txtdelete.py

T-Mobile: ./2005/01/02-01-05_1356001.txt

T-Mobile: ./2005/01/02-01-05_1858001.txt

...

</pre>Just for fun, run the program a second time and it will produce no output

since the bad files are already gone.

If we re-run <tt>txtcount.py</tt> we can see that we have removed

899 bad files:

<pre>

python txtcount.py

Files: 1018

</pre>In this section, we have followed a sequence where we use Python

to first look through directories and files seeking

patterns. We slowly use Python to help determine what we

want to do to clean up our directories. Once we

figure out which files are good and which files are

not useful, we use Python to delete the files and

perform the cleanup.

The problem you may need to solve can either be quite simple

and might only depend on looking at the names of files,

or perhaps you need to read every single file and

look for patterns within the files. Sometimes

you will need to read all the files and make a change

to some of the files. All of these are pretty

straightforward once you understand how <tt>os.walk</tt>

and the other <tt>os</tt> utilities can be used.

<h2><a name="htoc192">16.3</a>  Command line arguments</h2>

In earlier chapters, we had a number of programs that prompted

for a file name using <code>raw_input</code> and then read data

from the file and processed the data as follows:

<pre>

name = raw_input('Enter file:')

handle = open(name, 'r')

text = handle.read()

...

</pre>We can simplify this program a bit by taking the file name

from the command line when we start Python. Up to now,

we simply run our Python programs and respond to the

prompts as as follows:

<pre>

python words.py

Enter file: mbox-short.txt

...

</pre>We can place additional strings after the Python file and access

those command line arguments in our Python program. Here is a simple program

that demonstrates reading arguments from the command line:

<pre>

import sys

print 'Count:', len(sys.argv)

print 'Type:', type(sys.argv)

for arg in sys.argv:

print 'Argument:', arg

</pre>The contents of <tt>sys.argv</tt> are a list of strings where the first string

is the name of the Python program and the remaining strings are the arguments

on the command line after the Python file.

The following shows our program reading several command line arguments from the command

line:

<pre>

python argtest.py hello there

Count: 3

Type: <type 'list'>

Argument: argtest.py

Argument: hello

Argument: there

</pre>There are three arguments are passed into our program as a three-element list.

The first element of the list is the file name (argtest.py) and the others are

the two command line arguments after the file name.

We can rewrite our program to read the file, taking the file name

from the command line argument as follows:

<pre>

import sys

name = sys.argv[1]

handle = open(name, 'r')

text = handle.read()

print name, 'is', len(text), 'bytes'

</pre>We take the second command line argument as the name of the file (skipping past

the program name in the <tt>[0]</tt> entry). We open the file and read

the contents as follows:

<pre>

python argfile.py mbox-short.txt

mbox-short.txt is 94626 bytes

</pre>Using command line arguments as input can make it easier to reuse your Python programs

especially when you only need to input one or two strings.

<h2><a name="htoc193">16.4</a>  Pipes</h2>

Most operating systems provide a command-line interface,

also known as a shell. Shells usually provide commands

to navigate the file system and launch applications. For

example, in Unix, you can change directories with <tt>cd</tt>,

display the contents of a directory with <tt>ls</tt>, and launch

a web browser by typing (for example) <tt>firefox</tt>.

Any program that you can launch from the shell can also be

launched from Python using a pipe. A pipe is an object

that represents a running process.

For example, the Unix command<a name="text14" href="#note14">1</a>

<tt>ls -l</tt> normally displays the

contents of the current directory (in long format). You can

launch <tt>ls</tt> with <tt>os.popen</tt>:

<pre>

>>> cmd = 'ls -l'

>>> fp = os.popen(cmd)

</pre>The argument is a string that contains a shell command. The

return value is a file pointer that behaves just like an open

file. You can read the output from the <tt>ls</tt> process one

line at a time with <tt>readline</tt> or get the whole thing at

once with <tt>read</tt>:

<pre>

>>> res = fp.read()

</pre>When you are done, you close the pipe like a file:

<pre>

>>> stat = fp.close()

>>> print stat

None

</pre>The return value is the final status of the <tt>ls</tt> process;

<tt>None</tt> means that it ended normally (with no errors).

<h2><a name="htoc194">16.5</a>  Glossary</h2>

<dl compact="compact"><dt>absolute path:</dt><dd> A string that describes where a file or

directory is stored that starts at the "top of the tree of directories"

so that it can be used to access the file or directory, regardless

of the current working directory.

</dd><dt>checksum:</dt><dd> See also hashing. The term "checksum"

comes from the need to verify if data was garbled as it was

sent across a network or written to a backup medium and then

read back in. When the data is written or sent, the sending system

computes a checksum and also sends the checksum. When the

data is read or received, the receiving system re-computes

the checksum from the received data and compares it to the

received checksum. If the checksums do not match, we must

assume that the data was garbled as it was transferred.

</dd><dt>command line argument:</dt><dd> Parameters on the command line after the Python file name.

</dd><dt>current working directory:</dt><dd> The current directory that you

are "in". You can change your working directory using the

<tt>cd</tt> command on most systems in their command-line interfaces.

When you open a file in Python using just the file name with no path

information the file must be in the current working directory

where you are running the program.

</dd><dt>hashing:</dt><dd> Reading through a potentially large amount of data

and producing a unique checksum for the data. The best hash functions

produce very few "collisions" where you can give two different

streams of data to the hash function and get back the same hash.

MD5, SHA1, and SHA256 are examples of commonly used hash functions.

</dd><dt>pipe:</dt><dd> A pipe is a connection to a running program. Using

a pipe, you can write a program to send data to another program

or receive data from that program. A pipe is similar to a

socket except that a pipe can only be used to

connect programs running on the same computer (i.e. not

across a network).

</dd><dt>relative path:</dt><dd> A string that describes where a file or

directory is stored relative to the current working

directory.

</dd><dt>shell:</dt><dd> A command-line interface to an operating system.

Also called a "terminal program" in some systems. In this interface

you type a command and parameters on a line and press "enter"

to execute the command.

</dd><dt>walk:</dt><dd> A term we use to describe the notion of visiting

the entire tree of directories, sub-directories, sub-sub-directories,

until we have visited the all of the directories. We call this

"walking the directory tree".

<h2><a name="htoc195">16.6</a>  Exercises</h2>

<div align="left">Exercise 1

In a large collection of MP3 files there may be more than one

copy of the same song, stored in different directories or with

different file names. The goal of this exercise is to search for

these duplicates.

<ol type="1"><li>Write a program that walks a directory and all of its

sub-directories for all files with a given suffix (like <tt>.mp3</tt>)

and lists pairs of files with that are the same size.

Hint: Use a dictionary where the key of the dictionary is the size

of the file from <tt>os.path.getsize</tt> and the value in the

dictionary is the path name concatenated with the file name.

As you encounter each file check to see if you already have a

file that has the same size as the current file.

If so, you have a duplicate

size file and print out the file size and the two files names

(one from the hash and the other file you are looking at).

</li><li>Adapt the previous program to look for files that

have duplicate content using a hashing or checksum

algorithm. For example,

MD5 (Message-Digest algorithm 5) takes an arbitrarily-long

"message" and returns a 128-bit "checksum." The probability

is very small that two files with different contents will

return the same checksum.

You can read about MD5 at <tt>wikipedia.org/wiki/Md5</tt>. The

following code snippet opens a file, reads it and computes

its checksum.

<pre>

import hashlib

...

fhand = open(thefile,'r')

data = fhand.read()

fhand.close()

checksum = hashlib.md5(data).hexdigest()

</pre>You should create a dictionary where the checksum is the key

and the file name is the value. When you compute a checksum

and it is already in the dictionary as a key, you have two files with

duplicate content so print out the file in the dictionary

and the file you just read. Here is some sample output

from a run in a folder of image files:

<pre>

./2004/11/15-11-04_0923001.jpg ./2004/11/15-11-04_1016001.jpg

./2005/06/28-06-05_1500001.jpg ./2005/06/28-06-05_1502001.jpg

./2006/08/11-08-06_205948_01.jpg ./2006/08/12-08-06_155318_02.jpg

</pre>Apparently I sometimes sent the same photo more than once

or made a copy of a photo from time to time without deleting

the original.</li></ol></div>

<hr width="50%" size="1" /><dl><dt><a name="note14" href="#text14">1</a></dt><dd>When using pipes to talk

to operating system commands like <tt>ls</tt>, it is important

for you to know which operating system you are using and only

open pipes to commands that are supported on your operating system.

</dd></dl>

<hr />

</body>

</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

cfbook017.html

Latest commit

History

cfbook017.html

File metadata and controls