pythonlearn/html-008/cfbook013.html at master · StudyCourse/pythonlearn

History

634 lines (613 loc) · 26.3 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

"http://www.w3.org/TR/REC-html40/loose.dtd">

<html>

<head>

<title>

Networked programs

</title>

</head>

<body>

<hr />

<h1><a name="htoc146">Chapter 12</a>  Networked programs</h1>

While many of the examples in this book have focused on reading

files and looking for data in those files, there are many different

sources of information when one considers the Internet.

In this chapter we will pretend to be a web browser and retrieve web

pages using the HyperText Transport Protocol (HTTP). Then we will read

through the web page data and parse it.

<h2><a name="htoc147">12.1</a>  HyperText Transport Protocol - HTTP</h2>

The network protocol that powers the web is actually quite simple and

there is built-in support in Python called <tt>sockets</tt> which makes it very

easy to make network connections and retrieve data over those

sockets in a Python program.

A socket is much like a file, except that it

provides a two-way connection between two

programs with a single socket.

You can both read from and write to the same socket. If you write something to

a socket it is sent to the application at the other end of the socket. If you

read from the socket, you are given the data which the other application has sent.

But if you try to read a socket when the program on the other end of the socket

has not sent any data - you just sit and wait. If the programs on both ends

of the socket simply wait for some data without sending anything, they will wait for

a very long time.

So an important part of programs that communicate over the Internet is to have some

sort of protocol. A protocol is a set of precise rules that determine who

is to go first, what they are to do, and then what are the responses to that message,

and who sends next and so on. In a sense the two applications at either end

of the socket are doing a dance and making sure not to step on each other's toes.

There are many documents which describe these network protocols. The HyperText Transport

Protocol is described in the following document:

<tt>http://www.w3.org/Protocols/rfc2616/rfc2616.txt</tt>

This is a long and complex 176 page document with a lot of detail. If you

find it interesting feel free to read it all. But if you take a look around page 36 of

RFC2616 you will find the syntax for the GET request. If you read in detail, you will

find that to request a document from a web server, we make a connection to

the <tt>www.py4inf.com</tt> server on port 80, and then send a line of the form:

<tt>GET http://www.py4inf.com/code/romeo.txt HTTP/1.0 </tt>

Where the second parameter is the web page we are requesting and then

we also send a blank line. The web server will respond with some

header information about the document and a blank line

followed by the document content.

<h2><a name="htoc148">12.2</a>  The World's Simplest Web Browser</h2>

Perhaps the easiest way to show how the HTTP protocol works is to write a very

simple Python program that makes a connection to a web server and following

the rules of the HTTP protocol, requests a document

and displays what the server sends back.

<pre>

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect(('www.py4inf.com', 80))

mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')

while True:

data = mysock.recv(512)

if ( len(data) < 1 ) :

break

print data

mysock.close()

</pre>First the program makes a connection to port 80 on

the server <tt>www.py4inf.com</tt>.

Since our program is playing the role of the "web browser" the HTTP

protocol says we must send the GET command followed by a blank line.

Once we send that blank line, we write a loop that receives data

in 512 character chunks from the socket and prints the data out

until there is no more data to read (i.e. the recv() returns

an empty string).

The program produces the following output:

<pre>

HTTP/1.1 200 OK

Date: Sun, 14 Mar 2010 23:52:41 GMT

Server: Apache

Last-Modified: Tue, 29 Dec 2009 01:31:22 GMT

ETag: "143c1b33-a7-4b395bea"

Accept-Ranges: bytes

Content-Length: 167

Connection: close

Content-Type: text/plain

But soft what light through yonder window breaks

It is the east and Juliet is the sun

Arise fair sun and kill the envious moon

Who is already sick and pale with grief

</pre>The output starts with headers which the web server sends

to describe the document.

For example, the <tt>Content-Type </tt> header indicated that

the document is a plain text document (<tt>text/plain</tt>).

After the server sends us the headers, it adds a blank line

to indicate the end of the headers and then sends the actual

data of the file <tt>romeo.txt</tt>.

This example shows how to make a low-level network connection

with sockets. Sockets can be used to communicate with a web

server or with a mail server or many other kinds of servers.

All that is needed is to find the document which describes

the protocol and write the code to send and receive the data

according to the protocol.

However, since the protocol that we use most commonly is

the HTTP (i.e. the web) protocol, Python has a special

library specifically designed to support the HTTP protocol

for the retrieval of

documents and data over the web.

<h2><a name="htoc149">12.3</a>  Retrieving an image over HTTP</h2>

In the above example, we retreived a plain text file

which had newlines in the file and we simply copied the

data to the screen as the program ran. We can use a similar

program to retrieve an image across using HTTP. Instead

of copying the data to the screen as the program runs,

we accumulate the data in a string, trim off the headers

and then save the image data to a file as follows:

<pre>

import socket

import time

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect(('www.py4inf.com', 80))

mysock.send('GET http://www.py4inf.com/cover.jpg HTTP/1.0\n\n')

count = 0

picture = "";

while True:

data = mysock.recv(5120)

if ( len(data) < 1 ) : break

# time.sleep(0.25)

count = count + len(data)

print len(data),count

picture = picture + data

mysock.close()

# Look for the end of the header (2 CRLF)

pos = picture.find("\r\n\r\n");

print 'Header length',pos

print picture[:pos]

# Skip past the header and save the picture data

picture = picture[pos+4:]

fhand = open("stuff.jpg","w")

fhand.write(picture);

fhand.close()

</pre>When the program runs it produces the following output:

<pre>

$ python urljpeg.py

2920 2920

1460 4380

1460 5840

1460 7300

...

1460 62780

1460 64240

2920 67160

1460 68620

1681 70301

Header length 240

HTTP/1.1 200 OK

Date: Sat, 02 Nov 2013 02:15:07 GMT

Server: Apache

Last-Modified: Sat, 02 Nov 2013 02:01:26 GMT

ETag: "19c141-111a9-4ea280f8354b8"

Accept-Ranges: bytes

Content-Length: 70057

Connection: close

Content-Type: image/jpeg

</pre>You can see that that for this url, the

<tt>Content-Type </tt> header indicates that

body of the document is an image (<tt>image/jpeg</tt>).

Once the program completes, you can view the image data by opening

the file <tt>stuff.jpg</tt> in an image viewer.

As the program runs,

can see that we don't get 5120 characters each time we

call the <tt>recv()</tt> method.

We get as many characters that have been transfered across the network

to us by the web server at the moment we call <tt>recv()</tt>.

In this example, we either get 1460 or

2920 characters each time we request up to 5120 characters of data.

Your results may be different depending on your network speed. Also

note that on the last call to <tt>recv()</tt> we get 1681 bytes which is the end

of the stream and in the next call to <tt>recv()</tt> we get a zero length

string that tells us that the server has called <tt>close()</tt> on its end

of the socket and there is no more data forthcoming.

We can slow down our successive calls <tt>recv()</tt> by uncommmenting the call

to <tt>time.sleep()</tt>. This way, we wait a quarter of a second after each call

so that the server can "get ahead" of us and send more data to us

before we call <tt>recv()</tt>. With the delay in place the program

executes as follows:

<pre>

$ python urljpeg.py

1460 1460

5120 6580

5120 11700

...

5120 62900

5120 68020

2281 70301

Header length 240

HTTP/1.1 200 OK

Date: Sat, 02 Nov 2013 02:22:04 GMT

Server: Apache

Last-Modified: Sat, 02 Nov 2013 02:01:26 GMT

ETag: "19c141-111a9-4ea280f8354b8"

Accept-Ranges: bytes

Content-Length: 70057

Connection: close

Content-Type: image/jpeg

</pre>Now other than the first and last calls to <tt>recv()</tt>, we now get

5120 characters each time we ask for new data.

There is a buffer between the server making <tt>send()</tt> requests

and our application making <tt>recv()</tt> requests. When we run the

program with the delay in place, at some point the server might

fill up the buffer in the socket and be forced to pause until our

program starts to empty the buffer. The pausing of either the

sending application or the receiving application is called

"flow control".

<h2><a name="htoc150">12.4</a>  Retrieving web pages with <tt>urllib</tt></h2>

While we can manually send and receive data over HTTP

using the socket library, there is a much simpler way to

to perform this common task in Python by

using the <tt>urllib</tt> library.

Using <tt>urllib</tt>,

you can treat a web page much like a file. You simply

indicate which web page you would like to retrieve and

<tt>urllib</tt> handles all of the HTTP protocol and header

details.

The equivalent code to read the <tt>romeo.txt</tt> file

from the web using <tt>urllib</tt> is as follows:

<pre>

import urllib

fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')

for line in fhand:

print line.strip()

</pre>Once the web page has been opened with

<tt>urllib.urlopen</tt> we can treat it like

a file and read through it using a

<tt>for</tt> loop.

When the program runs, we only see the output

of the contents of the file. The headers

are still sent, but the <tt>urllib</tt> code

consumes the headers and only returns the

data to us.

<pre>

But soft what light through yonder window breaks

It is the east and Juliet is the sun

Arise fair sun and kill the envious moon

Who is already sick and pale with grief

</pre>

As an example, we can write

a program to retrieve the data for

<tt>romeo.txt</tt> and compute the frequency

of each word in the file as follows:

<pre>

import urllib

counts = dict()

fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')

for line in fhand:

words = line.split()

for word in words:

counts[word] = counts.get(word,0) + 1

print counts

</pre>Again, once we have opened the web page,

we can read it like a local file.

<h2><a name="htoc151">12.5</a>  Parsing HTML and scraping the web</h2>

One of the common uses of the <tt>urllib</tt> capability in Python is

to scrape the web. Web scraping is when we write a program

that pretends to be a web browser and retrieves pages and then

examines the data in those pages looking for patterns.

As an example, a search engine such as Google will look at the source

of one web page and extract the links to other pages and retrieve

those pages, extracting links, and so on. Using this technique,

Google spiders its way through nearly all of the pages on

the web.

Google also uses the frequency of links from pages it finds

to a particular page as one measure of how "important"

a page is and how highly the page should appear in its search results.

<h2><a name="htoc152">12.6</a>  Parsing HTML using Regular Expressions</h2>

One simple way to parse HTML is to use regular expressions to repeatedly

search and extract for substrings that match a particular pattern.

Here is a simple web page:

<pre>

<h1>The First Page</h1>

If you like, you can switch to the

Second Page</a>.

</pre>We can construct a well-formed regular expression to match

and extract the link values from the above text as follows:

<pre>

href="http://.+?"

</pre>Our regular expression looks for strings that start with

"href="http://" followed by one or more characters

".+?" followed by another double quote. The question mark

added to the ".+?" indicates that the match is to be done

in a "non-greedy" fashion instead of a "greedy" fashion.

A non-greedy match tries to find the smallest possible matching

string and a greedy match tries to find the largest possible

matching string.

We need to add parentheses to our regular expression to indicate

which part of our matched string we would like to extract and

produce the following program:

<pre>

import urllib

import re

url = raw_input('Enter - ')

html = urllib.urlopen(url).read()

links = re.findall('href="(http://.*?)"', html)

for link in links:

print link

</pre>The <tt>findall</tt> regular expression method will give us a list of all

of the strings that match our regular expression, returning only

the link text between the double quotes.

When we run the program, we get the following output:

<pre>

python urlregex.py

Enter - http://www.dr-chuck.com/page1.htm

http://www.dr-chuck.com/page2.htm

python urlregex.py

Enter - http://www.py4inf.com/book.htm

http://www.greenteapress.com/thinkpython/thinkpython.html

http://allendowney.com/

http://www.py4inf.com/code

http://www.lib.umich.edu/espresso-book-machine

http://www.py4inf.com/py4inf-slides.zip

</pre>Regular expressions work very nice when your HTML is well-formatted

and predictable. But since there is a lot of "broken" HTML pages

out there, you might find that a solution only using

regular expressions might either miss some valid links or end up

with bad data.

This can be solved by using a robust HTML parsing library.

<h2><a name="htoc153">12.7</a>  Parsing HTML using BeautifulSoup</h2>

There are a number of Python libraries which can help you parse

HTML and extract data from the pages. Each of the libraries

has its strengths and weaknesses and you can pick one based on

your needs.

As an example, we will simply parse some HTML input

and extract links using the BeautifulSoup library.

You can download and install the BeautifulSoup code

from:

<tt>www.crummy.com</tt>

You can download and "install" BeautifulSoup or you

can simply place the <tt>BeautifulSoup.py</tt> file in the

same folder as your application.

Even though HTML looks like XML and some pages are carefully

constructed to be XML, most HTML is generally broken in ways

that cause an XML parser to reject the entire page of HTML as

improperly formed. BeautifulSoup tolerates highly flawed

HTML and still lets you easily extract the data you need.

We will use <tt>urllib</tt> to read the page and then use

<tt>BeautifulSoup</tt> to extract the <tt>href</tt> attributes from the

anchor (<tt>a</tt>) tags.

<pre>

import urllib

from BeautifulSoup import *

url = raw_input('Enter - ')

html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)

# Retrieve all of the anchor tags

tags = soup('a')

for tag in tags:

print tag.get('href', None)

</pre>The program prompts for a web address, then opens the web

page, reads the data and passes the data to the BeautifulSoup

parser, and then retrieves all of the anchor tags and prints

out the <tt>href</tt> attribute for each tag.

When the program runs it looks as follows:

<pre>

python urllinks.py

Enter - http://www.dr-chuck.com/page1.htm

http://www.dr-chuck.com/page2.htm

python urllinks.py

Enter - http://www.py4inf.com/book.htm

http://www.greenteapress.com/thinkpython/thinkpython.html

http://allendowney.com/

http://www.si502.com/

http://www.lib.umich.edu/espresso-book-machine

http://www.py4inf.com/code

http://www.pythonlearn.com/

</pre>You can use BeautifulSoup to pull out various parts of each

tag as follows:

<pre>

import urllib

from BeautifulSoup import *

url = raw_input('Enter - ')

html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)

# Retrieve all of the anchor tags

tags = soup('a')

for tag in tags:

# Look at the parts of a tag

print 'TAG:',tag

print 'URL:',tag.get('href', None)

print 'Content:',tag.contents[0]

print 'Attrs:',tag.attrs

</pre>This produces the following output:

<pre>

python urllink2.py

Enter - http://www.dr-chuck.com/page1.htm

TAG: <a href="http://www.dr-chuck.com/page2.htm">

Second Page</a>

URL: http://www.dr-chuck.com/page2.htm

Content: [u'\nSecond Page']

Attrs: [(u'href', u'http://www.dr-chuck.com/page2.htm')]

</pre>These examples only begin to show the power of BeautifulSoup

when it comes to parsing HTML. See the documentation

and samples at

<tt>www.crummy.com</tt>

for more detail.

<h2><a name="htoc154">12.8</a>  Reading binary files using urllib</h2>

Sometimes you want to retrieve a non-text (or binary) file such as

an image or video file. The data in these files is generally not

useful to print out but you can easily make a copy of a URL to a local

file on your hard disk using <tt>urllib</tt>.

The pattern is to open the URL and use <tt>read</tt> to download the entire

contents of the document into a string variable (<tt>img</tt>) and then write that

information to a local file as follows:

<pre>

img = urllib.urlopen('http://www.py4inf.com/cover.jpg').read()

fhand = open('cover.jpg', 'w')

fhand.write(img)

fhand.close()

</pre>This program reads all of the data in at once across the network and

stores it in the variable <tt>img</tt> in the main memory of your computer

and then opens the file <tt>cover.jpg</tt> and writes the data out to your

disk. This will work if the size of the file is less than the size

of the memory of your computer.

However if this is a large audio or video file, this program may crash

or at least run extremely slowly when your computer runs out of memory.

In order to avoid running out of memory, we retrieve the data in blocks

(or buffers) and then write each block to your disk before retrieving

the next block. This way the program can read any sized file without

using up all of the memory you have in your computer.

<pre>

import urllib

img = urllib.urlopen('http://www.py4inf.com/cover.jpg')

fhand = open('cover.jpg', 'w')

size = 0

while True:

info = img.read(100000)

if len(info) < 1 : break

size = size + len(info)

fhand.write(info)

print size,'characters copied.'

fhand.close()

</pre>In this example, we read only 100,000 characters at a time and then

write those characters to the <tt>cover.jpg</tt> file

before retrieving the next 100,000 characters of data from the

web.

This program runs as follows:

<pre>

python curl2.py

568248 characters copied.

</pre>

If you have a Unix or Macintosh computer, you probably have a command

built into your operating system that performs this operation

as follows:

<pre>

curl -O http://www.py4inf.com/cover.jpg

</pre>The command <tt>curl</tt> is short for "copy URL" and so these two

examples are cleverly named <tt>curl1.py</tt> and <tt>curl2.py</tt> on

<tt>www.py4inf.com/code</tt> as they implement similar functionality

to the <tt>curl</tt> command. There is also a <tt>curl3.py</tt> sample

program that does this task a little more effectively in case you

actually want to use this pattern in a program you are writing.

<h2><a name="htoc155">12.9</a>  Glossary</h2>

<dl compact="compact"><dt>BeautifulSoup:</dt><dd> A Python library for parsing HTML documents

and extracting data from HTML documents

that compensates for most of the imperfections in the HTML that browsers

generally ignore.

You can download the BeautifulSoup code

from

<tt>www.crummy.com</tt>.

</dd><dt>port:</dt><dd> A number that generally indicates which application

you are contacting when you make a socket connection to a server.

As an example, web traffic usually uses port 80 while e-mail

traffic uses port 25.

</dd><dt>scrape:</dt><dd> When a program pretends to be a web browser and

retrieves a web page and then looks at the web page content.

Often programs are following the links in one page to find the next

page so they can traverse a network of pages or a social network.

</dd><dt>socket:</dt><dd> A network connection between two applications

where the applications can send and receive data in either direction.

</dd><dt>spider:</dt><dd> The act of a web search engine retrieving a page and

then all the pages linked from a page and so on until they have

nearly all of the pages on the Internet which they

use to build their search index.

<h2><a name="htoc156">12.10</a>  Exercises</h2>

<div align="left">Exercise 1

Change the socket program <tt>socket1.py</tt> to prompt the user for

the URL so it can read any web page.

You can use <tt>split('/')</tt> to break the URL into its component parts

so you can extract the host name for the socket <tt>connect</tt> call.

Add error checking using <tt>try</tt> and <tt>except</tt> to handle the condition where the

user enters an improperly formatted or non-existent URL.

</div>

<div align="left">Exercise 2

Change your socket program so that it counts the number of characters it has received

and stops displaying any text after it has shown 3000 characters. The program

should retrieve the entire document and count the total number of characters

and display the count of the number of characters at the end of the document.

</div>

<div align="left">Exercise 3

Use <tt>urllib</tt> to replicate the previous exercise of (1) retrieving the document

from a URL, (2) displaying up to 3000 characters, and (3) counting the overall number

of characters in the document. Don't worry about the headers for this exercise, simply

show the first 3000 characters of the document contents.

</div>

<div align="left">Exercise 4

Change the <tt>urllinks.py</tt> program to extract and count

paragraph (p) tags from the retrieved HTML document and

display the count of the paragraphs as the

output of your program.

Do not display the paragraph text - only count them.

Test your program on several small web pages

as well as some larger web pages.

</div>

<div align="left">Exercise 5

(Advanced) Change the socket program so that it only shows data after the

headers and a blank line have been received. Remember that <tt>recv</tt> is

receiving characters (newlines and all) - not lines.

</div>

<hr />

</body>

</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

cfbook013.html

Latest commit

History

cfbook013.html

File metadata and controls