HOME - Recent Changes - Search:

Academic Work


Personal

* pot de départ


dblp


(:twitter:)

-----

[ edit | logout ]
[ help | sandbox | passwd ]

HTML processing

#############################

sgmllib

Mark Pilgrim has a chapter about HTML processing with sgmllib in his book Dive into Python. Processing HTML pages this way is quite complicated so I'd concentrate on a different solution (see below).

Beautiful Soup

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.

Example 1

Extract all the links from a page:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

text = urlopen('http://python.org').read()
soup = BeautifulSoup(text)

for a in soup.findAll('a'):
    if a.has_key('href'):
        print a

Output:

...
<a href="http://www.zope.org/">Zope</a>
<a href="http://www.djangoproject.com/">Django</a>
...
Cloud City


anime | bash | blogs | bsd | c/c++ | c64 | calc | comics | convert | cube | del.icio.us | digg | east | eBooks | egeszseg | elite | firefox | flash | fun | games | gimp | google | groovy | hardware | hit&run | howto | java | javascript | knife | lang | latex | liferay | linux | lovecraft | magyar | maths | movies | music | p2p | perl | pdf | photoshop | php | pmwiki | prog | python | radio | recept | rts | scala | scene | sci-fi | scripting | security | shell | space | súlyos | telephone | torrente | translate | ubuntu | vim | wallpapers | webutils | wikis | windows


Blogs and Dev.

* Ubuntu Incident
* Python Adventures
* me @ GitHub


Places

Debrecen | France | Hungary | Montreal | Nancy


Notes

full circle | km


Hobby Projects

* Jabba's Codes
* PmWiki
* Firefox
* PHP
* JavaScript
* Scriptorium
* Tutorials
* me @ GitHub


Quick Links


[ edit ]

View - Edit - History - Attach - Print *** Report - Recent Changes - Search
Page last modified on 2009 November 15, 04:39