Academic Work Personal
|
Python /
HTML processing
sgmllibMark Pilgrim has a chapter about HTML processing with sgmllib in his book Dive into Python. Processing HTML pages this way is quite complicated so I'd concentrate on a different solution (see below). Beautiful SoupBeautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Example 1Extract all the links from a page: from urllib import urlopen from BeautifulSoup import BeautifulSoup text = urlopen('http://python.org').read() soup = BeautifulSoup(text) for a in soup.findAll('a'): if a.has_key('href'): print a Output: ... <a href="http://www.zope.org/">Zope</a> <a href="http://www.djangoproject.com/">Django</a> ... |
![]() anime | bash | blogs | bsd | c/c++ | c64 | calc | comics | convert | cube | del.icio.us | digg | east | eBooks | egeszseg | elite | firefox | flash | fun | games | gimp | google | groovy | hardware | hit&run | howto | java | javascript | knife | lang | latex | liferay | linux | lovecraft | magyar | maths | movies | music | p2p | perl | pdf | photoshop | php | pmwiki | prog | python | radio | recept | rts | scala | scene | sci-fi | scripting | security | shell | space | súlyos | telephone | torrente | translate | ubuntu | vim | wallpapers | webutils | wikis | windows Blogs and Dev. * Ubuntu Incident Places Debrecen | France | Hungary | Montreal | Nancy Notes Hobby Projects * Jabba's Codes Quick Links [ edit ] |