Oktatás * Programozás 2 * Szkriptnyelvek * levelezősök Félévek Linkek * kalendárium |
EnPy3 /
20121126a/r/EarthPornAt reddit there is a thread (subreddit) called /r/EarthPorn, where you can find photos of beautiful landscapes. Let's download an image from here to our local machine. It could be the basis of a background changer application in the future :) Naive approachThe URLs of the images are in the HTML source. We could download the HTML source, and from there we could extract the links with regular expressions. Advanced approachHowever, for extracting data from HTML, it's not recommended to use regular expressions. Most HTML sources are not valid (attributes are not in quotes, closing tags are missing, etc.). When the browser downloads a page, these errors are corrected, and the browser builds a DOM structure. Then this DOM structure is traversed and the page is rendered. If you want to extract data from HTML, you can use a library (e.g. BeautifulSoup) that builds a DOM hierarcy, and then you just need to navigate in this hierarcy. Less painful approachBefore attacking an HTML source, it's a good idea to look after if the content is available in a machine-readable format. Good news! At reddit they thought of it, and every page is available in XML. All you need to do is add "/.xml" to the end oy your URL: http://www.reddit.com/r/earthporn/.xml. The next question to ask: is the same content available in JSON? Painless approachWe are very lucky, reddit supports JSON too: http://www.reddit.com/r/earthporn/.json. Save this JSON source in a file (e.g. Steps:
Exercises(a) Print the URL of all the images on the page Earthporn. (b) If the URL points to an image, then print the dimension of the image too (width, height). Tip: use the Pillow module (this is the successor of the PIL module; it's better to use Pillow). |
Blogjaim, hobbi projektjeim * The Ubuntu Incident [ edit ] |