Szathmáry László honlapja @ DEIK

/r/EarthPorn

At reddit there is a thread (subreddit) called /r/EarthPorn, where you can find photos of beautiful landscapes.

Let's download an image from here to our local machine. It could be the basis of a background changer application in the future :)

Naive approach

The URLs of the images are in the HTML source. We could download the HTML source, and from there we could extract the links with regular expressions.

Advanced approach

However, for extracting data from HTML, it's not recommended to use regular expressions. Most HTML sources are not valid (attributes are not in quotes, closing tags are missing, etc.). When the browser downloads a page, these errors are corrected, and the browser builds a DOM structure. Then this DOM structure is traversed and the page is rendered.

If you want to extract data from HTML, you can use a library (e.g. BeautifulSoup) that builds a DOM hierarcy, and then you just need to navigate in this hierarcy.

Less painful approach

Before attacking an HTML source, it's a good idea to look after if the content is available in a machine-readable format. Good news! At reddit they thought of it, and every page is available in XML. All you need to do is add "/.xml" to the end oy your URL: http://www.reddit.com/r/earthporn/.xml.

The next question to ask: is the same content available in JSON?

Painless approach

We are very lucky, reddit supports JSON too: http://www.reddit.com/r/earthporn/.json. Save this JSON source in a file (e.g. earthporn.json). At the moment the source is not indented, the whole stuff is in one line. Use this method to make it readable: "cat earthporn.json | python -m json.tool →nice.json".

Steps:

Insert the JSON source in this visualization application: http://chris.photobooks.com/json/default.htm. Find the URL of an image and click on it. At the bottom of the left column the selected element's path in the JSON hierarchy will appear. Example: root.data.children[1].data.url. That is, the type of root.data.children is a list. The elements of this list are records, and from each record we need to extract the data.url part.
Write a Python script that fetches the JSON source, converts it to a Python dictionary, then prints the list root.data.children .
When ready, complete the script: from the list extract data.url and print just this part.

Exercises

(a) Print the URL of all the images on the page Earthporn.

(b) If the URL points to an image, then print the dimension of the image too (width, height). Tip: use the Pillow module (this is the successor of the PIL module; it's better to use Pillow).