Sunday, December 13, 2009

Downloading photos, coordinates, titles and tags from Panoramio

I really like Panoramio. It's the main place where I publicly share my photos. Whenever I really like a photo and feel that it fits criteria for inclusion in Google Earth (via the Panoramio layer), I upload that photo to Panoramio.

My main complaint about Panoramio is that I cannot export any information about my photos. I've spent a significant amount of time and effort geotagging and tagging photos in Panoramio, and I cannot obtain any of that information. Yesterday I decided to do something about this.

I chose to do this from the command prompt using common GNU tools: bash, wget, grep, sed and more. There was an obvious simple first step: downloading the photo index pages. I noted that I had 27 pages, and then I used this bash command to get them all:

for (( i = 1 ; i < 28 ; i++ )) ; do wget "--user-agent=Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" "http://www.panoramio.com/user/5539?comment_page=1&photo_page=$i" ; sleep 10 ; done

Looking at the HTML source, I noticed lines containing photo ID numbers and titles. Such information can be extracted by a simple regular expression:

for (( i = 1 ; i < 28 ; i++ )) ; do sed -n "sX^.*h2><a href=\"/photo/\([0-9]*\)\">\([^<]*\)<.*\$X\1:\2Xp" 5539\?comment_page\=1\&photo_page\=$i ; done > photo2title.txt

The tags were more tricky, because they were on multiple lines. This required a simple sed program which joined the lines:

for (( i = 1 ; i < 28 ; i++ )) ; do sed -n 's/^ *"\([0-9]*\)":\[/\1:/; t next; b ignore; :next; N; s/\]$//; t end; N; s/,$/,/; t next; :end; s/\
//g; s/   *//g; s/"//g; p; :ignore;' 5539\?comment_page\=1\&photo_page\=$i ; done > photo2tags.txt


The coordinates were not in the photo index pages however, and so I had to look for another source of information. Obviously, I could just download all the web pages of all the specific photos, but I didn't really like that solution. I went to view all of my photos in the map and then I opened the AdBlock Plus blockable items list. The link from where the map gets its information was very obvious:

http://www.panoramio.com/map/get_panoramas?order=upload_date&set=5539&size=thumbnail&from=0&to=24&minx=167.34375&miny=-58.26328705248601&maxx=11.953125&maxy=84.86578186731522

After downloading that I noticed that I was getting the type of information I wanted, but I was getting only 24 entries. I assumed it was due to the "to=24" part of the URL, so I changed that to the number of photos I had. The number I got was three less than the number of photos. Sure enough, three of my photos failed to appear in the map on Panoramio. I reported this bug in the Panoramio forum, and went on with my project. Extracting the coordinates wasn't hard. I based the regular expression on the data for one photo which I pasted from get_panoramas.

sed 's/"photo_id": \([0-9]*\), "longitude": \(-\?[0-9.]*\), "height": [0-9]*, "width": [0-9]*, "photo_title": "[^"]*", "latitude": \(-\?[0-9.]*\),/~\1:\3,\2~/g; y/~/\
/;' get_panoramas | grep "^[0-9]*:" > photo2coords.txt


If I didn't care about tags I could have avoided getting the photo index pages and started with get_panoramas output. It would have been simpler, but I would have missed three photos and failed to notice a bug in Panoramio. Since I knew about the three missing photos, I went to their photo pages, looked at the HTML source, and manually added the coordinates to photo2coords.txt.

The only remaining data was the correspondence between photos on my hard drive and photos on Panoramio. The data collected so far was not helpful. I decided to get the original photos and match them up with local photos based on MD5 hashes. I already had several files which listed all of my photo ID numbers, and it was trivial to download all the photos with wget. I placed delays between the downloads to reduce the load on Panoramio. At the end, I noticed one error in the wget output, and I manually downloaded that photo.

Once I had the photos, MD5 hashes successfully found local locations of most of the photos. The EXIF date and time of when the photo was taken found local locations of most of the rest. Out of 646 photos, I only had to manually match up 12, which wasn't bad. I'm not documenting the detailed matching procedure here. One hint: the join command is helpful.

2 comments:

Yarnosh said...

First!

Kevin Childress said...

I came across this topic in your forum post at Panoramio. I've really wanted to have this same information. Most of this code talk is over my head ... any chance you could help me understand all of this in much simpler terms. Such as, how would I go about doing this, exactly?

Beat regards,

Kevin
Pano user 2367993