Scraping Gumtree
September 8, 2014
Introduction
Gumtree.com.au is a trading post website, largely used by private sellers interacting with each other off-site to sell used goods. One particularly common use for Gumtree is concert tickets: this is one of the main uses I have for it.
The site is however fairly lightweight, and doesn’t have any sort of notifications system for a particular search. So when looking for tickets to a particular concert, say, you have to search the site constantly and see if there are any new results.
I decided to learn some basic web-scraping techniques to write a script that did all this work for me. I wanted the following features:
- To specify a keyword (e.g. “arctic monkeys”).
- To have the script crawl Gumtree every \(x\) minutes.
- To email me with any new results (within particular geographic areas or price brackets).
Web scraping
I’d not done much web scraping before but I’d heard good things about BeautifulSoup in Python to parse the scraped html, together with the requests package to actually scrape the pages. Since I was looking for tickets to a particular show, I used the following parameters, which could of course be altered depending on the user’s requirements:
1 artist="arctic+monkeys"
2 city="sydney"
3 URL_START='http://www.gumtree.com.au'
4 OFFER_CHOICE=['k0?ad=offering','k0?ad=wanted']
5 ARTIST_URL="arctic+monkeys"#re.sub("\s",+,artist)
I then used the requests package to scrape a gumtree results page and dump the JSON output; this scrapes the $n$th result page (Gumtree limits to 10 results per page).
1def scrape_gumtree_page_n(n):
2 current_time=int(time.time())
3 pageURL='{0}/s-{1}+{2}/page-{3}/{4}'.format(URL_START,ARTIST_URL,city,n,OFFER_CHOICE[0])
4 savePath=os.path.join(scrapings_dir,"{0}.html".format(current_time))
5 results=requests.get(pageURL)
6 results_file=open(savePath,'w')
7 with results_file:
8 results_file.write(results.text.encode('utf-8',errors='ignore'))
9 return savePath
There was then a simple while loop to run this over all results pages.
Parsing JSON using BeautifulSoup
Having obtained a whole bunch of messy output from the scraper, I used the following code to parse using BeautifulSoup:
1def html_parser(filename):
2 gumtree_file=open(filename)
3 gumtree_contents=gumtree_file.read()
4 gtsoup=BeautifulSoup(gumtree_contents)
5 master_list=[]
6 gt_li=gtsoup.findAll('li', attrs={'class': 'js-click-block'})
7 for node in gt_li:
8 if len(node.contents) > 0:
9 post_dict={}
10 if node.findAll('a') is not None:
11 post_dict['title']=node.findAll('a')[0].string
12 if node.find('div',attrs={"class":"h-elips"}) is not None:
13 post_dict['price']=node.find('div',attrs={"class":"h-elips"}).string
14 if node.findAll('span') is not None:
15 post_dict['description']=node.findAll('span')[0].contents[0]
16 if node.find('h3',attrs={"class":"rs-ad-location"}) is not None:
17 post_dict['location1']=node.find('h3',attrs={"class":"rs-ad-location-area"}).contents[0]
18 if node.find('span',attrs={"class":"rs-ad-location-suburb"}) is not None:
19 post_dict['location2']=node.find('span',attrs={"class":"rs-ad-location-suburb"}).contents[0]
20 if node.find('div',attrs={"class":"rs-ad-date"}) is not None:
21 post_dict['date']=node.find('div',attrs={"class":"rs-ad-date"}).contents[0]
22 anchors=node.findAll('a')
23 for node in anchors:
24 if node.get("data-adid") is not None:
25 post_dict['ad_id']=node.get('data-adid')
26 master_list.append(post_dict)
27 return master_list
This returned things like
1{
2 "ad_id": "1048303295",
3 "description": "Selling: 2 Coldplay Tickets - Sydney 19 June ",
4 "title": null,
5 "price": "\n500",
6 "location2": "St Ives Chase",
7 "date": "\n04/06/2014"
8}
which contains all the relevant information. This took me a while to write as I’d not dealt with JSON much before, so I had to figure everything out from scratch. It was very satisfying to run the parser successfully and have big page-sized chunks of horrible JSON condensed into little search result nuggets like the above.
Mandrill email automation
I wanted to run the scraper constantly in the background, preferably on an Amazon Web Server instance, and have it send automated emails every time it found a new result. I had to register a Mandrill account to use their API; once an API key was generated I wrote the following very basic email template:
1def new_result_email(result,artist,city):
2 subj=artist+' '+city
3 body=json.dumps(result)
4 message = {
5 'from_email': '...',
6 'from_name': 'Clinton Boys',
7 'headers': {'Reply-To': '...'},
8 'important': True,
9 'preserve_recipients': None,
10 'return_path_domain': None,
11 'signing_domain': None,
12 'subject': subj,
13 'text': body,
14 'to': [{'email': '...',
15 'name': 'Clinton Boys',
16 'type': 'to'}],
17 }
18 result = mandrill_client.messages.send(message=message)
19 print(result)
which I called in my function “regular_scraping”:
1def regular_scraping():
2 json_file=open('master_file.text',"r")
3 json_file_contents= json_file.read()
4 if not os.stat('master_file.text')[6]==0:
5 old_data=json.loads(json_file_contents)
6 else:
7 old_data=[]
8 this_scrape=[]
9 for n in range(1,pages_to_scrape()+1):
10 this_scrape.append(html_parser(scrape_gumtree_page_n(n)))
11 data_ids=[]
12 new_entries=[]
13 new_entry_count=0
14 for entry in old_data:
15 data_ids.append(entry[u'ad_id'])
16 for entry in this_scrape[n-1]:
17 if entry[u'ad_id'] not in data_ids:
18 old_data.append(entry)
19 new_entry_count += 1
20 new_entries.append(entry)
21 new_result_email(entry,artist,city)
22 json_file.close()
23 json_file=open('master_file.text',"w")
24 json_file.write(json.dumps(old_data))
25 json_file.close()
The last step was to set this up on an Amazon Web Server instance: I made a master file which put all these ingredients together: every \(x\) seconds (by default I chose 120), it ran the scraper across all results pages, looked through the neatened JSON for results, compared these to a master results file on disk, and if there were any new results, added to the file and sent an email through Mandrill. This was easy enough to set up, although there’s still a bunch of memory leak issues because my code isn’t great; after a while it will eventually crash the AWS instance. It did solve my initial problem though, which was the whole point of the exercise, and I did learn a whole lot of new techniques.