trendyscraper

November 2, 2017

My Google trends scraper is the most popular post on this site, and I’ve been getting questions about it for the last year or so, ever since Google changed the way the data is pulled and displayed on the trends site.

I finally found some free time, worked out how Google had changed the API behind the scenes, and changed my trendy scraper to work with the new version. I dramatically improved the code at the same time. Thanks to everyone who was interested!

New version

The way the code works is as follows:

when you visit https://trends.google.com/trends/explore and request a graph, you make a call to the Google trends explore API which generates for you a usage token (valid only for this page view)
when you click export, you make a call to a different API, using this token

 1def get_data(bucket_start_date,bucket_end_date, keyword):
 2    bucket_start_date_printed = datetime.strftime(bucket_start_date, '%Y-%m-%d')
 3    bucket_end_date_printed = datetime.strftime(bucket_end_date, '%Y-%m-%d')
 4    time_formatted = bucket_start_date_printed + '+' + bucket_end_date_printed
 5
 6    req = {"comparisonItem":[{"keyword":keyword, "geo":geo, "time": time_formatted}], "category":category,"property":""}
 7    hl = "en-GB"
 8    tz = "-120"
 9
10    explore_URL = 'https://trends.google.com/trends/api/explore?hl={0}&tz={1}&req={2}'.format(hl,tz,json.dumps(req).replace(' ','').replace('+',' '))
11    print explore_URL
12    print requests.get(explore_URL).text.encode('utf8')
13    return requests.get(explore_URL).text

This function simulates the first API call and stores the response as a string.

 1def get_csv(response_text):
 2    request = get_csv_request(response_text)
 3    token = get_token(response_text)
 4
 5    csv = requests.get('https://www.google.com/trends/api/widgetdata/multiline/csv?req={0}&token={1}&tz=-120'.format(request,token))
 6    return csv.text.encode('utf8')
 7
 8def parse_csv(csv_contents):
 9    lines = csv_contents.split('\n')
10    df = pd.DataFrame(columns = ['date','value'])
11    dates = []
12    values = []
13    # Delete top 3 lines
14    for line in lines[3:]:
15        try:
16            dates.append(line.split(',')[0].replace(' ',''))
17            values.append(line.split(',')[1].replace(' ',''))
18        except:
19            pass
20    df['date'] = dates
21    df['value'] = values
22    print df.head()
23    return df

These two functions then parse the response, pull the token and the request for the CSV and then simulates the second API call, saving the results as a pandas dataframe (much better than the previous code which had to actually save all the files to a local folder).

The rest of the code is basically the same: you can read about the stitching process in the original post below.

Original version

If you search Google trends for a term like “taylor+swift”, and set your timescale to “2004 to present”, you’ll get an interesting graph that does demonstrate the meteoric rise to stardom of the pop star.

Taylor swift

Google also gives you an option to download the data as a .CSV file; this is a great option if you want to compare trends in a more rigorous way than looking at their graphs side by side.

However there’s an issue with the timescales: if you download data from 2004 to present, Google will give you monthly or weekly data, but if you download data from June 2009, Google will give you daily data (unless it’s negligible).

I wrote a small script that searches Google trends for a certain term over a specified time frame, but does so over multiple small time intervals to obtain small time-scale data, and then stitches it all together using the full time scale for consistency.

Because Google offers the option of downloading the results of a search as a .CSV file, we don’t need any nasty scraping tools and can achieve everything in Python using just the webbrowser package (we will be using pandas later to join everything together and do some computations). We also define the directories we’ll be using.

 1import webbrowser
 2import time
 3import os
 4import shutil
 5import copy
 6import re
 7import csv
 8import pandas as pd
 9
10keyword = "taylor+swift"
11
12path = '/Users/clintonboys/Downloads'
13
14scrapings_dir = 'gt_{0}'.format(keyword)
15if not os.path.exists(path+"/"+scrapings_dir):
16    os.makedirs(path+"/"+scrapings_dir)

The largest timeframe over which Google will reliably return daily data is two months (the period is probably 90 days but most three month periods are longer than this). So we start by defining a function to get the data for a particular two-month period simply by opening the relevant results.csv page with the webbrowser function; this will store the file in the /Downloads folder by default.

 1def ScrapeTwoMonths(keyword, year, startmonth):
 2    print 'Scraping month'+str(startmonth)+' in '+str(year)
 3    URL_start = "http://www.google.com/trends/trendsReport?&q="
 4    URL_end = "&cmpt=q&content=1&export=1"
 5    
 6    queries = keyword[0]
 7    if len(keyword) > 1:
 8        queries_list = []
 9        for i in range(0,len(keyword)):
10            queries_list.append(keyword[i])
11        queries = '%20'.join(queries_list)
12        
13    date = '&date='+str(startmonth)+'%2F'+str(year)+'%202m'
14    
15    URL = URL_start+queries+date+URL_end
16
17    webbrowser.open(URL)

Now we define a function that gets all two-month data for a particular timespan, together with the weekly data over the whole timespan, and stores all the results files in a single folder (we sleep for a few seconds in between each call to ScrapeQuarter or the browser gets very confused).

 1def ScrapeRange(keyword, startmonth, startyear, endmonth, endyear):
 2            
 3    for i in range(startmonth,11,2):
 4        ScrapeTwoMonths(keyword,startyear,i)
 5        time.sleep(7)
 6    for y in range(startyear + 1, endyear):
 7        for i in range(1,11,2):
 8            ScrapeTwoMonths(keyword,y,i)
 9            time.sleep(7)
10    for i in range(1,endmonth,2):
11        ScrapeTwoMonths(keyword,endyear,i)
12        time.sleep(7)
13    
14    files = copy.deepcopy(os.listdir(path))    
15    
16    for i in range(0,len(files)):
17        if files[i].lower().endswith('.csv'):
18            try:
19                if files[i][-5] == ")":
20                    oldname = path+'/'+files[i]
21                    newname = path+'/report'+files[i][-6]+'.csv'
22                    os.rename(oldname,newname)
23            except OSError:
24                pass
25
26    quarterly_files = [fn for fn in os.listdir(path) if fn.lower().startswith('report')]
27                                    
28    for file in quarterly_files:
29        shutil.move(path+"/"+file,path+'/'+scrapings_dir)
30
31    full_path = path+'/'+scrapings_dir    
32    newfiles = copy.deepcopy(os.listdir(full_path))
33
34    for i in range(0,len(newfiles)):
35        oldname = full_path+'/'+newfiles[i]
36        if os.path.getsize(oldname) < 800:
37            print 'File '+oldname+' is unusually small...'
38        newname = full_path+'/'+str(os.path.getmtime(full_path+'/'+newfiles[i]))[:-2]+".csv"
39        os.rename(oldname, newname)

Now we need pandas to join all the bimonthly files together into a single big pandas data frame. This is a bit fiddly because we have to chop all the irrelevant stuff off the csv files before pandas will read them properly. Use a bit of regular expressions because they’re cool.

 1def CreateDailyFrame():
 2
 3    files = copy.deepcopy(os.listdir(path+'/'+scrapings_dir))[:-1]
 4    print files
 5    date_pattern = re.compile('\d\d\d\d-\d\d-\d\d')
 6    for i in range(0,len(files)):
 7        if files[i].lower().endswith('.csv'):
 8            oldname = path+'/'+scrapings_dir+'/'+files[i]
 9            newname = path+'/'+scrapings_dir+'/'+'bimonthly'+str(i)+'.csv'
10            temp_file = csv.reader(open(oldname,'ru'))
11            with open(newname,'wb') as write_to:
12                write_data = csv.writer(write_to, delimiter=',')
13                for row in temp_file:
14                    if len(row)==2:
15                        if re.search(date_pattern,row[0]) is not None:
16                            write_data.writerows([row])
17            os.remove(oldname)
18
19    files = [fn for fn in copy.deepcopy(os.listdir(path+'/'+scrapings_dir))[:-1] if fn.lower().startswith('bimonthly')]
20
21    frames_list = []
22
23    for file in files:
24        df = pd.read_csv(path+'/'+scrapings_dir+'/'+file,index_col=None,header=None)
25        list.append(df)
26
27    frame = pd.concat(list,ignore_index=True)
28
29    return frame

A similar function cleans up the weekly data and makes a pandas frame so everything is ready to be stitched together.

 1def CreateWeeklyFrame():
 2
 3        date_pattern = re.compile('\d\d\d\d-\d\d-\d\d\s-\s\d\d\d\d-\d\d-\d\d')
 4
 5        oldname = path+'/'+scrapings_dir+'/'+'weekly_data.csv'
 6        newname = path+'/'+scrapings_dir+'/'+'weekly.csv'
 7        temp_file = csv.reader(open(oldname,'ru'))
 8        with open(newname,'wb') as write_to:
 9            write_data = csv.writer(write_to, delimiter=',')
10            for row in temp_file:
11                if len(row) == 2:
12                    if re.search(date_pattern,row[0]) is not None:
13                        write_data.writerows([row])
14        os.remove(oldname)
15
16        frame = pd.read_csv(newname,index_col=None,header=None)
17
18        return frame

To stitch everything up, first we make the frames we need, using some regular expressions to break up the date ranges.

 1def StitchFrames():
 2
 3    daily_frame = CreateDailyFrame()
 4    interim_weekly_frame = CreateWeeklyFrame()
 5
 6    daily_frame.columns = ['Date', 'Daily_Volume']
 7    pd.to_datetime(daily_frame['Date'])
 8    
 9    interim_weekly_frame.columns = ['Date_Range', 'Weekly_Volume']
10    date_pattern = re.compile('\d\d\d\d-\d\d-\d\d')
11
12    startdates = []
13    enddates = []
14
15    for i in range(0,len(interim_weekly_frame['Date_Range'])):
16        startdates.append(re.findall(date_pattern,interim_weekly_frame['Date_Range'][i])[0])
17        enddates.append(re.findall(date_pattern,interim_weekly_frame['Date_Range'][i])[1])
18
19    weekly_frame = pd.DataFrame(data=[startdates,enddates,interim_weekly_frame['Weekly_Volume'].tolist()]).transpose()
20    weekly_frame.columns = ['Start_Date', 'End_Date', 'Weekly_Volume']
21    pd.to_datetime(weekly_frame['Start_Date'])
22    pd.to_datetime(weekly_frame['End_Date'])

Now we define weekly date bins using pandas.

 1    bins = []
 2
 3    for i in range(0,len(weekly_frame)):
 4        bins.append(pd.date_range(weekly_frame['Start_Date'][i],periods=7,freq='d'))
 5
 6    weekly_frame = weekly_frame.set_index('Start_Date')
 7
 8    daily_frame = daily_frame.set_index('Date')
 9
10    final_data = {}

We now stitch everything together, adjusting each individual week’s data so it is consistent with the start date’s measure from the long-timescale data.

1for i in range(0,len(bins)):
2    for j in range(0,len(bins[i])):
3        final_data[bins[i][j]] = weekly_frame['Weekly_Volume'][str(bins[i][0].date())]*daily_frame['Daily_Volume'][str(bins[i][j].date())]/daily_frame['Daily_Volume'][str(bins[i][0].date())]

Finally we rescale everything again so it is consistent with Google’s 0-100 scale.

1final_data_frame = DataFrame.from_dict(final_data,orient='index').sort()
2final_data_frame[0] = np.round(final_data_frame[0]/final_data_frame[0].max()*100)
3
4final_data_frame.columns=['Volume']
5final_data_frame.index.names = ['Date']
6
7final_name = path+'/'+scrapings_dir+'/'+'final_output.csv'
8
9final_data_frame.to_csv(final_name, sep=',')

This code is available in a Github repo. There’s definitely room for a bunch of improvements but it does what I need it to do at the moment.

UPDATE (24/10/2016): Google has made significant changes to their Trends product which means this code no longer works. I’m working on updating it and will write a new post when this is done.

UPDATE 2 (18/2/2017): I have fixed the code so the trendy scraper is now working again. The updated code is in Github. I made a few improvements along the way.