James P Houghton

James Houghton - Scraping view data from YouTube


Scraping view data from YouTube

29 Oct 2012

YouTube gives a chart of cumulative views for a video, using Google's (deprecated) image charts API. The charts are constructed as URL calls to the API, which means that you can back out the data that is embedded in them with some careful parsing of the page source. Here's an example:



The above is a dynamically created chart (check out the source for this page if you like) based upon the following call to the chart API (to which I've added comments):

http://chart.apis.google.com/chart?
cht=lc:nda&            //Chart Type
chs=460x100&           //The chart size (width x height)
chf=bg,s,F4F4F4&       //Gradient Fill
chco=5F8FC9&           //Series Colors
chls=1.5&              //Line Style
chg=0,-1,1,1&          //Grid Lines
chxt=y,x&              //Visible Axes
chxtc=0,0&             //Axis tick mark style
chxs=0N*s* ,333333,10|1,333333,10&     //Axis Label Style
chxl=1:|07/15/12|09/05/12|10/28/12&    //Axis Labels
chxp=1,5,50,95&                        //Axis Label Positions
chxr=0,0,709265784|1,0,100&            //Axis Ranges
chd=t:0.0,0.1,0.1,0.2,0.3,0.3,0.4,0.5,0.5,0.6,0.7,0.7,0.8,1.0,
      1.1,1.2,1.4,1.5,1.7,1.9,2.1,2.3,2.5,2.8,3.0,3.5,3.7,3.9,        
      4.1,4.3,4.6,4.9,5.2,5.5,5.8,6.2,6.5,7.0,7.9,8.4,8.9,9.4,
      10.0,10.5,11.2,11.9,12.7,13.4,14.1,14.9,16.5,17.5,18.5,
      19.5,20.4,21.4,22.4,23.5,24.9,26.2,27.4,28.7,31.1,32.5,
      34.0,35.7,37.1,38.5,39.8,41.2,42.4,44.0,45.8,47.4,48.7,
      51.2,52.5,54.0,55.7,57.0,58.2,59.4,60.5,61.7,63.0,64.5,
      65.9,68.0,69.1,70.2,71.5,73.0,74.3,75.4,76.5,77.7,78.9,
      80.2,81.8,83.3&                 //The actual chart data
chm=B,dce7eed4,0,0,0|AA,333333,0,0,10|AB,333333,0,0,10 //Line Fills

To interpret this image I consulted the charts API reference.

The data is embedded in the "chd" parameter as 100 values between 0 and 100. We can save this as a .csv file, and import into excel. The bounds of the X axis, corresponding to the first and last values in "chd" are the labels listed in "chxl", and the Y bounds are found in "chxr". We just have to interpolate x values between "7/15/12" and "10/28/12" for each of the data points, and (as the y axis starts at zero) scale the y values given by the y limit:
   x(i) = x_0 + (x_limit-x_0)/(100-0)*i
   y(i) = y'(i) * y_limit
Then we are able to replicate the chart on our own:
From here its relatively easy to calculate views per day or other statistics.

You can also get information for a YouTube Video with a call to the YouTube REST API, although this doesn't give historic data:
   https://gdata.youtube.com/feeds/api/videos/{VideoID}?v=2&alt=json




© 2016 James P. Houghton