James P Houghton

James Houghton - Using Bibliographic Coupling to Identify Seminal Works


Using Bibliographic Coupling to Identify Seminal Works

12 Feb 2014

Sometimes its helpful to identify the seminal papers in a certain field, the ones that introduced revolutionary new concepts or applied them in interesting ways. One way to do this is with Bibliographic Coupling: If you know one of the archetypal papers in a field, looking at who cites that paper and what else they cite should tell you what other works are important.

For instance, assume I'm interested in reading the seminal papers that relate to Markov Chain Monte Carlo, and from the first link in the wikipedia references, I identify this paper by Andrieu et al. as being representative of the field. According to Reuters Web of Knowledge, the Andrieu paper has been cited 324 times. What other papers do those 324 citations think are important?

I can download the citations of these 324 papers from the web of knowledge by following this procedure:

1. Go to Web of Science (probably through a university proxy)
2. Search for and find the seed paper
3. Click through to the list of citing papers (papers that cite the seed)
4. Drop down the save-to menu and select 'other file formats'
5. Select records 1 through n (up to 500 at once)
6. In 'Record Content' choose 'Full Record and Cited References' - note, this may only be available through a university subscription
7. In 'File Format' select 'Tab-Delimited (Win, UTF-8)'
8. Save to a working directory

I can load this data into python, specifically into a Pandas Dataframe:

   refs = pd.read_csv('Wiki_MCMC.txt', sep='\t', index_col=False, encoding='utf-8')

To get a sense for the data, I'll plot a histogram of the number of citations made per year:
Now I want to create a graph, using NetworkX, a python library for representing complex networks. Each paper will represent a node in this graph, and each citation forms a directional link between the nodes.

   CG = nx.DiGraph()

I'll iterate through the dataframe and in each row, add a node representing the citing paper based on data in the row, and then look at each of its references. To extract those references, I'll split the cell which contains the list of references by the semicolon delimiter ';' and use a regular expression to organize the data:

   pattern = '\s+(?P<Author>.*?),(\s+(?P<Year>\d{4}),)?(\s+(?P<Journal>.*?),)?(\s+V(?P<Volume>\d*?),)?(\s+P(?P<Page>\d*?),)?(\s+DOI\s+(?P<DOI>.*?))?;'

I'll use the Digital Object Identifier (DOI) of each document as the key for each node. (An explanation of the Web of Science export file is available here.)

   for index, ref in refs.iterrows():
     
       node1_attr = {'Author':ref['AU'], 'Year':ref['PY'], 'Journal':ref['SO'], 
                     'Volume':ref['VL'], 'Page':ref['BP'], 'DOI':ref['DI']}
    
       CG.add_node(node1_attr['DOI'], attr_dict=node1_attr)
    
       try:
           for cite in ref['CR'].split(';'):
               matches = re.finditer(pattern, cite+';')
               for m in matches:
                   node2_attr = m.groupdict()
                   CG.add_node(node2_attr['DOI'], attr_dict=node2_attr)
                   CG.add_edge(node1_attr['DOI'], node2_attr['DOI'])
       except:
           pass

There may be some nodes that didn't have DOI attributes, which we should remove:

   try:
       CG.remove_node(NaN)
   except:
       pass

Here's a quick image showing what the core of that network diagram looks like. It's a little messy, but nodes closer to the center are more connected within the diagram. There are also some clusters, which probably represent subfields of Markov Chain Monte Carlo analysis.
Now that I have the network diagram built, I can query it to find the nodes with the most in-links, which represent citations. To do this, I ask the graph to give me the in-degree of each node, sort descending, and print the top 10.

   in_degree_list = pd.Series(CG.in_degree())
   in_degree_list.sort(ascending=False)

   for citation, degree in in_degree_list[:10].iteritems():
       print degree, CG.node[citation]

Clearly the node with the highest in-degree will be our seed paper. The second paper and onward represent the papers that are most likely to be worth reading. In our example case, the results are:

227 citations: Andrieu C, 2003
32 citations: HASTINGS WK, 1970
29 citations: METROPOLIS N, 1953
22 citations: Blei DM, 2003
21 citations: Green PJ, 1995
13 citations: Griffiths TL, 2004
13 citations: CHIB S, 1995
11 citations: Arulampalam MS, 2002
10 citations: GELFAND AE, 1990
10 citations: TIERNEY L, 1994

After the seed paper, the top two results are the papers which outline the Metropolis-Hastings Algorithm, one of the most popular methods for Markov Chain Monte Carlo - clearly seminal works in that field.

For code and example data for this post, see this ipython notebook, and my github.




© 2016 James P. Houghton