Research Paper Recommender Systems: A Random-Walk Based Approach
Marco Gori Augusto Pucci
Dipartimento di Ingegneria Dipartimento di Ingegneria
dell’Informazione dell’Informazione
University of Siena University of Siena
Via Roma, 56 Siena, Italy Via Roma, 56 Siena, Italy
marco@[Link] augusto@[Link]
Abstract fact it requires a user to select an initial small subset of doc-
uments, relevant for the topic he is writing about. Then the
Every day researchers from all over the world have to fil- algorithm can spread its boosting effect, due to selected pa-
ter the huge mass of existing research papers with the cru- pers, through the citation graph in order to discover other
cial aim of finding out useful publications related to their interesting and useful resources. The paper is organized as
current work. In this paper we propose a research pa- follows. Subsection 1.1 contains a review of paper recom-
per recommending algorithm based on the Citation Graph mender systems. In section 2 paper recommender system
and random-walker properties. The PaperRank algorithm problem is formalized. Section 3 describe PaperRank algo-
is able to assign a preference score to a set of documents rithm in details. We describe the data set we build to test
contained in a digital library and linked one each other by our algorithm in section 4 and we show performance mea-
bibliographic references. A data set of papers extracted by sures in section 5. Finally in section 6 we summarize the
ACM Portal has been used for testing and very promising main contribution of the present paper and we show some
performances have been measured. possible future extensions of the given algorithm.
1.1 Related Work
1. Introduction In literature some algorithms have been proposed at-
tempting to overcome valuable research papers seeking.
Writing papers, technical reports, presentations and so This class of tasks is usually defined as ”Paper Recom-
on is a crucial activity for lots of people jobs all over the mender Systems”. A simplified example of recommender
world. Technology already provided good means to write, algorithm based on text and citations can be found in [9],
but we need better means to collect the information we need but this system just creates recommendations inside a sin-
for writing something interesting and meaningful. If we gle digital book. A more interesting example is [6], here
focus our attention on the production of research papers the authors propose a collaborative filtering based approach
we have a very interesting applicative scenario where the to paper recommendation and they use the Citation Graph
knowledge refinement and the bibliographical search is a between papers to create the ratings, they also tried six dif-
key part of the work. A researcher have to be well aware ferent approaches on a subset of ResearchIndex 186,000
of most recent improvement in the topic he is studying and research papers contained in Research Index. Another in-
writing about, but the number of research paper published teresting approach has been proposed in [1], the idea is
is exponentially growing, so he needs to choose and select that researchers from the same area tend to be interested
papers from the huge quantity of potential candidates. This to the same articles, so it is possible to improve search re-
filtering process is generally tedious and time consuming. sults by using recommendations based on previous searches
Typical tools adopted are keywords based search and paper performed by other people with similar interests. Paper rec-
citations following, unluckily both techniques are quite in- ommender systems are just a special case of recommender
accurate and with a low coverage level. In this paper we systems research area. In general a recommender system
propose a research paper recommending algorithm based makes personalized item suggestions by extracting knowl-
on the citation graphs and random-walker properties. Pa- edge from the previous user interactions with the system.
perRank is a support for the resource filtering process, in Such services are particularly useful in the modern elec-
tronic marketplace which offers an unprecedented range of find out valuable papers to suggest to an user, on the ba-
products. Several recommender systems have been devel- sis of papers already chosen to be part of the current pa-
oped that cope with different kind of items/products, see [8] per bibliography (Rp̄ ). If we consider the set of papers in
for a review. Rp̄ , we can exploit the information present in graph GC in
order to ”spread” user preferences through the Correlation
2. Paper Recommendation Problem Graph. Obviously we have to properly control the prefer-
ence flow in order to transfer high score values to papers
In this section we better formalize the paper recom- that are strongly related to papers already present in Rp̄ .
mender system problem scenario. We consider an active The spreading algorithm we apply has to possess two key
user u who is writing a paper containing references, we properties: propagation and attenuation. These properties
denote this ”work in progress” document as p̄. We sup- reflect two key assumptions. First of all if a paper pk is re-
pose user u can access a digital library to consult materi- lated to one or more good papers in p̄ reference list, then
als related to the topic he is currently working on. The set paper pk will also be a good suggestion for the user. If we
of documents in a given digital library will be referred as analyse the Correlation Graph we can easily discover re-
D = {pi : i = 1, · · · , n}, moreover, for every paper pi , we lationships between papers and also the strength of these
consider set Rpi which contains every paper cited by pi so connections, that is the weight associated to every link con-
D ⊇ Rpi = {pj : pj is cited by pi }. We also suppose user necting two papers. The second important factor we have
u has already cited in its current paper reference list some to take into account is attenuation. Good papers in Rp̄ have
papers belonging to D, so Rp̄ 6= ∅. The aim of a recom- to transfer their positive influence through the Correlation
mender system in this context is to compute a meaningful Graph, but this effect decreases its power if we move fur-
score s(pi ) for every pi ∈ D, such that the higher the score ther and further away from good papers, moreover if a good
of paper pi , the higher has to be its relevance with respect paper pi is connected to two or more nodes, these have to
to the topic of paper p̄. It is also possible to have a graph- share the boosting effect from pi according to the weights of
ical interpretation of the cross-citation structure of D. We their connections as computed in matrix C. PageRank algo-
can introduce the Citation Graph GC̃ . This graph contains rithm (see [7]) has both propagation and attenuation prop-
n nodes, one for each paper in D. From the edges point of erties we need, furthermore thanks to significant research
view, we consider graph GC̃ to be undirected and it contains efforts we can compute PageRank in a very efficient way
the edge pairs (pi , pj ) and (pj , pi ) if and only if paper pi (see [4]). Consider a generic graph G = (V, E), where V is
cites paper pj or vice versa. So more formally: the set of nodes connected by directed links in E, the classic
PageRank algorithm computes an importance score P R(n)
((pi , pj ), (pj , pi ) ∈ Edge{GC̃ }) ⇐⇒ pi ∈ Rpj ∨ pj ∈ Rpi for every node n ∈ V according to graph connectivity: a
node will be important if it is connected to important nodes
where Edge{GC̃ } is the set of edges in GC̃ . Matrix C˜ is the with a low out-degree. So the PageRank score for node n is
Connectivity Matrix corresponding to the Citation Graph, defined as:
such that:
1 if (pi , pj ) ∈ Edge{GC̃ }
C˜i,j = P R(n) = α ·
X P R(q)
+ (1 − α) ·
1
(1)
0 otherwise
ωq |V|
q:(q,n)∈E
where ∀i, C˜i,i = 0 and C˜ is a symmetric matrix. We
normalize matrix C˜ in order to obtain a stochastic matrix where ωq is the out-degree of node q, α is a decay fac-
C̃
Ci,j = ωi,j where ωj is the sum of entries in j − th column tor1 . Classic PageRank can be extended by generalizing
j
˜ C is the Correlation Matrix, every entry contains the equation 1:
of C.
correlation index between paper pairs. The Correlation Ma- PR = α · M · PR + (1 − α) · d (2)
trix can be also considered as a weighted connectivity ma-
trix for the Correlation Graph GC . The weight associated to where M is a stochastic matrix, its non-negative entries
link (pi , pj ) will be Ci,j , note that while C˜ is symmetrical, has to sum up to 1 for every column, and vector d has non-
C is not, so the weight associated to (pi , pj ) can differ from negative entries summing up to 1. Vector d can be tuned in
(pj , pi ) weight. order to bias the PageRank by boosting nodes correspond-
ing to high value entries and matrix M controls the prop-
3. PaperRank Algorithm agation and attenuation mode. Biased PageRank has been
analysed in [5] and custom static score distribution vectors
d have been applied to compute topic-sensitive PageRank
The idea underlaying the PaperRank algorithm is that we
can use the model expressed by the Correlation Graph to 1A common choice for α is 0.85
[3] and for combating web spam [2]. We present the Pa- Topic Number of Papers
perRank algorithm, that is a biased version of PageRank Distributed Databases 4,961
designed to be applied to a recommender system. Paper- Feature Extraction 3,463
Rank equation can be easily derived from equation 2. We Hidden Markov Model 3,978
use graph GC to compute a PaperRank vector IR, where its Multiprocessors 4,292
i − th component IRi correspond to the PaperRank value Page Rank 5,225
of paper pi , obviously these values also depend on the set of Recommendation System 4,367
papers in Rp̄ . In order to compute PaperRank, the stochas- Relevance Feedback 6,278
tic matrix M in equation 2 will be the Correlation Matrix Semantic Web 6,423
C and the dependencies between IR and Rp̄ can be mod- Support Vector Machine 4,206
eled by simply building a d static score distribution vector
depending on Rp̄ . The resulting equation is: Table 1. Graph dimension for every topic in
the data set.
IR = α · C · IR + (1 − α) · d (3)
where d has been build by defining the unnormalized d̃,
with respect to the i − th component, as:
and Qτ . Note that even if the maximum number of papers
˜ 1 if pi ∈ Rp̄ crawled is limited to 2000, that is the limit of papers down-
di =
0 if pi 6∈ Rp̄ loadable by the crawler for each topic, but the number of
identified cited documents is usually higher.
˜
So the normalized d vector will simply be d = d˜ . Paper- Table 1 shows the 9 selected topics and the number of
|d|
nodes of corresponding Citation Graphs.
Rank, as defined in equation 3, can be computed also iter-
atively. The corresponding dynamic system has to be run
for different Rp̄ in order to compute effective paper sugges- 5. Experimental Results
tion, luckily it only needs on average about 20 iterations to
converge. The interpretation of IR score vector is straight- Many different performance measure techniques have
forward, PaperRank scores induce a sorting of papers ac- been adopted in research paper recommender system litera-
cording to their expected liking for a given user. The higher ture in order to proof the effectiveness of a proposed algo-
is the PaperRank for a paper, the higher is the probability rithm. There are two different class of approaches: online
that it will be a valuable suggestion for a user. and offline experiments. In the online experiment human
subjects are involved, they try an algorithm and it has to be
4. Dataset collected user opinion on the quality of predictions made by
a given method. In the offline experiments you can remove
some citations from the test dataset and then attempting to
In order to test the effectiveness of algorithm proposed
predict those missing citations, otherwise you can focus on
in section 3 we build a data set of papers to measure per-
a cluster of documents you already know to be related one
formances. The data set has been obtained by extensively
each other and you can test if an algorithm is able to ”com-
crawling the ACM Portal web site2 that is a wide collec-
plete” this cluster if provided just a small part of it, that is
tion of high quality papers in Computer Science. First of all
the testing approach we decided to use in the present paper.
we chose 9 different topics, identified by one or more key-
In this paper the data sets described in section 4 has been
words, then ACM Portal has been queried for each topic, the
used to test PaperRank algorithm. Each data set τ is formed
result is a ranked list of papers within the digital library that
by the paper set Qτ and graph GC τ . It is possible to ran-
are good candidates to satisfy the query obtained by com-
domly choose a part of Qτ , that can be used as the reference
bining keywords related to a given topic. Now for every
list Rp̄ of an hypothetical current paper p̄ about topic τ . So
topic τ it is possible to select the first 20 papers as shown
the remaining part of Qτ will be the ”target set”, and it can
in the result list for that topic keywords by ACM Portal, we
be denoted as Tτ . Due to the way we build Qτ , the target
refer each of these sets as Qτ . The crawler has to look for
set, obtained by splitting Qτ , is composed by papers which
papers cited by papers contained in Qτ and then the process
are related to topic τ but not yet present in the current refer-
continues iteratively until a maximum number of paper ci-
ence list Rp̄ . A good scoring algorithm have to assign high
tations has been found (in this case the fixed maximum is
scores to papers contained in Tτ . Following this philosophy
2000). The crawler output will be a citation graph GC τ (ob-
we tested PaperRank algorithm on 9 data set we defined.
viously we have a different citation graph for every topic)
We also compared performances with respect to different
2 [Link] Tτ sizes, in particular, for every considered τ we randomly
Figure 1. Percentage of interesting papers ranked in the first 10 positions for every data set and with
increasing test set dimension.
chose 20%, 30%, 40% and 50% of papers contained in each on other digital libraries. Moreover we wish to add to the
Qτ to be part of the corresponding Tτ . Moreover experi- new version of PaperRank the capability of using also neg-
ments have been done in a 10-fold cross-validation mode, ative examples, in this case a set of papers selected by the
in fact for every τ and for every target set/reference list pro- user that are not relevant for the topic, in fact it could be re-
portion we randomly shuffled Qτ 10 times and each time ally valuable to introduce user feedback, after the first round
we obtained different Tτ (and Rp̄ ). Now we ran PaperRank of suggestions, into the paper recommendation task.
for every experimental setup, then we measured the abso-
lute ranking position of papers in every Tτ and counted the References
number of papers in Tτ which are present in the first 10,
20 and more positions. Results have been very promising [1] N. Agarwal, E. Haque, H. Liu, and L. Parsons. Research pa-
and they are shown in figure 1 with respect to the first 10 per recommender system: A subspace clustering approach. In
positions, we do not report plots for the percentage of inter- International Conference on Web-Age Information Manage-
esting papers ranked in the first 20 and more positions be- ment (WAIM) 2005, 2005.
cause PaperRank has been able to rank 100% of the papers [2] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating
web spam with trustrank. Technical report, Stanford Univer-
in Tτ within first 20 positions. On the y-axes there is the
sity, 2004.
percentage (1 correspond to 100%) of papers in Tτ placed [3] T. Haveliwala. Topic-sensitive pagerank. In Eleventh Inter-
by PaperRank within first 10 positions, these values are ob- national Conference on World Wide Web, 2002.
tained by averaging 10 fold-cross validation corresponding [4] S. Kamvar, T. Haveliwala, C. Manning, and G. Golub. Extrap-
percentages, one for each trial. Note that in most of the olation methods for accelerating pagerank computations. In
cases we obtain values higher than 50%, that is true for the Twelfth International Conference on World Wide Web, 2003.
[5] A. Langville and C. Meyer. Deeper inside pagerank. Internet
biggest Tτ too. It shows that we need just a small set of ini-
Mathematics, 1(3):335–380, 2003.
tial reference papers Rp̄ in order to rank properly the other [6] S. McNee, I. Albert, D. Cosley, P. Gopalkrishnan, S. Lam,
papers in the Citation Graph. A. Rashid, J. Konstan, and J. Riedl. On the recommending
of citations for research papers. In ACM Conference on Com-
6. Conclusions and Future Work puter Supported Cooperative Work, 2002.
[7] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank
citation ranking: Bringing order to the web. Technical report,
In this paper, we present a random–walk based scoring Stanford University, 1998.
algorithm, which can be used to recommend papers accord- [8] J. Schafer, J. Konstan, and J. Riedl. Electronic commerce rec-
ing to a small set of user selected relevant articles. We tested ommender applications. Journal of Data Mining and Knowl-
the algorithm on a data set extracted by ACM Portal Digital edge Discovery, January 2001.
Library and it performed very well, in fact target papers ob- [9] A. Woodruff, R. Gossweiler, J. Pitkow, E. Chi, and S. Card.
Enhancing a digital book with a reading recommender. In
tained a ranking within the first 20 result positions. Future
CHI ’00, 2000.
research topics include the experimentation of the algorithm