Method For Node Ranking in A Linked Database
Method For Node Ranking in A Linked Database
(54) METHOD FOR NODE RANKING INA Craig Boyle “To link or not to link: An empirical comparison
LINKED DATABASE of Hypertext linking strategies”. ACM 1992, pp. 221–231.*
L. Katz, “A new status index derived from sociometric
(75) Inventor; Lawrence Page, Stanford, CA (US) analysis,” 1953, Psychometricka, vol. 18, pp. 39–43.
(73) Assignee: The Board of Trustees of the Leland C.H. Hubbell, “An input-output approach to clique identi
Stanford Junior University, Stanford, fication sociometry,” 1965, pp. 377–399.
CA (US) Mizruchi et al., “Techniques for disaggregating centrality
scores in social networks,” 1996, Sociological Methodology,
(*) Notice: Subject to any disclaimer, the term of this pp. 26–48.
patent is extended or adjusted under 35 E. Garfield, “Citation analysis as a tool in journal evalua
U.S.C. 154(b) by 0 days. tion,” 1972, Science, vol. 178, pp. 471–479.
Pinski et al., “Citation influence for journal aggregates of
(21) Appl. No.: 09/004,827 scientific publications: Theory, with application to the lit
(22) Filed: Jan. 9, 1998 erature of physics,” 1976, Inf. Proc. And Management, vol.
12, pp. 297–312.
Related U.S. Application Data N. Geller, “On the citation influence methodology of Pinski
(60) Provisional
1997.
application No. 60/035,205, filed on Jan. 10, and Narin,” 1978, Inf. Proc. And Management, vol. 14, pp.
93–95.
(51) Int. Cl." … G06F 17/30 P. Doreian, “Measuring the relative standing of disciplinary
(52) U.S. Cl. .................................... 707/5; 707/7, 707/501 journals,” 1988, Inf. Proc. And Management, vol. 24, pp.
(58) Field of Search .................................... 707/100, 5, 7, 45–56.
707/513, 1–3, 10, 104, 501; 345/440; 38.2/226,
229, 230, 231 (List continued on next page.)
(56) References Cited Primary Examiner—Thomas Black
U.S. PATENT DOCUMENTS Assistant Examiner—Uyen Le
(74) Attorney, Agent, or Firm—Harrity & Snyder L.L.P.
4,953,106 + 8/1990 Gansner et al. ..................... 345/440 (57) ABSTRACT
5,450,535 + 9/1995 North ................ ... 395/140
5,748,954 5/1998 Mauldin .... -- ... 395/610
5,752,241 * 5/1998 Cohen ........... ... 707/3 A method assigns importance ranks to nodes in a linked
5,832,494 * 11/1998 Egger et al. ...... 707/102 database, such as any database of documents containing
5,848,407 * 12/1998 Ishikawa et al. ........................ 707/2 citations, the world wide web or any other hypermedia
6,014,678 * 1/2000 Inoue et al. .......................... 707/501 database. The rank assigned to a document is calculated
OTHER PUBLICATIONS
from the ranks of documents citing it. In addition, the rank
of a document is calculated from a constant representing the
S. Jeromy Carriere et al., “Web Query: Searching and Visu probability that a browser through the database will ran
alizing the Web through Connectivity”, Computer Networks domly jump to the document. The method is particularly
and ISDN Systems 29 (1997), pp. 1257–1267.* useful in enhancing the performance of search engine results
Wang et al “Prefetching in Worl Wide Web”, IEEE 1996, pp. for hypermedia databases, such as the world wide web,
28–32.* whose documents have a large variation in quality.
Ramer et al. “Similarity, Probability and Database Organi
sation: Extended Abstract”, 1996, pp. 272.276.” 29 Claims, 3 Drawing Sheets
US 6,285,999 B1
Page 2
tº . . .) A DS
FIG. 1
U.S. Patent Sep. 4, 2001 Sheet 2 of 3 US 6,285,999 B1
FIG. 2
U.S. Patent Sep. 4, 2001 Sheet 3 of 3 US 6,285,999 B1
START
101
103
COMPUTE AN APPROXIMATION p, TO A
STEADY-STATE PROBABILITY p. IN
ACCORDANCE WITH THE EQUATION p, - Ap,
105
DONE
FIG. 3
US 6,285,999 B1
1 2
METHOD FOR NODE RANKING IN A Hyperlink Search Engine, developed by IDD Information
LINKED DATABASE Services, ([Link] uses backlink informa
tion (i.e., information from pages that contain links to the
CROSS-REFERENCES TO RELATED current page) to assist in identifying relevant web docu
APPLICATIONS ments. Rather than using the content of a document to
This application claims priority from U.S. provisional determine relevance, the technique uses the anchor text of
links to the document to characterize the relevance of a
patent application Ser. No. 60/035,205 filed Jan. 10, 1997, document. The idea of associating anchor text with the page
which is incorporated herein by reference. the text points to was first implemented in the World Wide
STATEMENT REGARDING GOVERNMENT 10 Web Worm (Oliver A. McBryan, GENVL and WWWW.
SUPPORT Tools for Taming the Web, First International Conference on
the World Wide Web, CERN, Geneva, May 25–27, 1994).
This invention was supported in part by the National The Hyperlink Search Engine has applied this idea to assist
Science Foundation grant number IRI-9411306-4. The Gov in determining document relevance in a search. In particular,
ernment has certain rights in the invention. 15 search query terms are compared to a collection of anchor
text descriptions that point to the page, rather than to a
FIELD OF THE INVENTION keyword index of the page content. A rank is then assigned
This invention relates generally to techniques for analyz to a document based on the degree to which the search terms
ing linked databases. More particularly, it relates to methods match the anchor descriptions in its backlink documents.
for assigning ranks to nodes in a linked database, such as any 20 The well known idea of citation counting is a simple
database of documents containing citations, the world wide method for determining the importance of a document by
web or any other hypermedia database. counting its number of citations, or backlinks. The citation
rank r(A) of a document which has n backlink pages is
BACKGROUND OF THE INVENTION simply
25
Due to the developments in computer technology and its
increase in popularity, large numbers of people have recently In the case of databases whose content is of relatively
started to frequently search huge databases. For example, uniform quality and importance it is valid to assume that a
internet search engines are frequently used to search the highly cited document should be of greater interest than a
entire world wide web. Currently, a popular search engine 30
document with only one or two citations. Many databases,
might execute over 30 million searches per day of the however, have extreme variations in the quality and impor
indexable part of the web, which has a size in excess of 500 tance of documents. In these cases, citation ranking is overly
Gigabytes. Information retrieval systems are traditionally simplistic. For example, citation ranking will give the same
judged by their precision and recall. What is often neglected, rank to a document that is cited once on an obscure page as
to a similar document that is cited once on a well-known and
however, is the quality of the results produced by these 35
highly respected page.
search engines. Large databases of documents such as the
web contain many low quality documents. As a result, SUMMARY
searches typically return hundreds of irrelevant or unwanted
documents which camouflage the few relevant ones. In order Various aspects of the present invention provide systems
to improve the selectivity of the results, common techniques 40 and methods for ranking documents in a linked database.
allow the user to constrain the scope of the search to a One aspect provides an objective ranking based on the
specified subset of the database, or to provide additional relationship between documents. Another aspect of the
search terms. These techniques are most effective in cases invention is directed to a technique for ranking documents
where the database is homogeneous and already classified within a database whose content has a large variation in
into subsets, or in cases where the user is searching for well 45 quality and importance. Another aspect of the present inven
known and specific information. In other cases, however, tion is to provide a document ranking method that is scalable
these techniques are often not effective because each con and can be applied to extremely large databases such as the
straint introduced by the user increases the chances that the world wide web. Additional aspects of the invention will
desired information will be inadvertently eliminated from become apparent in view of the following description and
the search results. 50 associated figures.
Search engines presently use various techniques that One aspect of the present invention is directed to taking
attempt to present more relevant documents. Typically, advantage of the linked structure of a database to assign a
documents are ranked according to variations of a standard rank to each document in the database, where the document
vector space model. These variations could include (a) how rank is a measure of the importance of a document. Rather
recently the document was updated, and/or (b) how close the 55 than determining relevance only from the intrinsic content of
search terms are to the beginning of the document. Although a document, or from the anchor text of backlinks to the
this strategy provides search results that are better than with document, a method consistent with the invention deter
no ranking at all, the results still have relatively low quality. mines importance from the extrinsic relationships between
Moreover, when searching the highly competitive web, this documents. Intuitively, a document should be important
measure of relevancy is vulnerable to “spamming” tech 60 (regardless of its content) if it is highly cited by other
niques that authors can use to artificially inflate their docu documents. Not all citations, however, are necessarily of
ment’s relevance in order to draw attention to it or its equal significance. A citation from an important document is
advertisements. For this reason search results often contain more important than a citation from a relatively unimportant
commercial appeals that should not be considered a match to document. Thus, the importance of a page, and hence the
the query. Although search engines are designed to avoid 65 rank assigned to it, should depend not just on the number of
such ruses, poorly conceived mechanisms can result in citations it has, but on the importance of the citing docu
disappointing failures to retrieve desired information. ments as well. This implies a recursive definition of rank: the
US 6,285,999 B1
3 4
rank of a document is a function of the ranks of the ments B and C are pointers to document A. In this case we
documents which cite it. The ranks of documents may be say that B and C are backlinks of A, and that A is a forward
calculated by an iterative procedure on a linked database. link of B and of C. Documents B and C also have other
Because citations, or links, are ways of directing forward links to documents that are not shown.
attention, the important documents correspond to those Although the ranking method of the present invention is
documents to which the most attention is directed. Thus, a superficially similar to the well known idea of citation
high rank indicates that a document is considered valuable counting, the present method is more subtle and complex
by many people or by important people. Most likely, these than citation counting and gives far superior results. In a
are the pages to which someone performing a search would simple citation ranking, the rank of a document A which has
like to direct his or her attention. Looked at another way, the 10 n backlink pages is simply
importance of a page is directly related to the steady-state
probability that a random web surfer ends up at the page According to one embodiment of the present method of
after following a large number of links. Because there is a ranking, the backlinks from different pages are weighted
larger probability that a surfer will end up at an important differently and the number of links on each page is normal
page than at an unimportant page, this method of ranking 15 ized. More precisely, the rank of a page A is defined
pages assigns higher ranks to the more important pages. according to the present invention as
In one aspect of the invention, a computer implemented
method is provided for scoring linked database documents.
The method comprises the steps of: 20
obtaining a plurality of documents, at least some of the
documents being linked documents, at least some of the where B1, . . . , B, are the backlink pages of A, r(B1), . . . ,
documents being linking documents, and at least some r(B) are their ranks, [B], . . . , |B, are their numbers of
of the documents being both linked documents and forward links, and O. is a constant in the interval [0,1], and
linking documents, each of the linked documents being 25 N is the total number of pages in the web. This definition is
pointed to by a link in one or more of the linking clearly more complicated and subtle than the simple citation
documents; assigning a score to each of the linked rank. Like the citation rank, this definition yields a page rank
documents based on scores of the one or more linking that increases as the number of backlinks increases. But the
documents; and processing the linked documents present method considers a citation from a highly ranked
according to their scores. 30 backlink as more important than a citation from a lowly
Additional aspects, applications and advantages will ranked backlink (provided both citations come from back
become apparent in view of the following description and link documents that have an equal number of forward links).
associated figures. In the present invention, it is possible, therefore, for a
BRIEF DESCRIPTION OF THE DRAWINGS
document with only one backlink (from a very highly ranked
35 page) to have a higher rank than another document with
FIG. 1 is a diagram of the relationship between three many backlinks (from very low ranked pages). This is not
linked hypertext documents according to the invention. the case with simple citation ranking.
FIG. 2 is a diagram of a three-document web illustrating The ranks form a probability distribution over web pages,
the rank associated with each document in accordance with so that the sum of ranks over all web pages is unity. The rank
the present invention. 40 of a page can be interpreted as the probability that a surfer
will be at the page after following a large number of forward
FIG. 3 is a flowchart of one embodiment of the invention. links. The constant Cº. in the formula is interpreted as the
DETAILED DESCRIPTION
probability that the web surfer will jump randomly to any
web page instead of following a forward link. The page
Although the following detailed description contains 45 ranks for all the pages can be calculated using a simple
many specifics for the purposes of illustration, anyone of iterative algorithm, and corresponds to the principal eigen
ordinary skill in the art will appreciate that many variations vector of the normalized link matrix of the web, as will be
and alterations to the following details are within the scope discussed in more detail below.
of the invention. Accordingly, the following embodiments of In order to illustrate the present method of ranking,
the invention are set forth without any loss of generality to, 50 consider the simple web of three documents shown in FIG.
and without imposing limitations upon, the claimed inven 2. For simplicity of illustration, we assume in this example
tion. For support in reducing the present invention to that r=0. Document A has a single backlink to document C,
practice, the inventor acknowledges Sergey Brin, Scott and this is the only forward link of document C, so
Hassan, Rajeev Motwani, Alan Steremberg, and Terry Wino r(A)=r(C).
grad. 55 Document B has a single backlink to document A, but this
A linked database (i.e. any database of documents con is one of two forward links of document A, so
taining mutual citations, such as the world wide web or other r(B)=r(A)/2.
hypermedia archive, a dictionary or thesaurus, and a data Document C has two backlinks. One backlink is to
base of academic articles, patents, or court cases) can be document B, and this is the only forward link of document
represented as a directed graph of N nodes, where each node 60 B. The other backlink is to document A via the other of the
corresponds to a web page document and where the directed two forward links from A. Thus
connections between nodes correspond to links from one r(C)=r(B)+r(A)/2.
document to another. A given node has a set of forward links In this simple illustrative case we can see by inspection
that connect it to children nodes, and a set of backward links that r(A)=0.4, r(B)=0.2, and r(C)=0.4. Although a typical
that connect it to parent nodes. FIG. 1 shows a typical 65 value for O is ~0.1, if for simplicity we set os-0.5 (which
relationship between three hypertext documents A, B, and C. corresponds to a 50% chance that a surfer will randomly
As shown in this particular figure, the first links in docu jump to one of the three pages rather than following a
US 6,285,999 B1
5 6
forward link), then the mathematical relationships between present invention. Because the rank of various different
the ranks become more complicated. In particular, we then documents can vary by orders of magnitude, it is convenient
have to define a logarithmic rank
ps = lim A^po,
In the Specification
Column 1, lines 13-15 should be replaced with the following paragraph:
This invention was made with Government support under contract 9411306 awarded by the National
Science Foundation. The Government has certain rights in this invention.