PRESIDENCY UNIVERSITY, BENGALURU
SCHOOL OF ENGINEERING
SEO: The
PAGE RANK algorithm
Tapas Guha, Assistant Professor, CSE
Presidency University, Bangalore
Contents
• Introduction
• Search Engine Optimization( SEO)
• Citation Analysis: The beginning of Ranking
• The Page Rank Algorithm
• The Page Rank Calculation
• Markov Chains
• Dangling Links
• Some PageRank Algorithm Variations
• Parallel Page Rank Algorithms
• HITS
• Limitations of HITS
• Conclusion
References
Introduction
• PageRank is a numeric value that represents how important a page is on the web. Google
figures that when one page links to another page, it is effectively casting a vote for the
other page. The more votes that are cast for a page, the more important the page must be.
Also, the importance of the page that is casting the vote determines how important the vote
itself is. Google calculates a page’s importance from the votes cast for it. How important
each vote is taken into account when a page’s PageRank is calculated.
• PageRank is Google’s way of deciding a page’s importance. It matters because it is one of
the factors that determines a page’s ranking in the search results. It isn’t the only factor that
Google uses to rank pages, but it is an important one.
The web as a directed graph
4
The web as a directed graph
5
The web as a directed graph
Assumption 1: A hyperlink is a quality signal.
6
The web as a directed graph
Assumption 1: A hyperlink is a quality signal.
The hyperlink d1 → d2 indicates that d1‘s author deems d2
high-quality and relevant.
7
The web as a directed graph
Assumption 1: A hyperlink is a quality signal.
The hyperlink d1 → d2 indicates that d1‘s author deems d2
high-quality and relevant.
Assumption 2: The anchor text describes the content of d2.
8
The web as a directed graph
Assumption 1: A hyperlink is a quality signal.
The hyperlink d1 → d2 indicates that d1‘s author deems d2
high-quality and relevant.
Assumption 2: The anchor text describes the content of d2.
We use anchor text somewhat loosely here for: the text
surrounding the hyperlink .
9
The web as a directed graph
Assumption 1: A hyperlink is a quality signal.
The hyperlink d1 → d2 indicates that d1‘s author deems d2
high-quality and relevant.
Assumption 2: The anchor text describes the content of d2.
We use anchor text somewhat loosely here for: the text
surrounding the hyperlink .
Example: “You can find cheap cars ˂a href =[Link] ˂/a ˃. ”
10
The web as a directed graph
Assumption 1: A hyperlink is a quality signal.
The hyperlink d1 → d2 indicates that d1‘s author deems d2
high-quality and relevant.
Assumption 2: The anchor text describes the content of d2.
We use anchor text somewhat loosely here for: the text
surrounding the hyperlink .
Example: “You can find cheap cars ˂a href =[Link] ˂/a ˃. ”
Anchor text: “You can find cheap here”
11
[text of d2] only vs. [text of d2] + [anchor text → d2]
12
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
13
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
Example: Query IBM
14
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright page
15
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright page
Matches many spam pages
16
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright page
Matches many spam pages
Matches IBM wikipedia article
17
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright page
Matches many spam pages
Matches IBM wikipedia article
May not match IBM home page!
18
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright page
Matches many spam pages
Matches IBM wikipedia article
May not match IBM home page!
… if IBM home page is mostly graphics
19
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright page
Matches many spam pages
Matches IBM wikipedia article
May not match IBM home page!
… if IBM home page is mostly graphics
Searching on [anchor text → d2] is better for the query IBM.
20
[text of d2] only vs. [text of d2] + [anchor text → d2]
Searching on [text of d2] + [anchor text → d2] is often
more effective than searching on [text of d2] only.
Example: Query IBM
Matches IBM’s copyright page
Matches many spam pages
Matches IBM wikipedia article
May not match IBM home page!
… if IBM home page is mostly graphics
Searching on [anchor text → d2] is better for the query IBM.
In this representation, the page with most occurences of IBM is
[Link]
21
Anchor text containing IBM pointing to [Link]
22
Anchor text containing IBM pointing to [Link]
23
Indexing anchor text
24
Indexing anchor text
Thus: Anchor text is often a better description of a page’s
content than the page itself.
25
Indexing anchor text
Thus: Anchor text is often a better description of a page’s
content than the page itself.
Anchor text can be weighted more highly than document
text.
(based on Assumption 1&2)
26
Exercise: Assumptions underlying PageRank
27
Exercise: Assumptions underlying PageRank
Assumption 1: A link on the web is a quality signal – the
author of the link thinks that the linked-to page is high-
quality.
28
Exercise: Assumptions underlying PageRank
Assumption 1: A link on the web is a quality signal – the
author of the link thinks that the linked-to page is high-
quality.
Assumption 2: The anchor text describes the content of
the linked-to page.
29
Exercise: Assumptions underlying PageRank
Assumption 1: A link on the web is a quality signal – the
author of the link thinks that the linked-to page is high-
quality.
Assumption 2: The anchor text describes the content of
the linked-to page.
Is assumption 1 true in general?
30
Exercise: Assumptions underlying PageRank
Assumption 1: A link on the web is a quality signal – the
author of the link thinks that the linked-to page is high-
quality.
Assumption 2: The anchor text describes the content of
the linked-to page.
Is assumption 1 true in general?
Is assumption 2 true in general?
31
Google bombs
32
Google bombs
A Google bomb is a search with “bad” results due to maliciously manipulated anchor
text.
33
Google bombs
A Google bomb is a search with “bad” results due to maliciously manipulated anchor
text.
Google introduced a new weighting function in January 2007
that fixed many Google bombs.
34
Google bombs
A Google bomb is a search with “bad” results due to maliciously manipulated anchor
text.
Google introduced a new weighting function in January 2007
that fixed many Google bombs.
Still some remnants: [dangerous cult] on Google, Bing, Yahoo
35
Google bombs
A Google bomb is a search with “bad” results due to maliciously manipulated anchor
text.
Google introduced a new weighting function in January 2007
that fixed many Google bombs.
Still some remnants: [dangerous cult] on Google, Bing, Yahoo
Coordinated link creation by those who dislike the Church of
Scientology
36
Google bombs
A Google bomb is a search with “bad” results due to maliciously manipulated anchor
text.
Google introduced a new weighting function in January 2007
that fixed many Google bombs.
Still some remnants: [dangerous cult] on Google, Bing, Yahoo
Coordinated link creation by those who dislike the Church of
Scientology
Defused Google bombs: [dumb motherf…], [who is a failure?], [evil empire]
37
How far do people look for results?
(Source: [Link] WhitePaper_2006_SearchEngineUserBehavior.pdf) 38
Search Engine Optimization(SEO)
• With the rapid growth of WWW most of the users use information retrieval tools like search engines to find
information from the web.
• There are tens and hundreds of search engines available but some are popular like Google, Yahoo, Bing etc.,
because of their crawling and ranking methodologies.
• The search engines download, index and store hundreds of millions of web pages. They answer tens of
millions of queries every day. So Web mining and ranking mechanism becomes very important for effective
information retrieval.
• Before presenting the pages to the user, a ranking mechanism is done by the search engines to present the most
relevant pages at the top and less relevant ones at the bottom.
Citation Analysis: The beginning of Ranking
• Citation frequency
• Bibliographic coupling frequency-
• Articles that co-cite the same articles are related .
• Citation indexing.
• But, the web is not a scholarly citation. Millions of participants, each with self interests.
• Spamming is widespread.
• Once search engines began to use links for ranking (roughly 1998), link spam grew.
• One can join a link farm – a group of websites that heavily link to one another.
The Page Rank Algorithm
• "PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a
web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on
another random page. The probability that the random surfer visits a page is its PageRank.“
• Quoting from the original Google paper, PageRank is defined like this [1]:
We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor
which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also
C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
PR(Tn) – Page Rank of page Tn, C(Tn) –The count, or number, of outgoing links for page n, PR(Tn)/C(Tn) –The
share of the vote page A will get from Tn,
d – All these fractions of votes are added together but, to stop the other pages having too much influence, this total
vote is damped down by multiplying it by 0.85 (the factor "d”)
(1 – d) – The (1 – d) bit at the beginning is a bit of probability math magic so the "sum of all web pages' PageRanks
will be one”.
• The PageRanks form a probability distribution over web pages, so the sum of PageRanks of all web pages will be
1.
Page Rank Calculation
• The PR of each page depends on the PR of the pages pointing to it. But we won't know what PR those pages have
until the pages pointing to them have their PR calculated and so on… And when you consider that page links can
form circles it seems impossible to do this calculation! But actually it's not that bad.
• we can just go ahead and calculate a page's PR without knowing the final value of the PR of the other pages. That
seems strange but, basically, each time we run the calculation we're getting a closer estimate of the final value. So
all we need to do is remember the each value we calculate and repeat the calculations lots of times until the
numbers stop changing much. This stage is said to be convergence state. Let us take the following example.
• Each page has one outgoing link (the outgoing count is 1, i.e. C(A) = 1 and C(B) = 1).
• we don't know what their PR should be to begin with, so let's take a guess at 1.0 and do some calculations:
• d = 0.85
PR(A) = (1 – d) + d(PR(B)/1) PR(B) = (1 – d) + d(PR(A)/1)
• PR(A) = 0.15 + 0.85 * 1 = 1 PR(B) = 0.15 + 0.85 * 1 = 1
Sec. 21.2.1
Markov chains
• A Markov chain consists of n states, plus an nn transition probability matrix P.
• At each step, we are in one of the states.
• For 1 i,j n, the matrix entry Pij tells us the probability of j being the next state, given we
are currently in state i.
Pii>0
is OK.
i j
Pij
Dangling links
• PageRank “Dangling links are simply links that point to any page with no outgoing links. They affect the
model because it is not clear where their weight should be distributed, and there are a large number of
them. Often these dangling links are simply pages that we have not downloaded yet……….Because
dangling links do not affect the ranking of any other page directly, we simply remove them from the system
until all the PageRanks are calculated. After all the PageRanks are calculated they can be added back in
without affecting things significantly.” [1]
• A dangling link is a link to a page that has no links going from it, or a link to a page that Google hasn’t
indexed. In both cases Google removes the links shortly after the start of the calculations and reinstates
them shortly before the calculations are finished. In this way, their effect on the PageRank of other pages is
minimal.
Weighted Page Rank Algorithm
• Authors in [2] proposed a Weighted Page Rank (WPR) algorithm which is an addition of the Page Rank
algorithm which assigns larger rank values to more important (popular) pages instead of dividing the rank
value of a page evenly among its out link pages. Each out link page gets a value proportional to its
popularity (its number of in links and out links). The popularity from the number of in links and out links is
recorded as Win (v, u) and Wout (v, u), respectively.
• Win (v, u) is the weight of link (v, u) calculated based on the number of in links of page u and the number
of in links of all reference pages of page v. Wout (v, u) is the weight of link (v, u) calculated based on the
number of out links of page u and the number of out links of all reference pages of page v.
• Where Iu and Ip represent the number of in links of page u and page p, respectively. R (v) denotes the
reference page list of page v. Where Ou and Op represent the number of out links of page u and page p,
respectively. R (v) denotes the reference page list of page v.
• Considering the importance of pages, the original Page Rank formula is modified as:
Page ranking based on visit of link
• Another algorithm [3] considers user’s browsing behavior. This concept is very useful to display most
valuable pages on the top of the result list on the basis of user browsing behavior, which reduce the search
space to a large scale .
• In this algorithm we assign more rank value to the outgoing links which is most visited by users. In this
manner a page rank value is calculated based on visits of inbound links.
• The modified version based on VOL is given in equation:
• Here d is a dampening factor, u represents a web page, B (u) is the set of pages that point to u, PR (u) and PR
(v) are rank scores of page u and v respectively, Lu is the number of visits of link which is pointing page u
from v. TL (v) denotes total number of visits of all links present on v.
Weighted PageRank based on visit of links
• In [4], authors proposed an algorithm that gives more rank value is assigned to the outgoing links which is
most visited by users and received higher popularity from number of in link.
• Here the popularity of out links is not considered which is considered in the original algorithm. The
advanced approach in the new algorithm is to determine the user’s usage trends.
• The user’s browsing behavior can be calculated by number of hits (visits) of links.
• The modified version based on WPR (VOL) is given as:
• Here d is a dampening factor, u represents a web page, B (u) is the set of pages that point to u, WPRVOL
(u) and WPRVOL (v) are rank scores of page u and v respectively, Lu is the number of visits of link which
is pointing page u from v. TL (v) denotes total number of visits of all links present on v.
DistanceRank Algorithm
• DistanceRank algorithm [5] employs a novel recursive method based on reinforcement learning [24] which
considers distance between pages as punishment, called “DistanceRank” to compute ranks of web pages.
• The number of ‘average clicks’ between two pages is defined as distance.
• The main objective of this algorithm is to minimize distance or punishment so that a page with smaller
distance to have a higher rank.
• Most of the current ranking algorithms have the “rich-get-richer” problem i.e. the popular high rank web
pages become more and more popular and the young high quality pages are not picked by the ranking
algorithms.
• The DistanceRank algorithm algorithms is less sensitive to the “rich-get-richer” problem and finds
important pages faster than others.
• It uses the logic that normally related pages are linked to each other so the distance based solution can find
pages with high qualities more quickly.
Parallel PageRank Algorithms
PR on PC Cluster [8]
• Efficient computation of PageRank is required due to huge size of web graph. Recent researchers discuss
some algorithmic and architectural approaches to speed up the computation of PageRank.
• To compute PageRank vector on Parallel platform first this algorithm store hyperlink matrix into binary
link
• structure file 𝐵 that is partitioned into 𝛽 chunks. Here 𝛽 is the number of systems in a cluster setup.
• Let there are T records in binary structure file and file is described as follows:
• In this figure, file B have two attribute i.e. dest_id and src_id that contains incoming and outgoing pages’ id
e.g. page 1 is pointed by page 1029 and page 2 is pointed by page 101 and page 23.
• File 0, and file 1 contains number of outgoing link and incoming link of page of file B.
Parallel PageRank Algorithms(cont.)
• Parallel Adaptive PageRank algorithm: [9] proposed a parallel PageRank computation on PC Cluster.
Kamvar et al. [10] found that PageRank vector for about 2/3 of pages it converges very early than for other
pages. So based on this concept they proposed adaptive PageRank method that avoid the overhead of
recomputation of already converged web pages.
• Parallel PageRank computation by using Gauss-Seidel/Jacobi method: [11] accelerate the PageRank
computation by applying Gauss-Seidel or Jacobi distance measures to improve convergence of PageRank
algorithm.
• PPR – MT algorithm: [12] based on Multi-threaded model of MPI. In this algorithm MPI is used as a
communication model between processes on different systems, and inter-thread method is used for
communication between threads. They have partitioned the binary link structure file to distribute it equally
among cluster nodes. The file B and I are partitioned and distributed among the cluster nodes, while the file
O is not partitioned.
Parallel PR Algorithms: A Comparative View
The following table is a comparative view of the parallel PageRank algorithms discussed in this presentation.
Load Parallelism Type Memory Used
Balancing
ALGORITH Data Task Distribute Shared
MS: d
Parallel PR on No Yes No Yes No
PC Cluster [8]
Adaptive [9,10] No Yes No Yes No
Parallel PR MPI Yes Yes Yes(Thread Yes Yes
MT level)
[12]
Parallel PR on Yes Yes No Yes No
Gauss/Jacobi [11]
HITS
• Hyperlink Induced Topic Search(HITS) Algorithm [13] ranks the web page by processing in links and
out links of the web pages. In this algorithm a web page is named as authority if the web page is
pointed by many hyperlinks and a web page is names as HUB if the page points to various hyperlinks.
Authorities and hubs are illustrated in Figure
• Hubs and authorities are assigned respective scores. Scores are computed in a mutually reinforcing
way; an authority pointed to by several highly scored hubs should be a strong authority while a hub that
points to several highly scored authorities should be a popular hub.
• Let ap and hp represent the authority and hub scores of page p, respectively. B (p) and I (p) denote the
set of referrer and reference pages of page p, respectively. The scores of hubs and authorities are
calculated as follows;
HITS
The page’s authority weight is proportional to the sum of the hub weights of pages that it links
to it, Kleinberg. Similarly, a page’s hub weight is proportional to the sum of the authority
weights of pages that it links to. The following figure shows an example of the calculation of
authority and hub scores.
HITS is a purely link-based algorithm. It is used to rank pages that are retrieved from the web,
based on their textual contents to a given query. Once these pages have been assembled, the
HITS algorithm ignores textual content and focuses itself on the structure of the web only.
Constraints of HITS
The following are the constraints of HITS algorithm :
• Hubs and authorities: It is not easy to distinguish between hubs and authorities because many sites
are hubs as well as authorities.
• Topic drift: Sometime HITS may not produce the most relevant documents to the user queries
because of equivalent weights.
• Automatically generated links: HITS gives equal importance for automatically generated links which
may not produce relevant topics for the user query.
• Efficiency: HITS algorithm is not efficient in real time.
HITS was used in a prototype search engine called Clever for an IBM research project. Because of the
above constraints HITS could not be implemented in a real time search engine.
References
[1] L. Page, S. Brin, R. Motwani, and T. Winograd, “The Pagerank Citation Ranking: Bringing order to the Web”. Technical Report, Stanford Digital
Libraries SIDL-WP-1999-0120, 1999.
[2] Wenpu Xing and Ali Ghorbani, “Weighted Page Rank Algorithm”, Proceedings of the Second Annual Conference on Communication Networks and
Services Research (CNSR ’04), IEEE, 2004.
[3] Gyanendra Kumar, Neelam Duhan, and Sharma A. K., “Page Ranking Based on Number of Visits of Web Pages”, International Conference on
Computer & Communication Technology (ICCCT)-2011, 978-1-4577-1385-9.
[4] Neelam Tyagi, Simple Sharma, “Weighted Page Rank Algorithm based on number of visits of links of web page”, International Journal of Soft
Computing and Engineering (IJCSE)-ISSN: 2231-2307, Volume-2, Issue-3, July 2012
[5] A. M. Zareh Bidoki and N. Yazdani, “DistanceRank: An intelligent ranking algorithm for web pages” Information Processing and Management, Vol 44,
No. 2, pp. 877-892, 2008.
[6] Taruna Kumari, Ashlesha Gupta, Ashutosh Dixit,”Comparative Study of Page Rank and Weighted Page Rank Algorithm”, International Journal of
Innovative Research in Computer and Communication Engineering-ISSN: 2320- 9798, Volume-2, Issue 2, February 2015.
[7] Chen, Y. Y., Gan, Q., & Suel, T, “I/O efficient techniques for computing PageRank”. In Proceedings of the eleventh international conference on
Information and knowledge management (pp. 549-557). ACM, 2002
[8] Rungsawang, A., & Manaskasemsak, B, “PageRank computation using PC cluster”. In European Parallel Virtual Machine/Message Passing Interface
Users’ Group Meeting (pp. 152-159). Springer Berlin Heidelberg, 2006
[9] Rungsawang, A., & Manaskasemsak, B,”Parallel adaptive technique for computing PageRank”. In Parallel, Distributed, and Network-Based
Processing, 2006. PDP 2006. 14th Euromicro International Conference on (pp. 6-pp). IEEE, 2009
[10] Kamvar, S., Haveliwala, T., & Golub, G, “Adaptive methods for the computation of PageRank. Linear Algebra and its Applications”, 386, 51-65,
2004
[11] Kohlschütter, C., Chirita, P. A., & Nejdl, W, “Efficient parallel computation of PageRank”. In European Conference on Information Retrieval (pp.
241-252). Springer Berlin Heidelberg, 2006
[12] Manaskasemsak, Bundit, Putchong Uthayopas, and Arnon Rungsawang. "A mixed MPI-thread approach for parallel page ranking computation." On
the
Move to Meaningful Internet Systems 2006: CoopIS, DOA, GADA, and ODBASE (2006): 1223-1233.
[13] J. Kleinberg, “Authoritative Sources in a Hyper-Linked Environment”, Journal of the ACM 46(5), pp. 604-632, 1999.
Thank You