Citation
Algorithms to rank web pages for search queries

Material Information

Title:
Algorithms to rank web pages for search queries
Creator:
Ahmed, Zainab Ahmed
Place of Publication:
Denver, Colo.
Publisher:
University of Colorado Denver
Publication Date:
Language:
English
Physical Description:
iv, 55 leaves : ; 28 cm.

Thesis/Dissertation Information

Degree:
Master's ( Master of Science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer Science
Committee Chair:
Chlebus, Bogdan
Committee Members:
Alaghband, Gita
Ra, Ilkyeun

Subjects

Subjects / Keywords:
Web sites -- Ratings and rankings ( lcsh )
Ranking and selection (Statistics) ( lcsh )
Computer algorithms ( lcsh )
Web search engines ( lcsh )
Computer algorithms ( fast )
Ranking and selection (Statistics) ( fast )
Web search engines ( fast )
Web sites -- Ratings and rankings ( fast )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (M.S.)--University of Colorado Denver, 2011. Computer science
Bibliography:
Includes bibliographical references (leaves 54-55).
General Note:
Department of Computer Science and Engineering
Statement of Responsibility:
by Zainab Ahmed Ahmed.

Record Information

Source Institution:
University of Colorado Denver
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
786168540 ( OCLC )
ocn786168540

Downloads

This item has the following downloads:


Full Text
//
ALGORITHMS TO RANK WEB PAGES FOR SEARCH QUERIES
By
Zainab Ahmed Ahmed
B.S., Alfateh University/ Tripoli, 2000
A thesis submitted to the
University of Colorado Denver
In partial fulfillment
of the requirements for the degree of
Master of Science
Computer Science
2011


This thesis for the Master of Science
degree by
Zainab Ahmed Ahmed
has been approved
by
Bogdan Chlebus
Gita Alaghband
BP*
II /? /2i\
Date


Ahmed, Zainab (M.S., Computer Science)
Algorithms to Rank Web Pages for Search Queries
Thesis directed by Associate Professor Bogdan S. Chlebus
ABSTRACT
In this thesis we discuss algorithmic underpinnings of search engines. We
survey methods to produce ranking of web pages based on web mining. We
discuss PageRank algorithm, which is at the core of the Google search engine,
and ranking algorithms based on hubs and authorities. We propose a hybrid
approach to produce page ranking, which combines various methods to rank web
pages.


This abstract accurately represents the content of candidate's thesis. I recommend
its publication.
Signed
Bogdan S. Chlebus


TABLE OF CONTENTS
Figures............................................................ Viii
Tables............................................................ ix
Chapter
1.Introduction....................................................1
1.1 The World Wide Web (WWW)...................................... 4
1.1.1 The Hyperlinks.............................................. 5
1.2 Web Search Engine............................................. 6
1.2.1 Search Engine Mechanism..................................... 6
1.3 The Data Structure for the Web................................ 8
1.3.1 The Graph Notation.......................................... 8
1.3.2 The Web Graph..............................................10
1.4 Measure the Quality of Web Searching Results.................12
1.4.1 Page Importance............................................. 12
1.5.2 Page Relevance.............................................. 12
2. Web Mining.................................................... 13
2.1 Overview.....................................................13
2.2 Web Mining Categories........................................ 14
2.2.1 Web Content Mining..........................................14
2.2.2 Web Structure Mining........................................15
2.2.3 Web Usage Mining............................................ 15
3. Page Ranking.................................................. 17
3.1 The Problem of Ranking.......................................17
v


3.2 Page Ranking Algorithms........................................ 18
3.3 The PageRank Algorithm......................................... 20
3.3.1 Simple PageRank Algorithm..................................... 20
33.2 Random Surfer Model............................................22
3.3.3 Expanding PageRank Algorithm.................................. 22
3.3.4 PageRank Example.............................................. 24
3.3.5 Modified PageRank Algorithms.................................. 24
3.4 Hyperlink Induced Topic Search (HITS) Algorithm.............. 25
3.4.1 General HITS Algorithm........................................ 26
3.4.2 The HITS Example.............................................. 27
3.4.3The HITS Algorithm Detail.......................................28
3.4.3.1 Constructing the Base Set sub-graph....................... 28
3.43.2 Computing Hubs and Authorities Algorithm.................. 30
3.4.3.3 The Iterative Algorithm..................................... 30
3.4.4 HITS Algorithm Normalization.................................. 31
3.4.5 Modified HITS Algorithms...................................... 31
3.5 Stochastic Approach for Link Structure Analysis Algorithm... 32
3.5_1 The Algorithms Detail.........................................33
3.6 Page Content Ranking Algorithm (PCR)......................... 34
3.6.1 PCR Description............................................. 35
3.6.2 PCR Specification............................................. 36
3.6.3 The Parameters Calculation................................... 37
3.6.4 Page Classification and Importance Calculation................ 39
3.7 An Improved Usage-Based Ranking Algorithm...................... 40
4. A Hybrid Mining to Improve Web Page Ranking................... 44
4.1 Overview....................................................... 44
vi


4.2 The General Hybrid Method..................................... 45
4.3 The Algorithm................................................. 46
5. Conclusion..................................................... 50
References......................................................... 54
Vll


LIST OF FIGURES
Figure
1.1 Search Engine Concept........................................... 6
1.2 Search Engine Architecture..................................... 8
1.3 The Directed, Undirected and Weighted Graph.................... 9
1A A Strongly Connected Graph....................................... 9
1.5 The Bipartite Graph............................................ 10
1.6 The Bow-tie Graph of the Web....................................11
2.1 Web Mining Classification........................................ 16
3.1 Page A and Page B are Backlinks of the Page C.................. 20
3.2 Simplified PageRank Calculation.................................. 21
3.3 Node C is a Dangling Node...................................... 22
3.4 Example of Hyperlinked Structure............................... 24
3.5 A Linked Set of Hubs and Authorities........................... 26
3.6 The Basic Hub and Authority Operation.......................... 27
3.7 The Root Set and the Base Set.................................. 29
3.8 Constructing a Bipartite Graph................................. 33
4.1 User behavior Data and the Weighted Graph...................... 47
viii


LIST OF TABLES
Table
3.1 PageRank Example........................................ 24
4.1 Comparison of Various Web Pages Ranking Algorithms...... 52
ix


1.Introduction
The Word Wide Web has become the largest information repository in the
world. It is a dynamic and unstructured data source. These features of the Web
create challenges as finding relevant pages on a specific topic. Users are looking
for tools and techniques that better locate, extract, filter, and find the information
they seek. Web search engines alleviate that and help users in finding the needle
that they are looking for in the Web's haystack. Day by day, the web search
engine has attracted more attention throughout the world. There are hundreds of
web search engines, such as Google, Yahoo and Bing that have a crawling and
ranking mechanism in common. Web search engines face many challenges
because of the increasing number of the Web pages and users on the Web, which
in turn has led to the rapid increase in the number of queries issued by the users.
A web page is relevant to the query if it contains the query topic in its title or
somewhere in body text, while the web page is important to the query topic if it
has many other important web pages linking to it. Search engines need tools that
help in distinguishing web pages and rank them according to their relevancy and
importance to the query topic.
The Web mining technique is a tool used in the ranking process of the
Web pages. Web mining techniques can be classified into three categories: Web
1


link or structure mining, Web content mining and Web usage mining. Web
structure mining provides the necessary information through the hyperlink
structure of different interconnected Web pages. Web content mining provides the
necessary information through the content of the Web pages retrieved from the
index. The Web usage mining relies on the users* behavior on the Web to provide
the necessary information.
The ranking mechanism, which is the main focus of this thesis, is an
important component in the search engine architecture. PageRank algorithm
provides a ranking mechanism that makes Google a very successful and popular
search engine. Page ranking algorithms use Web mining techniques to order the
search results according to the user interest. Some page ranking algorithms
depend only on the link structure of the Web (Web link mining) to get the
important web pages, while others look for the content of the page (Web content
mining)to get the relevant web pages or the users behaviors (Web content
mining), whereas some other page ranking algorithms use a combination of two
Web mining techniques such as Web Link mining and Web content mining. We
present Web page ranking algorithms that use various mining techniques or a
combination of them to produce the searching results.
The thesis surveys different page ranking algorithms and how they work.
PageRank, HITS and SALSA are page ranking algorithms that are depend on the
2


analysis of the link structure of the Web to set the relationship between web pages
and to get important web pages according to the query topic. They build the graph
of the web where its nodes are the Web pages and its edges are the hyperlinks
between web pages.
PCR is a page ranking algorithm that relies on the content of the web
pages to produce web page ranks in terms of web pages' relevancy to the query
topic. PCR uses a parser to get the body text from the web page and to compute
many parameters such as query frequency to see how the page is related to the
query topic.
An improved usage-based algorithm is a page raking algorithm that uses a
combination of link structure mining with usage-based mining to rank web pages.
The algorithm uses users5 behavior on the Internet that can be recorded by the
Internet server as a source of information that gives more information on the
importance of the page. Users, behavior is represented by the time spent in
reading web pages.
A contribution part of my thesis is to propose a new page ranking
algorithm. The algorithm uses the three main mining techniques; web content
mining, link web mining and usage-based mining combined together in one
algorithm. It builds the algorithm in three steps. In the first step, the algorithm
produces rank score to the web pages according to their content
3


analysis(relevancy score). In the second step, the algorithm produces rank score to
the web pages according to link analysis of the Web pages as well as the analysis
of the user behavior on the internet represented in time spent in reading web pages
by the user and the frequency of visiting the web page. This can be achieved by
building a weighted undirected web graph. The rank score produced by the
second step is called the importance score. In the third step, the algorithm
combines the two rank scores produced in pervious steps to produce one rank
score for each web page that reflect the important and the relevant content of the
web page to the query topic and list these web pages on the top list of the
searching result.
1.1 The World Wide Web (WWW)
The World Wide Web is also known as W3 or a Web. The World Wide Web
is a system of hypertext documents connected to each other. The system uses the
Internet infrastructure to access the interlinked documents and to connect client
terminals and servers all around the world using the Hypertext Transport Protocol
(HTTP). Documents on the World Wide Web which are known as home pages,
Web pages, or pages are written in Hypertext Markup Language (HTML), and are
recognized by Uniform Resource Locators (URLs). The Web also includes
hypermedia, which is hyperlinked multimedia, including audio, video, graphics,
4


and animations beside the text. The Web is serving as a huge information
repository and has facilitated the spread of this information across the world.
1.1.1 The Hyperlinks
A hyperlink is the property of a word, phrase, or image that you can click on
in a document to link either to a new document or a new portion within the
document. Hyperlinks are found in the Web pages which allow users to click their
way from page to page. Text hyperlinks are often colored blue and are sometimes
underlined to be distinguished from other elements of the page that are not
hyperlinked. When you move the cursor over a hyperlink, whether it is word,
phrase or an image, the mouse cursor should change to a small hand pointing at
the link. Hyperlinks are created through the use of programming languages
HTML (Hypertext Markup Language) or XML (Extensible Markup Languages).
1.2 Web Search Engines
A search engine is a web site that collects and organizes contents from all over
the Internet. A search engine can be considered a software application that
searches a database of documents and files on the Internet. It is developed to
extract information from that database in order to answer search queries in the
form of a list of web pages, images, or files. The search looks for keywords or
exact phrases found in titles or texts. A search engine can be restricted to a single
website, or can search across many sites on the public Internet. A search engine
5


keeps records of information of what is available on the Web. It can be
understood basically as a Web content collector as in figure 1.1.
Figure 1.1: Search Engine Concept
1.2.1 Search Engine Mechanism
A Web search engine works by storing information about web pages. It
downloads, indexes and stores thousands of millions of web pages in order to
answer thousands of millions of users9 queries every day. There are three main
components in search engines (see figure 1.2): crawler, indexer and query
processing. The crawler, which is also called the spider, is the program that
browses the World Wide Web and follows every link and downloads Web pages.
The downloaded Web pages are then routed to the indexer. The indexer analyzes
Web page's content such as words extracted from titles or heading sand builds the
index depending on the keywords found in the pages. The index is maintained
alphabetically in terms of the keywords found.
When a user sends a query into the interface of the search engine, it is
delivered by the query processer component of the search engine to do the
6


matching process within the index and then return the Uniform Resource Locators
(URLs) of the pages to the user. Before presenting the URLs of the pages to the
user, the search engine applies some ranking mechanism (Web mining) to ease
users navigation between the searchs results. The most important and relevant
pages are presented on top of the searching result list. In this thesis we will
discuss different kinds of ranking algorithms for search engines.


Figure 1.2: Search Engine Architecture |6]
1.3 The Data Structure for the Web
Traditional Web information retrieval systems focus on the information
provided by the text of the Web page. The hyperlinks where different Web pages
are connected provide additional information in searching the Web. The Web can
be seen as a directed graph whose nodes are the Web pages and the links between
nodes are the hyperlinks Web pages. The directed graph structure is called the
Web Graph.
1.3.1 The Graph Notation
The graph G consist of two sets V and E. V is the set of nodes (Web
pages), while E is the set of edges (Hyperlinks). It also can be expressed as
G(V, E).
8


Undirected graph: the pair of nodes (u, v), (v, u) represent the same edge.
Directed graph: the pair of nodes (u, v), (v, u) represent two different edges.
Weighted graph: the edges are assigned a weight which might represent, for
example, costs, lengths or capacities, etc. depending on the problem at hand.
Figure 1.3 shows the directed graph, undirected graph and weighted graph.
Figure 13: The Directed, Undirected and Weighted Undirected Graph
Strongly Connected graph is a directed graph where for every pair of nodes u and
v there is a direct bath from u to v and there is a direct bath from v to u as in
figure 1.4.
Figure 1.4: A Strongly Connected Graph
9


Bipartite graph is a graph whose nodes can be divided into two disjoint setsf7 and
V such that every link from set E (set of links) connects a node in t/ to a node in
V. It also can be expressed as G(Uf V, E) as in figure 1.5.
1.3_2The Web Graph
Many researches have been done on the Web graph or map. Broader et al.in
[4] build the Web graph in a bow tie shape using strongly connected components
as the base block. The bow tie graph is shown in figure 1.3. They divided the
Web into four large pieces, and they showed how these pieces fit together.
Figure 1.5: The Bipartite Graph
Figure 1.6: The Bow-tie Graph of the Web|4)
10


1. Giant Strongly Connected Component (SCC) is the central core of the
Web map that has all the pages which can reach one another along
directed links.
2. The IN component consists of Web pages that can access the SCC but
they cannot be accessed from SCC.
3. The Out component consists of Web pages that are accessible from the
SCC but they don't have in-links to the SCC.
4. The TENDRILS contains all the pages that cannot reach the SCC and they
are not reachable by the Core component, SCC.
According to Broader et al in [4], the size of the SCC is relatively small
compared to the size of the other components.
1.4 Quality Measurement of Web Searching Results
Quality of Web search engine results form a real concern to the Internet users.
Current Web search engines respond to user queries within a portion of a second.
Users will expect good results from the web, and they do not mind if search engines
respond within a few more seconds. Next, we explain a few factors used by search
engines to measure the importance and relevance of web Pages.
1.4.1 Page Importance
Web page importance is the measure of how much the page is important to
the user's query. The page importance can be inferred from the graph structure of
the web that shows clearly how many inlinks to a Webpage. This importance can be
11


measured by popularity of the web page which can be computed by number of inlinks
to the page or can be measured by the quality which can be computed recursively by
the sum of the quality inlinks of pages. The more inlinks to the page from important
pages, the more likely the page is important. Another measurement to page
importance is the informative which is how many outlinks to nodes that have useful
information.
1.5.2 Page Relevance
Web page relevance is the measure of how much the page is related to the
user's query. There are many factors that are used for this measurement. A word
or a phrase location is a factor search engines consider relevant. The queries that
appear in the title of the page or at the top of the page are considered most
relevant. The frequency of the query in the page is another factor on how search
engines determine relevancy. The anchor text within a page is an important factor
for the page relevancy. Because all of these factors are found in pages content
we can say that page relevance is inferred from a pages5 content.
12


2. Web Mining
2.1 Overview
The extraction of implicit, unknown, non-trivial and potentially useful
information or patterns from a large database is called Data Mining. Web mining
is the application of the Data mining techniques to discover and extract
information from World Wide Web documents and services in an automatic way
[9]. Emergence of the Web as a huge source of information and the enormous
growth of the Web today makes web mining a wide area of research [9].
Suggested decomposing of Web mining into four main subtasks, namely:
1. Resource finding: the task of retrieving intended Web documents.
2. Information selection and pre-processing: automatically selecting and pre-
processing specific information from retrieved documents.
3. Generalization: automatically discovers general patterns at individual Web
sites as well as across multiple sites.
4. Analysis: validation and /or interpretation of the mined patterns.
Resource finding is the process of retrieving the online or offline data from
the text documents that are available on the World Wide Web. These text
documents represented in electronic newsgroups, newsletters, newswire, libraries
and HTML documents. The second subtask is about selecting the HTML
13


documents and transforming them by removing HTML tags, stop words,
stemming, etc. Generalization is the task of extracting general patterns at an
individual Web site and across multiple Web sites using data mining techniques.
Due to the interactivity of the Web, users9 behavior plays an important role in
discovering information. This is important for doing the forth subtask, validation
and interpretation.
2.2 Web Mining Categories
Kosala et al[9] categorized the Web mining into three area of interest
based on which part of the Web needs to be mined: Web content mining, Web
structure or link mining, and Web usage mining. Web content mining extracts
information from the content of the web page. Web structure mining extracts the
information from the hyperlink between web pages. While Web usage mining
extracts information from the user behavior and logs.
2.2.1 Web Content Mining
Web content mining is the process of extracting useful information from
the content of the Web pages. The Web page content consists of many kinds of
data such as text, image, audio, video and hyperlinks. The Web page content
consists of unstructured data such as the text, semi-structured data such as HTML
or XML documents and structured data such as data found in the tables. There are
two approaches to the Web content mining; agent based approach and database
14


approach. The agent based approach concentrates on finding relevant information
based on the inferred user profile. The database approach models the data and
retrieves the semi-structured data from the Web.
2.2.2 Web Structure Mining
Web structure mining or what is sometimes referred to by Web link
mining tries to generate the model summaries on the Web site and Web page.
This kind of Web mining builds the structure of the linked web pages as a Web
graph where the nodes represent the Web pages and the edges of the graph
represent the hyperlinks between the web pages. Web structure mining makes the
classification of the Web pages easy in terms of their relevance and importance.
Web structure mining also detects the similarity and the relationship between
different Web sites and Web pages.
2.2.3 Web Usage Mining
Web usage mining applies the data analysis technique to extract the usage
pattern generated by the surfers on the Web to use it in enhancing the web
services such as a users search experience. Web usage mining mines the data
derived from users1 interactivity with the Web. The data of patterns of usage can
be found for example in the IP addresses, time and the date of the user access,
user profile, user queries, mouse clicks and scrolls, cookies, transactions and any
other data that comes from user interactions.
15


The classification of the Web mining is shown in the figure 2.1.
Figure 2.1: Web Mining Classification [6]
Web mining has many important applications on the Web. Web mining
categories are used alone or in combinations with each other by most researchers
to give a good research results. They are used and applied in many research areas.
Their goal is to enhance search engines through their use in the ranking
algorithms in order to make the searching results more relevant and important to
the users' queries. There are many different page ranking algorithms that have
emerged with the use of these kinds of mining processes. Most of these
algorithms found that the structure mining is the most important one that helps in
enhancing the search results, while another one tries to combine the Web structure
mining with others such as Web content mining and Web usage mining to get
better results. In the next chapter we will discuss in detail, the page ranking
algorithms that emerged to enhance search engines.
16


3. Page Ranking
3.1 The Problem of Ranking
Web search engines consider Web mining to give scores to the Web pages
on a given query and these scores are called the page rank.
Ranking any kind of collection of documents is not an easy task. The
ranking of the Web documents or Web pages has more difficulties than the other
kind of document collections due to many reasons. The diversity, constantly
changing and the easy addition of the documents on the Web are some difficulties
that are faced in the process of ranking those web documents. Considering the
enormous size of the Web, other difficulties include repeated documents, broken
links and the many links created for commercial purposes.
Moreover, the keyword queries that people issue may suffer from the
problem of synonymy (multiple ways to say the same thing) and polysemy
(multiple meaning for the same word) because there is a diversity in the people
issuing the queries.
Dealing with queries sufficiently broad in nature is defined as the
Abundance Problem by Kleinberg: The number of pages that could be
reasonably returned as relevant is far too large for a human user to digest^ [8].
17


A search engine has no problem finding and indexing millions of pages,
but the problem is that people will want to look at just a few of them which would
likely be the first 10 or 20 results.
Page ranking algorithms benefit from the nature of the structure of the
Web pages as a graph and the inherited property of the hyperlink structure of the
Web which is a link from page A to page B and indicates an endorsement of the
quality of B [8] to produce link-based ranking algorithms. The content of the web
pages and the pattern used in visiting the Web pages by the surfers are also used
to address ranking difficulties and to generate efficient and different algorithms
for ranking the searching results.
3.2 Page Ranking Algorithms
Web mining techniques are powerful tools used within the ranking
algorithm of searching the web [9].The number of queries that users issue had
increased as the size of the Web grows substantially. With the number of users
increasing every day, the number of queries that they send to the search engines
vastly increases. So, the most important task for search engines is to process
these queries in an efficient way. Using web mining techniques within page
ranking algorithms make the extraction of only the relevant pages, important
pages from the database easier. There are many different page ranking algorithms
produced and available in many literatures [8], [3], [11], [14], [5]. Some
18


algorithms rely on the hyperlink structure of the Web, while others rely on the
content of the Web pages or the usage patterns on visiting Web pages. On the
other hand, other algorithms could use the combination between two Web mining
techniques such as the Web link mining and the Web content mining.
The PageRank algorithm is based on the principle that the importance of a
page is determined by the importance of the pages linking to it [3]. The hyperlink
structure of the Web is utilized by three other important link-based ranking
algorithms. Kleinberg9s HITS algorithm [8] attempts to utilize this structure
through the use of hubs and authorities over a root set, Lempel and Moran [11]
utilize this structure through their SALSA algorithm.
Page Content Ranking algorithm (PCR) is about ranking Web pages based
on their content, while an improved usage-based algorithm used the time spent on
reading the Web pages by the user along with the link structure to rank the Web
pages.
In the following subsections we will discuss some of page ranking
algorithms which are PageRank, HITS, SALSA, PCR, and an improved Usage-
based algorithm.
19


3.3 The PageRank Algorithm
PageRank is a page rank algorithm invented by Page and Brin [10] to rank
web pages using the Web link mining technique. Google search engine's main
influence is by PageRank algorithm [3].
PageRank algorithm depends on the principle that a page with a large
number of backlinks or a link from an important page should be considered
important [10]. If a page A links to page B, the importance assigned to B by A, is
proportional to the importance of A and on the other side is proportional to the
number of pages pointed to by A. The backlinks are shown in figure 3.1.
Figure 3.1: Page A and Page B are Backlinks of the Page C
3.3.1 Simple PageRank Algorithm
PageRank uses the inlinks to the page as an input parameter to compute a
rank score to that page. In a graph with n nodes (pages), for each node assign an
initial PageRank value which is equal to 1/n. By choosing a number of steps k, we
20


do a sequence of update steps to the value of the PageRank of each page as
follow:
1. Each node divides its PageRank value by the number of out-links of the
node.
2. The divided value is sent to all the nodes it points to.
When the node has zero out-links, it sends its current PageRank value back to
itself.
The simplified version of PageRank algorithm can be computed by the following
formula and is shown in figure 3.2
=Z
i/ Bu

u and v are some Web pages, and Bu is the set of pages that point to page
w, the PageRank R of the web page u is the sum of the PageRank R or its back-
links divided by the number of their corresponding out-links.
21


3.3.2 Random Surfer Model
Page and Brin simulated the computation of PageRank as a model of user
behavior [3]. The surfer of the Web can be seen as randomly walking on the web
graph and choosing one of two options: one is to click on one of the outlinks or to
jump to another web page when he gets bored clicking the following links.
The surfer is at any web page clicks on the out links with an equal
probability which is equal to l/np, where np is the number of outlinks of that web
page p. The probability limit for the surfer to be at a given web page is the
PageRank of that web page.
3*3.3 Expanding PageRank Algorithm
The basic PageRank algorithm encountered a difficulty [10]. The browser
or the surfer of the web can get to the page without any out-links, ^Dangling
node as in figure 3.3 and the browser who is making random clicks will stop at
end.
Figure 3.3: Node C is a Dangling Node
22


This can happen when the page is a kind of file that has no embedded
links such as, post script files or PDF files or the page may have a special access
that the surfer is not eligible for.
One possible action to resolve this issue is to remove the dangling links
from the system and update the number of out degree of each node which is not a
dangling node [10]. An alternative way would be to delete these dangling links
from the system until the calculation of the PageRank is finished and then put
them back in the system [10].
PageRank algorithm considers the random surfer model with the options
of the surfer either clicking on the out link or jumping to another page as
formulated in the following formula.
Let D represent the probability that the surfer is choosing to click on the
forward link while he is at page P. Then (1-D) will be the other option which is
the probability of jumping to another unrelated page while he is at page P. So, the
PageRank formula will be broadened to the following formula:
R(u) = (l-D) + Dj^^p-.
u and v are some web pages, and Bu is the set of pages that point to page u,
the PageRank R of the web page u is the sum of the PageRank R of its back-links
divided by the number of their corresponding out-links. D is the damping factor
(typically 0.85) and it ensures every page receives a minimum ranking of (1-D).
23


3.3.4PageRank Example
Consider a little Web graph with three nodes (pages) as in figure 3.4 as an
example to illustrate the computation of PageRank.
Figure 3.4: Example of Hyperlinked Structure
All nodes start with PageRank is 1/3
The PageRanks for pages 1? 2 and 3 in the first three update steps is shown in
table 3.1.
Table 3.1: PageRank Example
Steps Page 1 Page 2 Page 3
1 1/3 1/2 1/6
2 1/3 5/12 1/4
3 1/3 11/24 5/24
3.3.5 Modified PageRank Algorithms
There are many variations that were put forward by different developers to
improve PageRank after PageRank was originally proposed in 1998. Some of
these proposals were concerned about qualitative improvement of the algorithm,
24


and others are concerned about personalization by giving different PageRank
values to some pages to reflect the characteristics of the group of users of the
algorithm.
Baeza-Yates and Davis [1]presented a qualitative improved of PageRank
that give more precise answers to the queries by considering web page attributes
to give weight values to the links between web pages.
McSherry at Microsoft Research proposed [12] a uniform approach to fast
the computation of PageRank algorithm by reformulating the power method of
the algorithm. The PageRank Personalization variants are discussed by Jen and
Widom [7].
3.4 Hyperlink Induced Topic Search (HITS) Algorithm
Hyperlink-Induced Topic Search (HITS) is an algorithm developed by Jon
Kleinberg in [8] to rank web pages using Web link mining technique of the web
pages. This algorithm designed to rank web pages on the broad-topic queries that
are obtained from text-based search engines [8].
There is a frequent observation on the web that says; for any specific
query topic; there is a set of authoritative pages focused on the query topic and
there is also a set of hub pages that have links to multiple relevant authoritative
pages. Figure 3.4 shows the concept of hubs and authorities pages.
25


Hubs Authorities
Figure 3.5: A Linked Set of Hubs and Authorities
These two kind of pages have a basic principle that has t4a mutually
reinforcing relationship [8] which says a good hub points to many good
authorities; a good authority is pointed by many good hubs [10].
Kleinberg in his technique (HITS)used this phenomenon and the principle
of the relationship between these two kinds of pages to classify the web pages into
two sets of pages, hub pages and authority pages and differentiates between them
to rank all the pages on the sub-graph of the relevant pages.
3.4.1 General HITS Algorithm
In order to rank pages using HITS algorithm, each relevant page is
assigned a numerical hub and authority score. Each score starts out with the initial
value equal to 1.An authority score is computed as the sum of the scaled hub
values that point to that page, while the hub value is the sum of the scaled
authority values of the pages it points to.
26


The algorithm uses the quality of the hubs to refine the estimation of the
quality of the authority and uses the quality of the authority to refine the
estimation of the quality of the hubs by performing a series of iterations, each
iteration consisting of two basic steps:
Authority Update Step: For each node in the sub-graph, the Authority
score value is updated to be equal to the sum of the Hub Score values of
each node that points to it.
Hub Update Step: For each node in the sub-graph, the Hub Score value is
updated to be equal to the sum of the Authority Score values of each node
that it points to.
3.4.2 HITS Example
The basic operation of HITS algorithm is shown in figure 3.5.
a(1) = h(2) + h(3) + h(4) h(1)=a(5) + a(6) + a(7)
Figure 3.6: the Basic Hub and Authority Operation
27


3.4.3The HITS Algorithm Detail
3.4.3.1 Constructing the Base Set Sub-graph
In this phase a sub-graph of the web is constructed to be used as an input
to the algorithm. Given any set of web pages, and a specific query topic a, the
algorithm could just analysis the link structure of the set Qa of all pages
containing the query topic.
This restriction on the pages has two defects:
1. The set of pages may have a large number of nodes that lead to the
cost in the computation.
2. The set may not contain the highest amount of the best authorities
since the query string is not contained in this pages.
Ideally, the collection Sa of pages from the Qashould have the following
properties:
1.Sa is relatively small
2. Sa is rich in relevant pages
3. Sa contains most (or many) of the strongest authorities
The algorithm starts to construct the root set Ra (which is a subset of Qa)
that contains from t highest ranked pages from different text-based search engines
results for the query topic a.
28


For each page in the root set Ra, the algorithm allow each page to bring in
no more than d pages pointing to it into the subset Sa in order to keep the
algorithm from degeneration. The subset Sa is built by growing up the root set Ra
and is called the base set for the query topic a. Figure 3.5 illustrate the root set
and the base set.
Working on the base set Sa, the technique will construct the subgraph Ga
that has the transverse links (the links between pages with different domain
names) about the authority of the pages they point to in order to be strong and
more focused on the query topic.
Figure 3.7: The Root Set and the Base Set
29


The algorithm for finding a focused sub-graph is as follow [8]:
Subgraph(aEtd)
a: a query string
a text-based search engine
td: natural numbers.
Set Sa = Sa
For each page p e Ra
Let L+(p) denote the set of all pages p point to.
Let L"(p) denote the set of all pages pointing to p.
Add all pages in L+(p) to Sa.
If |L"(p)| Add all pages in L"(p) to Sa
else
Add an arbitrary set of d pages from L"(p) to Sa
Return Sa
3.4J.2Computing Hubs and Authorities Algorithm
Using the mutually reinforcing relationship between hub and authority
pages through an iterative produced the Hubs and Authority algorithm by
Kleinberg. For each page, the algorithm compute two score values; ap denote to
the authority score value and hp denote to the hub score value.
flp ^ Tiq(qtp)ee hq, which is called I operation by Kleinberg
hp Tiq{p,q)E which is called O operation by Kleinberg
3.4.3.3 The Iterative Algorithm
The iterative algorithm developed by Kleinberg [8] can be stated as:
30


Iterate(G,k)
G: a collection of n linked pages
k: a natural number
Let z denote the vector (1,1,1,1)2 Rn
Set a= z
Set h= z
For i =1,2, ...k
Apply the / operation to (aji,hj)obtaining new a-weights ai
Apply the O operation to (aihi-i)obtaining new h-weights hi
Normalize a obtaining a;
Normalize hiobtaining hi
End
Return (akhk)
3.4.4 HITS Algorithm Normalization
After infinite number of steps of the iterations of the algorithm, the final
scores of the pages will be determined and because iteratively applying the
Authority Update Step and Hub Update Step, the values will diverge. After single
iteration in the algorithm, the algorithm had to normalize the score values by
dividing each authority score value by sum of all authority scores, and dividing
each hub score value by the sum of all hub scores. The values obtained from the
normalized algorithm are guaranteed to converge and typically happened within
10 steps [8] _
31


3.4.5 Modified HITS Algorithms
There are many variants of algorithms obtained from HITS algorithm and
proposed by many developers after they have studied the HITS extensively.
Bharat and Henzinger [2] improved the HITS algorithm in their work
which is called topic distillation by weighting pages and their links using text
information.
Randomized HITS [13] used a two-level system of ranking the web pages.
It combines the random surfer model from PageRank with the general idea of
hubs and authorities from HITS to achieve a rank algorithm for calculating hub
and authority score values in a given web graph.
Tsaparas [15] has proposed another variant of HITS algorithm and his
work is primarily concerned with the theoretical analysis, while disregarding the
qualitative enhancement to the algorithm.
3.5 Stochastic Approach for Link Structure Analysis (SALSA) Algorithm
Stochastic Approach for Link Structure Analysis was proposed in [11] by
Lempel and Moran in their paper <4SALSA: The Stochastic Approach for Link-
Structure Analysis'*. The SALSA algorithm combines the two link-structure ideas.
One idea comes from the PageRank algorithm by using the random walk on the
Web graph and the other idea is from HITS algorithm by using hub and authority
observation on the Web. The SALSA algorithm mainly breaks the tightness
32


between the hubs and authority [11] that Kleinberg used in his algorithm through
the mutually reinforcing relationship. The Algorithm performs two random
walks on the bipartite hubs-and-authorities graphs.
3.5.1 The Algorithms Detail
The bipartite undirected graph H is constructed by building the subset Va
of all the nodes that represent the potential authoritiesand the subset Vh of all the
nodes that represent the potential hubs from collection of sites. An example of a
constructed bipartite graph is shown in figure 3.6.
Figure 3.8: Constructing an Undirected Bipartite Graph
SALSA algorithm performs two random walks on the constructed bipartite
graph; one random walk from each side; authority side and the hub side.
The random walk starts from the authority side and does the following:
At a node on the authority side of the graph H, the algorithm selects
randomly an in-link and moves to the hub node on the hub side.
33


Each hub node divides its value equally among the authorities to which it
points.
The authorities value of the node is computed by summing up the hub
values that point to it.
At a node on the hub side of the graph H, the algorithm selects randomly
an out link and move to the authority node on the authority side.
Each authority node divides its value equally among the hub nodes that
point to it.
The hub value of the node is computed by summing up the authorities
values that it point to.
The random walk starts from an authority node selected randomly and then
proceeds alternating backwards and forwards steps. Thus the probability to move
from authority i to authority j is then:
z
k:kB(i)nF(J)
\B(i)\\F(j)\
Instead of simply broadcasting its weight, each node divides its hub/authority
weight equally among the authorities/hubs connected to it.
=

yes
l^-)l

hi =
Iw
yeF(i)

aJ
34


3.6 Page Content Rank Algorithm (PCR)
Page Content algorithm was proposed by Pokomy and Smizansky [14].
The relevance measurement of the page is determined from the importance of the
terms that the page contains.
The PCR algorithm depends on two heuristics for analyzing the page
content. The first heuristic is to determine the page importance according to the
terms that the page contains, while the other heuristic is to determine the term
importance according to specific features regarding a given query q. The PCR
algorithm results Rq, a set of ranked web pages according to their relevance to the
query q and any search engine as its input. The PCR algorithm has an assumption
as stated in [14] that says "the importance of the page P is proportional to the
importance of all terms in /V
According to that assumption, PCR algorithm gathers some statistical and
linguistic features of the terms in a set of ranked pages that obtained from a usual
search engine. These features represented in for example, the number of
occurrences of the term ^in the ranked pages, distances of term occurrences from
occurrences of a set of terms in the query q, the frequency in the natural language,
synonyms, and term window. These features specified formally in the original
paper [14].
35


3.6.1 PCR Description
There are four steps that describe PCRs method:
1. Term extraction: using an HTML parser for extracting terms that appear as
a text in the web page and build the inverted list to be used in step 4.
2. Parameter calculation: in this step, some statistical and linguistic
parameters are calculated. The statistical parameters represented in term
frequency and occurrences position while linguistic parameters
represented for example in word frequency in the natural languages and
synonym classes. All of these calculations are depending on the query q.
3. Term classification: this step is to determine the importance of each term
based on the calculations that were made in the previous step.
4. Calculation of the page relevance: after the term importance calculation
has been done in the previous section, a relevance of the page containing
that term is determined. The new relevance of a page q is the average
importance of terms contained in page q.
3.6.2 PCR Specification [14]
PCR uses Sec moment function which is determined by:
Sec_moment(S) whereS={\i| i=l...n}, = |5| and x is a set of real
numbers.
PCR algorithm uses the following symbols:
36


D a set of all pages considered by a search engine,
q a conjunctive Boolean query,
R^D the subset of all pages from D marked as relevant by a search engine.
^q,n^qthe subset of n top ranked pages from Rq. lfn> |Rq|, then Rqtn := Rq.
TF(P, t) the number of term / occurrences in P,
DF(()Xhc number of pages which contain the term
Pos(P, t) the set of positions of t in P,
Term(P, i) a function returning the term at i* position in page P
Thus, Term(P, i) = t = i EPos(P, t).
3.6.3 The Parameter Calculations in PCR
There are two calculations are done in PCR for the importance of term t.
importance^) and classify^) are two functions that return the importance of the
term t based on the 5+(2*NEIB) parameter, where NEIB is the number of
neighboring terms.
The following parameters influence the importance of the term:
Occurrence frequency determines the number of occurrences of term t in the Rq.
freq(t) = Y,peRqTF(P, t)...........................................(3.1)
Distance of key terms determines the distance of t from key terms.
Let Q W is the set of all occurrences of term t from Q in all pages in Rq,n-
QW = Utefpe/?fln t)....................................................(3.2)
37


Then the minimum distance of key terms is:
dist(t) = - i\ \ iPos(Pt) G/\i EiQW}).......................(3.3)
Incidence of pages: Is determined by the following equation:
occur{t) = DF(t)/\R^n\..............................................(3.4)
Frequency in the natural language: The frequency can be defined as:
common(t)=FL{t).....................................................(3.5),
where FL is a mapping function from all the words of database of frequent words
to an integer number denoting word's frequency.
Term importance determines the importance of all terms from Rq,n by the
following equation
importance^) = classify(freq(t), dist(t\ occur(t), common{t), 0, 0,..., 0).(3.6)
Synonym class. For each synonym class S we calculate an aggregate importance
SC(S) on the base of the importance of term in the class S.
SC(S) = sec_moment({importance (: ?' G5})..........................(3.7)
A term is become important if it is the synonym of an important term. SC(S) is
propagated to the term t by the following function:
synclassit) sec _moment({SC(Sti): tf SENSE^t)})..........................(3.8),
where SENSE (t) contains all the meaning t' of t
38


Importance of neighboring terms: The importance of a term can be determined by
the importance of its neighbors. The set of term which are the i* neighbor of term
t in all pages PE Rq n is determined by the following function:
RelPosNeib(t, i) = \JpsRqn{Tern (J + 1):j G Pos(Pp t) G inside (P,j + 1)}
The parameter neib(X9 i) is defined as follows:
neib(tri) = sec _moment(RelPosNeib(tfi)).............................(3.9)
Based on these parameters, the importance of the term t is defined as:
importance^) = classijy{freq(t\ dist{t), occur(t\ common{t\ synclass(t)f
neib(t-NEIB\neib(tNEIB)).........................................(3.10)
3.6.4Page Classification and Importance Calculation
The c/a55/^()function uses the neural network NET and can be defined as:
classify{P\,...f Psm*NEiB))~ NET{P\, Ps+{2*neib))....................(3.11)
The importance of a page P is considered to be an aggregate value of the
importance of all terms in P:
Page importance{P) = sec moment{{importance{t) \ t ^lP}).............(3.12)
The importance value of every page in the set Rq,n gives a new rank to the
top n pages in the lists obtained from the search engine.
39


3.7 An Improved Usage-Based Ranking Algorithm
Ding, Chi, and Luo proposed in [5] an algorithm that uses the link and
usage analysis of the web for web page ranking. The algorithm depends on the
principle that says: If a page is more frequently selected, its chance to be judged
as relevant is higher; if a page is less frequently selected, its chance to be relevant
is lower [5].
In general the algorithm oversees the actions which are made on specific
web pages obtained from a search engine. These web pages have been chosen
from users and the actions have been done by them. Then some relevance
information is inferred from user behavior on those web pages. The algorithm
gives scores to the web pages when the same query is submitted from other users.
However, the frequency selection by itself cannot stand alone to score
the web pages. Accidental users5 mistakes, misguided web pages, or some
summaries are various reasons that make an inaccurate selection which means
that the users just click on the links and go back to the obtained list. The
algorithm adds the time spent on reading the selected web pages as a factor to let
the web pages become more important. The algorithm then normalizes the time
duration to the length of the web page since the time spent on reading a page is
less when the page is less in size.
40


Ding et al[5] found that adding the hyper link information to the
algorithm will increase the important of the pages. They observe in [5] that 4ta
higher percentage of links accessed from a page could be a strong indication of
the importance of the page.
So in this algorithm the time duration and the access via hyperlinks
are two major factors to measure the page importance. These factors are used to
compute a basic usage-based rank score. The basic ranking formula is as follows,
URank(Q,D) = ^(^^^xDuriQ.D.i))............................(3.13)
Z)ur((?, D, i) = dur(D, i) + Fu x 2 toefinfced dur(ZD, i).(3.14)
Where Uq'is the latest access time of query Q and,
^ the latest access time of document D in the /th access for Q\
nQtD is the number of accesses for D from Q\
dur(D, i) is the time spent on D in the /th access, which is normalized on the length
ofZ);
Fu is the fading factor for durations propagated from linked pages.
In the next step, the algorithm adjusts the basic score by applying for
heuristics [5].
Heuristic 1:if a web page has a high rank, and its selection frequency is less than
the average selection frequency for the query, then it should have a negative
adjustment value computed as follows,
41


URank'QD) =(=::))_ 1)^(HRTHRES-rank'iQ.D))
(3.15)
clickratet(QtD)=IreqWQD)}.........................................(3.16)
Where freq(QtD) is the selection frequency oiD for Q, which is equal to nQt\
freq(Q) is the frequency of Q;
rankf(Q,D) is the average rank position of D in previous searches for Q\ average
value is averaged on all result documents for Q.
When the rank of a document is less than HR THRES, it is considered to have a
high rank.
Heuristic!: if a web page has a high rank, and its average duration is less than the
lower bound for duration value LB DUR, then it has a negative adjustment value.
URank'(Q,D)=Pur((?.D.0 i)x (HRTHRES rank'(Q,D))...............(3.17)
Heuristic3: if a web page has a high rank, but it has never been accessed, then it
has a negative adjustment value.
URank'(Q.D) = (-r/rel?(^)- HRFREQTHRES) x (HRTHRES-rank'(Q,D)).... (3.18)
Where hrfreq(QfD) is to measure how many times a document D occurs in the
high position of the ranked list for Q and is accessed; HRFREQ THRES is a
threshold value for hrfreq(Q,D).
Heuristics: if a document has a low rank, ana its selection frequency is larger
than a threshold value LB CLICKRATE, it has a positive adjustment value.
42


URank'{Q,D) =(1 ===) x (rank'{Q,D) LRTHRES).(3.19)
When the rank of a document is larger than LR THRES, it is considered to have a
low rank.
The final usage-based rank score is the basic rank score adjusted with a
value and then multiplied by a reliability factor. It is as follows,
URank(QfD) = rf{Q) x (URank(QfD) + URankr(QfD))..................(3.20)
rf{Q) = ltQ x freq(Q) x (ltQ ft)...............................(3.21)
Where rf(Q) is the reliability factor for the rank value;
FtQis the first time of query submission for Q.
The reliability factor is determined by the usage data collected for the query.
If the latest submission time for the query is more current, the usage-based rank
for this query is more reliable.
If the query is more frequently submitted, the rank is more reliable.
If the query exists for longer time in query database, the rank is more reliable.
43


4. A Hybrid Mining to Improve Web Page Ranking
4.1 Overview
The complete use of the sources that are available on the web as an
alternative to being concentrated on just one or two sources is a key to accomplish
good and suitable page ranking algorithms which take into account the
importance of the web pages in addition to the relevance of the web pages. These
ranking algorithms consider the relevant and important pages to be on the top of
the list of search engines results.
Using the link graph alone as an information source gives incomplete and
inaccurate information since the hyperlinks of the pages is easily added and
deleted by their creators. Whereas using just the content of the pages for ranking
could cause the loss of the importance of the page.
User behaviors or Web logs are an important factor that contributes in
increasing users degree of satisfaction by its indicators to the quality and
importance of the Web pages. There are many problems comes from considering
only the usage data to rank web pages. Incorrect information is one of them which
are due the loss of the surfer and his clicks which are recorded on the server will
mislead the importance of the web pages. Another problem is that when a new
44


page is added and it has not been visited yet. This page will have a poor rank even
li it is important.
My general idea is to make a hyper use of all of the Web mining
mechanisms in one page ranking algorithm and benefit from the characteristics of
each mechanism to produce a successful algorithm used by search engines. This
algorithm gives ranks to a page which are a complex combination of weights or
scores defined on different kinds of parameters.
4.2 The General Hybrid Method
The hybrid method for ranking web pages is considered in three steps,
generating a different ranking score to the web pages in each step. It uses the
content mining in the first step and building a weighted graph structure of the web
in the second step. The weights of the links of the graph represent users' behavior
when he was surfing the Internet. In each step, the model will utilize the
characteristics of the mining techniques that help to give good rank for the pages.
The first step will produce ranks for web pages which will be considered
relevancy ranks of the web pages since it works with contents of the web pages.
The second step will produce ranks for web pages which will be
considered importance ranks of the web pages since it works with link structure of
the web pages on the web. A weighted directed graph is built for entire pages on
the web.
45


As two ranking are produced for each page, the final step is to combine
the two ranks; relevancy ranking and importance ranking in one rank score for
each web page.
4.3 The Algorithm
Step 1:
Considering all the pages on the Web, the algorithm uses a number of
heuristics together for analyzing every page's content to determine its relevancy
to a given query. The content of the page is found in many places within the web
page; title, body text or anchor text or link text which is the highlighted bits of
clickable text in a hyperlink that leads to another web page.
This relevance measure step is computing the importance of the text terms
contained in a page with respect to the query in the following steps;
Terms removal: for every page, remove the text terms using an html parser. Then
store these terms in a list.
Parawe/er*? ca/cw/ar/ow: statistical parameters are calculated such as query
frequency, query position on the page and the anchor text that match the query for
each web page.
Page relevance calculation: combines the calculation of the parameters calculated
in the previous step to produce a new unique relevance score for each page,
rank relevant.
46


Step 2:
This step is concerned with creating the weighted directed for the web
pages. The nodes of the graph represent the web pages, while the edges of the
graph represent the link information as well as the time spent by users on visiting
those pages and the frequency of link usage which will be considered a mining of
web usage.
After building the graph, a random surfer model as the one used in
PageRank algorithm is run on the weighted graph with emphasis on information
on the time spent and link frequencies to generate different scores to the out-links
of the node according to weights of the links of the graph instead of generating an
equal weights to the out-links as the original PageRank algorithm did. At the end
of the algorithm an important score will be assigned to each page,
rankimportant.
Figure 4.1:User Behavior Data and the Weighted Graph
47


Step 3:
By the end of the second step, each web page will have two different rank
scores; rank relevant and rank_important. This step works on computing one
rank score to each web page by using the combination between the two rank
scores.
Input: internet web pages
Output: final score a for each web page
Algorithm: Analyze each page's content
Produce rank_relevant for each page.
Analyze the users behavior for the web pages
Construct a directed weighted web graph
Run a random surfer model
Produce rank importantfor each page.
Combinerank relevan with rankjmportant for each page to produce the
a (finalscore)
The hybrid method refines the web pages more than ever considering the
importance and relevance of the web pages. It uses all the sources of information
on the web. It combines all the kinds of the web mining technique; web content
mining, web link mining and web usage mining. It considers the relevancy as well
48


as the importance of the web pages and as a result the quality of the search engine
result will be enhanced.
Even though the new approach looks good by its using for the all
information available on the Web to rank Web pages, it faces many storage
issues. It requires huge storage facilities for content information, server logs,
ranking scores and the hyperlink graph. For example, it needs to build list
corresponding to a list of occurrences of particular word in a particular page
including its position and many other information and that may needs a kind of
encoding, sorting and merging. This makes the development of the new approach
much more difficult.
This approach also requires storage for its mathematical components, the
matrices and vectors. The issues of stability, which is the algorithm accuracy,
converge, which is the termination of the algorithm within reasonable number of
iterations in limited values are arise along the computation of this algorithm.
The combination step is also a challenge. There are various methods for
combining the final score, but the best combination is not clear in this algorithm.
It needs a lot of study to come up with the best way. The complete
implementation and evaluation of this algorithm is beyond the scope of this thesis.
49


5. Conclusion
A search engine is expected to give most relevant web pages that users are
looking for along the top web pages of the obtained list.Web page ranking
mechanisms have been adopted by search engines as one of their main
components. Mining techniques are used in this component to rank the webpages
called Web mining or analysis techniques.
Analyzing or mining the Web is to extract the necessary information from
a huge amount of data located on the Web.By using this, search enginescould
respond to the user's query and produce the most relevant pages at the top making
the search faster and easier for the user.
PageRank, HITS, SALSA, PCR and An Improved Usage-based Ranking
arealgorithms for ranking web pages using different methodologies that
areconsideredand explained in this thesis.
These web page ranking algorithms rely on different kind of ways to
analyze or minethe Web; web structure mining, web content mining and web
usage mining. Some of them use a combination of these techniques. PageRank
algorithm mines the web structurally, e.g. extracting information from the
hyperlinks between the web structure, and Page Content Ranking (PCR)
algorithm emphasize on the content of the web pages, whereas the HITS and
50


SALSA algorithms also consider the structure the web pages. Thelmproved
Usage-Based Ranking Algorithm is an attempt to use both web link
miningwhiletaking into account the advantages of information earned from users *
behavior in navigating the Web, which is the web usage analysis.
Web pages ranking algorithms use different inputs. PageRank utilizes the
backlinks, while HITS and SALSA utilize both backlinks and forward links. An
Improved Usage-based Ranking algorithm uses behavior of the user such as the
time spent on the page beside some other features, but PCR uses the terms that the
page contain as its input.
According to the used technique, the page ranking algorithms output
different page orders as a result. Some algorithms return the most important
pagessuch as PageRank,HITS, and SALSA and improved usage-based Ranking
algorithms, while other algorithms ignore the importance of the pages, but result
in the most relevant pages on the top of the obtainedlist such as PCR.
The newalgorithm proposed in this thesis uses all the web mining
techniques. Its input consists of the pages5 content, inlinks, and server logs. As the
algorithm uses all the information sources on the web to produce pages' ranks, it
considers both the importance and the relevancy of the page. So, it produces the
most important and relevant web pages according to query topic.
51


The following table 4.1 is a summary comparison between web page
ranking algorithms I consideredand proposed in this thesis. The comparison is
done on the basis of parameters such as the algorithm's input, algorithm's type of
results, the technique used to mine the web and the principle of each algorithm.
Table 4.1:Comparison of Various Web Pages Ranking Algorithms
Algorithm PageRank HITS SALSA
Mining technique Web link mining Web link mining Web link mining
The Input Inlinks inlinks and forward links inlinks and forward links
Algorithms principle If a page has some important inlinks to it, its forward links to other pages become important a good hub points to many good authorities; a good authority is pointed by many good hubs Hubs and authorities
The Output Important score for each page Important score for each page Important score for each page
Complexity (lgN) < 0(log N) < O(log N)
52


Table 4.1: (con't.)
Algorithm PCR Improved Usage- Based Ranking A HybridMining
Mining technique Web content mining Web link mining and Web usage mining Web content mining, Web usage mining and Web link mining
The Input Pages Content Original Page Rank and Server Log (visiting time) inlinks, Page's content and server logs
Algorithms principle Page is important if terms contained it is important according to the query topic Server logs can contribute to the discovery of the page importance Using all information sources on the web contributes to the page relevancy and importance
The Output relevant score for each page Important score for each page Important and relevant score for each page
Complexity 0( M) 0(log N) Not Available
N: number of web pages, M: Total number of occurrences of query terms in n pages
53


REFERENCES
1. Baeza-Yates, Ricardo, and Emilio Davis. "Web Page Ranking Using Link
Attributes/* In the Proceedings of the 13th international World Wide Web.
(2004): 328-329.
2. Bharat, Krishna, and Monika Henzinger. ^Improved algorithms for topic
distillation in a hyperlinked environment/' In the Proceedings of the 21st
annual international ACMSIGIR conference on Research and development
in information retrieval. (1998): 104-111.
3. Brin, Sergey, and Lawrence Page. "The anatomy of a large-scale
hypertextual Web search engine." Computer Networks and ISDN Systems.
30.(1998): 107-117.
4. Broder, Andrei, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, and
Sridhar Las Rajagopalant. "Graph structure in the Web." Computer
Networks. 33. (2000): 309-320.
5. Ding, Chen, Chi-Hung Chi, and TiejianLuo. "An Improved Usage-Based
Ranking." In the Proceedings of the Third International Conference on
Advances in Web-Age Information Management. Xiaofeng Meng and
Jianwen Su and Yujun Wang: London, Springer-Verlag. 2002. 346-353.
6. Duhan, Neelam, A. Sharma, and Komal Bhatia. "Page Ranking Algorithms:
A Survey." In Proceedings of the IEEE International Conference on Advance
Computing. (2009): 1530-1537.
7. Jeh, Glen, and Jennifer Widom. ''Scaling personalized web search." In The
Proceedings of the 12 th international conference on World Wide Web.
(2003): 271-279.
8. Kleinberg, Jon. "Authoritative Sources in a Hyperlinked Environment/'
Journal of the ACM (JACM). 46.(1999): 604-632.
54


9. Kosala, Raymond, and Hendrik Blockeel. "Web Mining Research: A
SvrwQy." ACMSIGKDD Explorations. 2. (2000): 1-15.
10. Langville, Amy, and Carl Meyer. Google's PageRank and Beyond: The
Science of the Search Engine Rankings. 1st ed. 224. New Jersey: Princeton
University Press, 2006.
11. Lemple, Ronny, and Shlomo Moran. "The Stochastic Approach for Link-
structure Analysis." Computer Networks. 33. (2000): 387-401.
12. McSherry, Frank. HA uniform approach to accelerated PageRank
computation." In The Proceedings of the 14th international conference on
World Wide. (2005): 575-582.
13. Ng, Andrew, Alice Zheng, and Michael Jordan. "Stable algorithms for link
analysis/' In The Proceedings of the 24th annual international ACM SIGIR
conference on Research and development in information retrieval. (2001):
14. Pokomy, Jaroslav, and Jozef Smizansky. "Page Content Rank: An Approach
to the Web Content Mining." IADIS International Conference on Applied
Computing. 46. (2005): 289-296.
15. TsaparasPanayiotis. "Using non-linear dynamical systems for web searching
and ranking." In The Proceedings of the 23th ACM SIGMOD-SIGACT-
SIGART symposium on Principles of database system, (2004): 59-70.
55


Full Text

PAGE 1

ALGORITHMS TO RANK WEB PAGES FOR SEARCH QUERIES By Zainab Ahmed Ahmed B.S., Alfateh University/ Tripoli, 2000 A thesis submitted to the University of Colorado Denver In partial fulfillment of the requirements for the degree of Master of Science Computer Science 2011 II

PAGE 2

This thesis for the Master of Science degree by Zainab Ahmed Ahmed has been approved by Bogdan Chlebus .... Gita Alaghb'ifhd I I I o}/ 2 I Date

PAGE 3

Ahmed, Zainab (M.S., Computer Science) Algorithms to Rank Web Pages for Search Queries Thesis directed by Associate Professor Bogdan S. Chlebus ABSTRACT In this thesis we discuss algorithmic underpinnings of search engines. We survey methods to produce ranking of web pages based on web mining. We discuss PageRank algorithm, which is at the core of the Google search engine, and ranking algorithms based on hubs and authorities. We propose a hybrid approach to produce page ranking, which combines various methods to rank web pages.

PAGE 4

This abstract accurately represents the content of candidate's thesis. I recommend its publication. Signed Bogdan S. Chlebus

PAGE 5

TABLE OF CONTENTS Figures................................................................................ Viii Tables................................................................................. 1x Chapter 1. Introduction ...................................................................... 1.1 The World Wide Web (WWW).............................. .. . . . . . . . 4 1.1.1 The Hyperlinks . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . 5 1.2 Web Search Engine............................................................. 6 1.2.1 Search Engine Mechanism................................................... 6 1.3 The Data Structure for the Web............................................... 8 1.3.1 The Graph Notation.......................................................... 8 1.3.2 The Web Graph............................................................. 10 1.4 Measure the Quality of Web Searching Results........................... 12 1.4.1 Page Importance............................................................... 12 1.5.2 Page Relevance............................................................... 12 2. Web Mining....................................................................... 13 2.1 Overview......................................................................... 13 2.2 Web Mining Categories....................................................... 14 2.2.1 Web Content Mining......................................................... 14 2.2.2 Web Structure Mining....................................................... 15 2.2.3 Web Usage Mining........................................................... 15 3. Page Ranking . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 The Problem ofRanking....................................................... 17 v

PAGE 6

3.2 Page Ranking Algoritluns... ... ........... ............................... .... 18 3.3 The Page Rank Algorithm..................................................... 20 3.3.1 Simple PageRank Algoritlun... .. ... . .. . ...... ... ... .. . . . .. .. ...... .. 20 3.3.2 Random Surfer Model........................................................ 22 3.3.3 Expanding PageRank Algorithm........................................... 22 3.3.4 Page Rank Example........................................................... 24 3.3.5 Modified PageRank Algorithms............................................ 24 3.4 Hyperlink Induced Topic Search (HITS) Algoritlun................... .... 25 3.4.1 General HITS Algoritlun.................... ...... .......................... 26 3.4.2 The HITS Example.......................................................... 27 3.4.3The HITS Algorithm Detail................................................. 28 3.4.3.1 Constructing the Base Set sub-graph.................................... 28 3.4.3.2 Computing Hubs and Authorities Algorithm........................... 30 3.4.3.3 The Iterative Algorithm................................................... 30 3.4.4 HITS Algoritlun Normalization............................................ 31 3.4.5 Modified HITS Algorithms.................................................. 31 3.5 Stochastic Approach for Link Structure Analysis Algoritlun............ 32 3.5.1 The Algoritlun's Detail...................................................... 33 3.6 Page Content Ranking Algorithm (PCR)................................. ... 34 3.6.1 PCR Description............................................................. 35 3.6.2 PCR Specification............................................................ 36 3.6.3 The Parameters' Calculation................................................ 37 3.6.4 Page Classification and Importance Calculation......................... 39 3.7 An Improved Usage-Based Ranking Algoritlun... .. ... . . . ........ ... .. 40 4. A Hybrid Mining to Improve Web Page Ranking . . . . . . . . . . . . . 44 4.1 Overview......................................................................... 44 VI

PAGE 7

4.2 The General Hybrid Method................................................. 45 4.3 The Algorithm................................................................... 46 5. Conclusion........................................................................ 50 References............................................................................ 54 Vll

PAGE 8

LIST OF FIGURES Figure 1.1 Search Engine Concept.. . . . . . . . . . . . . . . . .. .. .. .. . .. ... ... . . . .. . .. 6 1.2 Search Engine Architecture................................................... 8 1.3 The Directed, Undirected and Weighted Graph............................ 9 1.4 A Strongly Connected Graph.................................................. 9 1.5 The Bipartite Graph............................................................ 10 1.6 The Bow-tie Graph of the Web............................................... 11 2.1 Web Mining Classification.................................................... 16 3.1 Page A and Page B are Backlinks of the Page C. . . . . . . . . . . . . . 20 3.2 Simplified PageRank Calculation........................................... 21 3.3 Node Cis a Dangling Node................................................... 22 3.4 Example ofHyperlinked Structure........................................... 24 3.5 A Linked Set of Hubs and Authorities...................................... 26 3.6 The Basic Hub and Authority Operation.................................... 27 3.7 The Root Set and the Base Set................................................ 29 3.8 Constructing a Bipartite Graph............................................... 33 4.1 User behavior Data and the Weighted Graph............................... 47 Vlll

PAGE 9

LIST OF TABLES Table 3.1 PageRank Example........................................................... 24 4.1 Comparison of Various Web Pages Ranking Algorithms............... 52 IX

PAGE 10

I. Introduction The Word Wide Web has become the largest information repository in the world. It is a dynamic and unstructured data source. These features of the Web create challenges as finding relevant pages on a specific topic. Users are looking for tools and techniques that better locate, extract, filter, and find the information they seek. Web search engines alleviate that and help users in finding the needle that they are looking for in the Web's haystack. Day by day, the web search engine has attracted more attention throughout the world. There are hundreds of web search engines, such as Google, Yahoo and Bing that have a crawling and ranking mechanism in common. Web search engines face many challenges because of the increasing number of the Web pages and users on the Web, which in tum has led to the rapid increase in the number of queries issued by the users. A web page is relevant to the query if it contains the query topic in its title or somewhere in body text, while the web page is important to the query topic if it has many other important web pages linking to it. Search engines need tools that help in distinguishing web pages and rank them according to their relevancy and importance to the query topic. The Web mining technique is a tool used in the ranking process of the Web pages. Web mining techniques can be classified into three categories: Web 1

PAGE 11

link or structure mmmg, Web content mining and Web usage mining. Web structure mining provides the necessary information through the hyperlink structure of different interconnected Web pages. Web content mining provides the necessary information through the content of the Web pages retrieved from the index. The Web usage mining relies on the users' behavior on the Web to provide the necessary information. The ranking mechanism, which is the main focus of this thesis, is an important component m the search engme architecture. PageRank algorithm provides a ranking mechanism that makes Google a very successful and popular search engine. Page ranking algorithms use Web mining techniques to order the search results according to the user interest. Some page ranking algorithms depend only on the link structure of the Web (Web link mining) to get the important web pages, while others look for the content of the page (Web content mining)to get the relevant web pages or the users' behaviors (Web content mining), whereas some other page ranking algorithms use a combination of two Web mining techniques such as Web Link mining and Web content mining. We present Web page ranking algorithms that use various mining techniques or a combination of them to produce the searching results. The thesis surveys different page ranking algorithms and how they work. PageRank, HITS and SALSA are page ranking algorithms that are depend on the 2

PAGE 12

analysis of the link structure of the Web to set the relationship between web pages and to get important web pages according to the query topic. They build the graph of the web where its nodes are the Web pages and its edges are the hyperlinks between web pages. PCR is a page ranking algorithm that relies on the content of the web pages to produce web page ranks in terms of web pages' relevancy to the query topic. PCR uses a parser to get the body text from the web page and to compute many parameters such as query frequency to see how the page is related to the query topic. An improved usage-based algorithm is a page raking algorithm that uses a combination of link structure mining with usage-based mining to rank web pages. The algorithm uses users' behavior on the Internet that can be recorded by the Internet server as a source of information that gives more information on the importance of the page. Users' behavior is represented by the time spent in reading web pages. A contribution part of my thesis is to propose a new page ranking algorithm. The algorithm uses the three main mining techniques; web content mining, link web mining and usage-based mining combined together in one algorithm. It builds the algorithm in three steps. In the first step, the algorithm produces rank score to the web pages according to their content 3

PAGE 13

analysis(relevancy score). In the second step, the algorithm produces rank score to the web pages according to link analysis of the Web pages as well as the analysis of the user behavior on the internet represented in time spent in reading web pages by the user and the frequency of visiting the web page. This can be achieved by building a weighted undirected web graph. The rank score produced by the second step is called the importance score. In the third step, the algorithm combines the two rank scores produced in pervious steps to produce one rank score for each web page that reflect the important and the relevant content of the web page to the query topic and list these web pages on the top list of the searching result. 1.1 The World Wide Web (WWW) The World Wide Web is also known as W3 or a Web. The World Wide Web is a system of hypertext documents connected to each other. The system uses the Internet infrastructure to access the interlinked documents and to connect client terminals and servers all around the world using the Hypertext Transport Protocol (HTTP). Documents on the World Wide Web which are known as home pages, Web pages, or pages are written in Hypertext Markup Language (HTML), and are recognized by Uniform Resource Locators (URLs). The Web also includes hypermedia, which is hyperlinked multimedia, including audio, video, graphics, 4

PAGE 14

and animations beside the text. The Web is serving as a huge information repository and has facilitated the spread of this information across the world. 1.1.1 The Hyperlinks A hyperlink is the property of a word, phrase, or image that you can click on in a document to link either to a new document or a new portion within the document. Hyperlinks are found in the Web pages which allow users to click their way from page to page. Text hyperlinks are often colored blue and are sometimes underlined to be distinguished from other elements of the page that are not hyperlinked. When you move the cursor over a hyperlink, whether it is word, phrase or an image, the mouse cursor should change to a small hand pointing at the link. Hyperlinks are created through the use of programming languages HTML (Hypertext Markup Language) or XML (Extensible Markup Languages). 1.2 Web Search Engines A search engine is a web site that collects and organizes contents from all over the Internet. A search engine can be considered a software application that searches a database of documents and files on the Internet. It is developed to extract information from that database in order to answer search queries in the form of a list of web pages, images, or files. The search looks for keywords or exact phrases found in titles or texts. A search engine can be restricted to a single website, or can search across many sites on the public Internet. A search engine 5

PAGE 15

keeps records of information of what is available on the Web. It can be understood basically as a Web content collector as in figure 1.1. Search Engine =(;;) ....____ _____ __, Web Content Collector Users Figure 1.1: Search Engine Concept 1.2.1 Search Engine Mechanism A Web search engine works by storing information about web pages. It downloads, indexes and stores thousands of millions of web pages in order to answer thousands of millions of users' queries every day. There are three main components in search engines (see figure 1.2): crawler, indexer and query processing. The crawler, which is also called the spider, is the program that browses the World Wide Web and follows every link and downloads Web pages. The downloaded Web pages are then routed to the indexer. The indexer analyzes Web page's content such as words extracted from titles or heading sand builds the index depending on the keywords found in the pages. The index is maintained alphabetically in terms of the keywords found. When a user sends a query into the interface of the search engine, it is delivered by the query processer component of the search engine to do the 6

PAGE 16

matching process within the index and then return the Uniform Resource Locators (URLs) of the pages to the user. Before presenting the URLs of the pages to the user, the search engine applies some ranking mechanism (Web mining) to ease users' navigation between the search's results. The most important and relevant pages are presented on top of the searching result list. In this thesis we will discuss different kinds of ranking algorithms for search engines. 7

PAGE 17

@ ............................ Query Interface Figure 1.2: Search Engine Architecture 161 1.3 The Data Structure for the Web Sea ch Engine Traditional Web information retrieval systems focus on the information provided by the text of the Web page. The hyperlinks where different Web pages are connected provide additional information in searching the Web. The Web can be seen as a directed graph whose nodes are the Web pages and the links between nodes are the hyperlinks Web pages. The directed graph structure is called the Web Graph. 1.3.1 The Graph Notation The graph G consist oftwo sets VandE. Vis the set of nodes (Web pages), while E is the set of edges (Hyperlinks). It also can be expressed as G(V, E). 8

PAGE 18

Undirected graph: the pair of nodes (u, v), (v, u) represent the same edge. Directed graph: the pair of nodes (u, v), (v, u) represent two different edges. Weighted graph: the edges are assigned a weight which might represent, for example, costs, lengths or capacities, etc. depending on the problem at hand. Figure 1.3 shows the directed graph, undirected graph and weighted graph. Figure 1.3: The Directed, Undirected and Weighted Undirected Graph Strongly Connected graph is a directed graph where for every pair of nodes u and v there is a direct bath from u to v and there is a direct bath from v to u as in figurel.4. Figure 1.4: A Strongly Connected Graph 9

PAGE 19

Bipartite graph is a graph whose nodes can be divided into two disjoint setsU and V such that every link from set E (set of links) connects a node in U to a node in V. It also can be expressed as G( U, V, E) as in figure 1.5. 1.3.2The Web Graph u r-, I . / \ ...... ....... / v Figure I.S: The Bipartite Graph Many researches have been done on the Web graph or map. Broader et al. in [4] build the Web graph in a bow tie shape using strongly connected components as the base block. The bow tie graph is shown in figure 1.3. They divided the Web into four large pieces, and they showed how these pieces fit together. Dlc.onnec'ted cornponn1 Figure 1.6: The Bow-tie Graph ofthe Web(41 10

PAGE 20

I. Giant Strongly Connected Component (SCC) is the central core of the Web map that has all the pages which can reach one another along directed links. 2. The IN component consists of Web pages that can access the SCC but they cannot be accessed from sec. 3. The Out component consists of Web pages that are accessible from the sec but they don't have in-links to the sec. 4. The TENDRILS contains all the pages that cannot reach the SCC and they are not reachable by the Core component, SCC. According to Broader et al in [4], the size of the SCC is relatively small compared to the size of the other components. 1.4 Quality Measurement of Web Searching Results Quality of Web search engine results form a real concern to the Internet users. Current Web search engines respond to user queries within a portion of a second. Users will expect good results from the web, and they do not mind if search engines respond within a few more seconds. Next, we explain a few factors used by search engines to measure the importance and relevance of web Pages. 1.4.1 Page Importance Web page importance is the measure of how much the page is important to the user's query. The page importance can be inferred from the graph structure of the web that shows clearly how many inlinks to a Webpage. This importance can be 11

PAGE 21

measured by popularity of the web page which can be computed by number of inlinks to the page or can be measured by the quality which can be computed recursively by the sum of the quality in links of pages. The more inlinks to the page from important pages, the more likely the page is important. Another measurement to page importance is the informative which is how many outlinks to nodes that have useful information. 1.5.2 Page Relevance Web page relevance is the measure of how much the page is related to the user's query. There are many factors that are used for this measurement. A word or a phrase location is a factor search engines consider relevant. The queries that appear in the title of the page or at the top of the page are considered most relevant. The frequency of the query in the page is another factor on how search engines determine relevancy. The anchor text within a page is an important factor for the page relevancy. Because all of these factors are found in pages' content, we can say that page relevance is inferred from a pages' content. 12

PAGE 22

2. Web Mining 2.1 Overview The extraction of implicit, unknown, non-trivial and potentially useful information or patterns from a large database is called Data Mining. Web mining is the application of the Data mining techniques to discover and extract information from World Wide Web documents and services in an automatic way [9]. Emergence of the Web as a huge source of information and the enormous growth of the Web today makes web mining a wide area of research [9]. Suggested decomposing of Web mining into four main subtasks, namely: 1. Resource finding: the task of retrieving intended Web documents. 2. Information selection and pre-processing: automatically selecting and pre processing specific information from retrieved documents. 3. Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites. 4. Analysis: validation and /or interpretation of the mined patterns. Resource finding is the process of retrieving the online or offline data from the text documents that are available on the World Wide Web. These text documents represented in electronic newsgroups, newsletters, newswire, libraries and HTML documents. The second subtask is about selecting the HTML 13

PAGE 23

documents and transforming them by removmg HTML tags, stop words, stemming, etc. Generalization is the task of extracting general patterns at an individual Web site and across multiple Web sites using data mining techniques. Due to the interactivity of the Web, users' behavior plays an important role in discovering information. This is important for doing the forth subtask, validation and interpretation. 2.2 Web Mining Categories Kosala et al [9] categorized the Web mining into three area of interest based on which part of the Web needs to be mined: Web content mining, Web structure or link mining, and Web usage mining. Web content mining extracts information from the content of the web page. Web structure mining extracts the information from the hyperlink between web pages. While Web usage mining extracts information from the user behavior and logs. 2.2.1 Web Content Mining Web content mining is the process of extracting useful information from the content of the Web pages. The Web page content consists of many kinds of data such as text, image, audio, video and hyperlinks. The Web page content consists of unstructured data such as the text, semi-structured data such as HTML or XML documents and structured data such as data found in the tables. There are two approaches to the Web content mining; agent based approach and database 14

PAGE 24

approach. The agent based approach concentrates on finding relevant information based on the inferred user profile. The database approach models the data and retrieves the semi-structured data from the Web. 2.2.2 Web Structure Mining Web structure mining or what is sometimes referred to by Web link mining tries to generate the model summaries on the Web site and Web page. This kind of Web mining builds the structure of the linked web pages as a Web graph where the nodes represent the Web pages and the edges of the graph represent the hyperlinks between the web pages. Web structure mining makes the classification of the Web pages easy in terms of their relevance and importance. Web structure mining also detects the similarity and the relationship between different Web sites and Web pages. 2.2.3 Web Usage Mining Web usage mining applies the data analysis technique to extract the usage pattern generated by the surfers on the Web to use it in enhancing the web services such as a users' search experience. Web usage mining mines the data derived from users' interactivity with the Web. The data of patterns of usage can be found for example in the IP addresses, time and the date of the user access, user profile, user queries, mouse clicks and scrolls, cookies, transactions and any other data that comes from user interactions. 15

PAGE 25

The classification of the Web mining is shown in the figure 2.1. Database Approach Agent based Approach Web Mining Web Link Mining General Access Pattern Figure 2.1: Web Mining Classification [6[ Customiz ed Usage Tracking Web mining has many important applications on the Web. Web mining categories are used alone or in combinations with each other by most researchers to give a good research results. They are used and applied in many research areas. Their goal is to enhance search engines through their use in the ranking algorithms in order to make the searching results more relevant and important to the users' queries. There are many different page ranking algorithms that have emerged with the use of these kinds of mining processes. Most of these algorithms found that the structure mining is the most important one that helps in enhancing the search results, while another one tries to combine the Web structure mining with others such as Web content mining and Web usage mining to get better results. In the next chapter we will discuss in detail, the page ranking algorithms that emerged to enhance search engines. 16

PAGE 26

3. Page Ranking 3.1 The Problem of Ranking Web search engines consider Web mining to give scores to the Web pages on a given query and these scores are called the page rank. Ranking any kind of collection of documents is not an easy task. The ranking of the Web documents or Web pages has more difficulties than the other kind of document collections due to many reasons. The diversity, constantly changing and the easy addition ofthe documents on the Web are some difficulties that are faced in the process of ranking those web documents. Considering the enormous size of the Web, other difficulties include repeated documents, broken links and the many links created for commercial purposes. Moreover, the keyword queries that people issue may suffer from the problem of synonymy (multiple ways to say the same thing) and polysemy (multiple meaning for the same word) because there is a diversity in the people issuing the queries. Dealing with quenes sufficiently broad in nature is defined as the "Abundance Problem" by Kleinberg: "The number of pages that could be reasonably returned as relevant is far too large for a human user to digest" [8]. 17

PAGE 27

A search engine has no problem finding and indexing millions of pages, but the problem is that people will want to look at just a few of them which would likely be the first 10 or 20 results. Page ranking algorithms benefit from the nature of the structure of the Web pages as a graph and the inherited property of the hyperlink structure of the Web which is a link from page A to page B and indicates an endorsement of the quality of B [8] to produce link-based ranking algorithms. The content of the web pages and the pattern used in visiting the Web pages by the surfers are also used to address ranking difficulties and to generate efficient and different algorithms for ranking the searching results. 3.2 Page Ranking Algorithms Web mining techniques are powerful tools used within the ranking algorithm of searching the web [9].The number of queries that users issue had increased as the size of the Web grows substantially. With the number of users increasing every day, the number of queries that they send to the search engines vastly increases. So, the most important task for search engines is to process these queries in an efficient way. Using web mining techniques within page ranking algorithms make the extraction of only the relevant pages, important pages from the database easier. There are many different page ranking algorithms produced and available in many literatures [8], [3], [11], [14], [5]. Some 18

PAGE 28

algorithms rely on the hyperlink structure of the Web, while others rely on the content of the Web pages or the usage patterns on visiting Web pages. On the other hand, other algorithms could use the combination between two Web mining techniques such as the Web link mining and the Web content mining. The PageRank algorithm is based on the principle that the importance of a page is determined by the importance of the pages linking to it [3]. The hyper link structure of the Web is utilized by three other important link-based ranking algorithms. Kleinberg's HITS algorithm [8] attempts to utilize this structure through the use of hubs and authorities over a root set, Lempel and Moran [11] utilize this structure through their SALSA algorithm. Page Content Ranking algorithm (PCR) is about ranking Web pages based on their content, while an improved usage-based algorithm used the time spent on reading the Web pages by the user along with the link structure to rank the Web pages. In the following subsections we will discuss some of page ranking algorithms which are PageRank, HITS, SALSA, PCR, and an improved Usage based algorithm. 19

PAGE 29

3.3 The PageRank Algorithm Page Rank is a page rank algorithm invented by Page and Brin [ 1 0] to rank web pages using the Web link mining technique. Google search engine's main influence is by PageRank algorithm [3]. PageRank algorithm depends on the principle that a page with a large number of backlinks or a link from an important page should be considered important [I 0]. If a page A links to page B, the importance assigned to B by A, is proportional to the importance of A and on the other side is proportional to the number of pages pointed to by A. The backlinks are shown in figure 3.1. c Figure 3.1: Page A and Page 8 are Back! inks of the Page C 3.3.1 Simple PageRank Algorithm PageRank uses the inlinks to the page as an input parameter to compute a rank score to that page. In a graph with n nodes (pages), for each node assign an initial PageRank value which is equal to 1/n. By choosing a number of steps k, we 20

PAGE 30

do a sequence of update steps to the value of the PageRank of each page as follow: 1. Each node divides its PageRank value by the number of out-links of the node. 2. The divided value is sent to all the nodes it points to. When the node has zero out-links, it sends its current PageRank value back to itself. The simplified version of Page Rank algorithm can be computed by the following formula and is shown in figure 3.2 R(u) = L R(v) ueBu Nv u and v are some Web pages, and Bu is the set of pages that point to page u, the PageRank R of the web page u is the sum of the PageRank R of its backlinks divided by the number of their corresponding out-links. 53 Figure 3.2: Simplified Page Rank Calculation 21

PAGE 31

3.3.2 Random Surfer Model Page and Brin simulated the computation of PageRank as a model of user behavior [3]. The surfer of the Web can be seen as randomly walking on the web graph and choosing one of two options: one is to click on one of the outlinks or to jump to another web page when he gets bored clicking the following links. The surfer is at any web page clicks on the out links with an equal probability which is equal to 1 /np, where np is the number of outlinks of that web page p. The probability limit for the surfer to be at a given web page is the PageRank of that web page. 3.3.3 Expanding PageRank Algorithm The basic PageRank algorithm encountered a difficulty [ 1 0]. The browser or the surfer of the web can get to the page without any out-links, "Dangling node" as in figure 3.3 and the browser who is making random clicks will stop at end. Figure 3.3: Node Cis a Dangling Node 22

PAGE 32

This can happen when the page is a kind of file that has no embedded links such as, post script files or PDF files or the page may have a special access that the surfer is not eligible for. One possible action to resolve this issue is to remove the dangling links from the system and update the number of out degree of each node which is not a dangling node [10]. An alternative way would be to delete these dangling links from the system until the calculation of the PageRank is finished and then put them back in the system [10]. PageRank algorithm considers the random surfer model with the options of the surfer either clicking on the out link or jumping to another page as formulated in the following formula. Let D represent the probability that the surfer is choosing to click on the forward link while he is at page P. Then (1-D) will be the other option which is the probability of jumping to another unrelated page while he is at page P. So, the PageRank formula will be broadened to the following formula: R(u) = (1-D)+ D L R(v)_ UE8u Nv u and v are some web pages, and Bu is the set of pages that point to page u, the PageRank R of the web page u is the sum of the PageRank R of its back-links divided by the number of their corresponding out-links. D is the damping factor (typically 0.85) and it ensures every page receives a minimum ranking of ( 1 D). 23

PAGE 33

3.3.4PageRank Example Consider a little Web graph with three nodes (pages) as in figure 3.4 as an example to illustrate the computation of Page Rank. Figure 3.4: Example of Hyperlinked Structure All nodes start with PageRank is 113 The PageRanks for pages 1, 2 and 3 in the first three update steps is shown in table 3.1. Table 3.1: PageRank Example Steps Page 1 Page 2 Page 3 1 1/3 112 1/6 2 113 5112 1/4 3 113 11/24 5/24 3.3.5 Modified PageRank Algorithms There are many variations that were put forward by different developers to improve PageRank after PageRank was originally proposed in 1998. Some of these proposals were concerned about qualitative improvement of the algorithm, 24

PAGE 34

and others are concerned about "personalization" by giving different PageRank values to some pages to reflect the characteristics of the group of users of the algorithm. BaezaYates and Davis [ 1] presented a qualitative improved of PageRank that give more precise answers to the queries by considering web page attributes to give weight values to the links between web pages. McSherry at Microsoft Research proposed [12] a uniform approach to fast the computation of PageRank algorithm by reformulating the power method of the algorithm. The PageRank Personalization variants are discussed by Jen and Widom [7]. 3.4 Hyperlink Induced Topic Search (HITS) Algorithm Hyperlink-Induced Topic Search (HITS) is an algorithm developed by Jon Kleinberg in [8] to rank web pages using Web link mining technique of the web pages. This algorithm designed to rank web pages on the broad-topic queries that are obtained from text-based search engines [8]. There is a frequent observation on the web that says; for any specific query topic; there is a set of "authoritative" pages focused on the query topic and there is also a set of "hub" pages that have links to multiple relevant authoritative pages. Figure 3.4 shows the concept of hubs and authorities pages. 25

PAGE 35

Hubs Authorities Figure 3.5: A Linked Set of Hubs and Authorities These two kind of pages have a basic principle that has "a mutually reinforcing relationship" [8] which says "a good hub points to many good authorities; a good authority is pointed by many good hubs" [10]. Kleinberg in his technique (HITS )used this phenomenon and the principle of the relationship between these two kinds of pages to classify the web pages into two sets of pages, hub pages and authority pages and differentiates between them to rank all the pages on the sub-graph of the relevant pages. 3.4.1 General HITS Algorithm In order to rank pages usmg HITS algorithm, each relevant page is assigned a numerical hub and authority score. Each score starts out with the initial value equal to 1. An authority score is computed as the sum of the scaled hub values that point to that page, while the hub value is the sum of the scaled authority values of the pages it points to. 26

PAGE 36

The algorithm uses the quality of the hubs to refine the estimation of the quality of the authority and uses the quality of the authority to refine the estimation of the quality of the hubs by performing a series of iterations, each iteration consisting of two basic steps: Authority Update Step: For each node in the sub-graph, the Authority score value is updated to be equal to the sum of the Hub Score values of each node that points to it. Hub Update Step: For each node in the sub-graph, the Hub Score value is updated to be equal to the sum of the Authority Score values of each node that it points to. 3.4.2 HITS Example The basic operation of HITS algorithm is shown in figure 3.5. 2 5 3 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(S) + a(6) + a(7) Figure 3.6: the Basic Hub and Authority Operation 27

PAGE 37

3.4.3The HITS Algorithm Detail 3.4.3.1 Constructing the Base Set Sub-graph In this phase a sub-graph of the web is constructed to be used as an input to the algorithm. Given any set of web pages, and a specific query topic a, the algorithm could just analysis the link structure of the set Qa of all pages containing the query topic. This restriction on the pages has two defects: 1. The set of pages may have a large number of nodes that lead to the cost in the computation. 2. The set may not contain the highest amount of the best authorities since the query string is not contained in this pages. Ideally, the collection Sa of pages from the Qashould have the following properties: 1. Sa is relatively small 2. Sa is rich in relevant pages 3. Sa contains most (or many) of the strongest authorities The algorithm starts to construct the root set Ra (which is a subset of Qa) that contains from t highest ranked pages from different text-based search engines results for the query topic a. 28

PAGE 38

For each page in the root set Ra, the algorithm allow each page to bring in no more than d pages pointing to it into the subset Sa in order to keep the algorithm from degeneration. The subset Sa is built by growing up the root set Ra and is called the base set for the query topic a. Figure 3.5 illustrate the root set and the base set. Working on the base set Sa, the technique will construct the subgraph Ga that has the transverse links (the links between pages with different domain names) about the authority of the pages they point to in order to be strong and more focused on the query topic. Figure 3.7: The Root Set and the Base Set 29

PAGE 39

The algorithm for finding a focused sub-graph is as follow [8]: Sub graph( a,E ,t,d) a: a query string E: a text-based search engine t,d: natural numbers. Set Sa= Sa For each page p ERa Let L + (p) denote the set of all pages p point to. Let L-(p) denote the set of all pages pointing to p. Add all pages in L + (p) to Sa. If IL-(p then Add all pages in L-(p) to Sa else Add an arbitrary set of d pages from L(p) to Sa Return Sa 3.4.3.2Computing Hubs and Authorities Algorithm Using the "mutually reinforcing relationship" between hub and authority pages through an iterative produced the Hubs and Authority algorithm by Kleinberg. For each page, the algorithm compute two score values; ap denote to the authority score value and hp denote to the hub score value. ap Lq:(q,p)EE hq, which is called I operation by Kleinberg hp Lq:(p,q)EE qq), which is called 0 operation by Kleinberg 3.4.3.3 The Iterative Algorithm The iterative algorithm developed by Kleinberg [8] can be stated as: 30

PAGE 40

Iterate(G,k) G: a collection of n linked pages k: a natural number Let z denote the vector (1, 1, 1, ... 1) 2 Rn Set ao= z Set ho= z For i = I, 2, ... k Apply the I operation to (a;-1. h;-1), obtaining new a-weights a'; Apply the 0 operation to (a';, h;-J), obtaining new h-weights h'; Normalize a';, obtaining a; Normalize h';, obtaining h; End Return ( ak,hk) 3.4.4 HITS Algorithm Normalization After infinite number of steps of the iterations of the algorithm, the final scores of the pages will be determined and because iteratively applying the Authority Update Step and Hub Update Step, the values will diverge. After single iteration in the algorithm, the algorithm had to normalize the score values by dividing each authority score value by sum of all authority scores, and dividing each hub score value by the sum of all hub scores. The values obtained from the normalized algorithm are guaranteed to converge and typically happened within I 0 steps [8]. 31

PAGE 41

3.4.5 Modified HITS Algorithms There are many variants of algorithms obtained from HITS algorithm and proposed by many developers after they have studied the HITS extensively. Bharat and Henzinger [2] improved the HITS algorithm in their work which is called topic distillation by weighting pages and their links using text information. Randomized HITS [13] used a two-level system of ranking the web pages. It combines the random surfer model from PageRank with the general idea of hubs and authorities from HITS to achieve a rank algorithm for calculating hub and authority score values in a given web graph. Tsaparas [15] has proposed another variant of HITS algorithm and his work is primarily concerned with the theoretical analysis, while disregarding the qualitative enhancement to the algorithm. 3.5 Stochastic Approach for Link Structure Analysis (SALSA) Algorithm Stochastic Approach for Link Structure Analysis was proposed in [11] by Lempel and Moran in their paper "SALSA: The Stochastic Approach for Link Structure Analysis". The SALSA algorithm combines the two link-structure ideas. One idea comes from the PageRank algorithm by using the random walk on the Web graph and the other idea is from HITS algorithm by using hub and authority observation on the Web. The SALSA algorithm mainly breaks the tightness 32

PAGE 42

between the hubs and authority [ 11] that Kleinberg used in his algorithm through the "mutually reinforcing relationship". The Algorithm performs two random walks on the bipartite hubs-and-authorities graphs. 3.5.1 The Algorithm's Detail The bipartite undirected graph H is constructed by building the subset Va of all the nodes that represent the potential authorities, and the subset Vh of all the nodes that represent the potential hubs from collection of sites. An example of a constructed bipartite graph is shown in figure 3.6. Figure 3.8: Constructing an Undirected Bipartite Graph SALSA algorithm performs two random walks on the constructed bipartite graph; one random walk from each side; authority side and the hub side. The random walk starts from the authority side and does the following: At a node on the authority side of the graph H, the algorithm selects randomly an in-link and moves to the hub node on the hub side. 33

PAGE 43

Each hub node divides its value equally among the authorities to which it points. The authorities' value of the node is computed by summing up the hub values that point to it. At a node on the hub side of the graph H, the algorithm selects randomly an out link and move to the authority node on the authority side. Each authority node divides its value equally among the hub nodes that point to it. The hub value of the node is computed by summing up the authorities values that it point to. The random walk starts from an authority node selected randomly and then proceeds alternating backwards and forwards steps. Thus the probability to move from authority i to authority j is then: Instead of simply broadcasting its weight, each node divides its hub/authority weight equally among the authorities/hubs connected to it. I _l_h jEB(i) IF U) I J hi= L _l_a jEF(i) IBU)I J 34

PAGE 44

3.6 Page Content Rank Algorithm (PCR) Page Content algorithm was proposed by Pokorny and Smizansky [ 14]. The relevance measurement of the page is determined from the importance of the terms that the page contains. The PCR algorithm depends on two heuristics for analyzing the page content. The first heuristic is to determine the page importance according to the terms that the page contains, while the other heuristic is to determine the term importance according to specific features regarding a given query q. The PCR algorithm results Rq, a set of ranked web pages according to their relevance to the query q and any search engine as its input. The PCR algorithm has an assumption as stated in [14] that says "the importance of the page P is proportional to the importance of all terms in P." According to that assumption, PCR algorithm gathers some statistical and linguistic features of the terms in a set of ranked pages that obtained from a usual search engine. These features represented in for example, the number of occurrences of the term tin the ranked pages, distances of term occurrences from occurrences of a set of terms in the query q, the frequency in the natura11anguage, synonyms, and term window. These features specified formally in the original paper [14]. 35

PAGE 45

3.6.1 PCR Description There are four steps that describe PCR's method: I. Term extraction: using an HTML parser for extracting terms that appear as a text in the web page and build the inverted list to be used in step 4. 2. Parameter calculation: in this step, some statistical and linguistic parameters are calculated. The statistical parameters represented in term frequency and occurrences position while linguistic parameters represented for example in word frequency in the natural languages and synonym classes. All of these calculations are depending on the query q. 3. Term classification: this step is to determine the importance of each term based on the calculations that were made in the previous step. 4. Calculation of the page relevance: after the term importance calculation has been done in the previous section, a relevance of the page containing that term is determined. The new relevance of a page q is the average importance of terms contained in page q. 3.6.2 PCR Specification (14) PCR uses Sec_ moment function which is determined by: 2 Sec_moment(S) =LP=1 whereS={xd i=l.. .n},n =lSI and xis a set of real numbers. PCR algorithm uses the following symbols: 36

PAGE 46

D a set of all pages considered by a search engine, q a conjunctive Boolean query, Rqr;D the subset of all pages from D marked as relevant by a search engine. Rq,nr;Rq the subset of n top ranked pages from Rq. If n > IR.ql, then Rq,n := Rq. TF(P, t) the number of term to occurrences in P, DF(t)the number of pages which contain the term t, Pos(P, t) the set of positions oft in P, Term(P, i) a function returning the term at ith position in page P Thus, Term(P, i) = t = i EPos(P, t). 3.6.3 The Parameter Calculations in PCR There are two calculations are done in PCR for the importance of term t. importance(t) and classify() are two functions that return the importance of the term t based on the 5+(2*NEIB) parameter, where NEIB is the number of neighboring terms. The following parameters influence the importance of the term: Occurrence frequency determines the number of occurrences of term t in the Rq. freq(t) = LpERq TF(P, t) ....................................................... (3.1) Distance of key terms determines the distance oft from key terms. Let QW is the set of all occurrences of term t from Q in all pages in Rq,n QW = UteQ PER Pos(P, t) ........................................................... (3.2) q,n 37

PAGE 47

Then the minimum distance of key terms is: dist(t) = min({li1 -il: i1EPos(P, t) EAi EQW}) .............................. (3.3) Incidence of pages: Is determined by the following equation: occur(t) = DF(t);lRq,nl ........................................................ (3.4) Frequency in the natura/language: The frequency can be defined as: common(t)=FL(t) ................. 00 .................... 00 ... 00 .. 00 ......... 00(3 .5), where FL is a mapping function from all the words of database of frequent words to an integer number denoting word's frequency. Term importance determines the importance of all terms from Rq,n by the following equation importance( f) = classify(jreq(t), dist(t), occur(t), common( f), 0, 0, ... 0) .. 00 00 00 (3.6) Synonym class. For each synonym class S we calculate an aggregate importance SC(S) on the base of the importance of term in the class S. SC(S) = sec_moment( {importance (t'): t' ES})oooooooooooooooooooooooooooo.(3.7) A term is become important if it is the synonym of an important term. SC(S) is propagated to the term t by the following function: synclass(t) =sec _moment({SC(St' ): t' SENSE(t)})oooooo 00 0000000000000000 (3.8), where SENSE (t) contains all the meaning t' oft 38

PAGE 48

Importance of neighboring terms: The importance of a term can be determined by the importance of its neighbors. The set of term which are the ith neighbor of term tin all pages PE Rq,n is determined by the following function: RelPosNeib(t, i) = UPER {Tern (j + l):j E Pos(P, t) E inside (P,j + 1)} q,n The parameter neib(t, i) is defined as follows: neib(t, i) = sec _moment(RelPosNeib(t, i)) ................................... (3.9) Based on these parameters, the importance of the term t is defmed as: importance(!) = classifyifreq(t), dist(t), occur(t), common(t), synclass(t), neib(t,-NEJB), ... neib(t,NEJB)) ................................................... (3.10) 3.6.4Page Classification and Importance Calculation The c/assif.j;()function uses the neural network NET and can be defined as: classify(PJ, ... Ps+(2*NEIBJ) = NET(PJ, ... Ps+(2*NEIBJ) ............................... (3.11) The importance of a page P is considered to be an aggregate value of the importance of all terms in P: Page _importance(P) =sec_ moment( {importance(!) : t EP} ) .................. (3.12) The importance value of every page in the set Rq,n gives a new rank to the top n pages in the lists obtained from the search engine. 39

PAGE 49

3.7 An Improved Usage-Based Ranking Algorithm Ding, Chi, and Luo proposed in [5] an algorithm that uses the link and usage analysis of the web for web page ranking. The algorithm depends on the principle that says: "If a page is more frequently selected, its chance to be judged as relevant is higher; if a page is less frequently selected, its chance to be relevant is lower" [5]. In general the algorithm oversees the actions which are made on specific web pages obtained from a search engine. These web pages have been chosen from users and the actions have been done by them. Then some relevance information is inferred from user behavior on those web pages. The algorithm gives scores to the web pages when the same query is submitted from other users. However, the "frequency selection" by itself cannot stand alone to score the web pages. Accidental users' mistakes, misguided web pages, or some summaries are various reasons that make an inaccurate selection which means that the users just click on the links and go back to the obtained list. The algorithm adds the time spent on reading the selected web pages as a factor to let the web pages become more important. The algorithm then normalizes the time duration to the length of the web page since the time spent on reading a page is less when the page is less in size. 40

PAGE 50

Ding et al [5] found that adding the hyper link information to the algorithm will increase the important of the pages. They observe in [5] that "a higher percentage oflinks accessed from a page could be a strong indication of the importance of the page." So in this algorithm the "time duration" and the "access via hyperlinks" are two major factors to measure the page importance. These factors are used to compute a basic usage-based rank score. The basic ranking formula is as follows, URank0(Q,D) = 1 X Dur(Q,D,i)) ............................. (3.13) lltQ-ltQ,D,il Dur(Q, D, i) = dur(D, i) + Fu XL LDElinked dur(LD, i) .................... (3.14) pages from D Where ltQ is the latest access time of query Q and, lt.Q.D.Iis the latest access time of document Din the /h access for Q; nQ.D is the number of accesses forD from Q; dur(D, i) is the time spent on D in the ith access, which is normalized on the length ofD; Fu is the fading factor for durations propagated from linked pages. In the next step, the algorithm adjusts the basic score by applying for heuristics [5]. Heuristic I: if a web page has a high rank, and its selection frequency is less than the average selection frequency for the query, then it should have a negative adjustment value computed as follows, 41

PAGE 51

1 ( clickrate(Q,D) ) ( 1 ) (3 15) URank (Q,D) = ( C l) -1 x HR_THRES-rank (Q, D) ...... ......... avg cltckrate Q,D clickratet(Q,D) = freq(Q(,D)) .. (3.16) freq Q Wherefreq(Q,D) is the selection frequency of D for Q, which is equal to nQ.D; freq(Q) is the frequency of Q; rank'(Q,D) is the average rank position of Din previous searches for Q; average value is averaged on all result docwnents for Q. When the rank of a document is less than HR_THRES, it is considered to have a high rank. Heuristic2: if a web page has a high rank, and its average duration is less than the lower bound for duration value LB _DUR, then it has a negative adjustment value. 1 Dur(Q.D.i) 1 URank (Q,D)-( LB_DUR 1) x (HR_THRES-rank (Q,D)) ............. (3.17) Heuristic3: if a web page has a high rank, but it has never been accessed, then it has a negative adjustment value. URank1(Q, D)= (rfreq(Q;D)HRFREQ_THRES) x (HR_THRESrank1(Q, D)) .... (3.18) freq(Q Where hrfreq(Q,D) is to measure how many times a document D occurs in the high position of the ranked list for Q and is accessed; HRFREQ_THRES is a threshold value for Heuristic4: if a document has a low rank, and its selection frequency is larger than a threshold value LB CLICKRATE, it has a positive adjustment value. 42

PAGE 52

I LB_CLICKRATE I URank (Q, D)= (1-. C ) x (rank (Q,D)-LR_THRES) ....... (3.19) cl!ckrate Q,D When the rank of a document is larger than LR THRES, it is considered to have a low rank. The final usage-based rank score is the basic rank score adjusted with a value and then multiplied by a reliability factor. It is as follows, URank(Q,D) = rf(Q) x (URank0(Q,D) + URank1(Q,D)) ................ (3.20) rf(Q) = ltQ x freq(Q) x (ltQ-[tQ) ........................................... (3.21) Where rf(Q) is the reliability factor for the rank value; FtQ is the first time of query submission for Q. The reliability factor is determined by the usage data collected for the query. If the latest submission time for the query is more current, the usage-based rank for this query is more reliable. If the query is more frequently submitted, the rank is more reliable. If the query exists for longer time in query database, the rank is more reliable. 43

PAGE 53

4. A Hybrid Mining to Improve Web Page Ranking 4.1 Overview The complete use of the sources that are available on the web as an alternative to being concentrated on just one or two sources is a key to accomplish good and suitable page ranking algorithms which take into account the importance of the web pages in addition to the relevance of the web pages. These ranking algorithms consider the relevant and important pages to be on the top of the list of search engines' results. Using the link graph alone as an information source gives incomplete and inaccurate information since the hyperlinks of the pages is easily added and deleted by their creators. Whereas using just the content of the pages for ranking could cause the loss of the importance of the page. User behaviors or Web logs are an important factor that contributes in increasing users' degree of satisfaction by its indicators to the quality and importance of the Web pages. There are many problems comes from considering only the usage data to rank web pages. Incorrect information is one of them which are due the loss of the surfer and his clicks which are recorded on the server will mislead the importance of the web pages. Another problem is that when a new 44

PAGE 54

page is added and it has not been visited yet. This page will have a poor rank even if it is important. My general idea is to make a hyper use of all of the Web mining mechanisms in one page ranking algorithm and benefit from the characteristics of each mechanism to produce a successful algorithm used by search engines. This algorithm gives ranks to a page which are a complex combination of weights or scores defined on different kinds of parameters. 4.2 The General Hybrid Method The hybrid method for ranking web pages is considered in three steps, generating a different ranking score to the web pages in each step. It uses the content mining in the first step and building a weighted graph structure of the web in the second step. The weights of the links of the graph represent users' behavior when he was surfing the Internet. In each step, the model will utilize the characteristics of the mining techniques that help to give good rank for the pages. The first step will produce ranks for web pages which will be considered relevancy ranks of the web pages since it works with contents of the web pages. The second step will produce ranks for web pages which will be considered importance ranks of the web pages since it works with link structure of the web pages on the web. A weighted directed graph is built for entire pages on the web. 45

PAGE 55

As two ranking are produced for each page, the final step is to combine the two ranks; relevancy ranking and importance ranking in one rank score for each web page. 4.3 The Algorithm Step 1: Considering all the pages on the Web, the algorithm uses a number of heuristics together for analyzing every page's content to determine its relevancy to a given query. The content of the page is found in many places within the web page; title, body text or anchor text or link text which is the highlighted bits of clickable text in a hyperlink that leads to another web page. This relevance measure step is computing the importance of the text terms contained in a page with respect to the query in the following steps; Terms removal: for every page, remove the text terms using an htrnl parser. Then store these terms in a list. Parameters calculation: statistical parameters are calculated such as query frequency, query position on the page and the anchor text that match the query for each web page. Page relevance calculation: combines the calculation of the parameters calculated in the previous step to produce a new unique relevance score for each page, rank relevant. 46

PAGE 56

Step 2: This step is concerned with creating the weighted directed for the web pages. The nodes of the graph represent the web pages, while the edges of the graph represent the link information as well as the time spent by users on visiting those pages and the frequency of link usage which will be considered a mining of web usage. User3 URL1 r----'--... URLs User2 URL5 .-----'--,. URLJ User2 URL1.time URL3.time URLc.time Figure 4.1: User Behavior Data and the Weighted Graph After building the graph, a random surfer model as the one used in PageRank algorithm is run on the weighted graph with emphasis on information on the time spent and link frequencies to generate different scores to the out-links of the node according to weights of the links of the graph instead of generating an equal weights to the out-links as the original PageRank algorithm did. At the end of the algorithm an important score will be assigned to each page, rank _important. 47

PAGE 57

Step 3: By the end of the second step, each web page will have two different rank scores; rank _relevant and rank_ important. This step works on computing one rank score to each web page by using the combination between the two rank scores. Input: internet web pages Output: final_ score a for each web page Algorithm: Analyze each page's content Produce rank_relevant for each page. Analyze the users' behavior for the web pages Construct a directed weighted web graph Run a random surfer model Produce rank _importantfor each page. Combinerank _rei evan with rank _important for each page to produce the a (final_score) The hybrid method refines the web pages more than ever considering the importance and relevance of the web pages. It uses all the sources of information on the web. It combines all the kinds of the web mining technique; web content mining, web link mining and web usage mining. It considers the relevancy as well 48

PAGE 58

as the importance of the web pages and as a result the quality of the search engine result will be enhanced. Even though the new approach looks good by its using for the all information available on the Web to rank Web pages, it faces many storage issues. It requires huge storage facilities for content information, server logs, ranking scores and the hyperlink graph. For example, it needs to build list corresponding to a list of occurrences of particular word in a particular page including its position and many other information and that may needs a kind of encoding, sorting and merging. This makes the development of the new approach much more difficult. This approach also requires storage for its mathematical components, the matrices and vectors. The issues of stability, which is the algorithm accuracy, converge, which is the termination of the algorithm within reasonable number of iterations in limited values are arise along the computation of this algorithm. The combination step is also a challenge. There are various methods for combining the final score, but the best combination is not clear in this algorithm. It needs a lot of study to come up with the best way. The complete implementation and evaluation of this algorithm is beyond the scope of this thesis. 49

PAGE 59

5. Conclusion A search engine is expected to give most relevant web pages that users are looking for along the top web pages of the obtained list.Web page ranking mechanisms have been adopted by search engines as one of their main components. Mining techniques are used in this component to rank the webpages called Web mining or analysis techniques. Analyzing or mining the Web is to extract the necessary information from a huge amount of data located on the Web.By using this, search enginescould respond to the user's query and produce the most relevant pages at the top making the search faster and easier for the user. PageRank, HITS, SALSA, PCR and An Improved Usage-based Ranking arealgorithrns for ranking web pages using different methodologies that areconsideredand explained in this thesis. These web page ranking algorithms rely on different kind of ways to analyze or minethe Web; web structure mining, web content mining and web usage mining. Some of them use a combination of these techniques. PageRank algorithm mines the web structurally, e.g. extracting information from the hyperlinks between the web structure, and Page Content Ranking (PCR) algorithm emphasize on the content of the web pages, whereas the HITS and 50

PAGE 60

SALSA algorithms also consider the structure the web pages. Thelmproved Usage-Based Ranking Algorithm is an attempt to use both web link miningwhiletaking into account the advantages of information earned from users' behavior in navigating the Web, which is the web usage analysis. Web pages ranking algorithms use different inputs. PageRank utilizes the backlinks, while HITS and SALSA utilize both backlinks and forward links. An Improved Usage-based Ranking algorithm uses behavior of the user such as the time spent on the page beside some other features, but PCR uses the terms that the page contain as its input. According to the used technique, the page ranking algorithms output different page orders as a result. Some algorithms return the most important pagessuch as PageRank,HITS, and SALSA and improved usage-based Ranking algorithms, while other algorithms ignore the importance of the pages, but result in the most relevant pages on the top of the obtainedlist such as PCR. The newalgorithm proposed in this thesis uses all the web mining techniques. Its input consists of the pages' content, inlinks, and server logs. As the algorithm uses all the information sources on the web to produce pages' ranks, it considers both the importance and the relevancy of the page. So, it produces the most important and relevant web pages according to query topic. 51

PAGE 61

The following table 4.1 is a summary comparison between web page ranking algorithms I consideredand proposed in this thesis. The comparison is done on the basis of parameters such as the algorithm's input, algorithm's type of results, the technique used to mine the web and the principle of each algorithm. Table 4.1: Comparison of Various Web Pages Ranking Algorithms Algorithm PageRank HITS SALSA Minin2 technique Web link mining Web link mining Web link mining The Input Inlinks inlinks in links and forward links and forward links Algorithm's If a page has some a good hub points Hubs and principle important inlinks to many good authorities to it, its forward authorities; a good links to other authority is pages become pointed by many important good hubs The Output Important score Important score Important score for each page for each page for each page Complexity O(logN) < O(log N) < O(log N) 52

PAGE 62

Table 4.1: (con't.) Algorithm PCR Improved UsageA HybridMining Based Ranking Mining technique Web content Web link mining Web content mining and mining, Web Web usage mining usage mining and Web link mining The Input Page's Content Original Page inlinks, Page's Rank and Server content and server Log (visiting time) logs Algorithm's Page is important Server logs can Using all principle if terms contained contribute to the information it is important discovery of the sources on the according to the page importance web contributes to query topic the page relevancy and importance The Output relevant score for Important score Important and each page for each page relevant score for each page Complexity O(M) O(log N) Not Available N: number or web pages, M: Total number or occurrences or query terms inn pages 53

PAGE 63

REFERENCES l. BaezaYates, Ricardo, and Emilio Davis. "Web Page Ranking Using Link Attributes." In the Proceedings of the 13th international World Wide Web. (2004): 328-329. 2. Bharat, Krishna, and Monika Henzinger. "Improved algorithms for topic distillation in a hyperlinked environment." In the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ( 1998): l 04-111. 3. Brin, Sergey, and Lawrence Page. "The anatomy of a large-scale hypertextual Web search engine." Computer Networks and ISDN Systems. 30. (1998): 107-117. 4. Broder, Andrei, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, and Sridhar Las Rajagopalant. "Graph structure in the Web." Computer Networks. 33. (2000): 309-320. 5. Ding, Chen, Chi-Hung Chi, and TiejianLuo. "An Improved Usage-Based Ranking." In the Proceedings of the Third International Conference on Advances in Web-Age Information Management. Xiaofeng Meng and Jianwen Su and Yujun Wang: London, Springer-Verlag. 2002. 346-353. 6. Duhan, Neelam, A. Sharma, and Komal Bhatia. "Page Ranking Algorithms: A Survey." In Proceedings of the IEEE International Conference on Advance Computing. (2009): 1530-153 7. 7. Jeh, Glen, and Jennifer Widom. "Scaling personalized web search." In The Proceedings of the I 2th international conference on World Wide Web. (2003): 271-279. 8. Kleinberg, Jon. "Authoritative Sources in a Hyperlinked Environment." Journal of the ACM (JACM). 46. ( 1999): 604-632. 54

PAGE 64

9. Kosala, Raymond, and Hendrik Blockeel. "Web Mining Research: A Survey." ACM SIGKDD Explorations. 2. (2000): 1-15. 10. Langville, Amy, and Carl Meyer. Google's Page Rank and Beyond: The Science ofthe Search Engine Rankings. 1st ed. 224. New Jersey: Princeton University Press, 2006. 11. Lemple, Ronny, and Shlomo Moran. "The Stochastic Approach for Link structure Analysis." Computer Networks. 33. (2000): 387-401. 12. McSherry, Frank. "A uniform approach to accelerated PageRank computation." In The Proceedings of the I 4th international conference on World Wide. (2005): 575-582. 13. Ng, Andrew, Alice Zheng, and Michael Jordan. "Stable algorithms for link analysis." In The Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. (2001): 258-266. 14. Pokorny, Jaroslav, and Jozef Smizansky.. "Page Content Rank: An Approach to the Web Content Mining." IADIS International Conference on Applied Computing. 46. (2005): 289-296. 15. Tsaparas, Panayiotis. "Using non-linear dynamical systems for web searching and ranking." In The Proceedings of the 23th ACM SIGMOD-SIGACT SIGART symposium on Principles of database system. (2004): 59-70. 55