Luận văn Www and the pagerank-Related problems

Tài liệu Luận văn Www and the pagerank-Related problems: ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN Nguyễn Hoài Nam WWW AND THE PAGERANK-RELATED PROBLEMS LUẬN VĂN THẠC SĨ KHOA HỌC Hà Nội - 2006 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN Nguyễn Hoài Nam WWW AND THE PAGERANK-RELATED PROBLEMS Chuyên ngành: Đảm bảo toán học cho máy tính và hệ thống tính toán Mã số: 1.01.07 LUẬN VĂN THẠC SĨ KHOA HỌC NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS. TS. HÀ QUANG THỤY Hà Nội - 2006 HANOI NATIONAL UNIVERSITY UNIVERSITY OF SCIENCE Nguyen Hoai Nam WWW AND THE PAGERANK-RELATED PROBLEMS Major: Mathematical assurances for computers and computing systems Code: 1.01.07 MASTER THESIS THESIS SUPERVISOR: ASSOC. PROF. HA QUANG THUY Hanoi - 2006 Intentionally left blank for your note Contents List of Figures ii List of Tables iii Introduction iv Acknowledgement vi Abstract ix List of Glossaries xi 1 Objects’ ranks and applications to WWW 1 1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1...

81 trang | Chia sẻ: hunglv | Lượt xem: 1826 | Lượt tải: 0

Bạn đang xem trước 20 trang mẫu tài liệu Luận văn Www and the pagerank-Related problems, để tải tài liệu gốc về máy bạn click vào nút DOWNLOAD ở trên

ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIấN Nguyễn Hoài Nam WWW AND THE PAGERANK-RELATED PROBLEMS LUẬN VĂN THẠC SĨ KHOA HỌC Hà Nội - 2006 ĐẠI HỌC QUỐC GIA HÀ NỘI TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIấN Nguyễn Hoài Nam WWW AND THE PAGERANK-RELATED PROBLEMS Chuyờn ngành: Đảm bảo toỏn học cho mỏy tớnh và hệ thống tớnh toỏn Mó số: 1.01.07 LUẬN VĂN THẠC SĨ KHOA HỌC NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS. TS. HÀ QUANG THỤY Hà Nội - 2006 HANOI NATIONAL UNIVERSITY UNIVERSITY OF SCIENCE Nguyen Hoai Nam WWW AND THE PAGERANK-RELATED PROBLEMS Major: Mathematical assurances for computers and computing systems Code: 1.01.07 MASTER THESIS THESIS SUPERVISOR: ASSOC. PROF. HA QUANG THUY Hanoi - 2006 Intentionally left blank for your note Contents List of Figures ii List of Tables iii Introduction iv Acknowledgement vi Abstract ix List of Glossaries xi 1 Objects’ ranks and applications to WWW 1 1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Rank for objects different from web page. . . . . . . . . . 2 1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4 1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11 1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19 2 Some PageRank-related problems 20 2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20 2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22 2.2 Connected-component PageRank approach . . . . . . . . . . . 30 2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30 2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32 2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35 2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37 2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37 2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42 3 Implementations and experimental results 43 3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48 3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58 Conclusions and future works 59 References 61 List of Figures 1.1 A small directed Web graph with 7 nodes . . . . . . . . . . . . . . . 13 1.2 Ranks of page in SmallWeb graph with α = .9 . . . . . . . . . . . . . 16 1.3 Figures exemplifying results with different α of SmallSet . . . . . . . 17 1.4 Graph of iterations needed with α ∈ [.5; .9] of SmallSet . . . . . . . . 18 2.1 Convergence rates for standard PageRank vs. BlockRank . . . . . . 29 2.2 Unarranged matrix and arranged matrix . . . . . . . . . . . . . . . . 31 2.3 Boosting techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Nutch packages dependencies . . . . . . . . . . . . . . . . . . . . . 46 3.2 Matrices from set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Ranks from PageRank vs. CCP for set 1 . . . . . . . . . . . . . . . . 50 3.4 Ranks and differences corresponding to each block for set 1 . . . . 51 3.5 Time of two methods with different decay values for set 1 . . . . . . 51 3.6 No. of iterations of two methods with different decay values for set 1 52 3.7 Matrices from set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.8 Ranks from PageRank vs. CCP for set 2 . . . . . . . . . . . . . . . . 53 3.9 Ranks and differences corresponding to each block for set 2 . . . . 53 3.10 Time of two methods with different decay values for set 2 . . . . . . 54 3.11 No. of iterations of two methods with different decay values for set 2 54 3.12 Matrices from set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.13 Ranks from PageRank vs. CCP for set 3 . . . . . . . . . . . . . . . . 55 3.14 Ranks and differences corresponding to each block for set 3 . . . . 56 3.15 Time of two methods with different decay values for set 3 . . . . . . 56 3.16 No. of iterations of two methods with different decay values for set 3 57 i 3.17 Mean case for 3 sets with different decay values . . . . . . . . . . . 57 List of Tables 1.1 Iterations needed for each value of decay-factor α of SmallSet . . . 17 2.1 The closeness of local PageRank vector to global PageRank vector 25 3.1 Each set with corresponding sources . . . . . . . . . . . . . . . . . . 49 3.2 Each set with corresponding number of pages and links . . . . . . . 49 iii Introduction Human beings’ history has witnessed many big improvements but nothing is comparable to networks’, in general, and the WWW’s improvement, in partic- ular. During the last decades, databases have grown exponentially in large stores and companies. In the old days, seekers faced many difficulties in finding enough data to feed their needs. The picture has changed and now the reverse picture is a daily problem – how to find relevant data in the information which is regularly accumulated. The need therefore arises for better approaches that are able to handle complex models in a reasonable amount of time because the old models in statistics, machine learning and traditional data analysis fail to cope with this level of complexity. That is named data-mining with an important branch, Web-mining. The key inspiration of web-mining based on WWW’s growth. There is no doubt that WWW is the most valuable source of information but it is not very easy for you to get used to if you do not know the way. Several studies try to estimate the size of the Web andmost of them agree that there are over billions of pages. Additionally, the changing rate of the Web is even more dramatic [4]. According to [4], the size of the Web has doubled in less than two years, and this growth rate is projected to continue for the next two years. Although a enormous amount of information is available on the Web, you can hardly find appropriate information if you nothing supporting. For that reason, you need a search engine to make your work easier. We can consider search engines roughly as places that store information and return what we need after some operations. Everything returned by a iv search engine is sorted descendingly, that means entries which appear first seem to be more relevant. Even if that seems normal, many efforts were made to make this step straightforward. For various ways of approaching, there are many methods that deal with this problem. In this problem, we have to deter- mine a page’s importance score in order to sort it in the list of pages [16, 6]. The most famous approach discovered is PageRankTM from Google [1]. This thesis focuses on PageRank and its modifications to tweak up the computing process. With the aim of understanding network phenomena and, especially, PageR- ank computation, this thesis is organized as follow: • Chapter 1: Demontrates on the rank concept and PageRank. This chap- ter is the background on which the following chapters flourish. • Chapter 2: Takes care of some PageRank-related problems such as ac- celerating problem, spam. And, a new method is also proposed in this chapter (section 3.2) which we has published [19]. • Chapter 3: Concentrates on the practical side of PageRank with im- plementations and experiments. Conclusions and future works are also mentioned in this chapter. Acknowledgement My first words are to acknowledge all those people who had helped, contributed to the works in this thesis. It is obvious that I can not finish this thesis alone. It will be an impossible task if I, myself alone, work on all things such as de- ciding orientation, doing research, implementing... and many more things. I tried my best, and if your name is not listed, please trust implicitly in me that my gratitude is not less than those who are listed below. In the foremost position of my gratitude would stand Assoc. Prof. Ha Quang Thuy for his supervision, advice and guidance to me from the very first day when I newly accustom science. His endless enthusiasm for science has in- spired me in doing research and finally this thesis can be realized. Above all and most needed, he always tries his utmost to encourage and support me in various ways. Once again, I would like to send my most nicely thankful words to him. Many thanks would go to Assoc. Prof. Hoang Chi Thanh, Head of Depart- ment of Informatics where I am working, for his help and advice to me in both work and life. Without his help, my thesis can not be here as it is being now. It is a pleasure for me to express my grateful thanks to Prof. Pham Ky Anh for his generous disposition. I was allowed to use many facilities in the Center for High Performance Computing, where he is the Director, to do research and these facilities helped forming a big important part of my thesis. To Doan Duy Hai, Department of Computational and Applied Mathemat- ics, it is a great joy for me to have such a wonderful friend, colleague like him. May be he does not know how much I am indebted to him for his kind help. I thank him for his valuable advice in science discussion and for his precious vi time spending on helping me. His precise ideas were quite an unforgettable assistance to me with my scientific works. Furthermore, with his superb skill in LATEX, I can have help from an expert to drive the troubles away. Working environment is a key point in forming the way and attitude of working to me. I am very lucky to work in a good environment with wise and scholarly colleagues. I also benefit by academic courses and professional subjects from these wise and scholarly colleagues including my teachers and friends. I thank Nguyen Trung Kien who, with his unconditional help of im- plementing, running and storing data in my experiments, makes things easier for me. I have a nice collaboration to the seminar KDD group from Faculty of In- formation Technology, College of Technology. I was so providential to have sev- eral opportunities to work, discuss about science with those intellectuals. I especially thank Nguyen Thu Trang for her help and time working on pro- gramming with me. This work was supported in part by the National Project "Developing con- tent filter systems to support management and implementation public secu- rity - ensure policy" and the MoST-203906 Project "Information Extraction Models for discovering entities and semantic relations from Vietnamese Web pages". I can not wait anymore to wholeheartedly acknowledge my sweetheart. She is really an angel whose dedication, love, confidence in me and intelligence really took me over hard works. I am extremely lucky to be born in my family. My father, the first person who set the basis of my learning character, taught me the joy of intellectual recreation even when I was a child. My mother, the best mother in the world, with her unselfish sacrifice, bring me up with gently love. And, my younger brother, without his spiritual support, may I fail many times. My last words are to thank to everybody because you deserve that for your importance to me. Hanoi, Octorber 25th, 2006 Nguyen Hoai Nam Abstract In this thesis, we will study about the whole WWW on the point of view of social networks. There are many directions to come up to WWW but we chose the way of social networks because it will give us many advantages in doing research. Everytime we heard of WWW, that refer us to the Internet - the biggest network all over the world. Networks are all around us, all the time. From the biochemistry of our cells to the web of friendships across the planet. From the circuitry of modern electronics to chains of historical events. A network is the result of the forces that shaped it. Thus the principles of network formation can be, to some extent, deciphered from the network itself. All such information comprises the structure of the network. Moreover, the Internet has been improving rapidly, from a small network for experimental purpose to a world-wide network. The plentiful content of the World-Wide Web is useful to millions. But infomation is only available if you know where to find it. This thesis centres around some aspects of network- related problems: • How can we find needed infomation? The answer is the appearance of search engines. • How is returned information sorted? The key idea is to use a ranking ap- proach and the most famous approach, PageRankTM, is carefully studied in this thesis. • What are raised problems associated to the answers of two previous ques- tions? These problems cover storage capacity, computational complexity, ix accelerating methods, programming skills... and many more. To check whether the theory to tweak up the existing method is true, we implemented with data sets downloaded from the Internet. Experimental fig- ures show that our method is relatively good and it is really promissing. List of Glossaries Glossary name Description CCP Connected-component PageRank IR Information Retrieval PR PageRank URL Unified Resource Locator WWW World Wide Web xi CHAPTER Objects’ ranks and applications to WWW WWW is a very famous entity that you can meet everyday at any time and in any circumstance. Gradually it prevails on every aspect of our lives and it is definitely useful. Because of endlessly enormous information, which can be provided by the WWW, we must have a scheme to sort things in order to determine which is relatively suit- able. This chapter will guide you through one of the most topical problems addressed. 1.1 Rank of objects Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other se- mantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, 1 1.1 Rank of objects a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and con- tacts, or in bibliographic domains describing publications, authors, and venues. These examples may include set of scientific papers. As all of us know, it is necessary that we credit others’ works if we use their works in our paper. That can be considered a link between our papers and credit-owner’s works. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analy- sis, hypertext mining, and web analysis, only recently has there been a cross- fertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this section, examples on ranking objects in many kinds of datasets are examined to insist that an approach to sort relevant results is required. Then, the section will go on with the need of a ranking algorithm for WWW–one of the most important communities in our contemporary world. 1.1.1 Rank for objects different from web page Progress in digital data acquisition and storage technology has resulted in the growth of huge databases. This has occurred in all areas of human en- deavor, from the mundane (such as supermarket transaction data, credit card usage records, telephone call details, and government statistics) to the more exotic (such as images of astronomical bodies, molecular databases, and med- ical records). Little wonder, then, that interest has grown in the possibility of tapping these data, of extracting from them information that might be of value to the owner of the database. The discipline concerned with this task has become known as data mining. Data mining, with its aim to find knowledge from a "big mountain" of infor- mation, needs a way to sort things extracted. We will study some examples to decide that a ranking algorithm is "must-have" part of every mining system. 2 1.1 Rank of objects Example 1.1. Consider some online computers store that accept queries on the Configuration and Price attributes of a goods. Agents could treat query conditions as if they were regular Boolean conditions then returns relevant things. This way, agent or any client can determine an acceptable radius of pattern and price around the preferred products. However, suppose we are browsing a very big store, there could be too many matching items making users’ tasks difficult. Additionally, many items that are slightly weaker in specification but price are very good in the searching range are missing. Thus, we should think of a new ranking system which ranks results if they are closest to the Configuration given and have best Price or relatively inex- pensive. In addition, we can propose, briefly, a method to rank items as follow. Example 1.2. Suppose, with two criteria Configuration and Price used, each criterion is assigned weight as in 0.8c+0.2p where c ∈ [0, 1] shows how close the result is to the target configuration and p ∈ [0, 1] indicates how close the price is to the target price. Looking at the formula, if a computer has higher relevant configuration, it seems to be ranked higher in results. If another store would want to weigh configuration and price equally, the formula should be 0.5c+0.5p. Suppose that a customer wants to look for a computer with specific con- figuration and target price of 500US$. Besides, in the store, there is only one computer with the specific configuration (c = 1, quite good) and high price (i.e. p = 0.2). All the remaining computers are in the near configuration with c = 0.8 and the price for all is moderate, p = 0.5. Note that, with the same or near configuration, computers can differ in price due to the make of manufac- turers. Applying the definition above, we compute the relevance score of each com- puter type. The computer, which suites best in configuration have score of 0.8 ì 1 + 0.2 ì 0.2 = 0.84 whereas other computers should have scores of 0.8ì0.8+0.2ì0.5 = 0.74. Therefore, in this store, the computer with best rele- vant configuration should be ranked first in results and others will have lower ranks. What will happen if we consider a store which assigns the same weight to both configuration and price? Let us have a computation of these computers. The computer, which suites best in configuration has score of 0.5ì1+0.5ì0.2 = 3 1.1 Rank of objects 0.6 whereas other computers should have scores of 0.5 ì 0.8 + 0.5 ì 0.5 = 0.65. Therefore, in this store, the answer is different. Computers with unlike config- uration but good price should be the first choice because they have high scores and the computer with the same configuration but high price would receive a worse vote from the store’s algorithm. Problem arises in Example 1.2 is not the only problem. We will face another problem which is described Example 1.3 Example 1.3. Suppose that there is a service, which combines results from many online stores to provide information to customers. Therefore, this ser- vice has to query many stores at once then combines the results into a single ranked result. Nevertheless, what will happen if this service uses a different algorithm to stores as the case described in Example 1.2? This is an open ques- tion. It is hard to determine which system is right, which is better, to which we should follow for previous examples. Moreover, not only computers but also many kinds of entities around us need to be ranked. That leads us to a big job: Ranking objects. This is an open question that the thesis’ author intends to solve in the future. In previous examples, computers with different configurations and prices have no interrelated information. How about objects those have connections between? Is it easier to mine and extract useful information? In the next sec- tion, we will consider web pages–special objects which have links and know how links between web pages benefit us. 1.1.2 Linkage usage in search engine Links or more generically relationships, among data instances are ubiquitous. These links often exhibit patterns that can indicate properties of the data in- stances such as the importance, rank, or category of the object. In some cases, not all links will be observed; therefore, we may be interested in predicting the existence of links between instances. In other domains, where the links are evolving over time, our goal may be to predict whether a link will exist 4 1.1 Rank of objects in the future, given the previously observed links. By considering links, more patterns that are complex arise as well. This leads to other challenges focused on discovering substructures, such as communities, groups, or common sub- graphs. Traditional data mining algorithms such as association rule mining, mar- ket basket analysis, and cluster analysis commonly attempt to find patterns in a dataset characterized by a collection of independent instances of a single relation. One can think of this process as learning a model for the node at- tributes of a homogeneous graph while ignoring the links between the nodes. A key emerging challenge for data mining is tackling the problem of mining richly structured, heterogeneous datasets. These kinds of datasets are best de- scribed as networks or graphs. The domains often consist of a variety of object types; the objects can be linked in a variety of ways. Thus, the graph may have different node and edge types. Naively applying traditional statistical infer- ence procedures, which assume that instances are independent, can lead to inappropriate conclusions about the data. Care must be taken that potential correlations due to links are handled appropriately. In fact, object linkage is knowledge that should be exploited. This information can be used to improve the predictive accuracy of the learned models: attributes of linked objects are often correlated, and links are more likely to exist between objects that have some commonality. In addition, the graph structure itself may be an impor- tant element to include in the model. Structural properties such as degree and connectivity can be important indicators. Link mining is a newly emerging re- search area that is at the intersection of the work in link analysis, hypertext and web mining, relational learning and inductive logic programming, and graph mining. We use the term link mining to put a special emphasis on the links, moving them up to first-class citizens in the data analysis endeavor. Now we will take care of link mining on the point of view of the main inspiration–WWW. There is no doubt that the Web is really gigantic and deal- ing with it is really a challenging task. Paper [4] has point out that the size and the content of the Web is changing second by second with the dramatic growth rate. The size of WWW is estimated to have doubled in less than two years and this growth rate is projected to continue for the next two years. As Google [1] has announced, they have a repository of over 8 billion pages indexed and this 5 1.1 Rank of objects number does not reflect well of all pages exist on the WWW. Aside from newly created pages, the existing pages are continuously up- dated and, consequently, contents of Web change incessantly. For example, with the study of over a half of million pages, [4] also shows that about 23% of pages changed daily. In the .com domain, 40% of the pages changed daily, and in about 10 days half of the pages are gone (i.e., their URLs are no longer valid). In addition to size and rapid change, the interlinked nature of the Web sets it apart from many other collections. Several studies aim to understand how the Web’s linkage is structured and how that structure can be modeled. One recent study, for example, suggests that the link structure of the Web is somewhat like a "bow-tie" [7]. That is, about 28% of the pages constitutes a strongly connected core (the center of the bow tie). About 22% form one of the tie’s loops: these are pages that can be reached from the core but not vice versa. The other loop consists of 22% of the pages that can reach the core, but cannot be reached from it. (The remaining nodes can neither reach the core nor can be reached from the core.) With its own complex structure, WWW has created many difficulties in people’s aim to understand it, and, furthermore, to utilize and find useful in- formation from its enormous warehouse of information resources. As many insisting sentences in thesis, this is the main reason why search engines ap- pear. Example 1.4. For many of us, if we want to search for scientific papers, be- sides non-free sites, one of the first-thought-of websites should be This website has a huge collection of scientific papers, especially in Infor- mation Technology. It collects papers from many sources then stores in its database. If you want to search for any paper, just get to the URL above, type what you want and look up at the results returned. This is quite an easy task for us. How about people who are newbies to using Internet? It may be im- possible for them to know the URL needed to find what they want. Moreover, Citeseer is limited in a small amount of available papers, and scientific papers are not the only thing that people want to look up from the WWW. What shown in Example 1.4 has addressed one of the smallest things that 6 1.1 Rank of objects Internet users face in finding what they want on the Internet although infor- mation can be available everywhere. Search engines may be the best solution for us. When a user type a keyword with some terms, a search engine will look up its database to find out relevant web pages and return results at user’s terminal. In general, a search engine accumulates web pages in its database. Then, bases on some criteria, it computes each web page’s importance score to give a place that web page should stand. Importance score is a combination two following parameters: • relevant score between the query and the content of web page, this score can be combined from many coefficients. These criteria may include: – term frequencies: number of terms’ (in keyword) appearances in web pages’ contents. For example, if your query contains "science" then a page with more words "science" would be more relevant to your query. – term proximities: this criterion defines how close your query is to a web page’s content. If your query has "health" word, then a hos- pital’s homepage would be likely more appropriate compares to a supermarket’s homepage. – term position: if terms in your keyword have one of the following positions in a web page’s content, the page is expected to be suit- able. These positions are: in title, top of page... It is quite easy to understand, a title or words on top of pages are highly extracted in- formation from many sentences below them. Thus, a web page with those characteristics would have good relation to the query. – term characteristics: in a web page, words can be formatted in many kinds: bold, italics, capitalized... and each of them has its own mean- ing. If a word is in bold or italics format, it may be emphasized so that people will pay more attentions. And, if your query has a word that is coincident to an emphasized word in web page, your query would receive higher relevance. We will not mention carefully how this score is computed. – category information: we divide things into some categories and it is not that all categories have the same importance. This criterion 7 1.1 Rank of objects takes out the above advantage. If a page in a higher prior category has a (some) word(s) in your keyword, then it will get more score from category information. – popularity information: there are world-wide-known pages and there are no-one-know pages, too. So if a more popular page has words that are the same to a query’s words, it will be rank higher according to popularity information gained. This factor is determined by the number of users reaching a web page, how active is that web page... • rank score which can be referred to as page’s own value and can be com- puted offline [16]. What is a page’s own value and why can it be computed offline? That means every page will have a value adhering and this value is independent from queries issued. Above scores can only be computed when queries are issued but rank score is an exception. Let us consider the following example to understand clearer about the previous criteria. Example 1.5. We assume that a user issues one query of a word "TOEFL" and there are two web pages, namely α and β, which can satisfy this query. For α, term frequencies of the word "TOEFL" is 9 and the number is 7 in web page β. Therefore, we can determine that α has a better score for term frequencies. Re- garding term proximities, we have α is about education-biased and β is in the field of engineering. As all of us know, TOEFL is a test from ETS to check En- glish proficiency so α, with the aim to education, is a better candidate. A better candidate in this example receives 6.8 and another one receives 5.4. The third criterion is designed as follow: α has no TOEFL word in its title and β has one TOEFL in its title. It is clearly that, in this case, β will receive more score from this criterion. And β gets 6.9, α gets 4.2 for that reason. The term characteris- tics is a little bit harder to compute. We have to check if our keywords lie in bold or italics or H11 formats, etc, to determine scores for this factor. Suppose that the final value of this score for two web pages are 6.0 for α and 7.0 for β. Cate- gory information is a relative score. We have to define which category is more important in hundreds of categories. We do believe that there are thousands ideas people can think of to defend that their choices of categories is better. 1In terms of web pages, which mainly derived from HTML, H1 corresponds to Heading 1 which defines the level of a portion of text (H1 is the highest level) 8 1.1 Rank of objects That means, people will sort categories’ importance quite differently. For that reason, in this example, we choose the same "category information" score of 5.0 for both α and β. The last but not least, popularity information score is the one that we should be careful. An enterprise’s homepage, such as Phillips, is definitely more important than an individual’s blog2 but Phillips’ homepage may be not as wide-known as the most famous blog or Phillips’ homepage is unlikely to be as famous as – a web page allowing you to post your self-edited video. So this is an important part in a combina- tion that settles up importance of a page. Come back to the problem, we give α popularity score of 2.2 and β 2.6. After having all information needed about two web pages, we bring them into a formula from which we get the result to evaluate web pages’ importance. As in Example 1.2, we can weigh different scores according to different coeffi- cients and howwe choose these coefficients can significantly change our result. In this example, weight of term frequencies is given to 0.1, weight of term prox- imities is given to 0.13, term position factor is set to 0.17, term characteristics is weighed by 0.1, weights for category information and popularity information are 0.1 and 0.15, corresponding. The final weight which is assigned for rank score is 0.25. The formula is given as IS = wfrefre + wpropro + wpopo + wchacha + wcatcat + wpoppop in which IS is importance score, w means weight and fre, pro, po, cha, cat, pop are term frequencies, term proximities, term positions, term characteristics, category information and popularity, corresponding. With those parameters, we compute IS for each α, β. Consider two cases of rank for α, β to demonstrate how importance scores of pages change. Case 1: rankα = 4 and rankβ = 6 ISα=0.1ì 9 + 0.13ì 6.8 + 0.17ì 4.2 + 0.1ì 7.0 + 0.1ì 5.0 + 0.15ì 2.6 + 0.25ì 4 =5.387 ISβ=0.1ì 9 + 0.13ì 6.8 + 0.17ì 4.2 + 0.1ì 7.0 + 0.1ì 5.0 + 0.15ì 2.6 + 0.25ì 6 =5.206 2a.k.a. web log – a kind of online diaries 9 1.2 Basis of PageRank Obviously, page α will be on top of page β. Case 2: rankα = 4 and rankβ = 7 ISα=5.387 ISβ=0.1ì 9 + 0.13ì 6.8 + 0.17ì 4.2 + 0.1ì 7.0 + 0.1ì 5.0 + 0.15ì 2.6 + 0.25ì 7 =5.456 In this case, final IS for β changes so that β will take a better place than α in results returned to user. As can see in Example 1.5, importance score of a page can change to get higher place if it has a good rank. But rank is not the only factor that can change importance score of a page, there are some other factors. Although other factors which can affect importance score of a page can be easily tweaked to get higher IS, there are some restrictions that make modifications harder. For example, you can bias your web page by using much word that relate to a specific subject but it may be consider spam3. We have detailed about how criteria are computed and used in determining a page’s importance score except rank. As a result, questions rise: how is rank of a page is calculated and how many ways are there to compute ranks of pages? The answer is as what had been stated: linkage structure can be greatly useful. And, how linkage structure is used will be detailed in the next section with the most famous algorithm PageRankTM from search engine Google [1]. 1.2 Basis of PageRank There are two main reasons why traditional Information Retrieval (IR) tech- niques may not be effective enough in ranking query results. First, the Web is very large, with great variation in the amount, quality and types of infor- mation presented in Web pages. Thus, many pages that contain the search terms may be of poor quality or not relevant. Second, many Web pages are not sufficiently self-descriptive, so IR techniques that examine the contents of a page alone may not work well. An often-cited example to illustrate this issue is 3This is subject to mention in the next chapter 10 1.2 Basis of PageRank the search for "search engines". The homepages of most of the principal search engines do not contain the text "search engine". Moreover, Web pages are fre- quently manipulated by adding misleading terms so they are ranked higher by a search engine (spamming). Thus, techniques that base their decisions on the content of pages alone are easy to manipulate. The link structure of the Web contains important implied information, and can help in altering or rank- ing Web pages. In particular, a link from page A to page B can be considered a recommendation of page B by the author of A. Some new algorithms have been proposed that exploit this link structure - not only for keyword search- ing, but also other tasks like automatically building a Yahoo-like hierarchy, or identifying communities on the Web. The qualitative performance of these algorithms is generally better than the IR algorithms since they make use of more information than just the contents of the pages. While it is indeed possi- ble to influence the link structure of the Web locally, it is quite hard to do so at a global level. Therefore, link analysis algorithms that work at a global level are relatively robust against spamming. 1.2.1 Mathematical basis In [16], Page et. al. introduced an approach, named PageRank, which tries to capture notion importance of a page. For instance, the Google homepage is intuitively more important than the homepage of a normal university. This difference is reflected in the number of other pages that point to these two pages, that is, more pages point to the Google homepage than to the univer- sity’s homepage. PageRank bases its ranking on an idea that if there is a link from web page A to web page B, the importance of page A will influence the importance of page B. That means a page receives more importance if Yahoo points to it, than if some unknown page points to it. Citation ranking, in con- trast, does not distinguish between these two cases. Note that the definition of PageRank is recursive - the importance of a page both depends on and influ- ences the importance of other pages. Page and Brin [16] uses the linkage structure of the Web to build a stochas- tic irreducible Markov chain with a transition probability matrix. We will first present the simple definition of PageRank and discuss some of the computa- 11 1.2 Basis of PageRank tional aspect, then more details on computational aspect are mentioned in the following chapter. Suppose that we have a collection of web pages in database and these pages are numbered from 1 to n. For any arbitrary page u, let BI(u) be the collection of pages in which every page points (links) to u, Nu be the number of links from u, let piu be the rank of page u. Rank piu of page u in PageRank method is computed as follow: piu = ∑ i∈BI(u) pii Ni (1.1) Expressing in graph terms, we define graph G = (V,E) in which V is a col- lection of web pages whose ranks are to be computed (pages number ranging from 1 to n), E is a collection of edges: E={(i, j) | if there is a link from i to j}. In its most trivial form, PageRank assumes that web graph G is strongly connected. Thus, we can construct the adjacent matrix representing G as fol- low: P = (pij)nìn (1.2) in which pij =  1/Ni if there is a link from i to j0 otherwise (1.3) As view in probability terms, P is transition matrix whose element pij is the probability of moving from state i (page i) to state j (page j) in one time step. For example, assume that, starting from any node (webpage), it is equally likely to follow any of the outgoing links to arrive at another node. Matrix P is stochastic and irreducible. Rewrite (1.1) in matrix form for every page, we have: pi = piP (1.4) in which pi = (pi1, pi2, . . . , pin) is a 1ì n vector with elements are ranks of pages. Thus, from (1.4), we come back to the problem of finding eigen-vector of matrix P corresponding to eigen-value λ = 1 or we can say that pi is the stationary vector of the chain. 12 1.2 Basis of PageRank 1.2.2 Practical issues The first practical issue we meet come from the Web linkage structure. Sim- ple PageRank is well defined only if the link graph is strongly connected that means there is, at least, a way from a page to other ones. However, the Web 1 2 0 3 6 4 5 Figure 1.1: A small directed Web graph with 7 nodes is far from strongly connected. In particular, there are some related problems that arise on the real Web: rank sinks and rank leaks. Any strongly (internally) connected cluster of pages within the Web graph from which no links point out- wards forms a rank sink. An individual page that does not have any outlinks constitutes a rank leak. Although, technically, a rank leak is a special case of rank sink, a rank leak causes a different kind of problem. In the case of a rank sink, nodes not in the sink receive a zero rank, which means we cannot distinguish the importance of such nodes. Example 1.6. In Figure 1.1, node 6 sinks and if we remove the link from node 5 to node 3, we make nodes 4, 5 and 6 sink. In that circumstance, group of 4, 5, and 6 sinks because there is no outlink from them to the outer world. Node 6 is even worse, it becomes a representative for rank leak. Because, from it, we can count no outlink. In this case, any rank reaching a rank leak is lost forever due to no way out. However, node 0 is the worst because there is no inlink to it and outlink from it. Page. et. al. suggested overcoming this problem in two steps. First, they remove all the leak nodes with out-degree 0 (out-degree of a page is the number of its outlinks). It is equivalent to the job that we add a value of 1 n to all entries 13 1.2 Basis of PageRank of every row corresponding to a rank sink node because that row will have all 0s. Matrix P turns into matrix P ′. Second, to insure a stationary vector exists, they define a parameter called decay-factor. The final modified matrix is given as follow: P˜ = αP ′ + (1− α) n J (1.5) in which α is "decay-factor" (often chosen as 0.85) and J is matrix of all 1s. Then, instead of finding eigen-vector of matrix P , we find eigen-vector of ma- trix P˜ given by equation: pi = piP˜ (1.6) We also have sum of all elements of pi equals to 1: n∑ i=1 pii = 1 (1.7) The second practical issue we face is an efficient way to compute PR vector. As indicated, the problem of computing ranks of pages became the problem of finding the principal eigenvector pi of transition matrix. One of the simplest methods used to determine eigenvectors of matrix is called power iteration. Power method can yield result of PageRank after about 50 to 100 iterations which is ranged by our choice of decay-factor α [16, 13, 12, 19, 14]. PageRank iterative method is given below: Algorithm 1.1 Iterative method for original PageRank computation 1: function PAGERANK(G) . G is the Web graph 2: tmp← any random vector . vector tmp is often assigned by ( 1 n ) 1ìn 3: ← any threshold value . value to determine if pi converges 4: while ||pi − tmp||L > do . norm L can be L1 or L2 5: pi ← tmp 6: pi ← tmpì P 7: end while 8: return pi . the final PR vector 9: end function Example 1.7. Suppose we have 7 web pages in our repository with the graph as shown in Figure 1.1. Notice that, we add 2 rank leak nodes: 0 and 6 to illustrate all cases in the real Web. 14 1.2 Basis of PageRank We, then, form the matrix of 6 nodes in this small web graph (except for node 0) for an easier demonstration. We name this 6-node graph as SmallSet to easily reference. P =   p1 p2 p3 p4 p5 p6 p1 0 1 2 1 2 0 0 0 p2 1 3 0 1 3 0 0 1 3 p3 0 1 2 0 1 2 0 0 p4 0 0 0 0 1 0 p5 0 0 1 3 1 3 0 1 3 p6 0 0 0 0 0 0   Row 6 contains all 0s because there is no outlink from the 6th page and as a result, P is not an adjacent matrix for a strongly connected graph. So we modify matrix P by adding 1 6 to all entries on the 6th row. We got: P ′ =   p1 p2 p3 p4 p5 p6 p1 0 1 2 1 2 0 0 0 p2 1 3 0 1 3 0 0 1 3 p3 0 1 2 0 1 2 0 0 p4 0 0 0 0 1 0 p5 0 0 1 3 1 3 0 1 3 p6 1 6 1 6 1 6 1 6 1 6 1 6   After the addition, P ′ is stochastic but the addition alone cannot definitely make sure that the stationary vector exists. The final step to make sure the matrix irreducible is to alter P ′ by P˜ as in 1.5. We choose α = 9 10 P˜ = 9 10 P ′ + 1 60 J =   p1 p2 p3 p4 p5 p6 p1 1 60 7 15 7 15 1 60 1 60 1 60 p2 19 60 1 60 19 60 1 60 1 60 19 60 p3 1 60 7 15 1 60 7 15 1 60 1 60 p4 1 60 1 60 1 60 1 60 11 12 1 60 p5 1 60 1 60 19 60 19 60 1 60 19 60 p6 1 6 1 6 1 6 1 6 1 6 1 6   Our matrix now becomes stochastic and irreducible. Therefore, every node can directly move to other ones and the stationary probability distribution over 15 1.2 Basis of PageRank the Markov chain can be found. With matrix P modified, we can obtain the principal eigenvector of P˜ : piT = (0.1973 0.2637 0.2066 0.1584 0.1574 0.0167) A visual illustration of this example can be found in Figure 1.2. 1 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Figure 1.2: Ranks of page in SmallWeb graph with α = .9 In Figure 1.3, we also test with different value of α to make sure that each page’s rank can be significantly changed by modifying α. As can see in the figure, α with the value of 0.9 really help us to draw a distinction between pages because it yields quite different results of ranks. Results become more neutral when value of α decreases. Medium high-ranked pages do not change much in ranks but the highest and the lowest change quite clearly. For more specifically information, in Figure 2.3(a) pi = (0.1973 0.2637 0.2066 0.1584 0.1574 0.0167) and in Figure 2.3(i) pi = (0.1681 0.2183 0.1809 0.1720 0.1773 0.0833) so pi2 goes down to 0.2183 from 0.2637 and pi6 goes up to 0.0833 from 0.0167. One’s changing rate is about 0.5 and another one is about 0.6. This can lead us to a different result in searching if α is not carefully chosen. For a better performance of system, we can choose a small value for α but the result may be useless. For a usable result, we should choose α near to 1 but the cost may increase sharply as in Table 1.1. 16 1.2 Basis of PageRank 1 2 3 4 5 6 0 0.2 0.4 (a) 1 2 3 4 5 6 0 0.2 0.4 (b) 1 2 3 4 5 6 0 0.2 0.4 (c) 1 2 3 4 5 6 0 0.2 0.4 (d) 1 2 3 4 5 6 0 0.2 0.4 (e) 1 2 3 4 5 6 0 0.2 0.4 (f) 1 2 3 4 5 6 0 0.2 0.4 (g) 1 2 3 4 5 6 0 0.2 0.4 (h) 1 2 3 4 5 6 0 0.2 0.4 (i) Figure 1.3: Figures exemplifying PR vector with (a) α = .9; (b) α = .85; (c) α = .8; (d) α = .75; (e) α = .7; (f) α = .65; (g) α = .6; (h) α = .55; (i) α = .5 Table 1.1 shows the number of iterations needed to compute PR vector cor- responding to each value of α. Value of α .9 .85 .8 .75 .7 .65 .6 .55 .5 Iterations 18 16 14 12 12 10 10 8 8 Table 1.1: Iterations needed for each value of decay-factor α of SmallSet Iterations needed for α = .9 is as double as iterations needed for α = .5. As Figure 1.4 shows, for the biggest value of α chosen, the cost is remarkable but if reference back Figure 1.3 the result is relatively better because we can discriminate pages’ ranks. For the smallest value of α, the number of spending iterations is smaller and therefore time taken is much lower but the result may be not really good. 17 1.2 Basis of PageRank 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 8 10 12 14 16 18 Figure 1.4: Graph of iterations needed with α ∈ [.5; .9] of SmallSet Algorithm 1.2 will discuss about the PR computation in practical use. 18 1.2 Basis of PageRank Algorithm 1.2 Iterative method for practical PageRank computation 1: function PRACTICALPAGERANK(G) . G is the Web graph 2: α← any decay value . α is often chosen .85 3: tmp← any random vector . vector tmp is often assigned by ( 1 n ) 1ìn 4: ← any threshold value . value to determine if pi converges 5: for indexRow ← 1ữ n do 6: remove rank leak and rank sink nodes 7: end for 8: construct P˜ 9: while ||pi − tmp||L > do . makes pi converge to the principal eigenvector 10: pi ← tmp 11: pi ← tmpì P˜ 12: end while 13: return pi . the final PR vector 14: end function In fact, all rank sink and rank leak nodes are removed by making them connected by other nodes before constructing matrix P˜ . After that, we apply the same process as in Algorithm 1.1 to get the final result. And, Algorithm 1.2 also finals this chapter. Conclusion of chapter In this chapter, we have gone through a key idea for making an order to objects in our world. Objects in a database should be ranked, scientific papers should be ranked and web pages must be ranked. Moreover, the most important thing it brings us that we can put everything in order to easily mine for knowledge. We also navigated about PageRank, a method used for ranking web pages, completely about its base and arisen problems in practical usage. In the next chapter, we will examine some matters rounding accelerating and spamming problems. 19 CHAPTER Some PageRank-related problems PageRank is a famous occurrence, a trademark and a tool from which Google benefits much. Because it is famous, many scientists pay attention and as a result, there are many ways to tweak PageR- ank up. Some accelerating methods for PageRank will be shown in this chapter as a proof that PageRank is one of the biggest current affairs in Internet-related fields. Furthermore, because a page can easily exploit PageRank to get higher rank so we must deal with a problem, which is called spam4. 2.1 Accelerating problems PageRank computation is a big problem which involves many fields of study such as storage capacity, computational complexity... In this section, we will discuss one of the major problems rising with the computation of PageRank which is so-called accelerating problems. 4This kind of spam is different from mail spam 20 2.1 Accelerating problems 2.1.1 Related works The Web is becoming a massive repository for delivering information. Recent studies [8] show that most of the users access the Web by means of a search engine. Nielsen/NetRatings, one of the leading Internet and digital media au- dience information and analysis services, reports that there were an estimated 151 million active Internet users in the US, for January 2004. Of these, 76% used a search engine at least once during the month and the average time spent searching was about 40 minutes. Therefore, to access something on In- ternet the easiest way is to insert some keywords on a favorite search engine and to jump to the selected Web site returned as answer. For these reasons, we witnessed to a growing interest and in large investment efforts to improve the quality of Web ranking algorithms. Besides, the research community has devoted an increased attention to reduce the computation time needed by Web ranking algorithms. Indeed, a particular attention was devoted to improve PageRank [6, 16], the well known ranking algorithm used by Google. The core of PageRank exploits an iterative weight assignment of ranks to the Web pages, until a fixed point is reached. This fixed point turns out to be the com- putation of the (dominant) eigenpairs of a matrix derived by the Web Graph itself. Brin and Page originally suggested computing this pair using the well- known Power method and they also gave a nice interpretation of PageRank in term of Markov Chains. The most recent researches about PageRank are aimed to address at least two different needs. First, the desire to reduce the time spent to weight the nodes of a Web Graphs containing many billions of pages which requires sev- eral days of computation. Note that Fetterly et al. in [9] showed that 35% of the Web changed during the course of their study (eleven downloads over a period of 11 weeks). More recently Cho et al. [14] showed that in a week more than 25% of links are changed and 5% of "new content" is created. They conclude that "This result indicates that search engines need to update link based rank- ing metrics (such as PageRank) very often... a week-old ranking may not reflect the current ranking of the pages very well. The second need, is to assign many PageRank values to each Web page, as results of PageRank’s personalization that was recently presented by Google as beta-service [1]. 21 2.1 Accelerating problems Previous approaches to accelerate PageRank followed three different direc- tions. The first approach tries to fit the graph structure into main memory by using compression techniques for the Web Graph [5]. The second approach efficiently computes in external memory the PageRank, by using a sequence of scan operations on the Web graph. The last direction exploits efficient nu- merical methods to reduce the computation time. These kinds of numerical techniques are the most promising and we have seen many intriguing results in the last few years. Arasu et al. [3] observe that computing PageRank is equivalent to solve a linear system and proposed to exploit iterative methods such as the Gauss-Seidel algorithm. Kamvar et al. [12] suggest accelerating computation by using extrapolation methods which periodically subtracts off estimates of non-principal eigenvectors from the current iterate of the power method. The reported speedup is about 50-300% and it is influenced by the particular parameter settings used for computing the PageRank. Kamvar et al. [14] note that the Web Graph’s matrix, permuted by sorting the lexico- graphically the URLs, assumes an approximate block structure since most of the links are intra-domains. They suggest to compute separately several "lo- cal" PageRank vectors, one for each block and to use the concatenation of these vectors as starting point for the computation of a global web rank. In [13], the authors suggest to avoid the re-computation of the (sub)-components of PageR- ank vector as soon as they arrive to a fixed point. This results in a speedup of about 17%. We will investigate a method motivated by cost-reducing direction as men- tioned above in the following part. It exploits the structure of the Web graph and arranges web pages into blocks if they are from the same host to reduce computational cost. By introducing BlockRank [14], we can have a nice bridge to a method proposed in the section 3.2. 2.1.2 Exploiting block structure of the Web Overview of BlockRank Algorithm The block structure of the web suggests a fast algorithm for computing PageRank, wherein a “local PageRank vector” is computed for each host, giving the relative importance of pages within a host. These local PageRank vectors 22 2.1 Accelerating problems can then be used to form an approximation to the global PageRank vector that is used as a starting vector for the standard PageRank computation. This is the basic idea behind the Block-Rank algorithm, which we summarize here and describe in this section. The algorithm proceeds as follows: Algorithm 2.1 General steps of BlockRank algorithm 1: function BLOCKRANK(G) . G is the Web graph 2: Split the web into blocks by domain 3: Compute the Local PageRanks for each block 4: Estimate the relative importance, BlockRank, for each block 5: Form an approximate Global PageRank vector z for web pages 6: Use z as a starting vector for standard PageRank 7: end function Before describing the steps in detail below, we should get familiar to some notations used in this section. We will use lower-case indices (i.e. i, j) to repre- sent indices of individual web sites, and upper case indices (i.e. I, J) to repre- sent indices of blocks. We use the shorthand notation i ∈ I to denote that page i ∈ block I. The number of elements in block J is denoted nJ . The graph of a given block J is given by the nJ ì nJ submatrix GJJ of the matrix G. Computing Local PageRanks In this section, we describe computing a “local PageRank vector” for each block in the web. One may wonder how a block is created and that question would be revealed in this part. Authors of [14] took all the hyperlinks in their FULL-LARGEWEB data set, and count how many of these links are “intra-host” links (links from a page to another page in the same host) and howmany are “inter-host” links (links from a page to a page in a different host). They had shown that 79.1% of the links in dataset are intra-host links, and 20.9% are inter-host links. These intra-host connectivity statistics are not far from the earlier results of Bharat et al. [2]. They also investigated the number of links that are intra-domain links, and the number of links that are interdomain links. And larger majority of links are intra-domain links (83.9%). These results make sense intuitively. 23 2.1 Accelerating problems Take as an example the domain cs.stanford.edu. Most of the links in cs.stanford.edu are links around the cs.stanford.edu site (such as cs.stanford.edu/admissions or cs.stanford.edu/research). Furthermore, almost all non-navigational links are links to other Stanford hosts, such as www.stanford.edu,library.stanford.edu or www-cs-students.stanford.edu. One might expect that there exists this structure even in lower levels of the web hierarchy. For example, one might expect that pages under cs.stanford.edu/admissions/ are highly inter- connected, and very loosely connected with pages in other sublevels, leading to a nested block structure. This type of nested block structure can be natu- rally exposed by sorting the link graph to construct a link matrix in the fol- lowing way. To sort URLs lexicographically, except that as the sort key, they reversed the components of the domain. For instance, the sort key for the URL www.stanford.edu/home/students/would be edu.stanford.www/home/students. The URLs are then assigned sequential identifiers when constructing the link matrix. A link matrix contains as its (i, j)th entry a 1 if there is a link from i to j, and 0 otherwise. This has the desired property that URLs are grouped in turn by top level domain, domain, hostname, and finally path. The subgraph for pages in stanford.edu appear as a sub-block of the full link matrix. In turn, the subgraph for pages in www-db.stanford.edu appear as a nested sub-block. Since most blocks have very few links in and out of the block as compared to the number of links within the block, it seems plausible that the relative rankings of most of the pages within a block are determined by the inter-block links. We define the local PageRank vector lJ of a block J (GJJ ) to be the result of the PageRank algorithm applied only on block J , as if block J represented the entire web, and as if the links to pages in other blocks did not exist. That is: lJ = PAGERANK(GJJ ) in which PAGERANK is exactly as what written in Algorithm 1.1. Local PageRank accuracies To investigate how well these local PageRank vectors approximate the rel- ative magnitudes of the true PageRank vectors within a given host, Kamvar et. al. ran these experiments. They computed the local PageRank vectors lJ of 24 2.1 Accelerating problems each host in STANFORD/BERKELEY. They also computed the global PageRank vector pi for STANFORD/BERKELEY using the standard PageRank algorithm. After the global PageRank vector computation is completed, they then com- pared the local PageRank scores of the pages within a given host to the global PageRank scores of the pages in the same host. Specifically, they took the el- ements corresponding to the pages in host J of the global PageRank vector pi, and form the vector gJ from these elements. This newly-created vector gJ is then normalized so that its elements sum to 1 in order to compare it to the local PageRank vector lJ , which also has an L1 norm of 1. Specifically, gJ = pij∈J ||pij∈J || In Table 2.1, the absolute error, ||lJ−gJ ||1, is on average 0.2383 for the hosts in STANFORD/BERKELEY. Error measure Average ||lJ − gJ ||1 0.2383 KDist(lJ , gJ) 0.0571 Table 2.1: The closeness of local PageRank vector to global PageRank vector Furthermore, they conclude that rank scores induced by local PageRank is close to the scores obtained from the global PageRank scores. To compare the orderings, we measure the average “distance” between the local PageRank vec- tors lJ and global PageRank segments gJ . The KDist distance measure, based on Kendall’s-τ rank correlation and used for comparing induced rank orders, is defined as follow. Consider two partially ordered lists of URLs, τ1 and τ2, each of length m. Let U be the union of the URLs in τ1 and τ2. If δ1 is U − τ1, then let τ ′1 be the extension of τ1, where τ ′1 contains δ1 appearing after all the URLs in τ1. We extend τ2 analogously to yield τ ′2. KDist is then defined as: KDist(τ1, τ2) = |{(u, v) : τ ′1, τ ′ 2 disagree on order of (u, v), u 6= v}| |U | ì (|U | − 1) In other words, KDist(τ1, τ2) is the probability that τ ′1 and τ ′2 disagree on the relative ordering of a randomly selected pair of distinct nodes (u, v) ∈ U ì U . 25 2.1 Accelerating problems In the current work, we only compare lists containing the same sets of ele- ments, so that KDist is identical to Kendall’s τ distance. The average distance KDist(lJ , gJ ) is 0.0571 for the hosts in STANFORD/BERKELEY. Notice that this distance is low. This observation means that the ordering induced by the lo- cal PageRank is close to being correct, and thus suggests that the majority of the L1 error in the comparison of local and global PageRanks comes from the miscalibration of a few pages on each host. Indeed the miscalibration may be among important pages; discussing next, this miscalibration is corrected by the final step of our algorithm. Furthermore, the relative rankings of pages on different hosts is unknown at this point. For these reasons, authors of [14] do not suggest using local PageRank for ranking pages; just using as a tool for computing the global PageRank more efficiently. And the local PageRank results compare favorable to the global PageRank which can be obtained from the standard PageRank algorithm. They suggest using the local PageRank vector as a start vector for computing global PageRank vector. It should be noted that errors of this pattern (where the majority of L1 er- ror comes from the miscalibration of a few pages) are fixed easily, since once these miscalibrated pages are fixed (by, for example, a few iterations of global PageRank), the rest of the pages fall into place. Errors that are more random take longer to fix. Local PageRank convergence rates Another interesting question to investigate is how quickly the local PageR- ank scores converge. In Figure 4, they showed a histogram of the number of it- erations it takes for the local PageRank scores for each host in their data set, in which dangling nodes are removed and contains roughly about 70M pages with over 600M edges, to converge to an L1 residual < 10−1. Notice that most hosts converge to this residual in less than 12 iterations. Interestingly, there is no correlation between the convergence rate of a host and the host’s size. Rather, the convergence rate is primarily dependent on the extent of the nested block structure within the host. That is, hosts with strong nested blocks are likely to converge slowly, since they represent slow-mixing Markov chains. Hosts with a more random connection pattern converge faster, since they represent a fast- mixing Markov chain. This observation suggests that one could make the local 26 2.1 Accelerating problems PageRank computations even faster by wisely choosing the blocks. That is, if a host has a strong nested block structure, use the directories within that host as your blocks. However, this is not a crucial issue, because we show in Section 5 that the local PageRank computations can be performed in a dis- tributed fashion in parallel with the crawl. Therefore, reducing the cost of the local PageRank computations are not a bottleneck for computing PageRank with our scheme, as the local computations can be pipelined with the crawl. Estimating the relative importance of each block After calculating all pages’ ranks in each host, the algorithm turns out to compute each block’s rank, namely BlockRanks. Assume there are k blocks in the web. The idea used to compute BlockRank is similar to the computation of page’s rank. In [14] firstly the block graph B is created, where each vertex corresponds to a block in the web graph. An edge between two pages in the web is represented as an edge between the corresponding blocks (or a self-edge, if both pages are in the same block). The edge weights are determined as follows: the weight of an edge between blocks I and J is defined to be the sum of the edge-weights from pages in I to pages in J in the web graph, weighted by the local PageRanks of the linking pages in block I. That is, if P is the web graph and li is the local Page-Rank of page i in block I, then weight of edge BIJ is given by: BIJ = ∑ i∈I,j∈J Pij.li They wrote it in matrix notations. Define the local PageRank matrix L to be the nì k matrix whose columns are the local PageRank vectors lJ . L =   l1 0 ã ã ã 0 0 l2 ã ã ã 0 ... ... . . . ... 0 0 ã ã ã lk   Define the matrix S to be the n ì k matrix that has the same structure as L, but whose nonzero entries are all replaced by 1. Then the block matrix B is the k ì k matrix: B = LTPS 27 2.1 Accelerating problems Notice that B is a transition matrix where the element BIJ represents the transition probability of block I to block J . That is: BIJ = p(J |I) Once we have the k ì k transition matrix B, we may use the standard PageR- ank algorithm on this reduced matrix to compute the BlockRank vector b. In this notation, each block is considered as a web page and the PageRank algo- rithm is applied same as for web pages. Using relative importance of blocks for computing PageRank At this point, there are two sets of rankings, one set contains relative ranks of web pages in each block and another set contains relative ranks of blocks in the web graph. Within each block J , we have the local PageRanks of the pages in the block bJ . We also have the BlockRank vector b whose elements bJ are the BlockRank for each block J , measuring the relative importance of the blocks. We may now approximate the global PageRank of a page j ∈ J as its local PageRank lj , weighted by the BlockRank bJ of the block in which it resides. That is, pi (0) j = lj ì bJ In matrix notation: pi(0) = Lì b And pi(0) is used as the start vector for the regular PR algorithm. Conclusion of approach More specific, this approach is presented in Algorithm 2.2. 28 2.1 Accelerating problems Algorithm 2.2 Specific version of BlockRank 1: procedure BLOCKRANK(G) . G is the whole Web graph 2: Sort the web graph lexicographically . to induce intra-host block 3: for all block J do . compute the local PageRank vector of block J 4: lJ = PageRank(J) 5: end for 6: B = LTGS . B is a matrix 7: b = PageRank(B) . B is understood as a graph of blocks 8: pi(0) = Lì b 9: pi = PageRank(G) . pi(0) is used as the start vector 10: return pi 11: end procedure This approach has shown that the hyperlink graph of the web has a nested block structure, something that has not yet been thoroughly investigated in studies of the web. And, this structure is exploited to compute PageRank in a fast manner using an algorithm which is so-called BlockRank. As shown in Figure 2.1: Convergence rates for standard PageRank (solid line) vs. Block- Rank (dotted line). The x-axis is the number of iterations, and the y-axis is the log of the L1-residual. Figure 2.1, BlockRank speeds up PageRank computations by factors of 2 and higher, depending on the particular scenario. There are a number of areas for future work: finding the “best” blocks for BlockRank by splitting up what would be slow-mixing blocks with internal nested block structure; using the 29 2.2 Connected-component PageRank approach block structure for hyperlink-based algorithms other than web search, such as in clustering or classification; and exploring more fully the topics of updates and personalized PageRank. 2.2 Connected-component PageRank approach In this section, we will discuss about advantages of using connected compo- nents, how connected-component concept is applied in computation of PageR- ank and we will prove some work generated on the theory side. We also propose an approach which utilizes connected component idea from graph theory. The proposal is so-called CCP, a short for "Connected Component PageRank". 2.2.1 Initial ideas While taking care of Markov model [17], we pay special attention to a defini- tion that states in Markov chains can be divided into classes; states which are transposable are grouped into one class. This concept is similar to concept of connected components in graph theory. Moreover, as mentioned, the web graph is presented using adjacent matrix. This gives us a chance to apply connected component concept into PageRank computation. The proposed algorithm is given as follow. Algorithm 2.3 General steps of CCP algorithm 1: function CCP(G) . G is the Web graph 2: Split the web into blocks by connected component 3: Compute the local PageRanks for each block 4: Form an approximate global PageRank vector pi for web pages by com- bining eigen-vectors of blocks 5: end function Figure 2.2 illustrates the idea of CCP. 30 2.2 Connected-component PageRank approach (a) Unarranged adjacent matrix (b) Arranged adjacent matrix Figure 2.2: Unarranged matrix and arranged matrix Applying connected components into PageRank computation, we gain the following advantages. ơ Regarding computational complexity: our method follow the trend of re- ducing cost needed for computing PageRank. We already know that time expenditure for multiplying a vector with a matrix is O(n2) in which n is size of common dimension. For that reason, if we compute PageRank by constructing its adjacent matrix, we face a hard problem when n reaches billions. By using connected components in computing PageRank, we can utilize it to reduce time needed. Here are some fundamental details about how we use connected components in PageRank problem. Assume that our web graph has about k connected components, therefore we have k small adjacent matrices (we call blocks) obtained from matrix P . Let max be the biggest number of pages in an arbitrary block. For every block, time needed to compute PageRank vector is smaller or equals to O(n2max) and if compare to whole web graph, it is much smaller. For all blocks, complexity can not exceed the amount of k ì O(n2max) which is smaller than O(n2) where n is the size of the graph. Generally, on the theory side, our method is more efficient than trivial PageRank. ư Regarding convenience in finding connected components: In [18], we can get many effective algorithms for finding connected components of graphs. 31 2.2 Connected-component PageRank approach Time needed for determining connected components is O(m+n) in which m is the number of edges and n is the number of vertices. This additional time is not remarkable, compares to the previous. đ Regarding job of combining with other accelerating methods:We can have even a better performance if we run our method in hybrid mode with some other accelerating methods. In each block, we can use MAP or Ex- trapolation method to reduce computational time for that block, so the total reduced time is remarkable. ¯ In terms of programming techniques, [14] refers to a programming way which splits web graph matrix into small parts for computing in parallel. Using our method, we can have PageRank computation parallelized eas- ily. We can assign each block to one processor in mediocre case; for this, things are very natural. There certainly exists many ways that we can parallelize PageRank work and CCP is a very promising one. Notice that these advantages was not ensured by a theoretical proof or practi- cal experiments. If we split web graph matrix into smaller blocks corresponding to connected components, we will have to answer these questions: ơ Is CCP method correct? ư How do we use method for effectiveness? We will concentrate on more descriptive information of CCP method in the following sections: 3.2.2 and 3.2.3. 2.2.2 Mathematical basis of CCP How did we use connected components in computing PageRank vector? Firstly, we repeat the PageRank computation as what has been stated in chapter 1 but with a slight modification. Suppose that we have web graph G with corre- sponding adjacent matrix P . We do not remove all leak nodes, instead of that, we define P˜ = αP + (1− α) n J (2.1) 32 2.2 Connected-component PageRank approach And the principal eigen vector of P˜ is computed as follow: pi = piP˜ (2.2) Secondly, let k be the number of connected components in G, let Pi be the adjacent matrix size ni ì ni corresponding to the ith component, k∑ i=1 ni = n. Thus, we can re-arrange P as follow: P =   P1 0 ã ã ã 0 0 P2 ã ã ã 0 ... ... . . . ... 0 0 ã ã ã Pk   (2.3) We define matrices: P˜i = αPi + (1− α) ni Ji i = 1, k, Ji is ni ì ni size (2.4) For each block Pi, i = 1, k, formula for the principal eigen-vector is: pii = piiP˜i (2.5) Proposition 2.1. From (2.1), (2.2), (2.3), (2.4) and (2.5) ˜ = (n1 n pi1, n2 n pi2, . . . , nk n pik ) (2.6) is eigen-vector of matrix P˜ . Proof. To prove that ˜ is eigen-vector of matrix P˜ , we have to prove that˜ = (n1 n pi1, n2 n pi2, . . . , nk n pik ) satisfies equation (2.2). Replace ˜ = (n1 n pi1, n2 n pi2, . . . , nk n pik ) into equation (2.2), we have: ˜ = ˜P˜ = ˜ [αP + (1− α) n J ] = ˜  α   P1 0 ã ã ã 0 0 P2 ã ã ã 0 ... ... . . . ... 0 0 ã ã ã Pk  + (1− α) n J   (2.7) 33 2.2 Connected-component PageRank approach ⇔ (n1 n pi1, n2 n pi2, . . . , nk n pik ) = (n1 n pi1, n2 n pi2, . . . , nk n pik ) ì  αP1 + 1− α n J11 1− α n J12 ã ã ã 1− α n J1k 1− α n J21 αP2 + 1− α n J21 ã ã ã 1− α n J2k ... ... . . . ... 1− α n Jk1 1− α n J2k ã ã ã αPk + 1− α n Jkk   (2.8) in which Jij is ni ì nj size matrix of all 1s. Multiply the right side of equation (2.8) and consider the first component, we have: n1 n pi1 = n1 n pi1 ( αP1 + 1− α n J11 ) + n2 n pi2 1− α n J21 + ã ã ã+ nk n pik 1− α n Jk1 (2.9) For every block Pi, PageRank pii of that block satisfies equation (2.5), that means: pii = piiP˜i = pi i ( αPi + 1− α ni Ji ) (2.10) ⇔ ni n pii = ni n pii ( αPi + 1− α ni Ji ) (2.11) Let i be 1, we get: n1 n pi1 = n1 n pi1 ( αP1 + 1− α n1 J1 ) (2.12) From (2.9), (2.12) we have: n1 n pi1 ( αP1 + 1− α n1 J1 ) = n1 n pi1 ( αP1 + 1− α n J11 ) + n2 n pi2 1− α n J21+ã ã ã+ nk n pik 1− α n Jk1 (2.13) Since J1 = J11, so (2.13)⇔ pi1J11 = n1 n pi1J11 + n2 n pi2J21 + ã ã ã+ nk n pikJk1 (2.14) ⇔ 1nì1 = 1nì1 ì k∑ i=1 ni n in which 1nì1 is a column vector of all 1s (2.15) 34 2.2 Connected-component PageRank approach Note that, in our definition, k∑ i=1 ni = n so (2.15) is always true. Do the same process to component i, i = 2, k, we get the expected result. Consequently, the proposition is true. We have the proposition proved so the first question is answered, CCP is true. In the next section, we briefly suggest a way of using CCP effectively. 2.2.3 On practical side of CCP CCP method can be divided into three main steps as follow: ơ First step: Get connected components from whole web graph and gener- ate corresponding adjacent graph, ư Second step: Compute each block’s eigen-vector to get PageRank vector, đ Third step: Combine the last PageRank vector as shown in Proposition 2.1. A more detailed version of CCP can be found in Algorithm 2.4. Algorithm 2.4 Connected-component PageRank algorithm 1: function CCP(G) . G is the whole Web graph 2: Construct matrix P . P is the adjacent matrix of G 3: Determine connected components Pi in G . i = 1, k 4: for all Pi do 5: compute Pi’s eigen-vector pii . Pi is the ith connected component 6: compute ni n pii . ni ì ni is the size of block Pi 7: end for 8: ˜i ← ni n pii . ˜ = (n1 n pi1, n2 n pi2, . . . , nk n pik ) 9: return ˜ 10: end function Note that we have introduced some advantages in the previous section but all of them are obvious or induced from other approaches. To be listed here are major advantages of proposal over standard PageRank algorithm. 35 2.2 Connected-component PageRank approach Advantage 1 The major speedup that this algorithm can obtain comes from one of the main directions in researching of PageRank – how to fit PageR- ank problem into memory. All of the blocks in our crawl are small enough so that each block graph fits in main memory, and the vector of ranks for the active block largely fits in the CPU cache. As the full graph does not fit entirely in main memory, the local PageRank (of each block) itera- tions thus require less disk I/O than the global computations. Indeed, if we manipulate the crawling process, specially designed for CCP, as the input instead to the standard PageRank algorithm, the result would be much better. Advantage 2 In CCP algorithm, the local PageRank vectors for many blocks will converge quickly; thus the computations of those blocks may be ter- minated after only a few iterations. This increases the effectiveness of the local PageRank computation by allowing it to expend more computation on slowly converging blocks, and less computation on faster converging blocks. We will discuss more on this advantage in Chapter 3. In the stan- dard PageRank algorithm, iterations operate on the whole graph; thus the convergence bottleneck is largely due to the slowest blocks. Much computation is wasted recomputing the PageRank of blocks whose local computation has already converged. Advantage 3 The local PageRank computations in Second Step of the CCP can be computed in a completely parallel or distributed fashion. That is, the local PageRanks for each block Pi can be computed on a separate pro- cessor, or computer. The only communication required is that, at the end of Second step, each computer should send their local PageRank vector to a central computer that will compute the global PageRank vector. If our graph consists of n total pages, the net communication cost consists of 8n bytes (if using 8-byte double precision floating point values). Naive paral- lelization of the computation that does not exploit block structure would require a transfer of 8nbytes after each iteration, a significant penalty. 36 2.3 Spam and spam detection 2.3 Spam and spam detection Web spamming refers to actions intended to mislead search engines into rank- ing some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. We will get through a comprehensive taxonomy of current spamming techniques and some fighting-spam techniques in this section. We use the term spamming (also, spamdexing) to refer to any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page’s true value. In com- mon, the word spam is used to mark all those web objects (page content items or links) that are the result of some form of spamming. People who perform spamming are called spammers [11, 10]. 2.3.1 Introduction to Web spam As more and more people rely on the wealth of information available online, increased exposure on the WWW may yield significant financial gains for indi- viduals or organizations. Most frequently, search engines are the entryways to the Web; that is why some people try to mislead search engines, so that their pages would rank high in search results, and thus, capture user attention. Just as with emails, we can talk about the phenomenon of spamming the Web. The primary consequence of web spamming is that the quality of search results decreases. For instance, a site may contain only a few lines of useful in- formation, but consists of thousands of pages, each repeating the same content and pointing to dozens of other pages. All the pages in that site were probably created to boost the rankings of some others, and none of them seems to be particularly useful for anyone looking for useful information. The secondary consequence of spamming is that search engine indexes are in inflated with useless pages, increasing the cost of each processed query. To provide low-cost, quality services, it is critical for search engines to ad- dress web spam. Search engines currently fight spam with a variety of often manual techniques, but as far as we know, they still lack a fully effective set of tools for combating it. The very first step in combating spam is understand- 37 2.3 Spam and spam detection ing it that is analyzing the techniques the spammers use to mislead search engines. A proper understanding of spamming can then guide the develop- ment of appropriate countermeasures. To that end, in [10], web spamming techniques are organized into a taxonomy that can provide a framework for combating spam, [10] gives us very useful information about spam such as ways that spams are created, in-depth discussion about spam... Furthermore, Zoltỏn G. et. al. [10] proposes a method, namely TrustRank, to combat web spam. A newly created spam technique, blog spam, is treated well with Lan- guage Model Disagreement in [15]. One can also find details for several specific techniques on the Web itself. 2.3.2 Spam techniques To build a taxonomy, authors of [10] worked closely with experts at one of the major search engine companies, relying on their experience, while at the same time investigating numerous spam instances on their own. As a sep- arated work, [11] agrees with [10] on almost ideas. We will investigate the common parts of these papers to have a basic understanding of web spam. There are two categories of techniques associated with web spam. The first category includes the boosting techniques, i.e., methods through which one seeks to achieve high relevance and/or importance for some pages. The second category includes hiding techniques, methods that by themselves do not in influence the search engine’s ranking algorithms, but that are used to hide the adopted boosting techniques from the eyes of human web users. Term spamming Term spamming [10], or text spamming [11], is the way that web pages are manipulated with redundancy of some specific terms to get higher ranks or tailored with text fields in order to make spam pages relevant for some queries. As we had known, when a query is issued, its content and contents of web pages will be evaluated to get the relevance score. Relevance score is calculated based on term frequencies, term positions... 5 In which, term positions and term frequencies are the relatively easy criteria to be modified so that pages would get more ranks. In evaluating relevance, term positions receive more 5Refer to chapter 1 for more details 38 2.3 Spam and spam detection Figure 2.3: Boosting techniques attention, [10] calls each type of location that a term appears in a field. The common text fields for a page p are the document body, the title, the meta tags in the HTML header, and page p’s URL. In addition, the anchor texts associated with URLs that point to p are also considered belonging to page p (anchor text field), since they often describe very well the content of p. The terms in p’s text fields are used to determine the relevance of p with respect to a specific query (a group of query terms), often with different weights given to different fields. Authors of [10, 11] mention the following techniques which are grouped based on the text field in which the spamming occurs. They are: • Body spam: In this case, the spam terms are included in the document body. This spamming technique is among the simplest and most popular ones, and it is almost as old as search engines themselves. • Title spam: Today’s search engines usually give a higher weight to terms 39 2.3 Spam and spam detection that appear in the title of a document. Hence, it makes sense to include the spam terms in the document title. • Meta tag spam: The HTMLmeta tags that appear in the document header have always been the target of spamming. Because of the heavy spam- ming, search engines currently give low priority to these tags, or even ignore them completely. • Anchor text spam: Just as with the document title, search engines as- sign higher weight to anchor text terms, as they are supposed to offer a summary of the pointed document. Therefore, spam terms are sometimes included in the anchor text of the HTML hyperlinks to a page. Please note that this spamming technique is different from the previous ones, in the sense that the spam terms are added not to a target page itself, but the other pages that point to the target. As anchor text gets indexed for both pages, spamming it has impact on the ranking of both the source and target pages. • URL spam: URLs of pages may be broken down by search engines in order to compute relevance scores. Spammers can exploit this by creating long URLs which contains spam terms. We have introduced some positions that spammers use to achieve better ranks for their web pages. Now, considering the number of occurrences of terms, we distinguish some ways: • Repetition of one or a few specific words in content of page. If an issued query is relevant, this page will receive higher relevance score. • Dumping a large amount of words so that the page would be relevant to many different queries. • Weaving spam terms into a copied content. A copied content can be news from a reliable source in which spam terms are randomly inserted. • Phrase stitching is the technique in which sentences or phrases from many different sources are weaved. To this method, spammers could get relevance to many kinds of queries because sentences are collected from different sources. 40 2.3 Spam and spam detection Link spamming Web spam involves in not only term spamming but also link spamming. Many major search engines use link-based algorithm to determine ranks of web pages and the most famous is PageRank. Many spammers create link structure which will exploit link-based algorithm to obtain better seats in search result. Spammers can use outgoing and incoming links as tools to make their goal come true. They can add a number of outgoing links to a popular page or gather many incoming links to a specific page or a group of pages. Outgoing links Numerous outgoing links might be added to popular pages in hope of spam- mers to get better hub scores for their pages. A spammer can use directory cloning technique to reach this goal. On the WWW, there are some directory sites which contain links of many topics. For example, DMOZ is one of the most famous in which we can find content of many topics and sub topics by navi- gating link lists. Spammers copy this idea and create their own directory web which resembles a popular directory web, it is called directory cloning. By this action, big amount of outgoing links are easily generated. Incoming links Spammers might deploy one of the following strategies: • Using a honey pot as a tool to attract users. A honey pot may be a page which provides free, pirated softwares... or useful resources, in general. This honey pot also contains links to spam page(s) and links are hidden from users. The more people uses this honey pot, the more it is popular. So many users may desire to link to this honey pot. When users point their links to the honey pot, they indirectly increase the ranking of spam page(s). Note that directory sites can be used as a honey pot, too. • Infiltrate web directory: Many web directories allow people to post links on their directories. But it might happen that editors of these directories do not strictly manage the given contents. So spammers would post links to target pages to get boosted in both PageRank and authorities scores. 41 2.3 Spam and spam detection • Post links to blogs, wikis... This is a hard problem that search engine could resolve. Links can be given directly to blog comments, wikis... or can be hidden by using anchor text. For this technique, spammers would get their target pages more popular. • Exchange links: A group of spammers would create a link structure in which pages are circle-pointed. Source and target pages in this strategy will fly higher in both PageRank and popularity. • Using expired domains: When a domain is newly out-of-date, spammers will buy it. Since that domain has just expired, incoming links to it still exist. Moreover, spammers can populate this old domain with some tricks. Then, spammers will create links from this expired domain to their target pages. Thus, target pages takes advantages from old links to an expired domain. • Create spam farm: Spammers, nowadays, have control of many sites and they can create link structure that boost the ranking of target pages. Link spamming is used more and more because PageRank algorithm is still deployed by Google. For that reason, many approaches appear to treat with this arisen problem. Conclusion of chapter This chapter has discussed about some problems which arose with PageRank computation. The underlined problem is accelerating problem which involves in many fields. We can benefit from storage techniques to reduce time taken to compute PageRank computation. Besides, we can take advantages from the structure of the web to speed up the computational time. Furthermore, nu- merical methods give us many promising approaches by which we just have to modify a little the way of programming to get a nice result. In the next chapter, experimental results are presented so that we will have a clearer picture of how PageRank and other methods are handled. 42 CHAPTER Implementations and experimental results Nowadays, hardly can a newly born result in any informatics field achieve agreement if it does not have experimental figures. This the- sis is not out of that stream. In this chapter, introduction of an or- dinary search engine, Nutch, is shown. We used Nutch, with some modifications and additions in code, to crawl data then the obtained data was processed to get experimental results. 3.1 Search engine Nutch As all of us know, building a search engine is very, very hard because of enor- mous amount of works need to be done. Thus, such effective search engine as Google is so successful in the market. But every effort made is tending towards making more and more profit, not for academic purpose. How could researchers, who want to do research about search engine, make their experi- ments? The answer may be Nutch [2]. This section concentrates on some gen- eral information about Nutch and the work needed to utilize Nutch for our 43 3.1 Search engine Nutch purpose. Nutch is an open source Java implementation of a search engine. It provides all of the tools you need to run your own search engine. But why would anyone want to run their own search engine? After all, there is always Google. There are at least three reasons. ơ Transparency:Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rank- ings to be based on payments, rather than on the relevance of the site’s contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important. ư Understanding: We do not have the source code to Google, so Nutch is probably the best we have. It is interesting to see how a large search en- gine works. Nutch has been built using ideas from academia and indus- try: for instance, core parts of Nutch are currently being reimplementa- tion to use the Map Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend. đ Extensibility: Do not like the way other search engines display their re- sults? Write your own search engine–using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For develop- ers, Nutch is a great platform for adding search to heterogeneous collec- tions of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plug-in mechanism. For example, you can integrate it into your site to add a search capability. Nutch installations typically operate at one of three scales: local file system, intranet, or whole web. All three have different characteristics. For instance, crawling a local file system is reliable compared to the other two, since network errors do not occur and caching copies of the page content is unnecessary (and actually a waste of disk space). Whole web crawling lies at the other extreme. Crawling billions of pages creates a whole host of engineering problems to be 44 3.1 Search engine Nutch solved: which pages do we start with? How do we partition the work between a set of crawlers? How often do we recrawl? How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? There is another set of challenges to solve to deliver scalable search–how do we cope with hun- dreds of concurrent queries on such a large dataset? This portion tells us about the architecture of Nutch. Nutch has a highly modular architecture that uses plug-in APIs for media-type parsing, HTML analysis, data retrieval protocols, and queries. The core has four major compo- nents: • Searcher:Given a query, it must quickly find a small relevant subset of a corpus of documents, then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display. • Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. • Database: Stores the document contents for indexing and later summa- rization by the searcher, along with information such as the link struc- ture of the document space and the time each document was last fetched. • Fetcher:Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch. Figure 3.1 outlines the relationships between elements that refer on each other, placing them in the same box, and those they depend on in a lower layer. For example, protocol does not depend on net, because protocol is only an interface point to plug-ins that actually provide much of Nutch’s functionality. Crawling An intranet or niche search engine might only take a single machine a few hours to crawl, while a whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists of generating a fetch list from the webdb, fetching those pages, parsing those for links, then updating the 45 3.1 Search engine Nutch Figure 3.1: Nutch packages dependencies webdb. Nutch’s crawler supports both a crawl-and-stop and crawl-and-stop- with-threshold (which requires feedback from scoring and specifying a floor). It also uses a uniform refresh policy; all pages are refetched at the same inter- val (30 days, by default) regardless of how frequently they change. The fetching process must also respect bandwidth and other limitations of the target web- site. However, any polite solution requires coordination before fetching; Nutch uses the most straightforward localization of references possible: namely, mak- ing all fetches from a particular host run on one machine. Indexing text Lucenemeets the scalability requirements for text indexing in Nutch. Nutch also takes advantage of Lucene’s multi-field case-folding keyword and phrase search in URLs, anchor text, and document text. The typical definition of Web search does not require some index types Lucene does not address, such as regular-expression matching (using, say, suffix trees) or document-similarity clustering (using, say, term-vectors). Indexing hypertext Lucene provides an inverted-file full-text index, which suffices for indexing text but not the additional tasks required by a web search engine. In addition to this, Nutch implements a link database to provide efficient access to the Web’s link graph, and a page database that stores crawled pages for indexing, 46 3.1 Search engine Nutch summarizing, and serving to users, as well as supporting other functions such as crawling and link analysis. Nutch also has to add support for HTML extrac- tion to Lucene. When a page is fetched, any embedded links are considered for addition to the fetch list and that link’s anchor text is also stored. What is eventually indexed by Lucene is only the text of a Web page, though: while a high-fidelity copy is stored in the page database, font, heading, and other structural information do not propagate to the Lucene indexing layer. Removing duplicates The Nutch dedup command eliminates duplicate documents from a set of Lucene indices for Nutch segments, so it inherently requires access to all the segments at once. It’s a batch-mode process that has to be run before running searches to prevent the search from returning duplicate documents. It uses temporary files containing the 5-tuple MD5 hash, float score, int indexID, int docID, int urlLen for each page. Presumably the documents with the same URLs were fetched at different times, so Nutch tries to sort the records so that the ones for the newest fetches come first. A second pass, using hash=MD5(content) and with slightly different sorting rules, eliminates multiple URLs for the same docu- ment from the index. Link analysis Nutch includes a link analysis algorithm similar to PageRank. It is per- formed by the DistributedAnalysisTool; even the single-machine LinkAnaly- sisTool merely calls into it. Nutch has already demonstrated the ability to harness multiple servers to compute link ranking for 100Mpage subsets of the World Wide Web. Distributed link analysis is a bulk synchronous parallel process. At the beginning of each phase, the list of URLs whose scores must be updated is divided up into many chunks; in the middle, many processes produce score-edit files by finding all the links into pages in their particular chunk. At the end, an updating phase reads the score-edit files one at a time, merging their results into new scores for the pages in the web database. Because of its promising features, Nutch can be used for academic purposes as previously indicated. In the next section, we will discuss in detail about 47 3.2 Experiments experiments done with trivial PageRank and CCP. 3.2 Experiments Nowadays, almost all papers in informatics field need experimental figures to show the correctness of a new idea or advantages that it takes over other methods. This section covers experiments done by the author of the thesis and through figures, CCP insists that it is a promising approach which reduces the computational time and number of iterations needed in PageRank compu- tation. 3.2.1 Infrastructure and data sets All experiments were done with the IBM cluster 1350 from Center for High Performance Computing, Hanoi University of Science. The IBM cluster con- sists of 8 computing nodes, each with 2 Xeon processors at 3.2GHz, 2GB of RAM. Additionally, the total storage capacity that the cluster can use is about 1TB. Web pages are downloaded from many sources. These sources are web sites of many universities from all over the world: from US, from UK or fr

Các file đính kèm theo tài liệu này:

MSc07_Nguyen_Hoai_Nam_Thesis_English.pdf