Results 11 - 20
of
1,178
Folksonomies - cooperative classification and communication through shared metadata
, 2004
"... This paper examines user-generated metadata as implemented and applied in two web services designed to share and organize digital media to better understand grassroots classification. Metadata- data about data- allows systems to collocate related information, and helps users find relevant informatio ..."
Abstract
-
Cited by 160 (0 self)
- Add to MetaCart
This paper examines user-generated metadata as implemented and applied in two web services designed to share and organize digital media to better understand grassroots classification. Metadata- data about data- allows systems to collocate related information, and helps users find relevant information. The creation of metadata has generally been approached in two ways: professional creation and author creation. In libraries and other organizations, creating metadata, primarily in the form of catalog records, has traditionally been the domain of dedicated professionals working with complex, detailed rule sets and vocabularies. The primary problem with this approach is scalability and its impracticality for the vast amounts of content being produced and used, especially on the World Wide Web. The apparatus and tools built around professional cataloging systems are generally too complicated for anyone without specialized training and knowledge. A second approach is for metadata to be created by authors. The movement towards creator described documents was heralded by SGML, the WWW, and the Dublin Core Metadata Initiative. There are problems with this approach as well- often due to inadequate or inaccurate description, or outright deception. This paper examines a third approach: user-created metadata, where users of the documents and media create metadata for their own individual use that is also shared throughout a community.
SimRank: A Measure of Structural-Context Similarity
- In KDD
, 2002
"... The problem of measuring "similarity" of objects arises in many applications, and many domain-specific measures have been developed, e.g., matching text across documents or computing overlap among item-sets. We propose a complementary approach, applicable in any domain with object-to-object rel ..."
Abstract
-
Cited by 157 (4 self)
- Add to MetaCart
The problem of measuring "similarity" of objects arises in many applications, and many domain-specific measures have been developed, e.g., matching text across documents or computing overlap among item-sets. We propose a complementary approach, applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, we compute a measure that says "two objects are similar if they are related to similar objects." This general similarity measure, called SimRank, is based on a simple and intuitive graph-theoretic model. For a given domain, SimRank can be combined with other domain-specific similarity measures. We suggest techniques for efficient computation of SimRank scores, and provide experimental results on two application domains showing the computational feasibility and effectiveness of our approach.
Implicit Feedback for Inferring User Preference: A Bibliography
, 2003
"... ... In this paper we consider the use of implicit feedback techniques for query expansion and user profiling in information retrieval tasks. These techniques unobtrusively obtain information about users by watching their natural interactions with the system. Some of the user behaviors that have been ..."
Abstract
-
Cited by 152 (11 self)
- Add to MetaCart
... In this paper we consider the use of implicit feedback techniques for query expansion and user profiling in information retrieval tasks. These techniques unobtrusively obtain information about users by watching their natural interactions with the system. Some of the user behaviors that have been most extensively investigated as sources of implicit feedback include reading time, saving, printing and selecting. The primary advantage to using implicit techniques is that such techniques remove the cost to the user of providing feedback. Implicit measures are generally thought to be less accurate than explicit measures [Nic97], but as large quantities of implicit data can be gathered at no extra cost to the user, they are attractive alternatives. Moreover, implicit measures can be combined with explicit ratings to obtain a more accurate representation of user interests. Implicit
Trust Management for the Semantic Web
- IN PROCEEDINGS OF THE SECOND INTERNATIONAL SEMANTIC WEB CONFERENCE
, 2003
"... Though research on the Semantic Web has progressed at a steady pace, its promise has yet to be realized. One major difficulty is that, by its very nature, the Semantic Web is a large, uncensored system to which anyone may contribute. This raises ..."
Abstract
-
Cited by 152 (3 self)
- Add to MetaCart
Though research on the Semantic Web has progressed at a steady pace, its promise has yet to be realized. One major difficulty is that, by its very nature, the Semantic Web is a large, uncensored system to which anyone may contribute. This raises
A large-scale study of the evolution of web pages
- In Proceedings of the 12th International World Wide Web Conference
, 2003
"... How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who m ..."
Abstract
-
Cited by 144 (5 self)
- Add to MetaCart
How fast does the web change? Does most of the content remain unchanged once it has been authored, or are the documents continuously updated? Do pages change a little or a lot? Is the extent of change correlated to any other property of the page? All of these questions are of interest to those who mine the web, including all the popular search engines, but few studies have been performed to date to answer them. One notable exception is a study by Cho and Garcia-Molina, who crawled a set of 720,000 pages on a daily basis over four months, and counted pages as having changed if their MD5 checksum changed. They found that 40 % of all web pages in their set changed within a week, and 23 % of those pages that fell into the.com domain changed daily. This paper expands on Cho and Garcia-Molina’s study, both in terms of coverage and in terms of sensitivity to change. We crawled a set of 150,836,209 HTML pages once every week, over a span of 11 weeks. For each page, we recorded a checksum of the page, and a feature vector of the words on the page, plus various other data such as the page length, the HTTP status code, etc. Moreover, we pseudo-randomly selected 0.1 % of all of our URLs, and saved the full text of each download of the corresponding pages. After completion of the crawl, we analyzed the degree of change of each page, and investigated which factors are correlated with change intensity. We found that the average degree of change varies widely across top-level domains, and that larger pages change more often and more severely than smaller ones. This paper describes the crawl and the data transformations we performed on the logs, and presents some statistical observations on the degree of change of different classes of pages.
An Introduction to MCMC for Machine Learning
, 2003
"... This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modern Markov chain Monte Carlo simulation, thereby providing and introduction to the remaining papers of ..."
Abstract
-
Cited by 141 (2 self)
- Add to MetaCart
This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modern Markov chain Monte Carlo simulation, thereby providing and introduction to the remaining papers of this special issue. Lastly, it discusses new interesting research horizons.
Web Spam Taxonomy
, 2005
"... Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, wh ..."
Abstract
-
Cited by 140 (2 self)
- Add to MetaCart
Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures.
Mining Knowledge-Sharing Sites for Viral Marketing
, 2002
"... Viral marketing takes advantage of networks of influence among customers to inexpensively achieve large changes in behavior. Our research seeks to put it on a firmer footing by mining these networks from data, building probabilistic models of them, and using these models to choose the best viral mar ..."
Abstract
-
Cited by 138 (7 self)
- Add to MetaCart
Viral marketing takes advantage of networks of influence among customers to inexpensively achieve large changes in behavior. Our research seeks to put it on a firmer footing by mining these networks from data, building probabilistic models of them, and using these models to choose the best viral marketing plan. Knowledge-sharing sites, where customers review products and advise each other, are a fertile source for this type of data mining. In this paper we extend our previous techniques, achieving a large reduction in computational cost, and apply them to data from a knowledge-sharing site. We optimize the amount of marketing funds spent on each customer, rather than just making a binary decision on whether to market to him. We take into account the fact that knowledge of the network is partial, and that gathering that knowledge can itself have a cost. Our results show the robustness and utility of our approach.
Dirichlet Reputation Systems
- INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY
, 2007
"... Reputation systems can be used in online markets and communities in order to stimulate quality and good behaviour as well as to sanction poor quality and bad behaviour. The basic idea is to have a mechanism for rating services on various aspects, and a way of computing reputation scores based on the ..."
Abstract
-
Cited by 124 (10 self)
- Add to MetaCart
Reputation systems can be used in online markets and communities in order to stimulate quality and good behaviour as well as to sanction poor quality and bad behaviour. The basic idea is to have a mechanism for rating services on various aspects, and a way of computing reputation scores based on the ratings from many different parties. By making the reputation scores public, such systems can assist parties in deciding whether or not to use a particular service. Reputation systems represent soft security mechanisms for social control. This article presents a type of reputation system based on the Dirichlet probability distribution which is a multinomial Bayesian probability distribution. Dirichlet reputation systems represent a generalisation of the binomial Beta reputation system. The multinomial aspect of Dirichlet reputation systems means that any set of discrete rating levels can be defined. This provides great flexibility and usability, as well as a sound basis for designing reputation systems.
On the Feasibility of Peer-to-Peer Web Indexing and Search
- IN IPTPS’03
, 2003
"... This paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We pr ..."
Abstract
-
Cited by 121 (11 self)
- Add to MetaCart
This paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We present a simple feasibility analysis based on the resource constraints and search workload. Our study suggests that the peer-to-peer network does not have enough capacity to make naive use of either of search techniques attractive for Web search. The paper presents a number of existing and novel optimizations for P2P search based on distributed hash tables, estimates their effects on performance, and concludes that in combination these optimizations would bring the problem to within an order of magnitude of feasibility. The paper suggests a number of compromises that might achieve the last order of magnitude.

