Results 1 - 10
of
15
Trawling the Web for Emerging Cyber-Communities
- Computer Networks
, 1999
"... : The web harbors a large number of communities -- groups of content-creators sharing a common interest -- each of which manifests itself as a set of interlinked web pages. Newgroups and commercial web directories together contain of the order of 20000 such communities; our particular interest here ..."
Abstract
-
Cited by 257 (7 self)
- Add to MetaCart
: The web harbors a large number of communities -- groups of content-creators sharing a common interest -- each of which manifests itself as a set of interlinked web pages. Newgroups and commercial web directories together contain of the order of 20000 such communities; our particular interest here is on emerging communities -- those that have little or no representation in such fora. The subject of this paper is the systematic enumeration of over 100,000 such emerging communities from a web crawl: we call our process trawling. We motivate a graph-theoretic approach to locating such communities, and describe the algorithms, and the algorithmic engineering necessary to find structures that subscribe to this notion, the challenges in handling such a huge data set, and the results of our experiment. Keywords: web mining, communities, trawling, link analysis 1. Overview The web has several thousand well-known, explicitly-defined communities -- groups of individuals who share a common int...
Shaping the Web: why the politics of search engines matters. to appear, The Information Society 16
, 2000
"... This articleargues that searchengines raise not merelytechnical issues but also political ones. Our study of search engines suggests that they systematically exclude (in some cases by design and in some, accidentally) certain sites and certain types of sites in favor of others, systematically giving ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
This articleargues that searchengines raise not merelytechnical issues but also political ones. Our study of search engines suggests that they systematically exclude (in some cases by design and in some, accidentally) certain sites and certain types of sites in favor of others, systematically giving prominence to some at the expense of others. We argue that such biases, which would lead to a narrowing of the Web’s functioning in society, run counter to the basic architecture of the Web as well as to the values and ideals that have fueled widespread support for its growth and development. We consider ways of addressing the politics of search engines, raising doubts whether, in particular, the market mechanism could serve as an acceptable corrective. Keywords search engines, bias, values in design, World Wide Web, digital divide, information access The Internet, no longer merely an e-mail and � le-sharing system, has emerged as a dominant interactive medium. Received 17 July 1997; accepted 24 November 1998. We are indebted to many colleagues for commenting on and questioning earlier versions of this article: audiences at the conference “Computer Ethics: A Philosophical Enquiry, ” London; members of the seminars at the Kennedy School of Government, Harvard University,
Adding Geographic Scopes to Web Resources
- CEUS - Computers, Environment and Urban Systems
, 2006
"... Many Web pages are rich in geographic information and primarily relevant to geographically limited communities. However, existing IR systems only recently began to offer local services and largely ignore geo-spatial information. This paper presents our work on automatically identifying the geographi ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Many Web pages are rich in geographic information and primarily relevant to geographically limited communities. However, existing IR systems only recently began to offer local services and largely ignore geo-spatial information. This paper presents our work on automatically identifying the geographical scope of Web documents, which provides the means to develop retrieval tools that take the geographical context into consideration. Our approach makes extensive use of an ontology of geographical concepts, and includes a system architecture for extracting geographic information from large collections of Web documents. The proposed method involves recognising geographical references over the documents and assigning geographical scopes through a graph ranking algorithm. Initial evaluation results are encouraging, indicating the viability of this approach.
Surfing the Web Backwards
- In: Proc. of WWW 8 Conference
, 1999
"... From a user’s perspective, hypertext links on the web form a directed graph between distinct information sources. We investigate the effects of discovering “backlinks ” from web resources, namely links pointing to the resource. We describe tools for backlink navigation on both the client and server ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
From a user’s perspective, hypertext links on the web form a directed graph between distinct information sources. We investigate the effects of discovering “backlinks ” from web resources, namely links pointing to the resource. We describe tools for backlink navigation on both the client and server side, using an applet for the client and a module for the Apache web server. We also discuss possible extensions to the HTTP protocol to facilitate the collection and navigation of backlink information in the world wide web. 1
Recurrent neural network learning for text routing
- In Proceedings of the International Conference on Artificial Neural Networks
, 1999
"... This paper describes new recurrent plausibility networks with internal recurrent hysteresis connections. These recurrent connections in multiple layers encode the sequential context of word sequences. We show how these networks can support text routing of noisy newswire titles according to different ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
This paper describes new recurrent plausibility networks with internal recurrent hysteresis connections. These recurrent connections in multiple layers encode the sequential context of word sequences. We show how these networks can support text routing of noisy newswire titles according to different given categories. We demonstrate the potential of these networks using an 82 339 word corpus from the Reuters newswire, reaching recall and precision rates above 92%. In addition, we carefully analyze the internal representation using cluster analysis and output representations using a new surface error technique. In general, based on the current recall and precision performance, as well as the detailed analysis, we show that recurrent plausibility networks hold a lot of potential for developing learning and robust newswire agents for the internet. 2
MSEEC - A multi search engine with multiple clustering
- In: Proceedings of the 99 Information Resources Management Association International Conference
, 1999
"... This paper presents a scalable architecture for a multi search engine for web documents with multiple cluster algorithms (MSEEC[12]). Querying search engines in the web may result in an overwhelming amount of matching documents. Clustering techniques are used to find a set of similar documents which ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper presents a scalable architecture for a multi search engine for web documents with multiple cluster algorithms (MSEEC[12]). Querying search engines in the web may result in an overwhelming amount of matching documents. Clustering techniques are used to find a set of similar documents which are presented using a suitable cluster title. The scalable and modular architecture of MSEEC allows to process information with multiple cluster algorithms to present alternative clusters and the related cluster title to the user. This paper presents as well a novel clustering technique that is based on the LZW compression method. 1
An efficient algorithm to rank Web resources
, 2000
"... How to rank Web resources is critical to Web Resource Discovery (Search Engine). This paper not only points out the weakness of current approaches, but also presents in-depth analysis of the multidimensionality and subjectivity of rank algorithms. From a dynamics viewpoint, this paper abstracts a us ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
How to rank Web resources is critical to Web Resource Discovery (Search Engine). This paper not only points out the weakness of current approaches, but also presents in-depth analysis of the multidimensionality and subjectivity of rank algorithms. From a dynamics viewpoint, this paper abstracts a user's Web surfing action as a Markov model. Based on this model, we propose a new rank algorithm. The result of our rank algorithm, which synthesizes the relevance, authority, integrativity and novelty of each Web resource, can be computed efficiently not by iteration but through solving a group of linear equations. 2000 Published by Elsevier Science B.V. All rights reserved.
Transposition of the cocitation method with a view to classifying web pages
- Journal of the American Society for Information Science and Technology
, 2004
"... The Web is a huge source of information, and one of the main problems facing users is finding documents which correspond to their requirements. Apart from the problem of thematic relevance, the documents retrieved by search engines do not always meet the users ’ expectations. The document may be too ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The Web is a huge source of information, and one of the main problems facing users is finding documents which correspond to their requirements. Apart from the problem of thematic relevance, the documents retrieved by search engines do not always meet the users ’ expectations. The document may be too general, or conversely too specialized, or of a different type from what the user is looking for, and so forth. We think that adding metadata to pages can considerably improve the process of searching for information on the Web. This article presents a possible typology for Web sites and pages, as well as a method for propagating metadata values, based on the study of the Web graph and more specifically the method of cocitation in this graph.
Virtual www documents : A concept to explicit the structure of www sites
- In Proceedings of the 21st Colloquim on Information Retrieval. BCS-IRSG
, 1999
"... This paper shows a new concept of a virtual WWW document (VWD), as a set of WWW pages representing a logical information space, generally dealing with one particular domain. The VWD is described using metadata in the XML syntax and will be accessed through a metadata.class file, stored at the root l ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
This paper shows a new concept of a virtual WWW document (VWD), as a set of WWW pages representing a logical information space, generally dealing with one particular domain. The VWD is described using metadata in the XML syntax and will be accessed through a metadata.class file, stored at the root level of WWW sites. We’ll suggest how the VWD can improve information retrieval on the WWW and reduce the network load generated by the robots. We describe a prototype implemented in JAVA, within an application in the environmental domain. The exchanges of such metadata lay in a flexible architecture based on two kinds of robots: generalists and specialists that collect and organize this metadata, in order to localize the resources on the WWW. They will contribute to the overall auto-organizing information process by exchanging their indices, therefore forwarding their knowledge each other.
The SGF Metadata Framework and its Support for Social Awareness on the World Wide Web
- on the World Wide Web. World Wide Web (Baltzer
, 1999
"... In this article, we first briefly introduce the idea of metadata and explain how it is transforming the Web into an information space that can be accessed not only by humans, but also by software agents. We then consider one particular application of metadata, the description of Web sites structures ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
In this article, we first briefly introduce the idea of metadata and explain how it is transforming the Web into an information space that can be accessed not only by humans, but also by software agents. We then consider one particular application of metadata, the description of Web sites structures in a machine understandable way. We introduce the Structured Graph Format (SGF), an XML-based format used to describe Web spaces as a structured graphs. We then describe the SGF framework, built around the format specification. The framework integrates components that support both the generation, distribution and processing of SGF metadata. We first describe SGF consumers, components that process the metadata for some purpose. As an example, we present SGViewer, a consumer that uses the metadata to generate interactive site maps. We then review three approaches to the problem of generating SGF metadata. These approaches highlight a tradeoff between the quality and the cost of metadata. A ...

