Results 1 -
9 of
9
Eliminating Noisy Information in Web Pages for Data Mining
- In ACM Conf. on Knowledge Discovery and Data Mining (SIGKDD
, 2003
"... A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main co ..."
Abstract
-
Cited by 62 (2 self)
- Add to MetaCart
A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site, a Style Tree can be built for the site, which we call the Site Style Tree (SST). We then introduce an information based measure to determine which parts of the SST represent noises and which parts represent the main contents of the site. The SST is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SST. The proposed technique is evaluated with two data mining tasks, Web page clustering and classification. Experimental results show that our noise elimination technique is able to improve the mining results significantly. Categories and Subject Descriptors H.3.3 [INFORMATION STORAGE AND RETRIEVAL]: Information Search and Retrieval clustering, information filtering, selection process. General Terms Algorith...
Mining Data Records in Web Pages
, 2003
"... A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
A large amount of information on the Web is contained in regularly structured objects, which we call data records. Such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. It is useful to mine such data records in order to extract information from them to provide value-added services. Existing automatic techniques are not satisfactory because of their poor accuracies. In this paper, we propose a more effective technique to perform the task. The technique is based on two observations about data records on the Web and a string matching algorithm. The proposed technique is able to mine both contiguous and noncontiguous data records. Our experimental results show that the proposed technique outperforms existing techniques substantially. Categories and Subject Descriptors I.5 [Pattern Recognition]: statistical and structural H.2.8 [Database Applications]: data mining Keywords Web data records, Web mining, Web information integration 1.#
Web Mining: Machine Learning for Web Applications
- Annual Review of Information Science and Technology
, 2004
"... With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich ..."
Abstract
-
Cited by 9 (7 self)
- Add to MetaCart
With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich
Geo-Tagging for Imprecise Regions of Different Sizes
- In: Proceedings of GIR07. ACM
, 2007
"... Extracting geographical information from various web sources is likely to be important for a variety of applications. One such use for this information is to enable the study of vernacular regions: informal places referred to on a day-to-day basis, but with no official entry in geographical resource ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Extracting geographical information from various web sources is likely to be important for a variety of applications. One such use for this information is to enable the study of vernacular regions: informal places referred to on a day-to-day basis, but with no official entry in geographical resources, such as gazetteers. Past work in automatically extracting geographical information from the web to support the creation of vernacular regions has tended to focus on larger regions (e.g. “The British Midlands ” and “The South of France”). In this paper we report the results of preliminary work to investigate the success of using a simple geotagging approach and resources of varying granularity from the Ordnance Survey to extract geographical information from web pages. We find that the data gathered for smaller regions (compared with larger ones) is more “fine-grained ” which has an effect on the type of resource most useful for geo-tagging and its success.
International Editorial Staff
"... IJ ITA is official publisher of the scientific papers of the members of ..."
A Thesis by John King, B.I.T.
, 2003
"... The deep web contains a massive number of collections that are mostly invisible to search engines. These collections often contain high-quality, structured information that cannot be crawled using traditional methods. ..."
Abstract
- Add to MetaCart
The deep web contains a massive number of collections that are mostly invisible to search engines. These collections often contain high-quality, structured information that cannot be crawled using traditional methods.
A Cooperative Paradigm for Fighting Information Overload
"... The Web is mainly processed by humans. The role of the machines is just to transmit and display the contents of the documents, barely being able to do something else. Nowadays there are lots of initiatives trying to change this situation; many of them are related to fields like the Semantic Web [ ..."
Abstract
- Add to MetaCart
The Web is mainly processed by humans. The role of the machines is just to transmit and display the contents of the documents, barely being able to do something else. Nowadays there are lots of initiatives trying to change this situation; many of them are related to fields like the Semantic Web [1] or Web Intelligence. In this paper we describe the Cooperative Web [2] that can be seen as a new proposal towards Web Intelligence. The Cooperative Web would allow us to extract semantics from the Web in an automatic way, without the need of ontological artifacts, with language independence and, besides of this, allowing the usage of browsing experience from individual users to serve the whole community of users.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 AntWeb – The Adaptive Web Server Based on the Ants ’ Behavior
"... to AntWeb application is inspired by the ant colonies foraging behavior, to adaptively mark the most significant links, by means of the shortest route to arrive to target pages. In our approach, we consider the web users as artificial ants, and use the ant theory as a metaphor to guide user’s activi ..."
Abstract
- Add to MetaCart
to AntWeb application is inspired by the ant colonies foraging behavior, to adaptively mark the most significant links, by means of the shortest route to arrive to target pages. In our approach, we consider the web users as artificial ants, and use the ant theory as a metaphor to guide user’s activity in the Web site. In this paper, we describe the ant’s theory in which AntWeb is based. We also present the AntWeb system, its implementation and a case study with some experiments. The database in AntWeb stores a vast amount of information related to the users ’ visit to Web sites, which can be useful for further Web mining. Index Terms--AntWeb system, adaptive Web server, ant’s
Improving the Ranking Capability of the Hyperlink Based Search Engines Using Heuristic Approach 1
"... Abstract: To evaluate the informative content of a Web page, the Web structure has to be carefully analyzed. Hyperlink analysis, which is capable of measuring the potential information contained in a Web page with respect to the Web space, is gaining more attention. The links to and from Web pages a ..."
Abstract
- Add to MetaCart
Abstract: To evaluate the informative content of a Web page, the Web structure has to be carefully analyzed. Hyperlink analysis, which is capable of measuring the potential information contained in a Web page with respect to the Web space, is gaining more attention. The links to and from Web pages are an important resource that has largely gone unused in existing search engines. Web pages differ from general text in that they posse’s external and internal structure. The Web links between documents can provide useful information in finding pages for a given set of topics. Making use of the Web link information would allow the construction of more powerful tools for answering user queries. Google has been among the first search engines to utilize hyper links in page ranking. Still two main flaws in Google need to be tackled. First, all the backlinks to a page are assigned equal weights. Second, less content rich pages, such as intermediate and transient pages, are not differentiated from more content rich pages. To overcome these pitfalls, this paper proposes a heuristic based solution to differentiate the significance of various backlinks by assigning a different weight factor to them depending on their location in the directory tree of the Web space. Key words: Ranking capability, web link information, search engine, heuristic based solution

