Results 11 -
19 of
19
Introduction to Data Mining
, 1999
"... Introduction to Data Mining We are in an age often referred to as the information age. In this information age, because we believe that information leads to power and success, and thanks to sophisticated technologies such as computers, satellites, etc., we have been collecting tremendous amounts of ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Introduction to Data Mining We are in an age often referred to as the information age. In this information age, because we believe that information leads to power and success, and thanks to sophisticated technologies such as computers, satellites, etc., we have been collecting tremendous amounts of information. Initially, with the advent of computers and means for mass digital storage, we started collecting and storing all sorts of data, counting on the power of computers to help sort through this amalgam of information. Unfortunately, these massive collections of data stored on disparate structures very rapidly became overwhelming. This initial chaos has led to the creation of structured databases and database management systems (DBMS). The efficient database management systems have been very important assets for management of a large corpus of data and especially for effective and efficient retrieval of particular information from a large collection whenever needed. The prol
On Learning Strategies for Topic Specific Web Crawling. Next Generation Data Mining Applications
, 2004
"... Crawling has been a topic of considerable interest in recent years because of the rapid growth of the world wide web. In many cases, it is possible to design more effective crawlers which can find web pages belonging to specific topics. In this paper, we will discuss some recent techniques for crawl ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Crawling has been a topic of considerable interest in recent years because of the rapid growth of the world wide web. In many cases, it is possible to design more effective crawlers which can find web pages belonging to specific topics. In this paper, we will discuss some recent techniques for crawling web pages belonging to specific topics. We discuss the following classes of techniques: (1) Intelligent Crawling Methods: These methods learn the relationship between the hyper-link structure/web page content and the topic of the web page. This learned information is utilized in order to guide the direction of the crawl. (2) Collaborative Crawling Methods: These methods utilize the pattern of world wide web accesses by individual users in order to build the learning information. In many cases, user access patterns contain valuable statistical patterns which cannot be inferred from purely linkage information. We will also discusses some creative ways of combining different kinds of linkage- and user-centered methods in order to improve the effectiveness of the crawl. We will discuss some of the recent algorithms proposed in each topic along with some discussions on the directions of future research. 1
ALII: An information integration environment based on the active logic framework
- In Third International Conference on Management Information Systems
, 2002
"... There is growing interest in accessing, relating, and combining data from multiple sources on the Web. Enormous amounts of heterogeneous information have been accumulated within corporations, government organization and universities. Such information continues to grow at an ever-increasing rate. Thi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
There is growing interest in accessing, relating, and combining data from multiple sources on the Web. Enormous amounts of heterogeneous information have been accumulated within corporations, government organization and universities. Such information continues to grow at an ever-increasing rate. This information comes from different subject areas, and comes in different formats: bitmap, plain text, binary, etc.
Focused Crawling Using Context Graphs
- In 26th International Conference on Very Large Databases, VLDB 2000
, 2000
"... Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency problem. The major problem in focused crawling is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the expense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algorithm that builds a model for the context within which topically relevant pages occur on the web.
Querying Large Text Databases for Efficient Information Extraction
"... A wealth of data is hidden within unstructured text. This data is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations fro ..."
Abstract
- Add to MetaCart
A wealth of data is hidden within unstructured text. This data is often best exploited in structured or relational form, which is suited for sophisticated query processing, for integration with relational databases, and for data mining. Current information extraction techniques extract relations from a text database by examining every document in the database. This exhaustive approach is not practical, or sometimes even feasible, for large databases. In this paper, we develop an efficient query-based technique to identify documents that are potentially useful for the extraction of a target relation. We start by sampling the database to characterize the documents from which an information extraction system manages to extract relevant tuples. Then, we apply machine learning and information retrieval techniques to derive queries likely to match additional useful documents in the database. Finally, we issue these queries to the database to retrieve documents from which the information extraction system can extract the final relation. Our technique requires that databases support only a minimal boolean query interface, and is independent of the choice of the underlying information extraction system. We report a thorough experimental evaluation over more than one million documents that shows that we significantly improve the efficiency of the extraction process by focusing only on promising documents. Our proposed technique could be used to query a standard web search engine, hence providing a building block for efficient information extraction over the web at large. 1

