Results 1 - 10
of
71
Web Spam Taxonomy
, 2005
"... Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, wh ..."
Abstract
-
Cited by 140 (2 self)
- Add to MetaCart
Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures.
Database Schema Matching Using Machine Learning with Feature Selection
- In Proceedings of the Conf. on Advanced Information Systems Engineering (CAiSE
, 2002
"... S hema mat hing, the problem offind ng mappings between the attributes of two semanti ally related d atabase s hemas, is an important aspe t of manyd atabase appli ations su h as s hema integration, d ata warehousing, and ele troni ommer e. Unfortunately, s hema mat hing remains largely a manual, la ..."
Abstract
-
Cited by 45 (1 self)
- Add to MetaCart
S hema mat hing, the problem offind ng mappings between the attributes of two semanti ally related d atabase s hemas, is an important aspe t of manyd atabase appli ations su h as s hema integration, d ata warehousing, and ele troni ommer e. Unfortunately, s hema mat hing remains largely a manual, labor-intensive pro ess. Furthermore, the e#ort required is typi ally linear in the number of s hemas to be mat hed ; the next pair of s hemas to mat h is not any easier than the previous pair. In this paper wed es ribe a system,alled Automat h, that uses ma hine learning te hniques to automate s hema mat hing. Based primarily on Bayesian learning, the system a quires probabilisti knowled e from examples that have been provid ed by d main experts. This knowled ge is stored in a knowled ge basealled the attribute dictionary. When presented with a pair of new s hemas that need to be mat hed (and their orrespond ng d tabase instan es), Automat h uses the attributed i tionary to find an optimal mat hing. We also report initial results from the Automat h proje t. 1
n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure
- In VLDB
, 2005
"... The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer.
Web object retrieval
- In Proc. WWW
, 2007
"... The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects embedded in static Web pages and online Web databases. In this paper, we propose a paradigm shift to enable searching at the obje ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects embedded in static Web pages and online Web databases. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. In this paper, we propose several language models for Web object retrieval. We test these models on our academic search engine called Libra and compare their performances. 1.
Maximal margin labeling for multi-topic text categorization
- Advances in Neural Information Processing Systems 17
, 2005
"... In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learn ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classifier on the induced multi-class categorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experiments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines. 1 Multi-topic Text Categorization (MTC) This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say
Web page cleaning for web mining through feature weighting
- In Intl. Joint Conf. on Artificial Intelligence (IJCAI
, 2003
"... Unlike conventional data or text, Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. Such irrelevant information (which we call Web page noise) in Web pages can seriously harm Web min ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Unlike conventional data or text, Web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. Such irrelevant information (which we call Web page noise) in Web pages can seriously harm Web mining, e.g., clustering and classification. In this paper, we propose a novel feature weighting technique to deal with Web page noise to enhance Web mining. This method first builds a compressed structure tree to capture the common structure and comparable blocks in a set of Web pages. It then uses an information based measure to evaluate the importance of each node in the compressed structure tree. Based on the tree and its node importance values, our method assigns a weight to each word feature in its content block. The resulting weights are used in Web mining. We evaluated the proposed technique with two Web mining tasks, Web page clustering and Web page classification. Experimental results show that our weighting method is able to dramatically improve the mining results. 1
Streams, Structures, Spaces, Scenarios, Societies (5S): A Formal Model for Digital Libraries
- ACM Trans. Inf. Syst
, 2004
"... Digital libraries (DLs) are complex information systems and therefore demand formal foundations lest development e#orts diverge and interoperability su#ers. In this paper, we propose the fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which contribute to defin ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
Digital libraries (DLs) are complex information systems and therefore demand formal foundations lest development e#orts diverge and interoperability su#ers. In this paper, we propose the fundamental abstractions of Streams, Structures, Spaces, Scenarios, and Societies (5S), which contribute to define digital libraries rigorously and usefully. Streams are sequences of abstract items used to describe static and dynamic content. Structures can be defined as labeled directed graphs, which impose organization. Spaces are sets of abstract items and operations on those sets that obey certain rules. Scenarios consist of sequences of events or actions that modify states of a computation in order to accomplish a functional requirement. Societies comprehend entities and the relationships between and among them. Together these abstractions relate and unify concepts, among others, of digital objects, metadata, collections, and services required to formalize and elucidate "digital libraries". The applicability, versatility and unifying power of the theory is demonstrated through its use in three distinct applications: building and interpretation of a DL taxonomy, analysis of case studies of digital libraries, and utilization as a formal basis for a DL description language. Keywords: digital libraries, theory, foundations, definitions, applications 1 1 Motivation Digital libraries are extremely complex information systems. The proper concept of a digital library seems hard to completely understand and evades definitional consensus. Di#erent views (e.g., historical, technological) and perspectives (e.g., from the library and information science, information retrieval, or human-computer interaction communities) have led to a myriad of di#ering definitions. Licklider, in his seminal ...
Keyword spices: A new method for building domain-specific web search engines
- In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01
, 2001
"... This paper presents a new method for building domain-specific web search engines. Previous methods eliminate irrelevant documents from the pages accessed using heuristics based on human knowledge about the domain in question. Accordingly, they are hard to build and can not be applied to other domain ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper presents a new method for building domain-specific web search engines. Previous methods eliminate irrelevant documents from the pages accessed using heuristics based on human knowledge about the domain in question. Accordingly, they are hard to build and can not be applied to other domains. The keyword spice method, in contrast, improves search performance by adding
Query-driven document partitioning and collection selection
- in INFOSCALE 2006: Proceedings of the first International Conference on Scalable Information Systems
, 2006
"... Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a l ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I.
Expansion-based technologies in finding relevant and new information: THU TREC2002 novelty track experiments
- the Proceedings of the Eleventh Text Retrieval Conference (TREC
, 2002
"... This is the first time that Tsinghua University took part in TREC. In this year's novelty track, our basic idea is to find the key factor that help people find relevant and new information on a set of documents with noise. We paid attention to three points: 1. how to get full information from a shor ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
This is the first time that Tsinghua University took part in TREC. In this year's novelty track, our basic idea is to find the key factor that help people find relevant and new information on a set of documents with noise. We paid attention to three points: 1. how to get full information from a short sentence; 2. how to complement hidden well-known knowledge to the sentences; 3. how to make the determination of

