Results 1 - 10
of
129
Two supervised learning approaches for name disambiguation in author citations
- In JCDL ’04: Proceedings of the 4th ACM/IEEE joint conference on Digital libraries
, 2004
"... Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integra ..."
Abstract
-
Cited by 80 (5 self)
- Add to MetaCart
(Show Context)
Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This paper investigates two supervised learning approaches to disambiguate authors in the citations 1. One approach uses the naive Bayes probability model, a generative model; the other uses Support Vector Machines(SVMs) [39] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: co-author names, the title of the paper, and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the web, mainly publication lists from homepages, the other collected from the DBLP citation databases.
Name disambiguation in author citations using a K-way spectral clustering method
- INTERNATIONAL CONFERENCE ON DIGITAL LIBRARIES
, 2005
"... An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies 1. This can produce name ambiguity which can affect the performance of document retrieval, web search, and database ..."
Abstract
-
Cited by 72 (7 self)
- Add to MetaCart
An author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies 1. This can produce name ambiguity which can affect the performance of document retrieval, web search, and database integration, and may cause improper attribution of credit. Proposed here is an unsupervised learning approach using K-way spectral clustering that disambiguates authors in citations. The approach utilizes three types of citation attributes: co-author names, paper titles, and publication venue titles 2. The approach is illustrated with 16 name datasets with citations collected from the DBLP database bibliography and author home pages and shows that name disambiguation can be achieved using these citation attributes.
Efficient name disambiguation for large-scale databases
- PKDD
, 2006
"... Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retr ..."
Abstract
-
Cited by 37 (3 self)
- Add to MetaCart
(Show Context)
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6 % pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated. 1
Automatic extraction of titles from general documents using machine learning
- In Proc. of ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL
, 2005
"... In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word is 0.810 and 0.837 respectively, and precision and recall for title extraction from PowerPoint is 0.875 and 0.895 respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to another domain, and more surprisingly we can even train models in one language and apply them to another language. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.
Learning Extractors from Unlabeled Text using Relevant Databases
, 2007
"... Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used ..."
Abstract
-
Cited by 22 (1 self)
- Add to MetaCart
(Show Context)
Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used to label text strings that express the same information. For tasks where text strings do not follow the same format or layout, and additionally may contain extra information, labeling the strings completely may be problematic. This paper presents a method for training extractors which fill in missing labels of a text sequence that is partially labeled using simple high-precision heuristics. Furthermore, we improve the algorithm by utilizing labeled fields from the database. In experiments with BibTeX records and research paper citation strings, we show a significant improvement in extraction accuracy over a baseline that only relies on the database for training data.
Collaboration Over Time: Characterizing and Modeling Network Evolution
- In Proceedings of The 1st ACM International Conference on Web Search and Data Mining (WSDM
, 2008
"... A formal type of scientific and academic collaboration is coauthorship which can be represented by a coauthorship network. Coauthorship networks are among some of the largest social networks and offer us the opportunity to study the mechanisms underlying large-scale real world networks. We construct ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
A formal type of scientific and academic collaboration is coauthorship which can be represented by a coauthorship network. Coauthorship networks are among some of the largest social networks and offer us the opportunity to study the mechanisms underlying large-scale real world networks. We construct such a network for the Computer Science field covering research collaborations from 1980 to 2005, based on a large dataset of 451,305 papers authored by 283,174 distinct researchers. By mining this network, we first present a comprehensive study of the network statistical properties for a longitudinal network at the overall network level as well as for the intermediate community level. Major observations are that the database community is the best connected while the AI community is the most assortative, and that the Computer Science field as a whole shows a
Generating Fuzzy Semantic Metadata Describing Spatial Relations from Images Using the R-Histogram
- In Proceedings of the 4th ACM/IEEE-CS joint conference on digital libraries
, 2004
"... Automatic generation of semantic metadata describing spatial relations is highly desirable for image digital libraries. Relative spatial relations between objects in an image convey important information about the image. Because the perception of spatial relations is subjective, we propose a novel f ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
(Show Context)
Automatic generation of semantic metadata describing spatial relations is highly desirable for image digital libraries. Relative spatial relations between objects in an image convey important information about the image. Because the perception of spatial relations is subjective, we propose a novel framework for automatic metadata generation based on fuzzy k-NN classification that generates fuzzy semantic metadata describing spatial relations between objects in an image. For each pair of objects of interest, the corresponding R-histogram is computed and used as input for a set of fuzzy k-NN classifiers. The R-histogram is a quantitative representation of spatial relations between two objects. The outputs of the classifiers are soft class labels for each of the following eight spatial relations: 1) LEFT OF, 2) RIGHT OF, 3) ABOVE, 4) BELOW, 5) NEAR, 6) FAR, 7) INSIDE, 8) OUTSIDE. Because the classifier-training stage involves annotating the training images manually, it is desirable to use as few training images as possible. To address this issue, we applied existing prototype selection techniques and also devised two new extensions. We evaluated the performance of di#erent fuzzy k-NN algorithms and prototype selection algorithms empirically on both synthetic and real images. Preliminary experimental results show that our system is able to obtain good annotation accuracy (92%--98% on synthetic images and 82%--93% on real images) using only a small training set (4--5 images).
Panorama: Extending digital libraries with topical crawlers
- Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. 2004b
"... A large amount of research, technical and professional documents are available today in digital formats. Digital libraries are created to facilitate search and retrieval of information supplied by the documents. These libraries may span an entire area of interest (e.g., computer science) or be limit ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
A large amount of research, technical and professional documents are available today in digital formats. Digital libraries are created to facilitate search and retrieval of information supplied by the documents. These libraries may span an entire area of interest (e.g., computer science) or be limited to documents within a small organization. While tools that index, classify, rank and retrieve documents from such libraries are important, it would be worthwhile to complement these tools with information available on the Web. We propose one such technique that uses a topical crawler driven by the information extracted from a research document. The goal of the crawler is to harvest a collection of Web pages that are focused on the Web communities associated with the given document. The collection created through Web crawling is further processed through lexical and linkage analysis. The entire process is automated and uses machine learning techniques to both guide the crawler as well as analyze the collection it fetches. A report is generated at the end that provides visual cues and information to the researcher.
MetaExtract: an NLP system to automatically assign metadata
- In Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries
, 2004
"... We have developed MetaExtract, a system to automatically assign Dublin Core + GEM metadata using extraction techniques from our natural language processing research. MetaExtract is comprised of three distinct processes: eQuery and HTML-based Extraction modules and a Keyword Generator module. We cond ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
(Show Context)
We have developed MetaExtract, a system to automatically assign Dublin Core + GEM metadata using extraction techniques from our natural language processing research. MetaExtract is comprised of three distinct processes: eQuery and HTML-based Extraction modules and a Keyword Generator module. We conducted a Web-based survey to have users evaluate each metadata element’s quality. Only two of the elements, Title and
eBizSearch: An OAI-Compliant Digital Library for eBusiness
- JCDL
, 2003
"... Niche Search Engines offer an efficient alternative to traditional search engines when the results returned by general-purpose search engines do not provide a sufficient degree of relevance and when nontraditional search features are required. Niche search engines can take advantage of their domain ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
(Show Context)
Niche Search Engines offer an efficient alternative to traditional search engines when the results returned by general-purpose search engines do not provide a sufficient degree of relevance and when nontraditional search features are required. Niche search engines can take advantage of their domain of concentration to achieve higher relevance and offer enhanced features. We discuss a new digital library niche search engine, eBizSearch, dedicated to e-business and e-business documents. The ground technology for eBizSearch is CiteSeer, a specialpurpose automatic indexing document digital library and search engine developed at NEC Research Institute. We present here the integration of CiteSeer in the framework of eBizSearch and the process necessary to tune the whole system towards the specific area of e-business. We show how using machine learning algorithms we generate metadata to make eBizSearch Open Archives compliant. eBizSearch is a publicly available service and can be reached at [13]. 1.