Results 1 - 10
of
48
LODStats – An Extensible Framework for High-performance Dataset Analytics
"... Abstract. One of the major obstacles for a wider usage of web data is the difficulty to obtain a clear picture of the available datasets. In order to reuse, link, revise or query a dataset published on the Web it is important to know the structure, coverage and coherence of the data. In order to obt ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
(Show Context)
Abstract. One of the major obstacles for a wider usage of web data is the difficulty to obtain a clear picture of the available datasets. In order to reuse, link, revise or query a dataset published on the Web it is important to know the structure, coverage and coherence of the data. In order to obtain such information we developed LODStats – a statement-stream-based approach for gathering comprehensive statistics about datasets adhering to the Resource Description Framework (RDF). LODStats is based on the declarative description of statistical dataset characteristics. Its main advantages over other approaches are a smaller memory footprint and significantly better performance and scalability. We integrated LODStats with the CKAN dataset metadata registry and obtained a comprehensive picture of the current state of a significant part of the Data Web. 1
Efficient Multidimensional Blocking for Link Discovery without losing Recall
"... Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link di ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
(Show Context)
Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets. In record linkage, many partitioning methods have been proposed which substantially reduce the number of required entity comparisons. Unfortunately, most of these methods either lead to a decrease in recall or only work on metric spaces. We propose a novel blocking method called Multi-Block which uses a multidimensional index in which similar objects are located near each other. In each dimension the entities are indexed by a different property increasing the efficiency of the index significantly. In addition, it guarantees that no false dismissals can occur. Our approach works on complex link specifications which aggregate several different similarity measures. MultiBlock has been implemented as part of the Silk Link Discovery Framework. The evaluation shows a speedup factor of several 100 for large datasets compared to the full evaluation without losing recall.
RAVEN – Active Learning of Link Specifications
"... Abstract. With the growth of the Linked Data Web, time-efficient approaches for computing links between data sources have become indispensable. Yet, in many cases, determining the right specification for a link discovery problem is a tedious task that must still be carried out manually. We present R ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
Abstract. With the growth of the Linked Data Web, time-efficient approaches for computing links between data sources have become indispensable. Yet, in many cases, determining the right specification for a link discovery problem is a tedious task that must still be carried out manually. We present RAVEN, an approach for the semi-automatic determination of link specifications. Our approach is based on the combination of stable solutions of matching problems and active learning with the time-efficient link discovery framework LIMES. RAVEN aims at requiring a small number of interactions with the user to generate classifiers of high accuracy. We focus on using RAVEN to compute and configure boolean and weighted classifiers, which we evaluate in three experiments against link specifications created manually. Our evaluation shows that we can compute linking configurations that achieve more than 90 % F-score by asking the user to verify at most twelve potential links.
Assessing linked data mappings using network measures
- In ESWC
, 2012
"... Abstract. Linked Data is at its core about the setting of links between resources. Links provide enriched semantics, pointers to extra informa-tion and enable the merging of data sets. However, as the amount of Linked Data has grown, there has been the need to automate the cre-ation of links and suc ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
(Show Context)
Abstract. Linked Data is at its core about the setting of links between resources. Links provide enriched semantics, pointers to extra informa-tion and enable the merging of data sets. However, as the amount of Linked Data has grown, there has been the need to automate the cre-ation of links and such automated approaches can create low-quality links or unsuitable network structures. In particular, it is difficult to know whether the links introduced improve or diminish the quality of Linked Data. In this paper, we present LINK-QA, an extensible framework that allows for the assessment of Linked Data mappings using network met-rics. We test five metrics using this framework on a set of known good and bad links generated by a common mapping system, and show the behaviour of those metrics.
Publishing Statistical Data on the Web
"... Abstract—Statistical data is one of the most important sources of information, relevant for large numbers of stakeholders in the governmental, scientific and business domains alike. In this article, we overview how statistical data can be managed on the Web. With OLAP2DataCube and CSV2DataCube we pr ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Statistical data is one of the most important sources of information, relevant for large numbers of stakeholders in the governmental, scientific and business domains alike. In this article, we overview how statistical data can be managed on the Web. With OLAP2DataCube and CSV2DataCube we present two complementary approaches on how to extract and publish statistical data. We also discuss the linking, repair and the visualization of statistical data. As a comprehensive use case, we report on the extraction and publishing on the Web of statistical data describing 10 years of life in Brazil. I.
When to Reach for the Cloud: Using Parallel Hardware for Link Discovery
"... Abstract. With the ever-growing amount of RDF data available across the Web, the discovery of links between datasets and deduplication of resources within knowledge bases have become tasks of crucial importance. Over the last years, several link discovery approaches have been developed to tackle the ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. With the ever-growing amount of RDF data available across the Web, the discovery of links between datasets and deduplication of resources within knowledge bases have become tasks of crucial importance. Over the last years, several link discovery approaches have been developed to tackle the runtime and complexity problems that are intrinsic to link discovery. Yet, so far, little attention has been paid to the management of hardware resources for the execution of link discovery tasks. This paper addresses this research gap by investigating the efficient use of hardware resources for link discovery. We implement the HR 3 approach for three different parallel processing paradigms including the use of GPUs and MapReduce platforms. We also perform a thorough performance comparison for these implementations. Our results show that certain tasks that appear to require cloud computing techniques can actually be accomplished using standard parallel hardware. Moreover, our evaluation provides break-even points that can serve as guidelines for deciding on when to use which hardware for link discovery.
Composition Methods for Link Discovery
"... Abstract: The Linked Open Data community publishes an increasing number of data sources on the so-called Data Web and interlinks them to support data integration applications. We investigate how the composition of existing links and mappings can help discovering new links and mappings between LOD so ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract: The Linked Open Data community publishes an increasing number of data sources on the so-called Data Web and interlinks them to support data integration applications. We investigate how the composition of existing links and mappings can help discovering new links and mappings between LOD sources. Often there will be many alternatives for composition so that the problem arises which paths can provide the best linking results with the least computation effort. We therefore investigate different methods to select and combine the most suitable mapping paths. We also propose an approach for selecting and composing individual links instead of entire mappings. We comparatively evaluate the methods on several real-world linking problems from the LOD cloud. The results show the high value of reusing and composing existing links as well as the high effectiveness of our methods. 1
A Machine Learning Approach for Instance Matching Based on Similarity Metrics
- Proceedings of 11th International Semantic Web Conference (ISWC
, 2012
"... Abstract. The Linking Open Data (LOD) project is an ongoing effort to construct a global data space, i.e. the Web of Data. One important part of this project is to establish owl:sameAs links among structured data sources. Such links indicate equivalent instances that refer to the same real-world ob ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract. The Linking Open Data (LOD) project is an ongoing effort to construct a global data space, i.e. the Web of Data. One important part of this project is to establish owl:sameAs links among structured data sources. Such links indicate equivalent instances that refer to the same real-world object. The problem of discovering owl:sameAs links between pairwise data sources is called instance matching. Most of the existing approaches addressing this problem rely on the quality of prior schema matching, which is not always good enough in the LOD scenario. In this paper, we propose a schema-independent instance-pair similarity metric based on several general descriptive features. We transform the instance matching problem to the binary classification problem and solve it by machine learning algorithms. Furthermore, we employ some transfer learning methods to utilize the existing owl:sameAs links in LOD to reduce the demand for labeled data. We carry out experiments on some datasets of OAEI2010. The results show that our method performs well on real-world LOD data and outperforms the participants of OAEI2010.
Utilizing Domain-Specific Keywords for Discovering Public SPARQL Endpoints: A Life-Sciences Use-Case
- In ACM SAC (SWA track
, 2014
"... The LOD cloud comprises of billions of facts covering hun-dreds of datasets. In accordance with the Linked Data prin-ciples, these datasets are connected by a variety of typed links, forming an interlinked “Web of Data”. The growing diversity of the Web of Data makes it more and more chal-lenging fo ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
The LOD cloud comprises of billions of facts covering hun-dreds of datasets. In accordance with the Linked Data prin-ciples, these datasets are connected by a variety of typed links, forming an interlinked “Web of Data”. The growing diversity of the Web of Data makes it more and more chal-lenging for publishers to find relevant datasets that could be linked to, particularly in specialist domain-specific settings. This paper thus proposes a baseline method to automati-cally identify a list of public SPARQL endpoints whose con-tent is deemed relevant to a local dataset based on queries generated from a local set of domain-specific keywords.
Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures.
- In Proceedings of ISWC,
, 2012
"... ..."