Results 1 - 10
of
12
Power Iteration Clustering
"... We show that the power iteration, typically used to approximate the dominant eigenvector of a matrix, can be applied to a normalized affinity matrix to create a one-dimensional embedding of the underlying data. This embedding is then used, as in spectral clustering, to cluster the data via k-means. ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We show that the power iteration, typically used to approximate the dominant eigenvector of a matrix, can be applied to a normalized affinity matrix to create a one-dimensional embedding of the underlying data. This embedding is then used, as in spectral clustering, to cluster the data via k-means. We demonstrate this method’s effectiveness and scalability on several synthetic and real datasets, and conclude that to find a meaningful low-dimensional embedding for clustering, it is not necessary to find any eigenvectors—we just need a linear combination of the top eigenvectors. 1
Web (2.0) Mining: Analyzing Social Media
"... Social media systems such as blogs, photo and link sharing sites, wikis and on-line forums are estimated to produce up to one third of new Web content. One thing that sets these ”Web 2.0 ” sites apart from traditional Web pages and resources is that they are intertwined with other forms of networked ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Social media systems such as blogs, photo and link sharing sites, wikis and on-line forums are estimated to produce up to one third of new Web content. One thing that sets these ”Web 2.0 ” sites apart from traditional Web pages and resources is that they are intertwined with other forms of networked data. Their standard hyperlinks are enriched by social networks, comments, trackbacks, advertisements, tags, RDF data and metadata. We describe recent work on building systems that analyse these emerging social media systems to recognize spam blogs, find opinions on topics, identify communities of interest, derive trust relationships, and detect influential bloggers. 1
Semi-Supervised Classification of Network Data Using Very Few Labels
, 2009
"... The goal of semi-supervised learning methods is to reduce the amount of labeled training data required by learning from both labeled and unlabeled instances. We make contribution towards this goal along several dimensions. Macskassy and Provost [13] proposed the weighted-vote relational neighbor cla ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The goal of semi-supervised learning methods is to reduce the amount of labeled training data required by learning from both labeled and unlabeled instances. We make contribution towards this goal along several dimensions. Macskassy and Provost [13] proposed the weighted-vote relational neighbor classifier (wvRN) as a simple yet solid baseline for semi-supervised learning on network data. It is shown to be essentially the same as the Gaussian-field classifier proposed by Zhu et al. [22] and proves to be very effective on many benchmark network datasets. We describe another simple and intuitive semisupervised learning method based on random graph walk that outperforms wvRN by a large margin on several benchmark datasets when very few labels are available. Secondly, we show that using authoritative instances as training seeds — instances that arguably cost much less to label — dramatically reduces the amount of labeled data required to achieve the same classification accuracy. For some existing state-of-the-art semi-supervised learning methods the labeled data needed is reduced by a factor of 50. Third, we offer insights as to why learning methods based on random graph walk are able to more fully exploit the unlabeled data than previous methods. Based on the above observations, we strongly recommend the proposed method as a strong baseline for future research on semi-supervised classification of network data.
User Evaluation of a System for Classifying and Displaying Political Viewpoints of
"... This paper presents a Web-based user evaluation of a system for classifying and presenting political viewpoints of blog posts. The system is based on a classification model trained using a supervised learning algorithm, and the data set consists of recent posts from blogs that are self-identified as ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper presents a Web-based user evaluation of a system for classifying and presenting political viewpoints of blog posts. The system is based on a classification model trained using a supervised learning algorithm, and the data set consists of recent posts from blogs that are self-identified as a liberal or a conservative viewpoint. We first discuss the classification process. Then, with a prototype system for retrieving and classifying political blogs, we look at how the classification results can be presented to users in order to improve the blog search experience. We describe an online user study with 15 users, and the study shows that users preferred the search results page that clearly shows the political viewpoint classification.
Generative Model To Construct Blog and Post Networks
- In Blogosphere. Master’s thesis
, 2007
"... Web graphs have been very useful in the structural and statistical analysis of the web. Various models have been proposed to simulate web graphs that generate degree distri-butions similar to the web. Real world blog networks resemble many properties of web graphs. But the dynamic nature of the blog ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Web graphs have been very useful in the structural and statistical analysis of the web. Various models have been proposed to simulate web graphs that generate degree distri-butions similar to the web. Real world blog networks resemble many properties of web graphs. But the dynamic nature of the blogosphere and the link structure evolving due to blog readership and social interactions is not well expressed by the existing models. In this research we propose a model for a blogger to construct blog graphs. We com-bine the existing preferential attachment and random attachment model to generate blog graphs which are type of scale-free networks. The blogger is modeled using read, write, idle states and finite read memory. The combination of these techniques helps in evolution of time stamped blog-blog and post-post network through citations within the blog-blog network. Other parameters like the growth function and the randomness in reading and writing posts help in the formation of graphs with different structural properties. We empirically show that these simulated blog graph exhibits properties similar to the real world blog networks in their degree distributions, degree correlations and clustering coefficient. We believe that this model will help researchers to evaluate and analyze the properties of the blogosphere and facilitate the testing of new algorithms.
The MultiRank Bootstrap Algorithm: Semi-Supervised Political Blog Classification and Ranking Using Semi-Supervised Link Classification
"... We present a new semi-supervised learning algorithm for classifying political blogs in a blog network and ranking them within predicted classes. We test our algorithm on two datasets and achieve classification accuracy of 81.9 % and 84.6 % using only 2 seed blogs. ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We present a new semi-supervised learning algorithm for classifying political blogs in a blog network and ranking them within predicted classes. We test our algorithm on two datasets and achieve classification accuracy of 81.9 % and 84.6 % using only 2 seed blogs.
Accurate Semi-supervised Classification for Graph Data
"... Most machine learning algorithms require labeled instances as training data; however, not all instances are equally easy to obtain labels for. For example, the best-known papers and/or websites would be easier for a domain expert to label. We propose a new PageRank-style method for performing semi-s ..."
Abstract
- Add to MetaCart
Most machine learning algorithms require labeled instances as training data; however, not all instances are equally easy to obtain labels for. For example, the best-known papers and/or websites would be easier for a domain expert to label. We propose a new PageRank-style method for performing semi-supervised learning by propagating labels from labeled seed instances to unlabeled instances in a graph using only the link structure. We show that on four real-world datasets, the proposed method, using only a small number of seed instances, gives highly accurate classification results and both outperforms simple content-only baselines and is competitive with state-of-the-art fully supervised algorithms that uses both the content and the link structure. In addition, the method is efficient for large datasets and works well given high-PageRank seeds. 1.
www.lti.cs.cmu.edu © 2008, Frank Lin and William W. CohenThe MultiRank Bootstrap Algorithm: Semi-Supervised Political Blog Classification and Ranking Using
"... We present a new, intuitive semi-supervised learning algorithm for classifying political blogs in a blog network and ranking them within classes. In the algorithm each link is assigned a label as well as the blogs. Using only the link structure as input and by exploiting the linking properties found ..."
Abstract
- Add to MetaCart
We present a new, intuitive semi-supervised learning algorithm for classifying political blogs in a blog network and ranking them within classes. In the algorithm each link is assigned a label as well as the blogs. Using only the link structure as input and by exploiting the linking properties found in political blog communities, we bootstrap the classification of links and blogs and blog rankings from a set of known seed blogs. We test our algorithm on two datasets and achieve blog classification accuracy of 81.9 % in a network of 404 blogs and 84.6 % in a network of 1222 blogs using only 2 seed blogs in each case. We analyze the results our algorithm and show that the misclassifications tend to be less important or less authoritative blogs.
Temporal Issue Trend Identifications in Blogs
"... Abstract — Many blog posts deal with current issues, so much attention has been paid to identifying topic trends in blogs. This paper suggests a new metric of selecting topic words. We empirically tested the accuracy and the performance of the metric with a massive blog corpus. First, we created blo ..."
Abstract
- Add to MetaCart
Abstract — Many blog posts deal with current issues, so much attention has been paid to identifying topic trends in blogs. This paper suggests a new metric of selecting topic words. We empirically tested the accuracy and the performance of the metric with a massive blog corpus. First, we created blog site groups to their indegree influence. Second, we ran the metric with blog posts of each group. The test was encouraging because the metric identified key issues matching to the headlines of New York Times when it is applied to the top indegree blog group. We expect that this metric and the source grouping methods will be developed to a new topic analysis framework of a large blog corpus. Keywords-component; Blog, Social Media, Issue Identification I.

