Results 1 - 10
of
106
I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System
- In Proceedings of the 5th ACM/USENIX Internet Measurement Conference (IMC’07
, 2007
"... User Generated Content (UGC) is re-shaping the way people watch video and TV, with millions of video producers and consumers. In particular, UGC sites are creating new viewing patterns and social interactions, empowering users to be more creative, and developing new business opportunities. To better ..."
Abstract
-
Cited by 109 (5 self)
- Add to MetaCart
User Generated Content (UGC) is re-shaping the way people watch video and TV, with millions of video producers and consumers. In particular, UGC sites are creating new viewing patterns and social interactions, empowering users to be more creative, and developing new business opportunities. To better understand the impact of UGC systems, we have analyzed YouTube, the world’s largest UGC VoD system. Based on a large amount of data collected, we provide an in-depth study of YouTube and other similar UGC systems. In particular, we study the popularity life-cycle of videos, the intrinsic statistical properties of requests and their relationship with video age, and the level of content aliasing or of illegal content in the system. We also provide insights on the potential for more efficient UGC VoD systems (e.g. utilizing P2P techniques or making better use of caching). Finally, we discuss the opportunities to leverage the latent demand for niche videos that are not reached today due to information filtering effects or other system scarcity distortions. Overall, we believe that the results presented in this paper are crucial in understanding UGC systems and can provide valuable information to ISPs, site administrators, and content owners with major commercial and technical implications. Categories and Subject Descriptors Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Robust De-anonymization of Large Sparse Datasets
, 2008
"... We present a new class of statistical deanonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge. ..."
Abstract
-
Cited by 81 (5 self)
- Add to MetaCart
We present a new class of statistical deanonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge. We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
Finding high-quality content in social media with an application to community-based question answering
- In Proceedings of WSDM
, 2008
"... The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content in sites based on user contributions—social media sites— becomes increasingly important. Social media in general exhi ..."
Abstract
-
Cited by 54 (10 self)
- Add to MetaCart
The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content in sites based on user contributions—social media sites— becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing – indexing methods, linguistic
CoScripter: Automating & Sharing How-To Knowledge in the Enterprise
"... Modern enterprises are replete with numerous online processes. Many must be performed frequently and are tedious, while others are done less frequently yet are complex or hard to remember. We present interviews with knowledge workers that reveal a need for mechanisms to automate the execution of and ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
Modern enterprises are replete with numerous online processes. Many must be performed frequently and are tedious, while others are done less frequently yet are complex or hard to remember. We present interviews with knowledge workers that reveal a need for mechanisms to automate the execution of and to share knowledge about these processes. In response, we have developed the CoScripter system (formerly Koala [ 11]), a collaborative scripting environment for recording, automating, and sharing web-based processes. We have deployed CoScripter within a large corporation for more than 10 months. Through usage log analysis and interviews with users, we show that CoScripter has addressed many user automation and sharing needs, to the extent that more than 50 employees have voluntarily incorporated it into their work practice. We also present ways people have used CoScripter and general issues for tools that support automation and sharing of how-to knowledge.
Query suggestion using hitting time
- in Proc. of conf. on Inf. and Knowledge Manage. (CIKM’08
"... Generating alternative queries, also known as query suggestion, has long been proved useful to help a user explore and express his information need. In many scenarios, such suggestions can be generated from a large scale graph of queries and other accessory information, such as the clickthrough. How ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Generating alternative queries, also known as query suggestion, has long been proved useful to help a user explore and express his information need. In many scenarios, such suggestions can be generated from a large scale graph of queries and other accessory information, such as the clickthrough. However, how to generate suggestions while ensuring their semantic consistency with the original query remains a challenging problem. In this work, we propose a novel query suggestion algorithm based on ranking queries with the hitting time on a large scale bipartite graph. Without involvement of twisted heuristics or heavy tuning of parameters, this method clearly captures the semantic consistency between the suggested query and the original query. Empirical experiments on a large scale query log of a commercial search engine and a scientific literature collection show that hitting time is effective to generate semantically consistent query suggestions. The proposed algorithm and its variations can successfully boost long tail queries, accommodating personalized query suggestion, as well as finding related authors in research.
Identifying the influential bloggers in a community
- In WSDM ’08: Proceedings of the international conference on Web search and web data mining
, 2008
"... Blogging becomes a popular way for a Web user to publish information on the Web. Bloggers write blog posts, share their likes and dislikes, voice their opinions, provide suggestions, report news, and form groups in Blogosphere. Bloggers form their virtual communities of similar interests. Activities ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
Blogging becomes a popular way for a Web user to publish information on the Web. Bloggers write blog posts, share their likes and dislikes, voice their opinions, provide suggestions, report news, and form groups in Blogosphere. Bloggers form their virtual communities of similar interests. Activities happened in Blogosphere affect the external world. One way to understand the development on Blogosphere is to find influential blog sites. There are many non-influential blog sites which form the “the long tail”. Regardless of a blog site being influential or not, there are influential bloggers. Inspired by the high impact of the influentials in a physical community, we study a novel problem of identifying influential bloggers at a blog site. Active bloggers are not necessarily influential. Influential bloggers can impact fellow bloggers in various ways. In this paper, we discuss the challenges of identifying influential bloggers, investigate what constitutes influential bloggers, present a preliminary model attempting to quantify an influential blogger, and pave the way for building a robust model that allows for finding various types of the influentials. To illustrate these issues, we conduct experiments with data from a real-world blog site, evaluate multi-facets of the problem of identifying influential bloggers, and discuss unique challenges. We conclude with interesting findings and future work.
A Theory of Expressiveness in Mechanisms
, 2007
"... A key trend in the world—especially in electronic commerce—is a demand for higher levels of expressiveness in the mechanisms that mediate interactions, such as the allocation of resources, matching of peers, and elicitation of opinions from large and diverse communities. Intuitively, one would think ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
A key trend in the world—especially in electronic commerce—is a demand for higher levels of expressiveness in the mechanisms that mediate interactions, such as the allocation of resources, matching of peers, and elicitation of opinions from large and diverse communities. Intuitively, one would think that this increase in expressiveness would lead to more efficient mechanisms (e.g., due to better matching of supply and demand). However, until now we have lacked a general way of characterizing the expressiveness of these mechanisms, analyzing how it impacts the actions taken by rational agents—and ultimately the outcome of the mechanism. In this technical report we introduce a general model of expressiveness for mechanisms. Our model is based on a new measure which we refer to as the maximum impact dimension. The measure captures the number of different ways that an agent can impact the outcome of a mechanism. We proceed to uncover a fundamental connection between this measure and the concept of shattering from computational learning theory. We also provide a way to determine an upper bound on the expected efficiency of any mechanism under its most efficient Nash equilibrium which, remarkably, depends only on the mechanism’s expressiveness. We show that for any setting and any prior over agent preferences, the
Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff?
, 2008
"... How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters in the clouds could be wiped out if a small cache of a few million urls could capture much of the value. Language modeling techniques are applied to MSN’s search logs to estimate entropy. The perplexity is surprisingly sm ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters in the clouds could be wiped out if a small cache of a few million urls could capture much of the value. Language modeling techniques are applied to MSN’s search logs to estimate entropy. The perplexity is surprisingly small: millions, not billions. Entropy is a powerful tool for sizing challenges and opportunities. How hard is search? How hard are query suggestion mechanisms like auto-complete? How much does personalization help? All these difficult questions can be answered by estimation of entropy from search logs. What is the potential opportunity for personalization? In this paper, we propose a new way to personalize search, personalization with backoff. If we have relevant data for a particular user, we should use it. But if we don’t, back off to larger and larger classes of similar users. As a proof of concept, we use the first few bytes of the IP address to define classes. The coefficients of each backoff class are estimated with an EM algorithm. Ideally, classes would be defined by market segments, demographics and surrogate variables such as time and geography.
Information scraps: How and why information eludes our personal information management tools
- ACM Transactions on Information Systems
, 2008
"... In this paper we investigate information scraps – personal information where content has been scribbled on Post-it notes, scrawled on the corners of sheets of paper, stuck in our pockets, sent in e-mail messages to ourselves, and stashed in miscellaneous digital text files. Information scraps encode ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
In this paper we investigate information scraps – personal information where content has been scribbled on Post-it notes, scrawled on the corners of sheets of paper, stuck in our pockets, sent in e-mail messages to ourselves, and stashed in miscellaneous digital text files. Information scraps encode information ranging from ideas and sketches to notes, reminders, shipment tracking numbers, driving directions, and even poetry. Although information scraps are ubiquitous, we have much still to learn about these loose forms of information practice. Why do we keep information scraps outside of our traditional PIM applications? What role do information scraps play in our overall information practice? How might PIM applications be better designed to accommodate and support information scraps ’ creation, manipulation and retrieval? We pursued these questions by studying the information scrap practices of 27 knowledge workers at five organizations. Our observations shed light on information scraps ’ content, form, media and location. From this data, we elaborate on the typical information scrap lifecycle, and identify common roles that information scraps play: temporary storage, archiving, work-in-progress, reminding, and management of unusual data. These roles suggest a set of unmet design needs in current PIM tools: lightweight entry, unconstrained content, flexible use and adaptability, visibility, and mobility.
Anatomy of the Long Tail: Ordinary People with Extraordinary Tastes
- in WSDM
, 2010
"... The success of “infinite-inventory ” retailers such as Amazon.com and Netflix has been ascribed to a “long tail ” phenomenon. To wit, while the majority of their inventory is not in high demand, in aggregate these “worst sellers, ” unavailable at limited-inventory competitors, generate a significant ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
The success of “infinite-inventory ” retailers such as Amazon.com and Netflix has been ascribed to a “long tail ” phenomenon. To wit, while the majority of their inventory is not in high demand, in aggregate these “worst sellers, ” unavailable at limited-inventory competitors, generate a significant fraction of total revenue. The long tail phenomenon, however, is in principle consistent with two fundamentally different theories. The first, and more popular hypothesis, is that a majority of consumers consistently follow the crowds and only a minority have any interest in niche content; the second hypothesis is that everyone is a bit eccentric, consuming both popular and specialty products. Based on examining extensive data on user preferences for movies, music, Web search, and Web browsing, we find overwhelming support for the latter theory. However, the observed eccentricity is much less than what is predicted by a fully random model whereby every consumer makes his product choices independently and proportional to product popularity; so consumers do indeed exhibit at least some a priori propensity toward either the popular or the exotic. Our findings thus suggest an additional factor in the success of infinite-inventory retailers, namely, that tail availability may boost head sales by offering consumers the convenience of “one-stop shopping ” for both their mainstream and niche interests. This hypothesis is further supported by our theoretical analysis that presents a simple model in which shared inventory stores, such as Amazon Marketplace, gain a clear advantage by satisfying tail demand, helping to explain the emergence and increasing popularity of such retail arrangements. Hence, we believe that the return-oninvestment (ROI) of niche products goes beyond direct revenue, extending to second-order gains associated with increased consumer satisfaction and repeat patronage. More generally, our findings call into question the conventional wisdom that specialty products only appeal to a minority of consumers.

