Results 1 - 10
of
16
A Framework for the Evaluation of Session Reconstruction Heuristics in Web Usage Analysis
- INFORMS Journal on Computing
, 2003
"... Web usage mining has become the subject of intensive research, as its ... The first experiment concerned a specific KDD application and has shown the sensitivity of the heuristics to particularities of the site's structure and traffic. The second experiment is not bound to a specific application but ..."
Abstract
-
Cited by 62 (7 self)
- Add to MetaCart
Web usage mining has become the subject of intensive research, as its ... The first experiment concerned a specific KDD application and has shown the sensitivity of the heuristics to particularities of the site's structure and traffic. The second experiment is not bound to a specific application but rather compares the performance of the heuristics for different measures and thus for di erent application types. Our results show that there is no single best heuristic, but our measures help the analyst in the selection of the heuristic best suited for the application at hand.
Active Feature-Value Acquisition for Classifier Induction
, 2004
"... Many induction problems, such as on-line customer profiling, include missing data that can be acquired at a cost, such as incomplete customer information that can be filled in by an intermediary. For building accurate predictive models, acquiring complete information for all instances is often prohi ..."
Abstract
-
Cited by 18 (10 self)
- Add to MetaCart
Many induction problems, such as on-line customer profiling, include missing data that can be acquired at a cost, such as incomplete customer information that can be filled in by an intermediary. For building accurate predictive models, acquiring complete information for all instances is often prohibitively expensive or unnecessary. Randomly selecting instances for feature acquisition allows a representative sampling, but does not incorporate estimations of the value of acquisition. Active feature acquisition aims at reducing the cost of achieving a desired model accuracy by identifying instances for which complete information is most informative to obtain. We present approaches in which instances are selected for feature acquisition based on the current model's ability to predict accurately and the model's confidence in its prediction. Experimental results on several real-world data sets demonstrate that these approaches can induce accurate models using substantially fewer feature acquisitions, and suggest promising directions for improvements.
On Active Learning for Data Acquisition
, 2002
"... Many applications are characterized by having naturally incomplete data on customers -- where data on only some fixed set of local variables is gathered. However, having a more complete picture can help build better models. The nave solution to this problem -- acquiring complete data for all custome ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Many applications are characterized by having naturally incomplete data on customers -- where data on only some fixed set of local variables is gathered. However, having a more complete picture can help build better models. The nave solution to this problem -- acquiring complete data for all customers -- is often impractical due to the costs of doing so. A possible alternative is to acquire complete data for "some" customers and to use this to improve the models built. The data acquisition problem is determining how many, and which, customers to acquire additional data from. In this paper we suggest using active learning based approaches for the data acquisition problem. In particular, we present initial methods for data acquisition and evaluate these methods experimentally on web usage data and UCI datasets. Results show that the methods perform well and indicate that active learning based methods for data acquisition can be a promising area for data mining research.
Context-Sensitive Modeling of Web-Surfing Behaviour using Concept Trees
- in Proceedings of the 5 th WEBKDD Workshop
, 2003
"... Early approaches to mathematically abstracting websurfing behavior were largely based on first-order Markov models. Most humans however do not surf in a "memoryless " fashion, rather they are guided by their timedependent situational context and associated information needs. This belief is corrobora ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Early approaches to mathematically abstracting websurfing behavior were largely based on first-order Markov models. Most humans however do not surf in a "memoryless " fashion, rather they are guided by their timedependent situational context and associated information needs. This belief is corroborated by the non-exponential revisit times observed in many site-centric weblogs. In this paper, we propose a general framework for modeling users whose surfing behavior is dynamically governed by their current topic of interest. This allows a modeled surfer to behave differently on the same page, depending on his situational context. The proposed methodology involves mapping each visited page to a topic or concept, (conceptually) imposing a tree hierarchy on these topics, and then estimating the parameters of a semi-Markov process defined on this tree based on the observed transitions among the underlying visited pages. The semi-Markovian assumption imparts additional flexibility by allowing for non-exponential state re-visit times, and the concept hierarchy provides a nice way of capturing context and user intent. Our approach is computationally much less demanding as compared to the alternative approach of using higher order Markov models for capturing history-sensitive surfing behavior. Several practical applications are described. The application of better predicting which outlink a surfer may take, is illustrated using web-log data from a rich community portal, www.sulekha.com as an example, though the focus of the paper is on forming a plausible generative model rather than solving any specific task.
M.: Mining Client-Side Activity for Personalization
- In WECWIS
, 2002
"... “Garbage in. garbage out ” is a well-known phrase in computer analysis, and one that comes to mind when mining Web data to draw conclusions about Web users. The challenge is that data analysts wish to infer patterns of client-side behavior from server-side data. However, because only a fraction of t ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
“Garbage in. garbage out ” is a well-known phrase in computer analysis, and one that comes to mind when mining Web data to draw conclusions about Web users. The challenge is that data analysts wish to infer patterns of client-side behavior from server-side data. However, because only a fraction of the user’s actions ever reach the Web server, analysts must rely on incomplete data. In this paper, we propose a client-side monitoring system that is unobtrusive and supports flexible data collection. Moreover, the proposed framework encompasses clientside applications beyond the Web browser. Expanding monitoring beyond the browser to incorporate standard office productivity tools enables analysts to derive a much richer and more accurate picture of user behavior on the Web.
Handling missing values when applying classification models
- Journal of Machine Learning Research. Forthcoming
"... Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased impu ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper first compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classification trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments.
Data Acquisition and Cost-effective Predictive Modeling: Targeting Offers for Electronic Commerce
"... Electronic commerce is revolutionizing the way we think about data modeling, by making it possible to integrate the processes of (costly) data acquisition and model induction. The opportunity for improving modeling through costly data acquisition presents itself for a diverse set of electronic comme ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Electronic commerce is revolutionizing the way we think about data modeling, by making it possible to integrate the processes of (costly) data acquisition and model induction. The opportunity for improving modeling through costly data acquisition presents itself for a diverse set of electronic commerce modeling tasks, from personalization to customer lifetime value modeling; we illustrate with the running example of choosing offers to display to web-site visitors, which captures important aspects in a familiar setting. Considering data acquisition costs explicitly can allow the building of predictive models at significantly lower costs, and a modeler may be able to improve performance via new sources of information that previously were too expensive to consider. However, existing techniques for integrating modeling and data acquisition cannot deal with the rich environment that electronic commerce presents. We discuss several possible data acquisition settings, the challenges involved in the integration with modeling, and various research areas that may supply parts of an ultimate solution. We also present and demonstrate briefly a unified framework within which one can integrate acquisitions of different types, with any cost structure and any predictive modeling objective.
On Using Page Cooccurrences for Computing Clickstream Similarity
- In Cimino JJ, editor, Proc. 1996 AMIA Annual Fall Symposium 1996;269--273
, 2003
"... ..."
On the Existence and Significance of Data Preprocessing Biases in Web-Usage Mining
- INFORMS Journal on Computing
, 2003
"... The literature on web-usage mining is replete with data preprocessing techniques, which correspond to many closely related problem formulations. We survey datapreprocessing techniques for session-level pattern discovery and compare three of these techniques in the context of understanding session-le ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The literature on web-usage mining is replete with data preprocessing techniques, which correspond to many closely related problem formulations. We survey datapreprocessing techniques for session-level pattern discovery and compare three of these techniques in the context of understanding session-level purchase behavior on the web. Using real data collected from 20,000 users ’ browsing behavior over a period of six months, four different models (linear regressions, logistic regressions, neural networks, and classification trees) are built based on data preprocessed using three different techniques. The results demonstrate that the three approaches result in radically different conclusions and provide initial evidence that a data preprocessing bias exists, the effect of which can be significant.
M.: Addressing users’ privacy concerns for improving personalization quality: Towards an integration of user studies and algorithm evaluation
- In: Intelligent Techniques in Web Personalisation. LNCS (LNAI
, 2005
"... Abstract. Numerous studies have demonstrated the effectiveness of personalization using quality criteria both from machine learning / data mining and from user studies. However, a site requires more than a high-performance personalization algorithm: it needs to convince its users to input the data n ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. Numerous studies have demonstrated the effectiveness of personalization using quality criteria both from machine learning / data mining and from user studies. However, a site requires more than a high-performance personalization algorithm: it needs to convince its users to input the data needed by the algorithm. Today’s Web users are becoming increasingly privacyconscious and less willing to disclose personal data. How can the advantages of personalization (and hence, of disclosure) be communicated effectively, and how can the success of such strategies be measured in terms of improved personalization quality? In this paper, we argue for a tighter integration of the HCI and computational issues involved in these questions. We first outline the problems for personalization that arise from the combination of users ’ privacy concerns and sites ’ current policies of dealing with privacy issues. We then describe the results of an experiment that investigated the effects of changes to a site’s interface on users ’ willingness to disclose data for personalization. This is followed by an overview of studies of the sensitivity of mining algorithms to

