@MISC{Wei_datacorpus:, author = {Lu Wei and Advisor Dr. Min-yen Kan and Lu Wei}, title = {Data Corpus: 1 CD By}, year = {} }
Share
OpenURL
Abstract
Web pages often embed scripts for a variety of purposes, including advertising and dynamic interaction. Understanding embedded scripts and their purposes can often help to interpret or provide crucial information about the web page. I have developed a functionality-based categorization of JavaScript, the most widely used web page scripting language. I then view understanding embedded scripts as a text categorization problem. I show how traditional information retrieval methods can be augmented with the features distilled from the domain knowledge of JavaScript and program analysis to improve classification performance. I perform experiments on the standard WT10G web page corpus, and show that my techniques eliminate over 50 % of errors over a standard text classification baseline. Subject Descriptors: