Technical papers: Web Mining

Web mining
Abstract :
Data Mining is a set of automated procedures used to find previously unknown patterns and relationships in data. These patterns and relationships, once extracted, can be used to make valid predictions about the behavior of the customer. Data
Mining is generally used for four main tasks: (1) to improve the process of making new
Customers and retaining customers; (2) to reduce fraud; (3) to identify internal wastefulness and deal with that wastefulness in operations, and (4) to chart unexplored areas of the internet .
An important and active area of current WWW and Internet is data mining and World Wide Web. A natural combination of the two areas, sometimes referred to as Web Mining, has been the focus of several recent research projects and papers. The term Web Mining has been used in two distinct ways. The first describes the process of information or resource discovery from millions of sources across the World Wide Web, known as Web Content Mining. The second, which we call Web Using Mining, is the process of mining Web access logs or other user information user browsing and access patterns on one or more Web localities. In these papers I define Web Mining and, in particular, present an overview of the various research issues, techniques, and development efforts in Web Content Mining and Web Usage Mining.

Introduction:

Present age is clinched Information Age. There is an ever-expanding amount of information “out there”. Moreover, the evolution of the Internet into the Global Information Infrastructure, coupled with the immense popularity of the Web, has also enabled the ordinary citizen to become not just a consumer of information, but also its disseminator. Given that there is this vast and ever growing amount of information, how does the average user quickly find what she/he is looking for - a task in which the present day search engines don’t seem to help much.

One possible approach is to personalize the web space – create a system, which responds to user queries by potentially aggregating information from several sources in a manner, which is dependent on who is the user, is. As a trivial example – An Indian querying on “Software Companies in AP” is probably better served by URL’s pointing to Hyderabad, whereas someone in America should get URL’s pointing to Silicon Valley.

Existing commercial systems seek to do some minimal personalization based on declarative information directly provided by the user, such as their zip code or keywords describing their interests or specific URL’s or even particular piece of information they are interested in (e.g. price for particular stock). Engineers are creating systems that (semi) automatically tailor the content delivered to the user from a web-site. They do so by mining the web – both the contents, as well as the user’s interaction.

Web mining can broadly defined as the discovery and analysis of useful information from the World Wide Web. Web mining, when looked upon in data mining terms; can be said to have three operations of interests –
(1). Clustering (finding natural grouping of users, pages etc.)
(2). Associations (which URLs tend to be requested together)
(3). Sequential Analysis (the order in which URLs tend to be accessed).

As in most real – world problems, the clustering and association in Web mining do not have crisp boundaries and often overlap considerably. In addition, bad exemplars (outliers) and incomplete data can easily occur in data set, due to a wide variety of reasons inherent to web browsing and logging. Thus, Web mining and personalization requires modeling of an unknown number of overlapping sets in presence of significant noise and outliers (i.e., bad exemplars). Moreover, the data sets in Web mining are extremely large. Research is going on to develop scalable robust fuzzy techniques to model noisy data sets containing an unknown number of overlapping categories.

Web Mining

Basically there are two types of web mining techniques are available. They are :- (a). Web Content Mining
(b). Web Usage Mining
(a)Web Content Mining:

The heterogeneity and lack of structure that permeates much of the ever Expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery, organization, and Management of Web–based information difficult. Traditional search and indexing tools of Internet and World Wide Web such as Lycos, AltaVista and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter or interpret documents.

In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as Intelligent Web Agents, as well as to extend databases and data mining techniques to provide a higher level of organization for semi-structured data available on the Web. Here I summarize these efforts below
(I). Database based web content mining
(II). Agent Based web content mining
(I) Database Approach:
The database approach to Web mining have generally focused on techniques for
integrating and organizing the heterogeneous and semi–structured data on web into more
structured and high-level collections of resources, such as in relational databases, and using standard database querying mechanism and data mining techniques to access and analyze this information.

(i). Multilevel Databases
several researchers have proposed a multilevel database approach to organizing Web based information. Most of them propose is that the lowest level of the database contains primitive semi-structured information stored in various Web repositories, such as hypertext documents. At the higherlevel(s) metadata or generalizations are extracted from lower levels and organized in structured collections such as relational or object-oriented databases. For example sometimes a multi-layered database is used where each layer is obtained via generalization and transformation operations performed on the lower layers. Another proposal is the creation and maintenance of Meta-databases at each information-providing domain and the use of a global schema for the Meta-databases.

(ii). Web Query Systems
There have been many Web-base query systems and languages developed recently that attempted to utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for accommodating the types of queries that are used in World Wide Web searches. Few examples of this Web-based query systems are
W3QL:Combines structured queries, based on the organization of hypertext documents, and contents queries, based on information-retrieval techniques.

WEBLOG: Logic-based query language for restructuring extracted Information from web information resources.

(II) Agent based approach:
These agent-based approach to Web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and Organize Web-based information. Generally, the agent-based Web mining systems can be placed on the following three categories:

(i). Intelligent Search Agents:
Several intelligent Web agents have been developed that search for relevant information using characteristics of a particular domain (and possibly a user profile) to organize and interpret the discovered information. For example, agents such as Harvest,
FAQ-finder, Information Manifold, OCCAM, and ParaSite rely either on pre-specific and domain specific information about particular types of documents, or on hard coded models of the information sources to retrieve and interpret documents. Other agents, such as ShopBot and ILA (Internet Learning agent), attempt to interact with and learn the structure of unfamiliar information sources. ShopBot retrieves product information from
a variety of vendor sites using only general information about the product domain. ILA, on the other hand, learns models of various information sources and translates these into its own internal concept hierarchy.

(ii). Information Filtering/Categorization:
A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. For example, Hypursuit uses semantic information embedded in link structures as well as document content to create cluster hierarchies of hypertext Documents, and structure an information space. BO(Bookmark Organizer) combines hierarchical clustering techniques and user interaction to organize a collection of Web documents based on conceptual information.

(iii). Personalized Web Agents:
Another category of Web agents include those that obtain or learn user preferences and discover Web information sources that correspond to these preferences, and possibly those of other individuals with similar interests (using collaborative filtering). A few recent examples of such agents include the WebWacther, PAINT, Syskill & Webert, and others. For example, Syskill & Webert are a system that utilizes a user profile and learns to rate WebPages of interest using a Bayesian classifier.
(b)Web Usage Mining:
Web usage mining is the type of Web mining activity that involves the automatic discovery of user access patterns from one or more Web servers. As more organizations rely on the Internet and World Wide Web to conduct business, the traditional strategies and techniques for market analysis need to be revisited in this context. Organizations often generate and collect large volumes of data in their daily operations. Most of this information is usually generated automatically by Web servers and collected in server access logs. Other sources of user information include referrer logs, which contain information about the referring pages for each page references, and user registration on survey data gathered via tools such as CGI scripts.

Analyzing such data can help these organizations to determine the life time value of customers, cross marketing strategies across products, and effectiveness of promotional campaigns, among other things. Analysis of server access logs and user registration data can also provide valuable information on how to better structure a Web site in order to create a more effective presence for the organization. In organizations using Intranet technologies, such analysis can shed light on more effective management of work group communication and organizational infrastructure. Finally, for organizations that sell advertising on the World Wide Web, analyzing user access patterns helps in targeting ads to specific groups of users.

Most of the existing Web analysis tools provide mechanisms for reporting user activity in the servers and various forms of data filtering. Using such tools, for example, it is possible to determine the number of accesses to the server and individual files within the organizations Web space, the times or time intervals of visits, and domain names and the URLs of users of the Web server. However, in general, these tools are designed to deal handle low to moderate traffic servers, and furthermore, they usually provide little or no analysis of data relationships among the accessed files and directories within the Web space. More sophisticated systems and techniques for discovery and analysis of patterns are now emerging. I shall discuss about tem below.

WEBMINER:
This is a general architecture for Web usage mining. The architecture divides the Web usage mining process into two Main parts. The first part includes the domain dependent processes of transforming the Web data into suitable transaction form. This includes preprocessing, transaction identification, and data integration components. The second part includes the largely domain independent application of generic data mining and pattern matching techniques (such as the discovery of association rule and equential Patterns) as part of the system’s data mining engine. The overall architecture for the Web mining process is depicted in the figure below.

Data cleaning is the first step performed in the Web usage mining process. After the data cleaning, the log entries must be partitioned into logical clusters using one or a series of transaction identification modules. Once the domain-dependent data transformation phase is completed, the resulting transaction data must be formatted to conform to the data model of the appropriate data-mining task. Finally, a query mechanism will allow the user (analyst) to provide more control over the discovery process by specifying various constraints.

The emerging data mining tools and systems lead naturally to demand for a powerful data mining query language, on top of which many interactive and flexible graphical user interfaces can be developed. Such a query mechanism can provide user control over the data mining process and allow the user to extract only relevant and useful rules. In WEBMINER, a simple Query mechanism has been implemented by adding some primitives to an SQL-like language. This allows the user to provide guidance to the mining engine by specifying the patterns of interest.

As an example, let us consider a situation where the user is interested in the patterns which start with URL l, and contain m and n in that order, this pattern can be expressed as a regular expression l*m*n*. To see how this expression is used within an SQL-like query, suppose further that the analyst is interested in finding all such rules with a minimum support of 1% and a minimum confidence of 90%. Moreover, let us also assume that the analyst is interested only in client from the domain .edu, and only wants to consider data later than Jan 1,1996. The query based on these parameters can be expressed as follows:

SELECT association-rules (L*M*N*)
FROM log. data
WHERE date >= 960101
AND domain = “edu”
AND support = 1.0
AND confidence = 90.0

This information from the query is used to reduce the scope, and thus the cost of the mining process.

PATTERN DISCOVERY TOOLS:

The emerging tools for user pattern discovery use sophisticated techniques from AI, data mining, psychology, and information theory, to mine for knowledge from collected data. For example, the some of the systems introduce a general architecture for Web usage mining. Automatically discovers association rules and sequential patterns from server access logs.

PATTERN ANALYSIS TOOLS:

Once access patterns have been discovered, analysts need the appropriate tools and techniques to understand, visualize, and interpret these patterns. Examples of such tools include the Web-Viz. systems for visualizing path traversal patterns. Others have proposed using OLAP techniques such as data cubes for the purpose of simplifying the analysis of usage statistics from server access logs.

OLAP Techniques:
On-Line Analytical Processing (OLAP) is emerging as a very powerful paradigm for strategic analysis of databases in business settings. Some of the key characteristics of strategic analysis include very large data volume ,explicit support for the temporal dimensions ,support for various kinds of information aggregations , long-range analysis, where overall trends are more important than details of individual data items.While OLAP can be performed directly on top of relational databases, industry has developed specialized tools to make it more efficient and effective.

Recent work has shown that the analysis needs of Web usage data have much in common with those of a data warehouse, and hence OLAP techniques are quite applicable. The access information in server logs is modified as an append-only history, which grows over time. A single access log is not likely to contain the entire request history for pages on a server, especially since many clients use a proxy server. Because information on access requests will be distributed and there is a need to integrate it. Since the size of server logs grows quite rapidly, it may not be possible to provide on-line analysis of all of it. Therefore, there is a need to summarize the log data, perhaps in various ways, to make its on-line analysis feasible. Making portions of the log selectively (in) visible to various analysts may be requested for security reasons. These requirements for Web usage data analysis show that OLAP techniques may be quite applicable.

THE MINING PROCESS:
The key component of Web mining is the mining process itself. As we know Web mining has adapted techniques from the field of data mining, databases, and information retrieval, as well as development some techniques of its own, e.g.. Path analysis. A lot of work still remains to be done in adapting known mining techniques as well as developing new ones. Specifically, the following issues must be addressed.

1.New Types of Knowledge:
We usage mining studies reported that to date have mined for association rules, temporal sequences, clusters, and path expressions. As the manner in which the Web is used continues to expand, there is a continual need to figure out new kinds of knowledge about user behavior that needs to be mined for.
2.Improved Mining Algorithms:
The quality of mining algorithm can be measured both in terms of how effective it is in mining for knowledge and how efficient it is in computational terms. There will always be a need to improve the performance of mining algorithms along both these dimensions.
3.Incremental Web mining
Usage data collection on the Web is incremental in nature. Hence, there is a need to develop mining algorithms that take as input the existing data and mined knowledge, and the new data, and develop a new model in an efficient manner.
4.Distributed Web mining
Usage data collection on Web is distributed by its very nature. If all the data to be integrated before mining, a lot of valuable information could be extracted. However, an approach of collecting data from all possible server logs is both non-scalable and impractical. Hence, there needs to be an approach where knowledge mined from various logs can be integrated together into a more comprehensive model.

DATA MINING AND BUSINESS GOALS:
Data mining works best when the business has clear, measuring goals. The following are some goals:

· Increase average page views per session
· Increase average profit per checkout
· Decrease products returned
· Increase number of referred customers
· Increase brand awareness
· Increase retention rate (such as number of visitors that have returned within 30 days)
· Reduce clicks-to-close (average page views to accomplish a purchase or obtain desired information)
· Increase conversion rate (checkouts per visit).

CONCLUSION: As the popularity of the World Wide Web continues to increase, there is an increasing need to develop tools and techniques that will help improve its overall usefulness. Since one of the principal goals of the World Wide Web is to act as a World –wide distributed information resources, a number of efforts are underway to develop techniques that will make it more useful in this regard. The term Web mining has been used to refer to different kinds of techniques that encompass a broad range of issues. However, while measuring and attractive, this very broadness has caused Web Mining to mean different things to different people, and there is a need to develop a common vocabulary for all these efforts. Towards this goal, in this paper I have followed a common, popular definition of web mining, and tried to discuss various ongoing efforts related to web mining

Technical papers

Saturday, February 9, 2008

Web Mining

1 comment:

Blog Archive