Special thanks to Eldon Norton, who reviewed the source code and suggested ways to improve its efficiency.
Today there are quite a few software tools for analyzing a user log. In my experience there are several drawbacks of those software packages: Some of them such as AccessWatch are CGI-based and thus it is difficult to customize them. On the other hand, some of them such as Andromedia, Whirl, Insight, and Internet Manager are more flexible but are extremely expensive. The advantages of using SAS to analyze user logs are: First, a SAS programmer can manipulate the data in the way he/she wants. Second, SAS is available in many institutions. The additional cost for the project is only the cost for SAS/IntrNet. This write-up will illustrate how you can clean up the access log and present useful results on the Web.
Linking user log with SAS
If SAS and SAS/IntrNet are not installed on the same server where the Web server is situated, you have to create a link between SAS and the targeted user log. There are several ways to accomplish this task. You can mount a Network File System (NFS) volume that can be recognized by both computers, or you can simply issue a "FTP" command in SAS (see below). After the link is established, SAS can read the data in the data step.
Extracting relevant data
You don't need everything in the user log. Typically what you need are the IP address (From where did the users look at your website?), the date and time (When did they read it?), and the hit (What did they look at)? Therefore you can use dummy variables as shown below to skip all other unnecessary data. To save memory space, you should drop all dummy and temporary variables after the data step is finished.
The above selection works very well if the primary function of your website is to display information. However, you may want to read more data if your server performs more functions. For example, in the previous user log example, right after the data and time data there are "action" data. The "action" is either "get" or "post." If a user reads the displayed information only, the action is "get." If your homesite allows a user to submit a query to search a database or upload a form to your server, then the action will be "post." A web-based instructor may want to find out the ratio between "get" and "post." (How many students use the search engine? How many sttudents post questions or submit homeworks?)
Cleaning up IP numbers
Many people use a counter to keep track of the number of visits to the homesite. However, a counter could not differentiate users and workers who work on the website. When I develop a website, I may check the site more than ten times. These counts are misleading for analyzing the website traffic. Therefore, I recommend to delete all the entries that originated from yourself and your colleagues who are involved in developing the pages. To do this, simply discard the records if the IP numbers belong to you or your coworkers (see below).
Cleaning up page access
In addition, it is important to distinguish "hit" from "page access." A hit includes every object accessed by the user such as individual JPEG images, GIF images, WAV sound clips, HTML files and so on. If a user opens a webpage which has two JPEG images and three GIF images, the total number of hits would be six. This statistic artificially inflates the website traffic. On the other hand, a page access counts only the html page. I recommend using the later because it reflects the usage of the website in a more accurate manner. The following SAS code could perform the filtration task:
To pull out non-HTML hits, first use the reverse function to reverse the "hit" string. Then use the scan function to locate the extension (.html, .jpg, .gif...etc). Because the extension names may be in both upper case and low case, use the upcase function to convert them into capital letters. Next, use an if-then statement to delete all non-HTML hits. In this example, only JPEG and GIF images are taken out. In your own implementation, you can take out other types of hits such as Shockwave movies, Java applets, QuickTime movies, Wave sound clips, and so on.
Hotlinking page access
In SAS/IntrNet, the variable "page" will be used to display the frequency of page access of each webpage. You can hotlink the display of the pages by using the following method:
To make a hotlinked text, first create several constants. The constant "start" carries the string of your website and the HTML tag of starting a link. The constants "middle" and "end" contain the tags that bracket the text to be hotlinked. To show a hotlinked text on the Web, concatenate the preceding constants and the variable "page" in the proper order. To avoid empty tails, use the trim function to remove the tail of shorter string.
Cleaning up date and time
Date and time in the user log are together in a continuous string. The following SAS code divides the variable "datetime" into several different variables. First, the substr function extract the string carrying date and time information. Second, the compress function removes the slash so that the date/time string conforms to the standard SAS date/time format. Next, use different date functions to extract the month, year, hour, minute, and date information from the string.
In some situations such as a slow downloading process, the user may click the refresh/reload button several times within a minute. This action improperly inflates the number of page accesses. To avoid duplication, you can perform a sort of no duplicated unique key (nodupkey) on the variables "IP," "dt" and "page" (see below). If the same person looked at the same page at the same time (within one minute), only the first page access will be kept.
Counting page access
Now the access log is clean and ready for analysis. You can compute the user log data in regular SAS procedures and later convert the output as html pages by using SAS/Intrnet. The following proc summary is used to display the frequency count of each page access. Also, the displayed pages are hotlinked.
Counting page access by month
In the following, proc summary and proc chart are used to show the page access by month. A graphical output is shown after the following source code.
Counting page access by hour
The following module returns the page access ranked by hour. A grapical output is attached.
Now you can pack the SAS output and show it on the Web by using SAS/IntrNet. However, this step may not be necessary. If you are the only person who would analyze the user log or you will share the user log info with your coworkers but they don't have to read the result on the Web, you and your coworkers can simply read the SAS output on your desktop or share it through a local area network.
SAS tips contents