系统设计以及javascript笔记:用户行为分析研究之数据采集
1.1用户行为分析的重要性
用户行为分析的重要性,我想做个网站的人都会用很清晰的认识,本来我想谈谈自己想法,但感觉自己毕竟还是做技术的,很难清晰的从商业价值的角度来分析它的重要性,因此放弃了想阐述自己意见的想法。当我第一次见到百度统计,和谷歌分析网站,就有那种惊鸿一瞥的激动,很想自己也能写出一套这样的网站,这也是我持续研究用户行为分析的初衷。
我估计还是有很多童鞋对“用户行为分析”的概念比较陌生,这里将百度百科里的解释在这里贴出来,抛砖引玉,希望能有更多的志同道合者跟我一起研究这个主题,百度百科的地址如下:
http://baike.baidu.com/view/2330219.htm
好了,废话不多说了,马上就进入正题。
1.2 设计优秀的数据采集系统
对于大型网站而言,网站响应速度是网站是否优秀一个重要衡量标准,下面我引用一些权威机构的统计数据来说明网站响应速度的重要性:
用户行为分析的前提就是能准确的采集到用户的相关数据,这就需要我们在网站页面里添加采集数据的代码,如果我们的采集代码写的不好,一定会对网站的性能产生一定的影响,更有甚者还会影响到网站的稳定性。因此设计一套性能卓越,安全性好,耦合度很低的日志采集程序是非常重要的。
这里我提供一套采集数据方案,方案详情如下:
我是做java的程序员,经常使用到的web应用服务器是tomact,jboss,weblogic等等,我这里为什么不使用这些我非常熟悉的web应用服务器,而去选择功能相对单一的apache或者是nginx呢?理由非常简单,因为apache和nginx速度更快,更加轻量级,这个经验来源于我做网站的经验,大型网站的服务端设计是很复杂的,但基本都有一个共同的原则:当用户一个请求提交到了服务端,服务端会先判断这个请求,如果请求的是一些对静态资源的访问(比如图片,不会变化的文字等),请求会直接提交到响应的静态资源服务器集群,这样速度会更快,而这些静态资源服务器基本都是apache或者是像nginx这样的轻量级web服务器集群。
1.3 采集系统之服务端
本地开发,我就不去搭建集群了,有兴趣的童鞋可以在网上查查相关的资料。本地开发我就搭建一个apache服务器。
服务器的开发非常简单,只要修改下apache下的conf文件(注意:我的开发平台是window7),代码如下:
<IfModule log_config_module> LogFormat "%h %l %u %t [%{%Y-%m-%d %T}t] \"%r\" [%q] [%U] %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined LogFormat "%h %l %u %t [%{%Y-%m-%d %T}t] \"%r\" [%q] [%U] %>s %b" common <IfModule logio_module> # You need to enable mod_logio.c to use %I and %O LogFormat "%h %l %u %t [%{%Y-%m-%d %T}t] \"%r\" [%q] [%U] %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %I %O" combinedio </IfModule>
在htdocs文件夹里添加如下文件:
1) a.gif。(1*1像素的透明文件)
2) click.html。(用于记录点击日志)
3) error.html。(记录错误信息日志)
启动apache服务器,我们在浏览器录入如下地址:
http://127.0.0.1/a.gif?name=sharpxiajun&msg=test
在logs文件夹里找到2012_06_26.access.log文件,打开文件,我们会看到如下日志:
127.0.0.1 - - [26/Jun/2012:11:37:07 +0800] [2012-06-26 11:37:07] "GET /a.gif?name=sharpxiajun&msg=test HTTP/1.1" [?name=sharpxiajun&msg=test] [/a.gif] 200 43
访问请求被完整的记录下来了。
1.4 采集系统之客户端
采集系统的核心还是客户端的采集脚本,这里我会贴出完整的采集脚本以及测试页面,代码的详细解析我会在以后的博客里进行阐述。
我的采集脚本可以记录用户访问的日志,还能记录用户的点击日志,不过点击日志一般包含业务含义需要用户根据自己的需求去定义。代码如下:
up_beacon.js:
(function(window,document,undefined){ // upLogger对象是采集脚本对外提供的操作对象 if (window.upLogger){//如果不为空,直接返回,避免重复安装 return; } var upBeaconUtil ={//日志记录工具类 jsName:\'up_beacon.js\',//程序名称 defaultVer:20120607,//版本日期 getVersion:function(){//获取版本号 var e = this.jsName; var a = new RegExp(e + "(\\?(.*))?$"); var d = document.getElementsByTagName("script"); for (var i = 0;i < d.length;i++){ var b = d[i]; if (b.src && b.src.match(a)){ var z = b.src.match(a)[2]; if (z && (/^[a-zA-Z0-9]+$/).test(z)){ return z; } } } return this.defaultVer; }, setCookie:function(sName,sValue,oExpires,sPath,sDomain,bSecure){//设置cookie信息 var currDate = new Date(), sExpires = typeof oExpires == \'undefined\'?\'\':\';expires=\' + new Date(currDate.getTime() + (oExpires * 24 * 60 * 60* 1000)).toUTCString(); document.cookie = sName + \'=\' + sValue + sExpires + ((sPath == null)?\'\':(\' ;path=\' + sPath)) + ((sDomain == null)?\'\':(\' ;domain=\' + sDomain)) + ((bSecure == true)?\' ; secure\':\'\'); }, getCookie:function(sName){//获取cookie信息 var regRes = document.cookie.match(new RegExp("(^| )" + sName + "=([^;]*)(;|$)")); return (regRes != null)?unescape(regRes[2]):\'-\'; }, getRand:function(){// 生产页面的唯一标示 var currDate = new Date(); var randId = currDate.getTime() + \'-\'; for (var i = 0;i < 32;i++) { randId += Math.floor(Math.random() * 10); } return randId; }, parseError:function(obj){ var retVal = \'\'; for (var key in obj){ retVal += key + \'=\' + obj[key] + \';\'; } return retVal; }, getParam:function(obj,flag){// 参数转化方法 var retVal = null; if (obj){ if (upBeaconUtil.isString(obj) || upBeaconUtil.isNumber(obj)){ retVal = obj; }else{ if (upBeaconUtil.isObject(obj)){ var tmpStr = \'\'; for (var key in obj){ if (obj[key] != null && obj[key] != undefined){ var tmpObj = obj[key]; if (upBeaconUtil.isArray(tmpObj)){ tmpObj = tmpObj.join(\',\'); }else{ if (upBeaconUtil.isDate(tmpObj)){ tmpObj = tmpObj.getTime(); } } tmpStr += key + \'=\' + tmpObj + \'&\'; } } tmpStr = tmpStr.substring(0,tmpStr.length - 1); retVal = tmpStr; }else{ if (upBeaconUtil.isArray(obj)){ if (upBeaconUtil.length & upBeaconUtil.length > 0){ retVal = obj.join(\',\'); } }else{ retVal = obj.toString(); } } } } if (!retVal){ retVal = \'-\'; } if (flag){ retVal = encodeURIComponent(retVal); retVal = this.base64encode(retVal); } return retVal; }, base64encode: function(G) {//base64加密 var A = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; var C, E, z; var F, D, B; z = G.length; E = 0; C = ""; while (E < z) { F = G.charCodeAt(E++) & 255; if (E == z) { C += A.charAt(F >> 2); C += A.charAt((F & 3) << 4); C += "=="; break } D = G.charCodeAt(E++); if (E == z) { C += A.charAt(F >> 2); C += A.charAt(((F & 3) << 4) | ((D & 240) >> 4)); C += A.charAt((D & 15) << 2); C += "="; break } B = G.charCodeAt(E++); C += A.charAt(F >> 2); C += A.charAt(((F & 3) << 4) | ((D & 240) >> 4)); C += A.charAt(((D & 15) << 2) | ((B & 192) >> 6)); C += A.charAt(B & 63) } return C }, getDomain:function(){//获取网站的域名 return document.URL.substring(document.URL.indexOf("://") + 3,document.URL.lastIndexOf("\/")); }, isString:function(obj){// 判断是不是String类型 return (obj != null) && (obj != undefined) && (typeof obj == \'string\') && (obj.constructor == String); }, isNumber:function(obj){// 判断是否是数组 return (typeof obj == \'number\') && (obj.constructor == Number); }, isDate:function(obj){// 判断是否是日期 return obj && (typeof obj == \'object\') && (obj.constructor == Date); }, isArray:function(obj){//判断是否是数组 return obj && (typeof obj == \'object\') && (obj.constructor == Array); }, isObject:function(obj){//判断是否是对象 return obj && (typeof obj == \'object\') && (obj.constructor == Object) }, trim:function(str){// 去除左右两边空格 return str.replace(/(^\s*)|(\s*$)/, "");; } }, beacon_vist_num = isNaN(beacon_vist_num = +upBeaconUtil.getCookie(\'up_beacon_vist_count\')) ? 1:beacon_vist_num + 1;// 从cookie里获取访问次数 upBeaconUtil.setCookie(\'up_beacon_vist_count\',beacon_vist_num);//记录新的访问次数 var setUpBeaconId = function(){ var sUpBeaconId = upBeaconUtil.trim(upBeaconUtil.getCookie(\'up_beacon_id\')); if (sUpBeaconId == undefined || sUpBeaconId == null || sUpBeaconId == \'\' || sUpBeaconId == \'-\'){ upBeaconUtil.setCookie(\'up_beacon_id\',(upBeaconUtil.getDomain() + \'.\' + (new Date()).getTime())); } }(), beaconMethod = { uvId:\'up_beacon_id\',// memId:\'up_dw_track\' , beaconUrl:\'127.0.0.1/a.gif\',//记录访问日志的url errorUrl:\'127.0.0.1/error.html\',//记录错误日志的url clickUrl:\'127.0.0.1/click.html\',//记录click日志的url pageId:typeof _beacon_pageid != \'undefined\'?_beacon_pageid:(_beacon_pageid = upBeaconUtil.getRand()),//生产pageId(页面唯一标示) protocol:function(){//请求的协议例如http:// var reqHeader = location.protocol; if (\'file:\' === reqHeader){ reqHeader = \'http:\'; } return reqHeader + \'//\'; }, tracking:function(){// 记录访问日志的方法(对外) this.beaconLog(); }, getRefer:function(){// 获取上游页面信息 var reqRefer = document.referrer; reqRefer == location.href && (reqRefer = \'\'); try{ reqRefer = \'\' == reqRefer ? opener.location:reqRefer; reqRefer = \'\' == reqRefer ? \'-\':reqRefer; }catch(e){ reqRefer = \'-\'; } return reqRefer; }, beaconLog:function(){// 记录访问日志方法 try{ var httpHeadInd = document.URL.indexOf(\'://\'), httpUrlContent = \'{\' + upBeaconUtil.getParam(document.URL.substring(httpHeadInd + 2)) + \'}\', hisPageUrl = \'{\' + upBeaconUtil.getParam(this.getRefer()) + \'}\', ptId = upBeaconUtil.getCookie(this.memId), cId = upBeaconUtil.getCookie(this.uvId), btsVal = upBeaconUtil.getCookie(\'b_t_s\'), beanconMObj = {}; var btsFlag = btsVal == \'-\' || btsVal.indexOf(\'s\') == -1; if (ptId != \'-\'){ beanconMObj.memId = ptId; } if (btsFlag){ beanconMObj.subIsNew = 1; upBeaconUtil.setCookie(\'b_t_s\',btsVal == \'-\' ? \'s\' : (btsVal + \'s\'),10000,\'/\'); }else{ beanconMObj.subIsNew = 0; } var logParams = \'{\' + upBeaconUtil.getParam(beanconMObj) + \'}\', logPageId = this.pageId, logTitle = document.title; if (logTitle.length > 25){ logTitle = logTitle.substring(0,25); } logTitle = encodeURIComponent(logTitle); var logCharset = (navigator.userAgent.indexOf(\'MSIE\') != -1) ? document.charset : document.characterSet, logQuery = \'{\' + upBeaconUtil.getParam({ pageId:logPageId, title:logTitle, charset:logCharset, sr:(window.screen.width + \'*\' + window.screen.height) }) + \'}\'; var sparam = { logUrl:httpUrlContent, logHisRefer:hisPageUrl, logParams:logParams, logQuery:logQuery }; this.sendRequest(this.beaconUrl,sparam); }catch(ex){ this.sendError(ex); } }, clickLog:function(sparam){// 记录点击日志 try{ // 获得pageId var clickPageId = this.pageId; if (!clickPageId){// 当pageId值为空,重新计算pageId this.pageId = upBeaconUtil.getRand(); clickPageId = this.pageId; } var clickAuthId = this.authId;//authId是针对某个网站的唯一标示 if (!clickAuthId){ clickAuthId = \'-\'; } if (upBeaconUtil.isObject(sparam)){// 当传入参数是javascript对象 sparam.pageId = clickPageId; sparam.authId = clickAuthId; }else{ if (upBeaconUtil.isString(sparam) && sparam.indexOf(\'=\') > 0){// 当传入参数是字符串 sparam += \'&pageId=\' + clickPageId + "&authId=" + clickAuthId; }else{ if (upBeaconUtil.isArray(sparam)){// 当传入参数是数组 sparam.push("pageId=" + clickPageId); sparam.push("authId=" + clickAuthId); sparam = sparam.join(\'&\');//数组转化为字符串 }else{// 其他数据类型 sparam = {pageId:clickPageId,authId:clickAuthId}; } } } this.sendRequest(this.clickUrl, sparam);// 发送点击日志 }catch(ex){ this.sendError(ex); } }, sendRequest:function(url,params){// 日志发送方法 var urlParam = \'\',currDate = new Date(); try{ if (params){ urlParam = upBeaconUtil.getParam(params,false); urlParam = (urlParam == \'\')?urlParam:(urlParam + \'&\'); } var tmpUrlParam = \'ver=\' + upBeaconUtil.getVersion() + \'&time=\' + currDate.getTime(); url = this.protocol() + url + \'?\' + urlParam + tmpUrlParam; var logImage = new Image(); logImage.onload = function(){ logImage = null; } logImage.src = url; }catch(e){ this.sendError(e); } }, sendError:function(ex){// 发送错误日志 var errURIParams = upBeaconUtil.parseError(ex), errURL = this.errorUrl + \'?type=send&exception=\' + encodeURIComponent(errURIParams.toString()), errImage = new Image(); errImage.onload = function(){ errImage = null; }; errImage.src = this.protocol() + errURL; } }; beaconMethod.tracking(); window.upLogger = beaconMethod;//构建window的upLogger对象 })(window,document);
install_up_beacon.js文件,这个文件对外提供:
(function(window,document,undefined){ /*安装采集脚本的js程序*/ // upLogger对象是采集脚本对外提供的操作对象 if (window.upLogger){//如果不为空,直接返回,避免重复安装 return; } var cookieUtil = {//cookie操作工具类 setCookie:function(sName,sValue,oExpires,sPath,sDomain,bSecure){ var currDate = new Date(), sExpires = typeof oExpires == \'undefined\'?\'\':\';expires=\' + new Date(currDate.getTime() + (oExpires * 24 * 60 * 60* 1000)).toUTCString(); document.cookie = sName + \'=\' + sValue + sExpires + ((sPath == null)?\'\':(\' ;path=\' + sPath)) + ((sDomain == null)?\'\':(\' ;domain=\' + sDomain)) + ((bSecure == true)?\' ; secure\':\'\'); }, getCookie:function(sName){ var regRes = document.cookie.match(new RegExp("(^| )" + sName + "=([^;]*)(;|$)")); return (regRes != null)?unescape(regRes[2]):\'-\'; } }; var btsVal = cookieUtil.getCookie(\'b_t_s\'),//b_t_s的cookie作用1.标识该页面是否已经安装了采集脚本;2.记录采集脚本的有效期 startTime = 0, intervalTime = 3 * 24 * 60 * 60 * 1000, currIntervalTime = new Date().getTime() - 1200000000000, domainHead = (document.URL.substring(0,document.URL.indexOf(\'://\'))) + \'://\'; if (btsVal != \'-\' && btsVal.indexOf(\'t\') != -1){ var getBtsTime = btsVal.substring(btsVal.indexOf(\'t\') + 1,btsVal.indexOf(\'x\')); getCurrInterVal = currIntervalTime - getBtsTime; if (getCurrInterVal > intervalTime){ startTime = currIntervalTime; cookieUtil.setCookie(\'b_t_s\',btsVal.replace(\'t\' + getBtsTime + \'x\', \'t\' + currIntervalTime + \'x\'), 10000, \'/\'); }else{ startTime = getBtsTime; } }else{ if (btsVal == \'-\'){ cookieUtil.setCookie(\'b_t_s\',\'t\' + currIntervalTime + \'x\', 10000, \'/\'); }else{ cookieUtil.setCookie(\'b_t_s\',btsVal + \'t\' + currIntervalTime + \'x\', 10000, \'/\'); } startTime = currIntervalTime; } document.write(\'<script src="\' + domainHead + \'127.0.0.1/up_beacon.js?\' + startTime + \'"><\/script>\');//安装采集脚本 })(window,document);
下面是测试页面;
第一个测试页面:testbeacon.html,代码如下:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>beacon test page</title> </head> <script type="text/javascript" src="install_up_beacon.js"></script> <body> <h1>日志测试</h1> <input type="button" value="Click Button" id="clickBtn" name="clickBtn" onclick="clickLog(\'testClickBtn\',\'MyTest\')"/> </body> </html> <script type="text/javascript"> // 用户行为统计代码 function recordStaticLogerr(authId,type,msg){ if (window.upLogger){ upLogger.authId = authId; upLogger.clickLog(\'type=\' + type + \'&clickTarget=\' + msg); } } // 记录click日志的方法 function clickLog(clog_msg,clog_type){ var clog_authId = \'sharpxiajun\'; recordStaticLogerr(clog_authId,clog_type,clog_msg); } </script>
第二个测试页面:parent.html,代码如下:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>parent html</title> </head> <body> <a href="testbeacon.html" target="_self">child.html</a> </body> </html>
1.5 测试结果
测试地址:
http://localhost/testbeacon.html
我们查看cookies信息,如下图:
日志信息如下:
127.0.0.1 - - [26/Jun/2012:10:01:52 +0800] [2012-06-26 10:01:52] "GET /parent.html HTTP/1.1" [] [/parent.html] 304 - 127.0.0.1 - - [26/Jun/2012:10:01:54 +0800] [2012-06-26 10:01:54] "GET /testbeacon.html HTTP/1.1" [] [/testbeacon.html] 304 - 127.0.0.1 - - [26/Jun/2012:10:01:54 +0800] [2012-06-26 10:01:54] "GET /install_up_beacon.js HTTP/1.1" [] [/install_up_beacon.js] 304 - 127.0.0.1 - - [26/Jun/2012:10:01:54 +0800] [2012-06-26 10:01:54] "GET /up_beacon.js?140675524644 HTTP/1.1" [?140675524644] [/up_beacon.js] 304 - 127.0.0.1 - - [26/Jun/2012:10:01:54 +0800] [2012-06-26 10:01:54] "GET /a.gif?logUrl={/localhost/testbeacon.html}&logHisRefer={http://localhost/parent.html}&logParams={subIsNew=0}&logQuery={pageId=1340676114790-42900296489937289847295051780050&title=beacon%20test%20page&charset=UTF-8&sr=1280*1024}&ver=140675524644&time=1340676114791 HTTP/1.1" [?logUrl={/localhost/testbeacon.html}&logHisRefer={http://localhost/parent.html}&logParams={subIsNew=0}&logQuery={pageId=1340676114790-42900296489937289847295051780050&title=beacon%20test%20page&charset=UTF-8&sr=1280*1024}&ver=140675524644&time=1340676114791] [/a.gif] 200 43 127.0.0.1 - - [26/Jun/2012:10:02:01 +0800] [2012-06-26 10:02:01] "GET /click.html?type=MyTest&clickTarget=testClickBtn&pageId=1340676114790-42900296489937289847295051780050&authId=sharpxiajun&ver=140675524644&time=1340676121252 HTTP/1.1" [?type=MyTest&clickTarget=testClickBtn&pageId=1340676114790-42900296489937289847295051780050&authId=sharpxiajun&ver=140675524644&time=1340676121252] [/click.html] 200 310
大家看到了吧,请求都被记录下来,下面我们只要好好分析这些日志文件的信息就行了。
pdf下载地址: