网站日志收集与分析

/ 技术文章 / 1 条评论 / 710浏览

网站日志收集与分析

意义

网站日志一般分为系统日志和程序日志

通过对网站日志的分析,我们可以了解

网站日志分析,可帮助管理增提供决策支持,如是否需要增加服务器,提升网站安全策略,修改营销策略等等

如何收集

一般流程

  1. 数据采集:从网站容器如Tomcat, Nginx,或者程序本身手机日志
  2. 数据处理:一般包含以下3个步骤,若量较小可直接使用JAVA或者Python直接分析,若量较大需要借助Hadoop等分布式计算框架
    • 数据清理
    • 数据转换
    • 数据抽取
  3. 数据展现:数据图表化
  4. 结果处理:针对分析结果,优化网站

tomcat

修改conf/server.xml,开放注释即可

<Valve className=“org.apache.catalina.valves.AccessLogValve” 
             directory=“logs”  prefix=“localhost_access_log.” suffix=“.txt” 
             pattern=“common” resolveHosts=“false”/>
%a – 远程主机的IP (Remote IP address) 
%A – 本机IP (Local IP address) 
%b – 发送字节数,不包含HTTP头,0字节则显示 ‘-’ (Bytes sent, excluding HTTP headers, or ‘-’ if no bytes 
were sent) 
%B – 发送字节数,不包含HTTP头 (Bytes sent, excluding HTTP headers) 
%h – 远程主机名 (Remote host name) 
%H – 请求的具体协议,HTTP/1.0 或 HTTP/1.1 (Request protocol) 
%l – 远程用户名,始终为 ‘-’ (Remote logical username from identd (always returns ‘-’)) 
%m – 请求方式,GET, POST, PUT (Request method) 
%p – 本机端口 (Local port) 
%q – 查询串 (Query string (prepended with a ‘?’ if it exists, otherwise 
an empty string) 
%r – HTTP请求中的第一行 (First line of the request) 
%s – HTTP状态码 (HTTP status code of the response) 
%S – 用户会话ID (User session ID) 
%t – 访问日期和时间 (Date and time, in Common Log Format format) 
%u – 已经验证的远程用户 (Remote user that was authenticated 
%U – 请求的URL路径 (Requested URL path) 
%v – 本地服务器名 (Local server name) 
%D – 处理请求所耗费的毫秒数 (Time taken to process the request, in millis) 
%T – 处理请求所耗费的秒数 (Time taken to process the request, in seconds) 
你可以用以上的任意组合来定制你的访问日志格式,也可以用下面两个别名common和combined来指定常用的日志格式:

nginx

修改nginx.conf,增加access_log,使用log_format可配置日志的格式

# 默认日志
access_log logs/access.log

#自定义日志格式, 需要制定日志格式名称
access_log logs/access.log combined;

log_format combined '$remote_addr-$remote_user [$time_local]' '"$request"$status $body_bytes_sent' '"$http_referer" "$http_user_agent"’

日志预处理

从访问日志的信息中,通过一些简单的处理转换,可以得到一些有用的信息:

由于日志文件中没有唯一标识,所以使用java处理日志文件,为每一行创建一个独立的ID标识

mport java.io.*;

public class LogParser {
    /**
     * 日志文件预处理
     *
     * @param inPath  日志文件输入地址
     * @param outPath 输出地址
     * @throws IOException
     */
    public void accessLogId(String inPath, String outPath) throws IOException {
        String line = "";
        long index = 0;
        BufferedReader bufferReader = null;
        FileWriter fileWriter = null;
        try {
            bufferReader = new BufferedReader(new FileReader(inPath));
            File outFile = new File(outPath);
            if (outFile.exists()) {
                outFile.delete();
            }
            outFile.createNewFile();

            fileWriter = new FileWriter(outFile);
            while ((line = bufferReader.readLine()) != null) {
                index++;
                String newLine = String.valueOf(index) + " " + line + "\r\n";
                fileWriter.write(newLine);
            }
        } finally {
            if (bufferReader != null) {
                bufferReader.close();
            }
            if (fileWriter != null) {
                fileWriter.close();
            }
        }

    }

    public static void main(String[] args) {
        try {
            if (args.length != 2) {
                System.err.printf("Usage: %s needs two arguments, input and output files\n", getClass().getSimpleName());
                return -1;
            }
            LogParser logParser = new LogParser();
            //日志文件预处理 传入输入和输出地址
            logParser.accessLogId(args[0], args[1]);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

网站分析主要指标

数据处理

梳理日志文件基本步骤如下

  1. 提取属性 1.1 日志文件属性提取:提取用户IP,访问时间,请求连接,来源连接,客户端(代理)信息等属性 1.2 ip信息提取:通过第三方服务或数据库,获得IP地址,省市,运营商信息
  2. 地域及运营商信息提取,join上步骤中的两张表获得日志中用户的地域信息及运营商信息
  3. 用户所在省和访问链接:使用正则表达式,提取省份及有效访问链接
  4. 访问时间及用户来源:解析用户访问的时间(精确到小时即可),次数和来源网站
  5. 用户信息处理:解析访问用户的操作系统,浏览器
  6. 计算业务数据 6.1 topN数据采集 6.2 获取每个访客的第一次访问时间和访问链接 6.3 ip黑名单处理
  7. 数据可视化

日志文件属性提取

--初始化临时表
DROP TABLE IF EXISTS t_web_access_log_tmp1;

CREATE TABLE t_web_access_log_tmp1 (
	id BIGINT,
	ip STRING,
	ip_num BIGINT,
	ip_1 BIGINT,
	access_time STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	c_info STRING
);

--使用正则表达式提取并向临时表插入数据
INSERT OVERWRITE TABLE t_web_access_log_tmp1
SELECT CAST(regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 1), ' $', '') AS BIGINT) AS id
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 2), ' $', '') AS ip
	, CAST(regexp_substr(content, '\\d+', 1, 2) AS BIGINT) * 255 * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 3) AS BIGINT) * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 4) AS BIGINT) * 255 + CAST(regexp_substr(content, '\\d+', 1, 5) AS BIGINT) AS ip_num
	, regexp_substr(content, '\\d+', 1, 2) AS ip_1
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 5), '^[[]|[]] $', '') AS access_time
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 6), '^"|" $', '') AS url
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 7), ' $', '') AS status
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 8), ' $', '') AS traffic
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 9), '^"|" $', '') AS referer
	, regexp_replace(regexp_substr(content, '".*?"', 1, 3), '^"|"$', '') AS c_info
FROM t_web_access_log_content;

--初始化IP表
DROP TABLE IF EXISTS t_cz_ip;

CREATE TABLE t_cz_ip (
	ip_start BIGINT,
	ip_start_1 STRING,
	ip_end BIGINT,
	city STRING,
	isp STRING
);

--使用正则表达式提取并向IP表插入数据
INSERT OVERWRITE TABLE t_cz_ip
SELECT CAST(regexp_substr(content, '\\d+', 1, 1) AS BIGINT) * 255 * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 2) AS BIGINT) * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 3) AS BIGINT) * 255 + CAST(regexp_substr(content, '\\d+', 1, 4) AS BIGINT) AS ip_start
	, regexp_substr(content, '\\d+', 1, 1) AS ip_start_1
	, CAST(regexp_substr(content, '\\d+', 1, 5) AS BIGINT) * 255 * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 6) AS BIGINT) * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 7) AS BIGINT) * 255 + CAST(regexp_substr(content, '\\d+', 1, 8) AS BIGINT) AS ip_end
	, regexp_replace(regexp_substr(content, ' +[^ ]+', 1, 2), '^ +', '') AS city
	, regexp_replace(regexp_substr(content, ' +[^ ]+', 1, 3), '^ +', '') AS isp
FROM t_cz_ip_content;

地域及运营商信息提取

--初始化临时表
DROP TABLE IF EXISTS t_web_access_log_tmp2;

CREATE TABLE t_web_access_log_tmp2 (
	id BIGINT,
	ip STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	c_info STRING
);

--从ip表中查询城市和运营商信息
INSERT OVERWRITE TABLE t_web_access_log_tmp2
SELECT /*+ mapjoin(b) */ a.id, a.ip, b.city, b.isp, a.access_time
	, a.url, a.status, a.traffic, a.referer, a.c_info
FROM t_web_access_log_tmp1 a
JOIN t_cz_ip b
ON a.ip_1 = b.ip_start_1
	AND a.ip_num >= b.ip_start
	AND a.ip_num <= b.ip_end;

用户所在省和访问链接

--初始化临时表
DROP TABLE IF EXISTS t_web_access_log_tmp3;

CREATE TABLE t_web_access_log_tmp3 (
	id BIGINT,
	ip STRING,
	province STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	c_info STRING
);

--提取省信息使用正则表达式提取访问链接的实际地址并向临时表插入数据
INSERT OVERWRITE TABLE t_web_access_log_tmp3
SELECT id, ip
	, CASE 
		WHEN INSTR(city, '省') > 0 THEN SUBSTR(city, 1, INSTR(city, '省') - 1)
		WHEN INSTR(city, '内蒙古') > 0 THEN '内蒙古'
		WHEN INSTR(city, '西藏') > 0 THEN '西藏'
		WHEN INSTR(city, '广西') > 0 THEN '广西'
		WHEN INSTR(city, '宁夏') > 0 THEN '宁夏'
		WHEN INSTR(city, '新疆') > 0 THEN '新疆'
		WHEN INSTR(city, '北京') > 0 THEN '北京'
		WHEN INSTR(city, '上海') > 0 THEN '上海'
		WHEN INSTR(city, '天津') > 0 THEN '天津'
		WHEN INSTR(city, '重庆') > 0 THEN '重庆'
		WHEN INSTR(city, '香港') > 0 THEN '香港'
		WHEN INSTR(city, '澳门') > 0 THEN '澳门'
		ELSE city
	END AS province, city, isp, access_time
	, regexp_replace(regexp_substr(url, ' .*?(\\.mooc| )', 1, 1), '^ | $', '') AS url
	, status, traffic, referer, c_info
FROM t_web_access_log_tmp2;

访问时间及用户来源

--初始化临时表
DROP TABLE IF EXISTS t_web_access_log_tmp4;

CREATE TABLE t_web_access_log_tmp4 (
	id BIGINT,
	ip STRING,
	province STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	access_hour STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	ref_type STRING,
	c_info STRING
);

--使用正则表达式提取访问时间和来源分类并向临时表插入数据
INSERT OVERWRITE TABLE t_web_access_log_tmp4
SELECT id, ip, province, city, isp
	, regexp_replace(regexp_substr(access_time, ':\\d.*? ', 1, 1), ':|$', '') AS access_time
	, regexp_replace(regexp_substr(access_time, ':\\d\\d:', 1, 1), '^:|:$', '') AS access_hour
	, url, status, traffic, referer
	, CASE 
		WHEN INSTR(referer, 'www.chinamoocs.com') > 0
		OR LENGTH(referer) < 5 THEN 'self'
		WHEN INSTR(referer, 'www.google.com') > 0 THEN 'google'
		WHEN INSTR(referer, 'www.baidu.com') > 0 THEN 'baidu'
		WHEN INSTR(referer, 'www.bing.com') > 0 THEN 'bing'
		WHEN INSTR(referer, 'www.so.com') > 0 THEN '360'
		ELSE 'other'
	END AS ref_type, c_info
FROM t_web_access_log_tmp3;

用户信息处理

--初始化访问日志表
DROP TABLE IF EXISTS t_web_access_log;

CREATE TABLE t_web_access_log (
	id BIGINT,
	ip STRING,
	province STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	access_hour STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	ref_type STRING,
	c_info STRING,
	client_type STRING,
	client_browser STRING
);

--使用正则表达式提取客户端信息中的操作系统和浏览器信息并向表插入数据
INSERT OVERWRITE TABLE t_web_access_log
SELECT id, ip, province, city, isp
	, access_time, access_hour, url, status, traffic
	, referer, ref_type, c_info
	, CASE 
		WHEN INSTR(c_info, 'iPhone;') > 0 THEN 'IOS'
		WHEN INSTR(c_info, 'iPad;') > 0 THEN 'IOS'
		WHEN INSTR(c_info, 'Mac OS X ') > 0 THEN 'OS X'
		WHEN INSTR(c_info, 'X11;') > 0 THEN 'Linux'
		WHEN INSTR(c_info, 'Android ') > 0 THEN 'Android'
		WHEN INSTR(c_info, 'Windows NT ') > 0 THEN 'Windows'
		ELSE 'other'
	END AS client_type
	, CASE 
		WHEN INSTR(c_info, ' QQBrowser') > 0 THEN 'QQBrowser'
		WHEN INSTR(c_info, ' UCBrowser') > 0 THEN 'UCBrowser'
		WHEN INSTR(c_info, ' Edge') > 0 THEN 'Edge'
		WHEN INSTR(c_info, ' LBBROWSER') > 0 THEN 'LBBROWSER'
		WHEN INSTR(c_info, ' Maxthon') > 0 THEN 'Maxthon'
		WHEN INSTR(c_info, ' Firefox') > 0 THEN 'Firefox'
		WHEN INSTR(c_info, ' Chrome') > 0 THEN 'Chrome'
		WHEN INSTR(c_info, ' Mac OS X') > 0
		AND INSTR(c_info, ' Safari') > 0 THEN 'Safari'
		WHEN INSTR(c_info, ' MSIE') > 0 THEN 'IE'
		ELSE 'other'
	END AS client_browser
FROM t_web_access_log_tmp4;

top 10数据采集

--初始化访问链接TopN表
DROP TABLE IF EXISTS t_web_access_log_url_top;

CREATE TABLE t_web_access_log_url_top (
	url STRING,
	times INT
);

INSERT OVERWRITE TABLE t_web_access_log_url_top
SELECT top.url, top.times
FROM (
	SELECT url, COUNT(1) AS times
	FROM t_web_access_log
	WHERE INSTR(url, '.mooc') > 0
	GROUP BY url
) top
ORDER BY top.times DESC
LIMIT 10;

每个访客的第一条访问日志

--初始化每个访客的第一个访问日志
DROP TABLE IF EXISTS t_web_access_log_first;

CREATE TABLE t_web_access_log_first (
	id BIGINT,
	ip STRING,
	province STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	access_hour STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	ref_type STRING,
	c_info STRING,
	client_type STRING,
	client_browser STRING
);

INSERT OVERWRITE TABLE t_web_access_log_first
SELECT a.id, a.ip, a.province, a.city, a.isp
	, a.access_time, a.access_hour, a.url, a.status, a.traffic
	, a.referer, a.ref_type, a.c_info, a.client_type, a.client_browser
FROM t_web_access_log a
JOIN (
	SELECT c.ip, MIN(c.id) AS id
	FROM t_web_access_log c
	GROUP BY c.ip, 
		c.c_info
) b
ON a.ip = b.ip
	AND a.id = b.id;

IP黑名单处理

--初始化访问IP黑名单表
DROP TABLE IF EXISTS t_web_access_log_ip_black;

CREATE TABLE t_web_access_log_ip_black (
	ip STRING,
	times INT
);

INSERT OVERWRITE TABLE t_web_access_log_ip_black
SELECT ip, COUNT(1) AS times
FROM t_web_access_log
WHERE status = '404'
GROUP BY ip
HAVING COUNT(1) > 10;

参考 MOOC网站访问日志分析

  1. 21556778