网站日志收集与分析

意义

网站日志一般分为系统日志和程序日志

系统日志：访问日志，系统错误日志
程序日志：自定义

通过对网站日志的分析，我们可以了解

网站运行状况：PV UV
网站安全状况：是否存在恶意攻击及密码暴力破解
网站运营状况：搜索引擎来源等
网站用户信息：操作系统，浏览器

网站日志分析，可帮助管理增提供决策支持，如是否需要增加服务器，提升网站安全策略，修改营销策略等等

如何收集

一般流程

数据采集：从网站容器如Tomcat， Nginx，或者程序本身手机日志
数据处理：一般包含以下3个步骤，若量较小可直接使用JAVA或者Python直接分析，若量较大需要借助Hadoop等分布式计算框架
- 数据清理
- 数据转换
- 数据抽取
数据展现：数据图表化
结果处理：针对分析结果，优化网站

tomcat

修改conf/server.xml，开放注释即可

<Valve className=“org.apache.catalina.valves.AccessLogValve” 
             directory=“logs”  prefix=“localhost_access_log.” suffix=“.txt” 
             pattern=“common” resolveHosts=“false”/>

%a – 远程主机的IP (Remote IP address) 
%A – 本机IP (Local IP address) 
%b – 发送字节数，不包含HTTP头，0字节则显示 ‘-’ (Bytes sent, excluding HTTP headers, or ‘-’ if no bytes 
were sent) 
%B – 发送字节数，不包含HTTP头 (Bytes sent, excluding HTTP headers) 
%h – 远程主机名 (Remote host name) 
%H – 请求的具体协议，HTTP/1.0 或 HTTP/1.1 (Request protocol) 
%l – 远程用户名，始终为 ‘-’ (Remote logical username from identd (always returns ‘-’)) 
%m – 请求方式，GET, POST, PUT (Request method) 
%p – 本机端口 (Local port) 
%q – 查询串 (Query string (prepended with a ‘?’ if it exists, otherwise 
an empty string) 
%r – HTTP请求中的第一行 (First line of the request) 
%s – HTTP状态码 (HTTP status code of the response) 
%S – 用户会话ID (User session ID) 
%t – 访问日期和时间 (Date and time, in Common Log Format format) 
%u – 已经验证的远程用户 (Remote user that was authenticated 
%U – 请求的URL路径 (Requested URL path) 
%v – 本地服务器名 (Local server name) 
%D – 处理请求所耗费的毫秒数 (Time taken to process the request, in millis) 
%T – 处理请求所耗费的秒数 (Time taken to process the request, in seconds) 
你可以用以上的任意组合来定制你的访问日志格式，也可以用下面两个别名common和combined来指定常用的日志格式：

nginx

修改nginx.conf，增加access_log,使用log_format可配置日志的格式

# 默认日志
access_log logs/access.log

#自定义日志格式， 需要制定日志格式名称
access_log logs/access.log combined;

log_format combined '$remote_addr-$remote_user [$time_local]' '"$request"$status $body_bytes_sent' '"$http_referer" "$http_user_agent"’

$remote_addr和$http_x_forwarded_for用于记录IP地址；
$remote_user用于记录远程客户端用户名称；
$time_local用于记录访问时间与时区；
$request用于记录请求URL与HTTP协议；
$status用于记录请求状态，例如成功时状态为200，页面找不到时状态为404；
$body_bytes_sent用于记录发送客户端的文件主体内容大小；
$http_referer用于记录是从哪个页面链接访问过来的；
$http_user_agent用于记录客户浏览器的相关信息。 $

日志预处理

从访问日志的信息中，通过一些简单的处理转换，可以得到一些有用的信息：

用户IP：城市，网络运营商，一般分析日志时，需要将IP转化成10进制数字
- 免费API接口：淘宝 http://ip.taobao.com/instructions (5qps)
- 收费接口
- 本地数据库：GeoIP，纯真IP数据库
用户访问时间：用户访问趋势图，一般用到小时就行了
用户请求连接：一般只需要请求URL即可，协议，参数等可以清洗掉
来源连接：一般需要进行一定的分类，需要区分出是本站来源还是外站来源，外站来源还要区分是否是搜索引擎来源
客户端信息：一般提取操作系统和浏览器

由于日志文件中没有唯一标识，所以使用java处理日志文件，为每一行创建一个独立的ID标识

mport java.io.*;

public class LogParser {
    /**
     * 日志文件预处理
     *
     * @param inPath  日志文件输入地址
     * @param outPath 输出地址
     * @throws IOException
     */
    public void accessLogId(String inPath, String outPath) throws IOException {
        String line = "";
        long index = 0;
        BufferedReader bufferReader = null;
        FileWriter fileWriter = null;
        try {
            bufferReader = new BufferedReader(new FileReader(inPath));
            File outFile = new File(outPath);
            if (outFile.exists()) {
                outFile.delete();
            }
            outFile.createNewFile();

            fileWriter = new FileWriter(outFile);
            while ((line = bufferReader.readLine()) != null) {
                index++;
                String newLine = String.valueOf(index) + " " + line + "\r\n";
                fileWriter.write(newLine);
            }
        } finally {
            if (bufferReader != null) {
                bufferReader.close();
            }
            if (fileWriter != null) {
                fileWriter.close();
            }
        }

    }

    public static void main(String[] args) {
        try {
            if (args.length != 2) {
                System.err.printf("Usage: %s needs two arguments, input and output files\n", getClass().getSimpleName());
                return -1;
            }
            LogParser logParser = new LogParser();
            //日志文件预处理 传入输入和输出地址
            logParser.accessLogId(args[0], args[1]);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

网站分析主要指标

IP：一般之独立IP数，24小时内，访问网站的独立IP个数，重复IP只会被统计一次
PV：Page View，页面浏览次数或称为访问人次，刷新页面即+1
UV：Unique Visitor，访问人数，一般在Cookie中分配唯一编号，但是网站访问日志中获取不到Cookie信息，此时可以改用IP+客户端信息代替
如果PV，UV很接近，说明网站的点击次数并不多，说明网站的内容并不吸引用户；
如果IP>>UV，说明一般用户访（家庭宽带用户）问网站较多，家庭宽带IP经常变动；
如果IP<<UV，说明可能企业用户比较多，一般企业IP是固定的，且内部众多机器共享外网IP
流量分析，趋势图
来源分析：用户来源占比
搜索引擎来源：各个搜索引擎的占比，针对搜索引擎推广
访问链接TopN，可以考虑在热门连接中，添加广告和营销
恶意攻击IP：搜集疑似恶意攻击IP（大量404http请求），用于IP黑名单
访客分析：
- 访客地域分析（针对性优化，如服务器部署），
- 访客浏览器（页面适配优化），
- 操作系统
- 运营商：不同运营商存在很大延迟，优化服务器配额

数据处理

梳理日志文件基本步骤如下

提取属性 1.1 日志文件属性提取：提取用户IP，访问时间，请求连接，来源连接，客户端（代理）信息等属性 1.2 ip信息提取：通过第三方服务或数据库，获得IP地址，省市，运营商信息
地域及运营商信息提取，join上步骤中的两张表获得日志中用户的地域信息及运营商信息
用户所在省和访问链接：使用正则表达式，提取省份及有效访问链接
访问时间及用户来源：解析用户访问的时间（精确到小时即可），次数和来源网站
用户信息处理：解析访问用户的操作系统，浏览器
计算业务数据 6.1 topN数据采集 6.2 获取每个访客的第一次访问时间和访问链接 6.3 ip黑名单处理
数据可视化

日志文件属性提取

--初始化临时表
DROP TABLE IF EXISTS t_web_access_log_tmp1;

CREATE TABLE t_web_access_log_tmp1 (
	id BIGINT,
	ip STRING,
	ip_num BIGINT,
	ip_1 BIGINT,
	access_time STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	c_info STRING
);

--使用正则表达式提取并向临时表插入数据
INSERT OVERWRITE TABLE t_web_access_log_tmp1
SELECT CAST(regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 1), ' $', '') AS BIGINT) AS id
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 2), ' $', '') AS ip
	, CAST(regexp_substr(content, '\\d+', 1, 2) AS BIGINT) * 255 * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 3) AS BIGINT) * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 4) AS BIGINT) * 255 + CAST(regexp_substr(content, '\\d+', 1, 5) AS BIGINT) AS ip_num
	, regexp_substr(content, '\\d+', 1, 2) AS ip_1
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 5), '^[[]|[]] $', '') AS access_time
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 6), '^"|" $', '') AS url
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 7), ' $', '') AS status
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 8), ' $', '') AS traffic
	, regexp_replace(regexp_substr(content, '[[].*?[]] |".*?" |.*? ', 1, 9), '^"|" $', '') AS referer
	, regexp_replace(regexp_substr(content, '".*?"', 1, 3), '^"|"$', '') AS c_info
FROM t_web_access_log_content;

--初始化IP表
DROP TABLE IF EXISTS t_cz_ip;

CREATE TABLE t_cz_ip (
	ip_start BIGINT,
	ip_start_1 STRING,
	ip_end BIGINT,
	city STRING,
	isp STRING
);

--使用正则表达式提取并向IP表插入数据
INSERT OVERWRITE TABLE t_cz_ip
SELECT CAST(regexp_substr(content, '\\d+', 1, 1) AS BIGINT) * 255 * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 2) AS BIGINT) * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 3) AS BIGINT) * 255 + CAST(regexp_substr(content, '\\d+', 1, 4) AS BIGINT) AS ip_start
	, regexp_substr(content, '\\d+', 1, 1) AS ip_start_1
	, CAST(regexp_substr(content, '\\d+', 1, 5) AS BIGINT) * 255 * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 6) AS BIGINT) * 255 * 255 + CAST(regexp_substr(content, '\\d+', 1, 7) AS BIGINT) * 255 + CAST(regexp_substr(content, '\\d+', 1, 8) AS BIGINT) AS ip_end
	, regexp_replace(regexp_substr(content, ' +[^ ]+', 1, 2), '^ +', '') AS city
	, regexp_replace(regexp_substr(content, ' +[^ ]+', 1, 3), '^ +', '') AS isp
FROM t_cz_ip_content;

地域及运营商信息提取

--初始化临时表
DROP TABLE IF EXISTS t_web_access_log_tmp2;

CREATE TABLE t_web_access_log_tmp2 (
	id BIGINT,
	ip STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	c_info STRING
);

--从ip表中查询城市和运营商信息
INSERT OVERWRITE TABLE t_web_access_log_tmp2
SELECT /*+ mapjoin(b) */ a.id, a.ip, b.city, b.isp, a.access_time
	, a.url, a.status, a.traffic, a.referer, a.c_info
FROM t_web_access_log_tmp1 a
JOIN t_cz_ip b
ON a.ip_1 = b.ip_start_1
	AND a.ip_num >= b.ip_start
	AND a.ip_num <= b.ip_end;

用户所在省和访问链接

--初始化临时表
DROP TABLE IF EXISTS t_web_access_log_tmp3;

CREATE TABLE t_web_access_log_tmp3 (
	id BIGINT,
	ip STRING,
	province STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	c_info STRING
);

--提取省信息使用正则表达式提取访问链接的实际地址并向临时表插入数据
INSERT OVERWRITE TABLE t_web_access_log_tmp3
SELECT id, ip
	, CASE 
		WHEN INSTR(city, '省') > 0 THEN SUBSTR(city, 1, INSTR(city, '省') - 1)
		WHEN INSTR(city, '内蒙古') > 0 THEN '内蒙古'
		WHEN INSTR(city, '西藏') > 0 THEN '西藏'
		WHEN INSTR(city, '广西') > 0 THEN '广西'
		WHEN INSTR(city, '宁夏') > 0 THEN '宁夏'
		WHEN INSTR(city, '新疆') > 0 THEN '新疆'
		WHEN INSTR(city, '北京') > 0 THEN '北京'
		WHEN INSTR(city, '上海') > 0 THEN '上海'
		WHEN INSTR(city, '天津') > 0 THEN '天津'
		WHEN INSTR(city, '重庆') > 0 THEN '重庆'
		WHEN INSTR(city, '香港') > 0 THEN '香港'
		WHEN INSTR(city, '澳门') > 0 THEN '澳门'
		ELSE city
	END AS province, city, isp, access_time
	, regexp_replace(regexp_substr(url, ' .*?(\\.mooc| )', 1, 1), '^ | $', '') AS url
	, status, traffic, referer, c_info
FROM t_web_access_log_tmp2;

访问时间及用户来源

--初始化临时表
DROP TABLE IF EXISTS t_web_access_log_tmp4;

CREATE TABLE t_web_access_log_tmp4 (
	id BIGINT,
	ip STRING,
	province STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	access_hour STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	ref_type STRING,
	c_info STRING
);

--使用正则表达式提取访问时间和来源分类并向临时表插入数据
INSERT OVERWRITE TABLE t_web_access_log_tmp4
SELECT id, ip, province, city, isp
	, regexp_replace(regexp_substr(access_time, ':\\d.*? ', 1, 1), ':|$', '') AS access_time
	, regexp_replace(regexp_substr(access_time, ':\\d\\d:', 1, 1), '^:|:$', '') AS access_hour
	, url, status, traffic, referer
	, CASE 
		WHEN INSTR(referer, 'www.chinamoocs.com') > 0
		OR LENGTH(referer) < 5 THEN 'self'
		WHEN INSTR(referer, 'www.google.com') > 0 THEN 'google'
		WHEN INSTR(referer, 'www.baidu.com') > 0 THEN 'baidu'
		WHEN INSTR(referer, 'www.bing.com') > 0 THEN 'bing'
		WHEN INSTR(referer, 'www.so.com') > 0 THEN '360'
		ELSE 'other'
	END AS ref_type, c_info
FROM t_web_access_log_tmp3;

用户信息处理

--初始化访问日志表
DROP TABLE IF EXISTS t_web_access_log;

CREATE TABLE t_web_access_log (
	id BIGINT,
	ip STRING,
	province STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	access_hour STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	ref_type STRING,
	c_info STRING,
	client_type STRING,
	client_browser STRING
);

--使用正则表达式提取客户端信息中的操作系统和浏览器信息并向表插入数据
INSERT OVERWRITE TABLE t_web_access_log
SELECT id, ip, province, city, isp
	, access_time, access_hour, url, status, traffic
	, referer, ref_type, c_info
	, CASE 
		WHEN INSTR(c_info, 'iPhone;') > 0 THEN 'IOS'
		WHEN INSTR(c_info, 'iPad;') > 0 THEN 'IOS'
		WHEN INSTR(c_info, 'Mac OS X ') > 0 THEN 'OS X'
		WHEN INSTR(c_info, 'X11;') > 0 THEN 'Linux'
		WHEN INSTR(c_info, 'Android ') > 0 THEN 'Android'
		WHEN INSTR(c_info, 'Windows NT ') > 0 THEN 'Windows'
		ELSE 'other'
	END AS client_type
	, CASE 
		WHEN INSTR(c_info, ' QQBrowser') > 0 THEN 'QQBrowser'
		WHEN INSTR(c_info, ' UCBrowser') > 0 THEN 'UCBrowser'
		WHEN INSTR(c_info, ' Edge') > 0 THEN 'Edge'
		WHEN INSTR(c_info, ' LBBROWSER') > 0 THEN 'LBBROWSER'
		WHEN INSTR(c_info, ' Maxthon') > 0 THEN 'Maxthon'
		WHEN INSTR(c_info, ' Firefox') > 0 THEN 'Firefox'
		WHEN INSTR(c_info, ' Chrome') > 0 THEN 'Chrome'
		WHEN INSTR(c_info, ' Mac OS X') > 0
		AND INSTR(c_info, ' Safari') > 0 THEN 'Safari'
		WHEN INSTR(c_info, ' MSIE') > 0 THEN 'IE'
		ELSE 'other'
	END AS client_browser
FROM t_web_access_log_tmp4;

top 10数据采集

--初始化访问链接TopN表
DROP TABLE IF EXISTS t_web_access_log_url_top;

CREATE TABLE t_web_access_log_url_top (
	url STRING,
	times INT
);

INSERT OVERWRITE TABLE t_web_access_log_url_top
SELECT top.url, top.times
FROM (
	SELECT url, COUNT(1) AS times
	FROM t_web_access_log
	WHERE INSTR(url, '.mooc') > 0
	GROUP BY url
) top
ORDER BY top.times DESC
LIMIT 10;

每个访客的第一条访问日志

--初始化每个访客的第一个访问日志
DROP TABLE IF EXISTS t_web_access_log_first;

CREATE TABLE t_web_access_log_first (
	id BIGINT,
	ip STRING,
	province STRING,
	city STRING,
	isp STRING,
	access_time STRING,
	access_hour STRING,
	url STRING,
	status STRING,
	traffic STRING,
	referer STRING,
	ref_type STRING,
	c_info STRING,
	client_type STRING,
	client_browser STRING
);

INSERT OVERWRITE TABLE t_web_access_log_first
SELECT a.id, a.ip, a.province, a.city, a.isp
	, a.access_time, a.access_hour, a.url, a.status, a.traffic
	, a.referer, a.ref_type, a.c_info, a.client_type, a.client_browser
FROM t_web_access_log a
JOIN (
	SELECT c.ip, MIN(c.id) AS id
	FROM t_web_access_log c
	GROUP BY c.ip, 
		c.c_info
) b
ON a.ip = b.ip
	AND a.id = b.id;

IP黑名单处理

--初始化访问IP黑名单表
DROP TABLE IF EXISTS t_web_access_log_ip_black;

CREATE TABLE t_web_access_log_ip_black (
	ip STRING,
	times INT
);

INSERT OVERWRITE TABLE t_web_access_log_ip_black
SELECT ip, COUNT(1) AS times
FROM t_web_access_log
WHERE status = '404'
GROUP BY ip
HAVING COUNT(1) > 10;

参考 MOOC网站访问日志分析

本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名,转载请标明出处
最后编辑时间为: 2019/01/29 22:16