1 abstract

当前针对企业信息系统的攻击主要包括两种：内部攻击、APT攻击。

而当前的攻击检测方法仅仅是考虑日志的顺序关系（sequential relationship）来对用户的顺序行为进行建模，而忽略了其他的关系。

本文提出的log2vec是一种基于异构图嵌入的模块化方法：

使用启发式的方法，根据日志之间的多种关系将其转化为异构图
graph embedding：将每个日志条目转化为低维向量
使用检测算法将恶意和良性的日志条目分成不同的簇，并且识别出恶意部分。

经过分析之后面临的三个问题：

如何同时检测到这两种攻击和三种关系：
- - sequential relationship among log entries
  - logical relationships among days
  - interactive relationship among hosts
- APT场景下的细粒度检测
- 在没有攻击样本的情况下训练检测模型

本文的设计：

graph construction
graph embedding
detection algorithm

Log2vec introduces specialized designs to tackle the aforementioned problems：

construct a heterogeneous graph
divide a log entry into 5 attributes( subject (e.g. user id), operation type (e.g. visit or send), object (e.g. website or email), time and host (e.g. server id).)
graph embedding & detection algorithm without attack sample
automatically learn representation instead of manually extracting features

2 Overview

2.1 Motivating Example

©中描述了sequence approach，可以用LSTM来预测接下来的行为，显然其主要捕获causal and sequential relationships among log entries.
对于第三天来说，有大量的device connect和file copy操作，这可以由用户行为异常检测探测到（logical relationships among days ），但是LSTM做不到
而且LSTM无法捕捉(d)中所示的主机之间的互动关系（interactive relationships among hosts）
图（d）的方法可以定位出受威胁的主机（粗粒度），但是进一步去除其中包含的大量正常操作也需要很多人力工作

2.2 Architecture

graph construction

rule-based heuristic approach to map relationships among log entries into a graph. 主要考虑三种关系：(1) causal and sequential relationships within a day; (2) logical relationships among days (3) logical relationships among objects.

Through different combinations of log’s attributes, we devise various behavioral sequences involving fewer log entries and map multiple finer relationships among log entries within a day and a host, into the graph. (日志关联)

graph embedding

每个节点都是一条日志（五元组）；采用random walk将每一个节点表示为向量

detection algorithm

Log2vec adopts a clustering algorithm to analyze the above vectors and group benign operations (log entries) and malicious ones into different clusters. After the clustering, we set a threshold to identify malicious clusters.

the number of malicious operations is small [2, 4] and thereby smaller clusters tend to be malicious.

3 Graph Construction

3.1 Definition of log entry

log entry: $< sub,obj,A,T,H >$ .

$sub, obj, A$ and $H$ have their own sets of attributes.

Besides, log2vec treats a logon operation in the following way, a user (sub) logs in to (A) a destination host (obj) in a source one (H), just as a user writes a file in a server.

3.2 Rules of Graph Construction

rule1∼rule3 to concatenate log entries within a day into sequences, corresponding to relationship (1). 三个规则分别对应不同角度的用户行为，即：day、host、operation type。因此，非常规的操作将会在graph中被隔离出来。
对log entries进行关联：
- Rule 1: same day (weight 1)
- Rule 2: same host & same day (weight 1)
- Rule 3: same operation type & same host & same day (weight 1)

rule4∼rule6 to separately bridge these daily sequences based on relationship (2). The anomalous behavioral sequences would be separated from benign ones.

对log sequence进行关联
- Rule 4: 每天的log sequence被关联（权重与二者之间日志条目的相似数目成正比）
- Rule 5: same host的daily log sequence被关联（权重与二者之间日志条目的相似数目成正比）
- Rule 6: same operation type & same host的daily log sequence被关联（权重与二者之间日志条目的相似数目成正比）

这六条规则将用户跨主机、跨时间段的不同的操作类型进行了关联

rule7∼rule10 corresponding to relationship (3), are presented to consider how a user logs into/compromises a host and sends confidential data to the outside. Specifically, they construct user’s patterns of logon and web browsing operation.

rule7 and rule8, consider how these hosts are interactively accessed within intranet
rule9 and rule10, focus on user’s browser usage via the Internet

对log entries进行关联：
- Rule 7: same destination host & same source host & same authentication type (weigth 1)

对log sequence进行关联：
- Rule 8a: same destination host & same source host & different authentication type (权重与二者之间日志条目的相似数目成正比)
- Rule 8b: different destination/source & same authentication type (权重与二者之间日志条目的相似数目成正比)

对log entries进行关联：
- Rule 9: same host & accessing the same domain name (weight 1)

对log sequence进行关联：
- Rule 10: same host & different domain name (权重与accessing mode之间的相似度&二者之间日志条目的相似数目成正比，and the access mode is superior to the number)

由于log2vec的目标在于分析每个用户的行为，因此其为每个用户创建一个heterogeneous graph。

4 Graph Embedding

random walk提取每个节点的上下文信息，将其输入到word2vec中计算其向量

random walk：一种图遍历算法，walker从一个节点，以据边的类型和权重来选择下一个访问的节点；生成的路径是一串节点序列，被视为这些节点的上下文信息。举例来说：

For instance, when a walker resides at a node belonging to the sequence of device connections in day1 or day2 (Figure 2a), generated by graph construction, he would seldom choose the node (device connection) in the sequence of day3 because of low link weight. Likewise, when residing at a node in sequence of day3, he would rarely reach the sequence of day1 or day2. Therefore, log2vec extracts paths involving nodes of day1 and day2 or individually day3.

通过每个结点的paths(context)来计算其向量。这些paths被当作自然语言处理中的sentence，来学习每个word（node）的向量。该方法可以确保：a node (log entry) and its neighbors (log entries having close
relationships with it) share similar embeddings (vectors).

4.1 Random Walk with Different Sets of Edge Type

升级版随机游走的两个提升点：

控制邻居节点的数量
以边类型集合的不同比例来提取上下文信息（应对数据不平衡问题）

###　４.１.１　转移概率

4.1.2 邻居节点的数量

每个序列的第一个和最后一个节点会有很多的邻居节点，针对此类节点，将邻居节点的数量(neighs)当作超参数来进行调节。原因有以下两点：

insider threat typically embraces only a few malicious log entries. They cannot be isolated in the vector space if random walk considers all neighbor nodes.
The optimal neiдh is thereby varied based on various users.

We in principle tend to set a lower value, such as 1, to ensure that the most similar sequence connects to the given one.

4.1.3 边类型集合的比例

Log2vec根据不同的场景来确定每个边集的重要性。这是因为：only one or two meta-attributes of a log entry become anomalous in each kind of insider attacks.

首先，为5个边集设置的比例为1：1：1：1：1，接下来，log2vec考虑 {edge3, edge6}, {edge7, edge8} and {edge9, edge10} 调整其占比，因为这三个边集覆盖了涉及异常行为的meta-attributes。实际上，最终的比例（ps）是1：1：1：8：1.

此外，还设置了随机游走的次数r，最终，随机游走的次数 = $r\times ps$

4.2 Word2vec

Log2vec employs a word2vec model, skip-gram, to map log entries into low-dimensional vectors through context (paths). It is an objective function that maximizes the probability of its neighbors conditioned on a node.