+ Motivated by recent advancements in node representation learning for single-graph tasks
+ a framework that leverages the power of automatically learned node representations to match nodes across different graphs.
- xNetMF, an elegant and principled node embedding formulation that uniquely generalizes to multi-network problems.
- network alignment or matching, which is the problem of finding corresponding nodes in different networks.
+ Crucial for identifying similar users in different social networks or analysing chemical compounds
- Many existing methods try to relax the computationally hard optimization problem, as designing features that directly compared for nodes in different networks is not an easy task.
- we propose network alignment via matching latent, learned node representations.
- *Problem:* Given two graphs G_1 and G_2 with nodesets V_1 and V_2 and possibly node attributes A_1 and A_2 resp., devise an efficient network alignment method that aligns nodes by learning directly comparable node representations Y_1 and Y_2, from which a node mapping $\phi: V_1 \rightarrow V_2$ between the networks can be inferred.
- REGAL is a framework that efficiently identifies node matchings by greedily aligning their latent feature representations.
- They use Cross-Network Matrix Factorization (xNetMF) to learn the representations
+ xNetMF preserves structural similarities rather than proximity-based similarities, allowing for generalization beyond a single network.
+ xNetMF is formulated as matrix factorization over a similarity matrix which incorporates structural similarity and attribute agreement between nodes in disjoint graphs.
+ Constructing the similarity matrix is tough, as is requires computing all pairs of similarities between nodes in the multiple networks, they extend the Nyström low-rank approximation, which is commonly used for large-scale kernel machines.
+ This makes xNetMF a principled and efficient implicit matrix factorization-based approach.
- our approach can be applied to attributed and unattributed graphs with virtually no change in formulation, and is unsupervised: it does not require prior alignment information to find high-quality matchings.
- Many well-known node embedding methods based on shallow architectures such as the popular skip-gram with negative sampling (SGNS) have been cast in matrix factorization frameworks. However, ours is the first to cast node embedding using SGNS to capture structural identity in such a framework
- we consider the significantly harder problem of learning embeddings that may be individually matched to infer node-level alignments.
*** REGAL Description
- Let G_1(V_1, E_1) and G_2(V_2, E_2) be two unweighted and undirected graphs (described in the setting of two graphs, but can be extended to more), with node sets V_1 and V_2 and edge sets E_1 and E_2; and possible node attribute sets A_1 and A_2.
+ Graphs does not have to be the same size
- Let n = |V_1| + |V_2|, so the amount of nodes across the two graphs.
- The steps are then:
1)*Node Identity Extraction:* Extract structure and attribute-related info from all n nodes
2)*Efficient Similarity-based Representation:* Obtains node embeddings, conceptually by factorising a similarity matrix of the node identities from step 1. However, the computation of this similarity matrix and the factorisation of it is expensive, so they extend the Nystrom Method for low-rank matrix approximation to perform an implicit similarity matrix factorisation by *(a)* comparing similarity of each node only to a sample of p << n so-called "landmark nodes" and *(b)* using these node-to-landmark similarities to construct the representations from a decomposition of its low-rank approximation.
3)*Fast Node Representation Alignment:* Align nodes between the two graphs by greedily matching the embeddings with an efficient data structure (KD-tree) that allows for fast identification of the top-a most similar embeddings from the other graph.
- The first two steps are the xNetMF method
**** Step 1
- The goal of REGAL’s representation learning module, xNetMF, is to define node “identity” in a way that generalizes to multi-network problems.
- As nodes in multi-network problems have no direct connections to each other, their proximity can't be sampled by random walks on separate graphs. This is overcome by instead focusing on more broadly comparable, generalisable quantities: Structural Identity which relates to structural roles and Attribute-Based Identity.
- *Structural Identity*: In network alignment, the well-established assumption is that aligned nodes have similar structural connectivity or degrees. Thus, we can use the degrees of the neighbours of a node as structural identity. They also consider neighbors up to k hops from the original node.
+ For some node $u \in V$, $R_u^k$ is then the set of nodes at exactly (up to??) k hops from $u$. We could capture the degrees of these nodes within a vector of length the highest degree within the graph $(D)$ $d_u^k$ where the i'th entry of $d_u^k(i)$ then denotes the amount of nodes in $R_u^k$ of degree $i$. This will however potentially be very long and very sparse, if a single node has a high degree, forcing up the length of $d_u^k$. Instead, nodes are bin'ned together into $b = [log_2(D)]$ logarithmically scaled buckets with entry $i$ of $d_u^k$ contains number of nodes $u \in R_u^k$ such that $floor([log_2(deg(u))]) = i$. Is both much shorter ($log_2(D)$) but also more robust to noise.
- *Attribute-Based Identity*: Given $F$ node attributes, they create for each node $u$ an $F$-dimensional vector $f_u$ representing the values of $u$. So $f_u(i)$ = the i'th attribute of $u$.
- *Cross-Network Node Similarity*: Relies on the structural and attribute information rather than direct proximity: $sim(u,v) = exp[-\gamma_s \cdot \left\lVert d_u - d_v \right\rVert_2^2 - \gamma_a \cdot dist(f_u, f_v)]$ where $\gamma_s, \gamma_a$ are scalar params controlling effect of structural and attribute based identity, $dist(f_u, f_v)$ is attribute-based dist of nodes $u$ and $v$ and $d_u = \sum_k=1^K \delta^{k-1} d_u^k$ describes the neighbor degree vector for $u$ aggregated over $K$ different hops where $\delta$ is a discount factor for greater hop distances and K is the maximum hop distance to consider. So they compare structural identities at several levels by combining the neighborhood degree distributions at several hop distances. The distance between attribute based identities depends on the type of node attributes, real-valued, categorial, so on. For categorical attributes, the number of disagreeing features can be used as an attribute-based distance measure.
**** Step 2
- Avoids random walks due to two reasons:
1) The variance they introduce in the representation learning often makes embeddings across different networks non-comparable
2) they can add to the computational expense. For example, node2vec’s total runtime is dominated by its sampling time.
- Use an implicit matrix factorisation-based approach that leverages a combined structural and attribute-based similarity matrix S, which is a result of the sim function from step 1, and considers similarities at different neighborhoods.
- We need to find $n \times p$ matrices $Y$ and $Z$ such that $S \approx YZ^T$ where $Y$ is the node embedding matrix and $Z$ is irrelevant. Thus, we need to find these node embeddings $Y$ WITHOUT actually computing $S$.
+ Finding Y can naturally be done by computing S via sim() and then factorise it (via some function using something called the Frobenius Norm as error function apparently). This is very expensive though.
+ Can also be done by creating a sparse matrix by computing only the "most important" similarities for each node, choosing only a small number of comparisons for instance by looking at similarity of node degree. This is fragile to noise though.
- We will approximate S with a low-rank matrix $\tilde{S}$ which is never explicitly computed. We randomly select $p << n$ "landmark" nodes chosen across both graphs G_1 and G_2 and then compute the similarities to all $n$ nodes in the these graphs using the sim() function. This yields a $n \times p$ similarity matrix $C$. (Note that we only compute it for the $p$ landmark nodes, yielding the $n \times p$ matrix). From $C$ we can extract a $p \times p$ "landmark-to-landmark" matrix, which is called $W$. $C$ and $W$ can be used to approximate the full similarity matrix which then allows us to obtain the node embeddings without ever computing and factorising the approximative similarity matrix $\tilde{S}$. To accomplish this, they extend the Nystrom method such that the low-rank matrix $\tilde{S}$ is given as: $\tilde{S} = CW^{\dag}C^T$. $C$ is the landmark-to-all similarity matrix and $W^\dag$ is the pseudoinverse (??) of $W$, the landmark-to-landmark similarity matrix. The landmark nodes are chosen randomly, as more elaborate methods such as looking at node centrality and such are much more inefficient and offers little to none improvements. Since \tilde{S} contains an estimate for all similarities within the graphs, it would still take $n^2$ space, but luckily we never have to compute this.
- We can actually get the node embeddings $Y$ from a decomposition of the equation for \tilde{S}.
- Given graphs G_1(V_1, E_1) and G_2(V_2, E_2) with $n \times n$ joint combined structural and attribute-based similarity matrix $S \approx YZ^T$, its node embeddings $Y$ can then be approximated as: $\tilde{Y} \approx CU\Sigma^{1/2}$, where $C$ is the $n \times p$ landmark-to-all matrix and $W^\dag = U\Sigma V^T$ is the full rank singular value decomposition of the pseudoinverse of the small $p \times p$ landmark-to-landmark sim matrix W.
+ Given the full rank SVD of $p \times p$ matrix $W^\dag$ as $U\Sigma V^T$, we can then write $S \approx \tilde{S} = C(U\Sigma V^T) C^T = (CU\Sigma^{1/2}) = (\Sigma^{1/2}V^T C^T) = \tilde{Y} \tilde{Z}^T$.
+ So we can compute $\tilde{Y}$ based on the SVD (Which is expensive) on the small matrix $W^\dag$ and the matrix $C$. The p-dimensional node embeddings of the two graphs are then subsets of the $\tilde{Y}$.
**** Step 3
- We have to efficiently align nodes, assuming $u \in V_1$, $v \in V_2$ may match if their xNetMF embeddings are similar. Let \tilde{Y}_1 and \tilde{Y}_2 denote the matrices of the p-dimensional embeddings of G_1 and G_2.
- We take the likeliness of (soft) alignment to be proportional to the similarity between the nodes’ embeddings. Thus, we greedily align nodes to their closest match in the other graph based on embedding similarity.
- A naive way of finding alignments for each node would be to compute similarities of all pairs between node embeddings (The rows of \tilde{Y}_1 and \tilde{Y}_2) and then choose the top-1 for each node. This is inefficient though.
- Instead, we store the embeddings of \tilde{Y}_2 in a k-d tree, which accelerates exact similarity search for nearest neighbor algorithms. For each node in G_1 we then query this tree with its embeddings to find the $a << n$ closest embeddings from nodes in G_2. This allows us to compute "soft" alignments where we return one or more nodes with the most similar embeddings. The similarity between the p-dimensional embeddings of $u$ and $v$ are defined as: $sim_{emb}(\tilde{Y}_1[u], \tilde{Y}_2[v]) = e^{-\left\lVert \tilde{Y}_1[u] - \tilde{Y}_2[v] \right\rVert_2^2}$, converting the euclidean distance to similarity.
*** Complexity Analysis
- We assume both graphs have $n_1 = n_2 = n$ nodes.
1)*Extracting Node Identity*: Takes approximately $O(nKd_{avg}^2)$ time finding neighborhoods up to distance $K$, by joining the neighborhoods of neighbors at the previous hop. We can construct $R_u^k = \Cup_{v \in R_u^{k-1}} R_v^1 - \Cup_{i=1}^{k-1} R_u^i$ for node $u$. Could also be solved using breadth-first-search in time $O(n^3)$.
2)*Computing Similarities*: Similarities are computed of the length-b features (weighted counts of node degrees in the k-hop neighborhoods split into b buckets) between each node and the p landmark nodes in time: $O(npb)$
3)*Obtaining Representations*: Constructing the pseudoinverse $W^\dag$ and computing the SVD of this $p \times p$ matrix takes time $O(p^3)$ and then multiplying it with $C$ in time $O(np^2)$. Since $p << n$, total time is $O(np^2)$.
4)*Aligning Embeddings*: Constructing k-d tree and using it to find the top alignments in G_2 for each of the n nodes in G_1 is average-case time complexity $O(nlog(n))$.
- Total time complexity is then: $O(n * max(pb, p^2, Kd_{avg}^2, log(n))$
- It suffices to pick small values $K$ and $p$ and picking $b$ logarithmically in $n$. $d_{avg}$ is oftentimes small in practice. $d_{avg}$ explains the average node degree.
*** Experiments
- They test on networks where they find a real network dataset with some adjacency matrix $A$. They then generate a new network with adjacency matrix $A' = P*A*P^T$ where $P$ is some randomly generated permutation matrix. Structural noise is added to $A'$ by removing edges with probability $p_s$ without disconnecting any nodes.
- For experiments with attributes, they generate synthetic attributes for nodes, if the graph does not have any. Noise is added by flipping binary values or choosing values randomly with probability $p_a$.
- In accuracy when noise is added, Regal using xNetMF and Regal using Struct2Vec (which is some other form of computing the embeddings) far outperform any other algorithms. Apparently struct2vec adds some noise as it samples something called contexts, which might add variance to it, which is why xNetMF likely wins with low noise. As noise grows however, struct2vec wins in accuracy, but not speed.
- When looking at attribute-based noise, REGAL outperforms FINAL(which uses a proximity embedding but handles attributes) in both accuracy and runtime, mostly. FINAL achieves slightly higher accuracy with small noise, due to it's reliance on attributes. FINAL incurs significant runtime increases as it uses extra attribute information.
- The sensitivity to changes in parameters is shown to be quite significant and they conclude that the discount factor \delta should be between 0.01 and 0.1. The hop distance K should be less than 3. Setting structural and attributed similarity to 1 does fairly well and that the top-a (using more than 1, such as 5 or 10) accuracy is significantly better than the top-1. Higher number of landmarks means higher accuracy and it should be $p = t*log_2(n)$ for $t \approx 10$.
- So it's highly scalable, it's suitable for cross-network analysis, it leverages the power of structural identity, does not require any prior alignment information, it's robust to different settings and datasets and it's very fast and quite accurate.
- EigenAlign requires a memory which is linear in the size of the graphs, whereas most other methods require quadratic memory.
- The key step to this insight is identifying low-rank structure in the node-similarity matrix used by EigenAlign for determining matches.
- With an exact, closed-form low-rank structure, we then solve a maximum weight bipartite matching problem on that low-rank matrix to produce the matching between the graphs.
- For this task, we show a new, a-posteriori, approximation bound for a simple algorithm to approximate a maximum weight bipartite matching problem on a low-rank matrix.
- There are two major approaches to network alignment problems: local network alignment, where the goal is to find local regions of the graph that are similar to any given node, and global network alignment, where the goal is to understand how two large graphs would align to each other.
- The EigenAlign method uses the dominant eigenvector of a matrix related to the product-graph between the two networks in order to estimate the similarity. The eigenvector information is rounded into a matching between the vertices of the graphs by solving a maximum-weight bipartite matching problem on a dense bipartite graph
- a key innovation of EigenAlign is that it explicitly models nodes that may not have a match in the network. In this way, it is able to provably align many simple graph models such as Erdős-Rényi when the graphs do not have too much noise.
+ Even though is still suffers from the quadratic memory requirement.
*** Network Alignment formulations
**** The Canonical Network Alignment problem
- In some cases we additionally receive information about which nodes in one network can be paired with nodes in the other. This additional information is presented in the form of a bipartite graph whose edge weights are stored in a matrix L; if L_uv > 0, this indicates outside evidence that node u in G_A should be matched to node v in G_B.
**** Objective Functions for Network Alignment
- Describes the problem as seeking to find a matrix P which has 1 in index u,v, if u is matched with (only) v in the other graph.
- We then seek a matrix P which maximises the number of overlapping edges between G_A and G_B, so the number of adjacent node pairs should be mapped to adjacent node pairs in the other graph. We get an integer quadratic program.
- There is no downside to matches that do not produce an overlap, so edges in G_A which are mapped to non-edges (wtf is this?) in G_B, or vice versa.
- They define the $AlignmentScore(P) = s_O(#overlaps) + s_N(#non-informative) + s_C(#conflicts)$, where the different $s$ are weights, such that $s_O > s_N > s_C$. This score defines a matrix $M$, which is massive. This matrix M is used to define a quadratic assigment problem, which is equivalent to maximising ther AlignmentScore.
- One can however solve an eigenvector equation instead of the quadratic program, which is what EigenAlign does.
1) Find the eigenvector $x$ of M that corresponds to the eigenvalue of largest magnitude. M is of dimension $n_A n_B \times n_A n_B$, where $n_A$ and $n_B$ are the number of nodes in G_A and G_B, so the eigenvector is of dimension $n_A n_B$ and can thus be reshaped into a matrix X of size $n_A \times n_B$ where each entry represents a score for every pair of nodes between the two graphs. This is the similarity matrix, as it reflects the topological similarity between vertices of G_A and G_B.
2) Run bipartite matching on the similarity matrix X, that maximises the total weight of the final alignment.
- The authors show that the similarity matrix X can be represented through an exact low-rank factorisation. This allows them to avoid quadratic storage requirement of EigenAlign. They also present new fast techniques for bipartite matching problems on low-rank matrices. Together, this yields a far more scalable algorithm.
*** Low Rank Factors of EigenAlign
- Use power iteration (some algo used to compute an eigenvector + value from a diagonalisable matrix) on M to find the dominant eigenvector which can then be reshaped into the sim matrix X. This can also be solved as an optimisation problem.
- If matrix X is estimated with the power-method starting from a rank 1 matrix, then the kth iteration of the power method results in a rank k+1 matrix that can be explicitly and exactly computed.
- We wish to show that the matrix X can be factorised via an only two-factor decomposition: $X_k = UV^T$ for X of rank k.
** Cross-Network Embedding for Multi-Network Alignment
*** Intro
- Recently, data mining through analyzing the complex structure and diverse relationships on multi-network has attracted much attention in both academia and industry. One crucial prerequisite for this kind of multi-network mining is to map the nodes across different networks, i.e., so-called network alignment.
- CrossMNA is for multi-network alignment through investigating structural information only.
- Uses two types of node embedding vectors:
1) Inter-vector for network alignment
2) Intra-vector for other downstream network analysis tasks
- A crucial prerequisite for mining multi-networks is to map the nodes/participants among these related networks, network alignment.
- The shared participants among the networks are defined as anchor nodes, so they act like anchors aligning the networks they participate in, and the relationship among anchor nodes across networks are called anchor links.
+ In many cases, a few anchor links can be known beforehand (such as when people link their twitter or such on facebook)
+ Network alignment seeks to infer these unknown or potential anchor links.
- Most previous work assumes topology consistency such that a node tends to have a consistent connectivity structure across networks
- Attribute based methods are not applicable in many realistic scenarios, as the attribute information may be unreliable or incomplete. (such as usernames, gender or other profile information)
+ REGAL supports using attribute based information
- CrossMNA uses an additional vector named the "network vector", which is proposed to extract the semantic meaning of the network, which can reflect the difference of global structure among the networks.
- Also uses two kinds of embedding:
1) The inter-vector, which reflects the common features of the anchor nodes in different networks and is shared among the known anchor nodes
2) The intra-vector, which preserves the specific structural feature for a node in its selected network and is generated through the combination of network vector and inter-vector.
- REGAL proposes an embedding-based method based on the assumption that nodes with similar structural connectivity or degrees have a high probability to be aligned (The whole point of that degree vector)
+ Although one node may share some similar features in related networks, its local structural connections can be entirely different in each network due to the distinctiveness in network semantic meanings (You are likely to connect to very different people on facebook vs linkedin)
- There are the following challenges for network alignment algos:
1) Semantic diversity: Diversities in network semantics lead to the different interactional behaviours of the same node in each network, which adds inaccuracies to the alignment. This problem is further worsened, when considering +2 networks.
2) Data imbalance: Has two aspects; first, the size of each network may vary (which is OK within REGAL), second, the number of anchor links between each pair of networks can be unequal. Any pair-wise learning method will suffer a lot from this, as they only consider pairs of networks. So how to make full use of the information across ALL the networks to deal with the data imbalance problem is also a challenge.
3) Model storage: Network embedding is a practical approach to extract structural information of the node and has been applied in some network alignment methods (such as REGAL). However. In large-scale multi-networks, it is essential to take into account the space overhead of the methods (which is why REGAL never compute the similarity matrix, but comes up with the embedding in a clever way). REGAL still has to compute the embedding vector for most nodes though, which still takes up a lot of space. It does have the landmark nodes though, which alleviates this slightly.
- There is a network vector for each network. This reflects the difference of global structure among the networks. Thus, if the global structure of two networks is similar, their network vectors will be close in vector space.
- Each node has an inter-vector and an intra-vector. The inter-vector depicts the commonness of anchor nodes in different networks and is shared among the anchor nodes. The intra-vector reflects the specific structural feature of this node in its selected network, but is generated through a combination of the network vector and inter-vector
*** CrossMNA
- We suppose networks are unweighted and all the edges are directed, as an undirected edge can be divided into two directed edges.
- A set of networks is defined: G = ((G^1, G^2, ..., G^n), (A^{(1,2)}, A^{(1,3)}, ..., A^{(n-1, n)})), where each G^i represents a network and A^{{i,j)} represents the anchor nodes between G^i and G^j. Each network is defined from its nodes and edges.
- An anchor link between G^i and G^j is defined as (v_k^i, v_k^j) \in A^{(i,j)}. Anchor links follow the transitivity law
**** The Cross-Network Embedding
- The inter-vector u preserves the common features among the anchor nodes. Through training, the inter-vector of an unknown anchor node should get close to its counterparts in vector space. This inter-vector is difficult to learn directly, as there is no direct correlation between the unknown anchor nodes. They expect some anchor node in one network to both show similar structural features with its counterparts, but also distinctive connection relationships, due to the semantic meaning of a network.
- The intra-vector is straightforward, as it is easy to extract structural features of nodes into network embedding, named the intra-vector v. This contains both the commonness among counterparts and the specific local connections in its selected network due to the semantics, so it can NOT be applied to node matching, unless the impact of the network semantics can be removed.
- Thus, the authors are to present an equation to build a correlation among intra-vector, inter-vector and network semantics: v_i^k = u_i + r^k where r^k is the network vector which can extract the unique characteristics of G^k. Thus, we can learn the inter-vector of the anchor nodes indirectly, by training the combining-based intra-vectors. Thus, the network vector can be used to mitigate the "noise" added to the intra-vector. (u_4 = v^3_4 - r^3)
- The inter-vector and intra-vector can be in different vector spaces, which is solved by a transformation matrix W which is used to align them with different dimensions. So, v^k_i = Wu_i + r^k
- Like REGAL, they use a k-d tree to find the top-a most likely nodes in other networks. We need to compare the inter-vectors of nodes, to find the alignments.
*** Experiments
- CrossMNA outperforms REGAL in their tests, as REGAL assumes topology consistency (at least somewhat), so it performs poorly when they use datasets not having this.
- The dimension of the inter-vector and intra-vector (d1 and d2) should be set d1 = 200 or 300 and d2 = 30 or 50 ish to save memory in practice. When d1 grows, so does the performance
- With no prior known anchor links, CrossMNA uses as much space as REGAL
- CrossMNA is dramatically space efficient for large-scale multi-network applications though.
- The time complexity of learning embeddings is approximately $O(t*N(d1*d2*|V|+d2*|E|))$ where $t$ is the number of iterations, $N$ denotes the number of networks and $|V|$ and $|E|$ number of nodes and edges in each network.
- Time complexity of finding soft alignments between each two networks is $O(|V|*log(|V|))$.
* Graph Similarities
** Deep Graph Kernels
- In domains such as social networks, bioinformatics and robotics we are often interested in computing similarities between structured objects. Graphs offer a natural way to represent structured data.
- Consider the problem of identifying a subreddit on reddit. To tackle this problem, one can represent an online discussion thread as a graph where nodes represent users and edges represent whether two users interact. The task is then to predict which sub-community a certain discussion belongs to, based on its communication graph.
- One of the increasingly popular approaches to measure similarity between structured objects is that of kernel methods.
- A kernel method measures the similarity between two objects with a kernel function, an inner product in reproducing kernel Hilbert space (RHKS). The challenge for using kernel functions is to pick a suitable kernel that captures the semantics of the structure while being computationally tractable.
+ Roughly speaking, this means that if two functions and g in the RKHS are close in norm, i.e., ||f − g|| is small, then f and g are also pointwise close, i.e., |f(x) − g(x)| is small for all x.
- R-convolution is a general framework for handling discrete objects where the key idea is to recursively decompose structured objects into "atomic" (likely non-decomposable) sub-structures and define valid local kernels between them. Given a graph G, let \phi(G) denote a vector which contains counts of atomic sub-structures and (\cdot, \cdot)_H denote a dot product in RHKS H, then the kernel between two graphs G and G' is given by: K(G,G') = (\phi(G), \phi(G'))_H.
- This representation does however not take a number of important observations into account.
1) Sub-structures that are used to compute the kernel matrix are not independent. A popular substructure, graphlets, which is used for decomposing graphs, are defined as induced, non-isomorphic sub-graphs of size k. Graphlets exhibit strong dependence relationships, size k+1 graphlets can be derived from size k graphlets by addition of nodes or edges.
2) By increasing the size k, the number of unique graphlets increase exponentially, so when the number of features grows there is a sparsity problem: only a few substructures will be common across graphs (not entirely sure why this applies though). This leads to the diagonal dominance problem, which is when a given graph is likely to be similar to itself, but not any other.
- We would like a kernel matrix where all entries belonging to a class are similar to each other and dissimilar to everything else. Thus. Consider an alternative kernel between two graphs G and G': K(G,G') = \phi(G)^T * M * \phi(G') where M represents a |V| \times |V| positive semi-definite matrix that encodes relationship between sub-structures and V represents the "vocabulary" of sub-structures obtained from the training data.
+ This allows for one to design M in a clever way that respects the similarity within the given sub-structure space. This could be the edit-distance in spaces where there is a strong mathmatical relationship between sub-structures, such that one could design an M that respects the geometry of the space.
+ This geometry assumption can also be fulfilled by "learning" the geometry of the space through data.
- This paper propose recipes for designing such M matrices for graph kernels.
- They propose two recipes:
1) They exploit an edit-distance relationship between sub-structures and directly compute M
2) They propose a framework that computes an M matrix by learning latent representations of substructures
- Their contributions are:
1) They propose a general framework that learns hidden representations of sub-structures used in graph kernels
2) They demonstrate their framework on three popular graph kernels: Graphlet kernel, Weisfeiler-Lehman subtree kernels, Shortest-Path kernels
3) They apply their framework to derive deep variants of string kernels which are a result of the R-convolution kernels
*** Graph Kernels
- The three graphs kernels are, based on limited-sized subgraphs, based on subtree patterns or based on walks and paths.
- Let GG be a set of n graphs G_1, .., G_n. Let Y represent a set of labels associated with each graph in GG, where Y = y_{G_1}, ..., y_{G_n}
- Given some G = (V, E) and H = (V_H, E_H), H is then a subgraph of G iff an injective mapping a : V_H -> V such that (v,w) \in E_H iff (a(v), a(w)) \in E.
- A graph G is labeled if there is a function l : V -> \Sigma that assigns labels from some alphabet \Sigma to vertices in G. Likewise, a graph is unlabeled if there is nothing to distinguish between nodes, apart from their interconnectiveness.
- K(G,G') is a kernel function which measures similarity between G and G'.
- The graph classification problem is to map graphs into two or more categories. Given a set of graphs GG and labels Y, we should learn to map graphs to labels within Y.
**** Graph Kernels based on subgraphs
- A graphlet Gr is an induced and non-isomorphic sub-graph of size k. Let V_k = (Gr_1, Gr_2, ..., Gr_{n_k}) be the set of size k graphlets where n_k denotes the number of unique graphlets of size k. Given two unlabeled graphs G and G', the graphlet kernel is defined: K_{GK}(G, G') = (f^G, f^{G'}) where f^G and f^{G'} are vectors of normalised counts, that is, the i'th component of f^G denotes the frequency of graphlet Gr_i occuring as a subgraph of G and (\cdot, \cdot) is the euclidean dot product.
**** Graph kernels based on subtree patterns
- These decompose the graph into its subtree patterns, the Weisfeiler-Lehman subtree kernels is in this family.
- Requires a labeled graph in which we can iterate over each vertex and its neighbors in order to create a multiset label.
- The multiset at every iteration consists of the label of the vertex and the sorted labels of its neighbors, this multiset is then given a new label, which can be used for the next iteration.
- To compare graphs, we then simply count the co-occurances of each label.
- Given G and G', the Weisfeiler-Lehman subtree kernel is then: K_{WL}(G,G') = (1^G, 1^{G'}), where (.,.) denotes the euclidean dot product. If we assume h iterations of relabeling, then 1^G consists of h blocks s.t. the i'th component in the j'th block of 1^G contains the frequency of which the i'th label was assigned to a node in the j'th iteration.
**** Graphs Kernels based on random-walks
- Decomposes graphs into random walks or paths and then counts the co-occurence of random walks or paths in two graphs.
- Let P_G represent the set of all shortest paths in G and p_i \in P_G denotes a triplet (l_s^i, l_e^i, n_k) where n_k is the length of the path and l_s^i and l_e^i are the larbels of the starting and ending vertices. The shortest path for LABELED graphs G and G' is then: K_{SP}(G,G') = (P^G, P^{G'}) where the i'th component of P^G contains the frequency of the i'th triplet occuring in graph G.
- Does this still use the euclidean dot product?
**** General
- All graphs kernels mentioned are instances of the R-convolution framework.
- The recipe for definining graph kernels is as follows:
1) Recursively decompose a graph into its subgraphs (Graphlet kernel decomposes into sub-graphs, Weisfeiler-Lehman decomposes into subtrees and shortest-path decomposes into shortest-paths (lol))
2) The decomposed sub-structures are then represented as a vector of frequencies where each item in the vector represents how many times a given sub-structure occurs in the graph
3) The euclidean space or some other domain specific RKHS is used to define the dot product between the vectors of frequencies
*** Methodology
**** Sub-structure similarity via edit distance
- How to compute an M matrix by using the edit-distance relationship between sub-structures
- When substructures exhibit a clear mathmatical relationship, one can exploit the underlying similarities between substructures, to compute a matrix M.
- For graphlet kernels, one can use the edit-distance relationship to encode how similar graphlets are.
- Given graphlet Gr_i of size k and a graphlet Gr_j of size k+1, we can build an undirected edit-distance graph UED-Graph by adding an undirected edge from G_i to G_j iff G_i can be obtained from G_j by deleting a node from G_j or vice versa. Given such a UED-G, one can compute the shortest path between G_i and G_j in order to compute their edit distance. Now, we can simply compute the matrix M directly. However, the cost of computing the shortest-path distances on UED-G becomes very expensive as a function of k. Thus, one can instead of creating the matrix M of size |V| x |V|, create a much smaller one of size |V'| x |V'| for V' << V, but only taking the observed sub-structures into account.
**** Sub-Structure Similarity via Learning
- The second approach is to LEARN the latent respresentation of sub-structures, by using language modeling and deep learning techniques. These learned representations are then utilised to compute M that respects similarities between sub-structures.
- *Neural Language Models*: Traditional language models estimate the likelihood of a sequence of words appearing in a corpus. Given some sequence of training words (w_1, w_2, ..., w_T) n-gram based language models then aim to maximise the following probability: $Pr(w_t\ |\ w_1, ..., w_{t-1})$, so they estimate the likelihood of seeing some word, given all the prior.
- Recent work in language modeling focus on distributed vector representations of words, word embeddings. Neural language models improve classic n-gram language models, by using continuous vector representations for words.
- Note: Word embeddings are words mapped into a d-dimensional embedding space such that similar words are mapped to similar positions in that space.
- Unlike traditional n-gram models, these neural language models take advantage of the notion of context
+ A context is defined as a fixed number of preceding words
- The objective of word embedding models is to maximise $\Sum_{t=1}^T \log Pr(w_t\ |\ w_1, \dots, w_{t-1})$ where $w_{t-n+1}, \dots, w_{t-1}$ is the context of $w_t$.
- *Continuous Bag-of-words*: Used to approximate the above objective. Predicts the current word given the surrounding words within a given window. Similiar to feed-forward neural network language models where the non-linear hidden layer is removed and the projection layer is shared for all words.
- Tries to maximise the objective: $\Sum_{t=1}^T \log Pr(w_t\ |\ w_{t-c}, \dots, w_{t+c})$ where c is the length of the mentioned context. This objective is computed using softmax.
- *Skip-gram model*: Maximises co-occurence probability among the words that appear within a given window. So instead of predicting the current word based on surrounding words, we predict the surrounding words given the current word. So the objective of skip-gram is: $\Sum_{t=1}^T \log Pr(w_{t-c}, \dots, w_{t+c}\ |\ w_t)$ where the probability is computed as: $\Prod_{-c \leq j \leq c, j \neq 0} Pr(w_{t+j}|w_t)$ This probability is again computed sort of like the softmax.
- Hierarchical softmax and negative sampling are used in training the skip-gram and CBOW models.
- Hierarchical softmax uses a binary huffman tree
- Negative sampling selects the contexts at random instead of considering all words in the vocabulary. If a word w appears in the context of another word w', then the vector represenation of the word w is closer to the vector representation of w'.
- Once training converges, similar words are mapped to similar positions in the vector space. The learned word vectors are empirically shown to preserve semantics. Word vectors can be used to answer analogy question susing simple vector algebra where the result of a vector calculation v("Madrid") - v("Spain") + v("France") is closer to v("Paris") than any other word vector.
+ So we view sub-structures in graph kernels as words that are generated from a special language. So different sub-structures compose graphs in a similar way that words compose a sentence when used together.
**** Deep Graph Kernels
- The framework takes list of graphs GG and decomposes each into substructures.
- List of substructures for each graph is treated as a sentence which is generated from some vocabulary V, where V is the unique set of observed sub-structures in the training data (that whole V' << V thing)
- We need to generate corpus where the co-occurence relationship is meaningful
- *Corpus generation for graphlet kernels*: Exhausting all graphlets is very expensive. Instead one can perform random sampling: Random sampling of graphlets of size k for a graph G involves placing a randomly generated window of size k x k on the adjacency matrix of size G and collecting the observed graphlet within this window. This is done n times, if we want n graphlets. As random sampling preserves no notion of co-occurence, the scheme is slightly altered by using the notion of neighborhoods. Whenever we sample a graphlet, we also sample its immediate neighbors. The graphlet and its neighbors are then interpreted as co-occured. Thus, graphlets with similar neighborhoods will acquire similar representations.
- *Corpus Gen for shortest path*: For every shortest path, take every sub-path as co-occured shortest path
- *Corpus gen for weisfeiler-lehman*: Not clear. I suppose it is all multiset labels for any given iteration h are co-occured.
**** Algorithm
1) Choose a graph kernel
2) Construct similarity matrix M:
- Build substructure vocabulary V
- Construct the co-occurences
- Apple CBOW or Skip-gram to get the embeddings
- Calculate sim matrix M
3) Decompose graph into substructures
4) Build histogram vector (the frequencies of the substructures) \phi(G)
- Under noise, the edit-distance graphlet kernel beats the base kernel of all the datasets except for one. Likely due to EGK only using a mathematical relationship between sub-structures rather than learning a sophisticated relationship. The Deep learning graphlet kernel thing which learns (DGK) outperformed all base kernels significantly, except for one, which is a different one from which one beat EGK.
- In regards to accuracy, DGK slightly outperforms EGK on all datasets.
- Running time is measured in seconds. What.
** Matching Node Embeddings for Graph Similarity
- Most graph kernels focus on local properties of graphs and ignore global structure (Not really the case with DGK)
- In the heart of graph kernels lies a positive semidefinite kernel function k. Once such a function k : X x X -> R is defined for a set X, it is known that there exists a map $\phi : X -> H$ into a hilbert space H s.t. $k(x,x') = (\phi(x), \phi(x'))$ for all $x, x' \in X$ where (.,.) is the inner product in H.
- Most existing graph kernels compare specific substructures of graphs (this is what DGK does! Or at least the kernels they use)
+ So these algos focus on local properties of graphs and ignore global structure
- The goal of this paper is to fix the problems related to kernels focusing on local substructures. This is accomplished by using algos that utilise features describing global properties of graphs.
- They present two algos designed to compare pairs (pairs!) of graphs based on their global properties. They are applicable to both labeled and unlabeled graphs.
1) Each graph is represented as a collection of the embeddings of its vertices.
2) The vertices of the graphs are embedded in the euclidean space using the eigenvectors of the corresponding adjacency matrices
3) The similarity between pairs of graphs is measured by computing a matching between their sets of embeddings
- Two algos are employed.
1) One which casts the problem as an instance of the Earth Mover's Distance for a set of graphs to, for a set of graphs, build a similarity matrix. This sim matrix is however not always positive semidefinite, so an SVM classification algorithm using indefinite kernels which treats the indefinite similarity matrix as a noisy observation of the true positive semidefinite kernel.
2) Corresponds to a technique adapted from the pyramid match kernel and yields a positive semidefinite matrix. This method is called the Pyramid Match graph Kernel.
*** Prelims
- Graphs are defined as usual: $G = (V,E)$.
- A set of labels L are defined as: $l : V -> L$, as a function which assigns labels to the vertices.
- Given a graph G, its vertices can be represented as points in a vector space using a node embedding algorithm. In this paper, embeddings are generated for vertices of a graph using the eigenvectors of its adjacency matrix A.
+ Given the eigenvalue decomposition of A, $A = U\wedge U^T$, the i'th row of $u_i$ of $U$ corresponds to the embedding of vertex $u_i \in V$.
+ These capture global properties of graphs and offer a powerful and flexible mechanism for performing machine learning tasks on them.
+ A is real -> It's eigenvalues $\lambda_1, ..., \lambda_n$ are real.
+ The graphs contain no self-loops, so Tr(A) = 0 (wtf?), which means the eigenvalues sum to zero.
- The eigenvectors with the largest eigenvalues share some interesting properties.
- The eigenvector with the biggest eigenvalue is specifc, as the i'th component of this vector gives the eigenvector centrality score of vertex v_i in the graph.
+ Eigenvector centrality is a measure of global connectivity of a graph which is captured by the spectrum of the adjacency matrix.
- We can also use the next to largest eigenvalues. Note that we work with the magnitude, so the sign of the eigenvalue is irrelevant.
- A graph G can be represented as a bag-of-vectors. E.g. graph G can be represented as the set: $(u_1, u_2, ..., u_n)$, where each vector of the set corresponds to the representation of each vertex $u_i \in V$.
*** Earth Mover's Distance
- The similarity of two graphs G_1 and G_2 is formulated as the minimum travel cost between the two graphs, which is provided by a linear program which just minimises the distance between each vertex, which is the euclidean distance within the embedding space.
- Note these guys work on labeled data and we want the distance between pairs of vertices with different labels to be large, thus it is just set to the largest possible value (which is apparently $\sqrt{d}$.)
- Let D be the matrix from computing all the pair-wise distances between graphs G_1 and G_2. If D is then a Euclidean Distance Matrix (EDM), then we could use D to define a positive semi-definite kernel matrix K: $K = -\frac{1}{2} JDJ$, where $J$ is the centering matrix (wat).
- However, in our setting, K is not positive semi definite, as D is not euclidean, we can use the SVM trick to convert it into such.
- The basic idea is to map the bag-of-vector representations of graphs to multi-resolution histograms and then compare these histograms with a weighted histogram intersection measure in order to find an approximate correspondence between the two sets of vectors.
- The algorithm works by partitioning the feature space into regions of increasingly larger size and taking a weighted sum of the matches that occur at each level.
+ Two points are said to match if they fall into the same region.
+ Something something histogram intersection
- Apparently the pyramid match kernel is a Mercer kernel, so by computing it for all pairs of graphs, a PSD kernel matrix can be built.
- Complexity is $O(dnL)$ for n nodes.
- Works for unlabeled graphs
- Can be modified to work for labeled graphs.
+ Only vertices that share label should be able to be matched
+ Instead of representing each graph as a set of vectors, they can be represented as a set of sets of vectors, where each internal set corresponds to a specific label and contains embeddings of the vertices with that label.
*** Experiments
- They test on all the graph kernels used in DGK (Weisfeiler-Lehman, Graphlet, Shortest path and random walk)
- The EMD and Pyramid Match (PM) methods works well for unlabeled graphs
- The PM is in general very good
- They also did very well on labeled graphs.
- The EMD and PM are slow though
** NetLSD: Hearing the Shape of a Graph
- Ideally graph comparisons should be invariant to the order of nodes and the sizes of compared graphs, adaptive to the scale of graph patterns and scalable
- They present the Network Laplacian Spectral Descriptor (NetLSD)
- Is a permutation, size-invariant, scale-adaptive and efficiently computable graph representation method that allows for straightforward comparisons of large graphs.
+ Permutation is the order and formally states that if two graphs are isomorphic, they should have distance 0
+ Scale-adaptive means it can handle comparisons both at a local level but also at a global level and that the representation should contain both local and gloval features.
+ Size-invariance means that if two graphs essentially show the same thing but at different sizes, their distance from each other should be 0.
- NetLSD extracts a compact signature that inherits the formal properties of the Laplacian spectrum, specifically its heat or wave kernel.
- Grounded in spectral graph theory, NetLSD allows for constant time similarity computations at several scales
- So essentially, we can view a graph of nodes and edges as a graph with an x and y axis, where the x axis is the scale
*** Related work
**** Direct methods
- Stuff like graph edit distance is heavy
**** Kernel methods
- No graph kernel achieves both scale-adaptive and size-invariant graph comparisons.
- Kernels are expensive to compute
**** Statistical representations
- Quadratic time complexity
**** Spectral representations
- Spectral graph theory is effective in the comparison of 3D objects
- Apparently it's clever
*** Problem statement
- G = (V,E)
- A representation is a function $\sigma : G -> R^N$ from any graph G within a collection of graphs, to an infinitely dimensions real vector. The element j of the representation is denoted at $\sigma_j(G)$.
- A representation-based distance is a function $d^\sigma : R^N \times R^N -> R_0^+$ on the representations of two graphs $G_1, G_2 \in G$, that returns a positive real number.
- The distance should be pseudometric, so it should be symmetric and support the triangle inequality.
*** NetLSD
- A useful metaphor is that of heating the graph's nodes and observing the heat diffusion as time passes. Another is that of a system of masses corresponding to the graph's nodes and springs corresponding to its edges. The propagation of mechanical waves through the graph is another way to capture its structural invariants. In both cases, the overall process describes the graph in a permutation-invariant manner, and embodies more global information as time elapses. Their representation employs a trace signature encoding such a heat diffusion or wave propagation process over time.
- Two graphs are compared via the L_2 distance among trace signatures sampled at selected time scales.
**** Spectra as representations
- The spectrum of a graph is defined as the eigenvalues of the laplacian of the adjacency matrix.
- The laplacian spectrum encodes important graph properties such as the normalised cut size used in spectral clustering. Likewise, the normalised laplacian spectrum can determine whether a graph is bipartite, but not the number of its edges.
- Thus, rather than consider the laplacian spectrum per se, they consider an associated heat diffusion process on the graph to obtain a more expressive representation in a manner reminiscent of random walk models.
- So the main idea is that we consider the heat equation based on the laplacian of the adjacency matrix: $\frac{\delta u_t}{\delta t} = - L u_t$ where $u_t$ are scalar values on vertices representing the heat of each vertex at time t. The solution to this provides the heat at each vertex at time t, when the initial heat $u_0$ is init with a fixed value on one of the vertices.
+ It's closed-form solution (what) is given by some n x n heat kernel matrix H_t which can be computed directly by exponentiating the laplacian eigenspectrum
- However, as the heat kernel involves pairs of nodes, it is not directly usable to compare graphs, so they consider instead the heat trace at time t: $h_t = tr(H_t) = \Sum_j e^{-t\lambda_j}$.
- The NetLSD representation consists then of a heat trace signature of graph G, i.e. a collection of heat traces at different time scales, $h(G) = \{h_t\}_{t>0}$.
- An alternative is the wave equation, which is pretty much the same: $\frac{\delta^2 u_t}{\delta t^2} = - L u_t$ and the wave trace signature is then: $w_t = tr(W_t) = \Sum_j e^{-it \lambda_j}$.
**** Scaling to large graphs
- Full eigendecomposition of the laplacian: $L = \Phi \wedge \Phi^T$, takes $O(n^3)$ time and $\Theta(n^2)$ memory.
- This allows them to compute the signatures of graphs efficiently, but the direct computation is impossible, so they need to approximate the heat trace signatures.
- The first proposal is to use a Taylor Expansion, which allows them to compare two graphs locally in $O(m)$. (Note that m is the amount of edges perhaps).
+ Is useful on very large graphs on which eigendecomposition is prohibitive, however, for manageable graph sizes we adopt a more accurate strategy based on approximating the eigenvalue growth rate.
- They compute k eigenvalues on both ends of the spectrum (what) and interpolate a linear growth of the interloping eigenvalues.
**** Properties of heat trace
- *Permutation invariance:* Isomorphic graphs are isospectral, hence their respective heat trace signatures are equal
- *Scale-adaptivity:* The value of t (that time thing) can be tuned to either produce local connectivty (at low values) and global connectivity at large values.
- *Size-invariance:* We can normalise the heat trace signatures, thus making it size-invariant.