Done with second

2020-01-09 22:09:02 +01:00 · 2020-01-09 22:09:02 +01:00 · 4011ae2a19
commit 4011ae2a19
parent 7bab2d30b9
1 changed files with 228 additions and 0 deletions
--- a/notes.org
+++ b/notes.org
@ -102,3 +102,231 @@
 - Uses two types of node embedding vectors:
  1) Inter-vector for network alignment
  2) Intra-vector for other downstream network analysis tasks
+- A crucial prerequisite for mining multi-networks is to map the nodes/participants among these related networks, network alignment. 
+- The shared participants among the networks are defined as anchor nodes, so they act like anchors aligning the networks they participate in, and the relationship among anchor nodes across networks are called anchor links. 
+  + In many cases, a few anchor links can be known beforehand (such as when people link their twitter or such on facebook)
+  + Network alignment seeks to infer these unknown or potential anchor links.
+- Most previous work assumes topology consistency such that a node tends to have a consistent connectivity structure across networks
+- Attribute based methods are not applicable in many realistic scenarios, as the attribute information may be unreliable or incomplete. (such as usernames, gender or other profile information)
+  + REGAL supports using attribute based information
+- CrossMNA uses an additional vector named the "network vector", which is proposed to extract the semantic meaning of the network, which can reflect the difference of global structure among the networks. 
+- Also uses two kinds of embedding:
+  1) The inter-vector, which reflects the common features of the anchor nodes in different networks and is shared among the known anchor nodes
+  2) The intra-vector, which preserves the specific structural feature for a node in its selected network and is generated through the combination of network vector and inter-vector.
+- REGAL proposes an embedding-based method based on the assumption that nodes with similar structural connectivity or degrees have a high probability to be aligned (The whole point of that degree vector)
+  + Although one node may share some similar features in related networks, its local structural connections can be entirely different in each network due to the distinctiveness in network semantic meanings (You are likely to connect to very different people on facebook vs linkedin)
+- There are the following challenges for network alignment algos:
+  1) Semantic diversity: Diversities in network semantics lead to the different interactional behaviours of the same node in each network, which adds inaccuracies to the alignment. This problem is further worsened, when considering +2 networks. 
+  2) Data imbalance: Has two aspects; first, the size of each network may vary (which is OK within REGAL), second, the number of anchor links between each pair of networks can be unequal. Any pair-wise learning method will suffer a lot from this, as they only consider pairs of networks. So how to make full use of the information across ALL the networks to deal with the data imbalance problem is also a challenge.
+  3) Model storage: Network embedding is a practical approach to extract structural information of the node and has been applied in some network alignment methods (such as REGAL). However. In large-scale multi-networks, it is essential to take into account the space overhead of the methods (which is why REGAL never compute the similarity matrix, but comes up with the embedding in a clever way). REGAL still has to compute the embedding vector for most nodes though, which still takes up a lot of space. It does have the landmark nodes though, which alleviates this slightly. 
+- There is a network vector for each network. This reflects the difference of global structure among the networks. Thus, if the global structure of two networks is similar, their network vectors will be close in vector space. 
+- Each node has an inter-vector and an intra-vector. The inter-vector depicts the commonness of anchor nodes in different networks and is shared among the anchor nodes. The intra-vector reflects the specific structural feature of this node in its selected network, but is generated through a combination of the network vector and inter-vector
+*** CrossMNA
+- We suppose networks are unweighted and all the edges are directed, as an undirected edge can be divided into two directed edges. 
+- A set of networks is defined: G = ((G^1, G^2, ..., G^n), (A^{(1,2)}, A^{(1,3)}, ..., A^{(n-1, n)})), where each G^i represents a network and A^{{i,j)} represents the anchor nodes between G^i and G^j. Each network is defined from its nodes and edges.
+- An anchor link between G^i and G^j is defined as (v_k^i, v_k^j) \in A^{(i,j)}. Anchor links follow the transitivity law
+**** The Cross-Network Embedding
+- The inter-vector u preserves the common features among the anchor nodes. Through training, the inter-vector of an unknown anchor node should get close to its counterparts in vector space. This inter-vector is difficult to learn directly, as there is no direct correlation between the unknown anchor nodes. They expect some anchor node in one network to both show similar structural features with its counterparts, but also distinctive connection relationships, due to the semantic meaning of a network. 
+- The intra-vector is straightforward, as it is easy to extract structural features of nodes into network embedding, named the intra-vector v. This contains both the commonness among counterparts and the specific local connections in its selected network due to the semantics, so it can NOT be applied to node matching, unless the impact of the network semantics can be removed. 
+- Thus, the authors are to present an equation to build a correlation among intra-vector, inter-vector and network semantics: v_i^k = u_i + r^k where r^k is the network vector which can extract the unique characteristics of G^k. Thus, we can learn the inter-vector of the anchor nodes indirectly, by training the combining-based intra-vectors. Thus, the network vector can be used to mitigate the "noise" added to the intra-vector. (u_4 = v^3_4 - r^3)
+- The inter-vector and intra-vector can be in different vector spaces, which is solved by a transformation matrix W which is used to align them with different dimensions. So, v^k_i = Wu_i + r^k
+- Like REGAL, they use a k-d tree to find the top-a most likely nodes in other networks. We need to compare the inter-vectors of nodes, to find the alignments. 
+*** Experiments
+- CrossMNA outperforms REGAL in their tests, as REGAL assumes topology consistency (at least somewhat), so it performs poorly when they use datasets not having this. 
+- The dimension of the inter-vector and intra-vector (d1 and d2) should be set d1 = 200 or 300 and d2 = 30 or 50 ish to save memory in practice. When d1 grows, so does the performance
+- With no prior known anchor links, CrossMNA uses as much space as REGAL
+- CrossMNA is dramatically space efficient for large-scale multi-network applications though.
+- The time complexity of learning embeddings is approximately $O(t*N(d1*d2*|V|+d2*|E|))$ where $t$ is the number of iterations, $N$ denotes the number of networks and $|V|$ and $|E|$ number of nodes and edges in each network.
+- Time complexity of finding soft alignments between each two networks is $O(|V|*log(|V|))$.
+* Graph Similarities
+** Deep Graph Kernels
+- In domains such as social networks, bioinformatics and robotics we are often interested in computing similarities between structured objects. Graphs offer a natural way to represent structured data.
+- Consider the problem of identifying a subreddit on reddit. To tackle this problem, one can represent an online discussion thread as a graph where nodes represent users and edges represent whether two users interact. The task is then to predict which sub-community a certain discussion belongs to, based on its communication graph. 
+- One of the increasingly popular approaches to measure similarity between structured objects is that of kernel methods.
+- A kernel method measures the similarity between two objects with a kernel function, an inner product in reproducing kernel Hilbert space (RHKS). The challenge for using kernel functions is to pick a suitable kernel that captures the semantics of the structure while being computationally tractable. 
+  + Roughly speaking, this means that if two functions and g in the RKHS are close in norm, i.e., ||f − g|| is small, then f and g are also pointwise close, i.e., |f(x) − g(x)| is small for all x.
+- R-convolution is a general framework for handling discrete objects where the key idea is to recursively decompose structured objects into "atomic" (likely non-decomposable) sub-structures and define valid local kernels between them. Given a graph G, let \phi(G) denote a vector which contains counts of atomic sub-structures and (\cdot, \cdot)_H denote a dot product in RHKS H, then the kernel between two graphs G and G' is given by: K(G,G') = (\phi(G), \phi(G'))_H.
+- This representation does however not take a number of important observations into account.
+  1) Sub-structures that are used to compute the kernel matrix are not independent. A popular substructure, graphlets, which is used for decomposing graphs, are defined as induced, non-isomorphic sub-graphs of size k. Graphlets exhibit strong dependence relationships, size k+1 graphlets can be derived from size k graphlets by addition of nodes or edges. 
+  2) By increasing the size k, the number of unique graphlets increase exponentially, so when the number of features grows there is a sparsity problem: only a few substructures will be common across graphs (not entirely sure why this applies though). This leads to the diagonal dominance problem, which is when a given graph is likely to be similar to itself, but not any other. 
+- We would like a kernel matrix where all entries belonging to a class are similar to each other and dissimilar to everything else. Thus. Consider an alternative kernel between two graphs G and G': K(G,G') = \phi(G)^T * M * \phi(G') where M represents a |V| \times |V| positive semi-definite matrix that encodes relationship between sub-structures and V represents the "vocabulary" of sub-structures obtained from the training data.
+  + This allows for one to design M in a clever way that respects the similarity within the given sub-structure space. This could be the edit-distance in spaces where there is a strong mathmatical relationship between sub-structures, such that one could design an M that respects the geometry of the space. 
+  + This geometry assumption can also be fulfilled by "learning" the geometry of the space through data.
+- This paper propose recipes for designing such M matrices for graph kernels.
+- They propose two recipes:
+  1) They exploit an edit-distance relationship between sub-structures and directly compute M
+  2) They propose a framework that computes an M matrix by learning latent representations of substructures
+- Their contributions are:
+  1) They propose a general framework that learns hidden representations of sub-structures used in graph kernels
+  2) They demonstrate their framework on three popular graph kernels: Graphlet kernel, Weisfeiler-Lehman subtree kernels, Shortest-Path kernels
+  3) They apply their framework to derive deep variants of string kernels which are a result of the R-convolution kernels
+*** Graph Kernels
+- The three graphs kernels are, based on limited-sized subgraphs, based on subtree patterns or based on walks and paths.
+- Let GG be a set of n graphs G_1, .., G_n. Let Y represent a set of labels associated with each graph in GG, where Y = y_{G_1}, ..., y_{G_n}
+- Given some G = (V, E) and H = (V_H, E_H), H is then a subgraph of G iff an injective mapping a : V_H -> V such that (v,w) \in E_H iff (a(v), a(w)) \in E. 
+- A graph G is labeled if there is a function l : V -> \Sigma that assigns labels from some alphabet \Sigma to vertices in G. Likewise, a graph is unlabeled if there is nothing to distinguish between nodes, apart from their interconnectiveness.
+- K(G,G') is a kernel function which measures similarity between G and G'.
+- The graph classification problem is to map graphs into two or more categories. Given a set of graphs GG and labels Y, we should learn to map graphs to labels within Y.
+**** Graph Kernels based on subgraphs
+- A graphlet Gr is an induced and non-isomorphic sub-graph of size k. Let V_k = (Gr_1, Gr_2, ..., Gr_{n_k}) be the set of size k graphlets where n_k denotes the number of unique graphlets of size k. Given two unlabeled graphs G and G', the graphlet kernel is defined: K_{GK}(G, G') = (f^G, f^{G'}) where f^G and f^{G'} are vectors of normalised counts, that is, the i'th component of f^G denotes the frequency of graphlet Gr_i occuring as a subgraph of G and (\cdot, \cdot) is the euclidean dot product.
+**** Graph kernels based on subtree patterns
+- These decompose the graph into its subtree patterns, the Weisfeiler-Lehman subtree kernels is in this family.
+- Requires a labeled graph in which we can iterate over each vertex and its neighbors in order to create a multiset label.
+- The multiset at every iteration consists of the label of the vertex and the sorted labels of its neighbors, this multiset is then given a new label, which can be used for the next iteration. 
+- To compare graphs, we then simply count the co-occurances of each label.
+- Given G and G', the Weisfeiler-Lehman subtree kernel is then: K_{WL}(G,G') = (1^G, 1^{G'}), where (.,.) denotes the euclidean dot product. If we assume h iterations of relabeling, then 1^G consists of h blocks s.t. the i'th component in the j'th block of 1^G contains the frequency of which the i'th label was assigned to a node in the j'th iteration.
+**** Graphs Kernels based on random-walks
+- Decomposes graphs into random walks or paths and then counts the co-occurence of random walks or paths in two graphs.
+- Let P_G represent the set of all shortest paths in G and p_i \in P_G denotes a triplet (l_s^i, l_e^i, n_k) where n_k is the length of the path and l_s^i and l_e^i are the larbels of the starting and ending vertices. The shortest path for LABELED graphs G and G' is then: K_{SP}(G,G') = (P^G, P^{G'}) where the i'th component of P^G contains the frequency of the i'th triplet occuring in graph G. 
+- Does this still use the euclidean dot product?
+**** General
+- All graphs kernels mentioned are instances of the R-convolution framework.
+- The recipe for definining graph kernels is as follows: 
+  1) Recursively decompose a graph into its subgraphs (Graphlet kernel decomposes into sub-graphs, Weisfeiler-Lehman decomposes into subtrees and shortest-path decomposes into shortest-paths (lol))
+  2) The decomposed sub-structures are then represented as a vector of frequencies where each item in the vector represents how many times a given sub-structure occurs in the graph
+  3) The euclidean space or some other domain specific RKHS is used to define the dot product between the vectors of frequencies
+*** Methodology
+**** Sub-structure similarity via edit distance
+- How to compute an M matrix by using the edit-distance relationship between sub-structures
+- When substructures exhibit a clear mathmatical relationship, one can exploit the underlying similarities between substructures, to compute a matrix M.
+- For graphlet kernels, one can use the edit-distance relationship to encode how similar graphlets are. 
+- Given graphlet Gr_i of size k and a graphlet Gr_j of size k+1, we can build an undirected edit-distance graph UED-Graph by adding an undirected edge from G_i to G_j iff G_i can be obtained from G_j by deleting a node from G_j or vice versa. Given such a UED-G, one can compute the shortest path between G_i and G_j in order to compute their edit distance. Now, we can simply compute the matrix M directly. However, the cost of computing the shortest-path distances on UED-G becomes very expensive as a function of k. Thus, one can instead of creating the matrix M of size |V| x |V|, create a much smaller one of size |V'| x |V'| for V' << V, but only taking the observed sub-structures into account. 
+**** Sub-Structure Similarity via Learning
+- The second approach is to LEARN the latent respresentation of sub-structures, by using language modeling and deep learning techniques. These learned representations are then utilised to compute M that respects similarities between sub-structures. 
+- *Neural Language Models*: Traditional language models estimate the likelihood of a sequence of words appearing in a corpus. Given some sequence of training words (w_1, w_2, ..., w_T) n-gram based language models then aim to maximise the following probability: $Pr(w_t\ |\ w_1, ..., w_{t-1})$, so they estimate the likelihood of seeing some word, given all the prior.
+- Recent work in language modeling focus on distributed vector representations of words, word embeddings. Neural language models improve classic n-gram language models, by using continuous vector representations for words.
+- Note: Word embeddings are words mapped into a d-dimensional embedding space such that similar words are mapped to similar positions in that space.
+- Unlike traditional n-gram models, these neural language models take advantage of the notion of context
+  + A context is defined as a fixed number of preceding words 
+- The objective of word embedding models is to maximise $\Sum_{t=1}^T \log Pr(w_t\ |\ w_1, \dots, w_{t-1})$ where $w_{t-n+1}, \dots, w_{t-1}$ is the context of $w_t$.
+- *Continuous Bag-of-words*: Used to approximate the above objective. Predicts the current word given the surrounding words within a given window. Similiar to feed-forward neural network language models where the non-linear hidden layer is removed and the projection layer is shared for all words. 
+- Tries to maximise the objective: $\Sum_{t=1}^T \log Pr(w_t\ |\ w_{t-c}, \dots, w_{t+c})$ where c is the length of the mentioned context. This objective is computed using softmax.
+- *Skip-gram model*: Maximises co-occurence probability among the words that appear within a given window. So instead of predicting the current word based on surrounding words, we predict the surrounding words given the current word. So the objective of skip-gram is: $\Sum_{t=1}^T \log Pr(w_{t-c}, \dots, w_{t+c}\ |\ w_t)$ where the probability is computed as: $\Prod_{-c \leq j \leq c, j \neq 0} Pr(w_{t+j}|w_t)$ This probability is again computed sort of like the softmax.
+- Hierarchical softmax and negative sampling are used in training the skip-gram and CBOW models.
+- Hierarchical softmax uses a binary huffman tree
+- Negative sampling selects the contexts at random instead of considering all words in the vocabulary. If a word w appears in the context of another word w', then the vector represenation of the word w is closer to the vector representation of w'. 
+- Once training converges, similar words are mapped to similar positions in the vector space. The learned word vectors are empirically shown to preserve semantics. Word vectors can be used to answer analogy question susing simple vector algebra where the result of a vector calculation v("Madrid") - v("Spain") + v("France") is closer to v("Paris") than any other word vector. 
+  + So we view sub-structures in graph kernels as words that are generated from a special language. So different sub-structures compose graphs in a similar way that words compose a sentence when used together.
+**** Deep Graph Kernels
+- The framework takes list of graphs GG and decomposes each into substructures.
+- List of substructures for each graph is treated as a sentence which is generated from some vocabulary V, where V is the unique set of observed sub-structures in the training data (that whole V' << V thing)
+- We need to generate corpus where the co-occurence relationship is meaningful
+- *Corpus generation for graphlet kernels*: Exhausting all graphlets is very expensive. Instead one can perform random sampling: Random sampling of graphlets of size k for a graph G involves placing a randomly generated window of size k x k on the adjacency matrix of size G and collecting the observed graphlet within this window. This is done n times, if we want n graphlets. As random sampling preserves no notion of co-occurence, the scheme is slightly altered by using the notion of neighborhoods. Whenever we sample a graphlet, we also sample its immediate neighbors. The graphlet and its neighbors are then interpreted as co-occured. Thus, graphlets with similar neighborhoods will acquire similar representations.
+- *Corpus Gen for shortest path*: For every shortest path, take every sub-path as co-occured shortest path
+- *Corpus gen for weisfeiler-lehman*: Not clear. I suppose it is all multiset labels for any given iteration h are co-occured.
+**** Algorithm
+1) Choose a graph kernel
+2) Construct similarity matrix M:
+  - Build substructure vocabulary V
+  - Construct the co-occurences
+  - Apple CBOW or Skip-gram to get the embeddings
+  - Calculate sim matrix M
+3) Decompose graph into substructures
+4) Build histogram vector (the frequencies of the substructures) \phi(G)
+5) Compute graph kernel as K(G, G') = \phi(G)^T * M*\phi(G')
+*** Experiments
+- Under noise, the edit-distance graphlet kernel beats the base kernel of all the datasets except for one. Likely due to EGK only using a mathematical relationship between sub-structures rather than learning a sophisticated relationship. The Deep learning graphlet kernel thing which learns (DGK) outperformed all base kernels significantly, except for one, which is a different one from which one beat EGK.
+- In regards to accuracy, DGK slightly outperforms EGK on all datasets.
+- Running time is measured in seconds. What.
+
+
+** Matching Node Embeddings for Graph Similarity
+- Most graph kernels focus on local properties of graphs and ignore global structure (Not really the case with DGK)
+- In the heart of graph kernels lies a positive semidefinite kernel function k. Once such a function k : X x X -> R is defined for a set X, it is known that there exists a map $\phi : X -> H$ into a hilbert space H s.t. $k(x,x') = (\phi(x), \phi(x'))$ for all $x, x' \in X$ where (.,.) is the inner product in H.
+- Most existing graph kernels compare specific substructures of graphs (this is what DGK does! Or at least the kernels they use)
+  + So these algos focus on local properties of graphs and ignore global structure
+- The goal of this paper is to fix the problems related to kernels focusing on local substructures. This is accomplished by using algos that utilise features describing global properties of graphs. 
+- They present two algos designed to compare pairs (pairs!) of graphs based on their global properties. They are applicable to both labeled and unlabeled graphs. 
+  1) Each graph is represented as a collection of the embeddings of its vertices. 
+  2) The vertices of the graphs are embedded in the euclidean space using the eigenvectors of the corresponding adjacency matrices
+  3) The similarity between pairs of graphs is measured by computing a matching between their sets of embeddings
+- Two algos are employed.
+  1) One which casts the problem as an instance of the Earth Mover's Distance for a set of graphs to, for a set of graphs, build a similarity matrix. This sim matrix is however not always positive semidefinite, so an SVM classification algorithm using indefinite kernels which treats the indefinite similarity matrix as a noisy observation of the true positive semidefinite kernel.
+  2) Corresponds to a technique adapted from the pyramid match kernel and yields a positive semidefinite matrix. This method is called the Pyramid Match graph Kernel.
+*** Prelims
+- Graphs are defined as usual: $G = (V,E)$. 
+- A set of labels L are defined as: $l : V -> L$, as a function which assigns labels to the vertices.
+- Given a graph G, its vertices can be represented as points in a vector space using a node embedding algorithm. In this paper, embeddings are generated for vertices of a graph using the eigenvectors of its adjacency matrix A. 
+  + Given the eigenvalue decomposition of A, $A = U\wedge U^T$, the i'th row of $u_i$ of $U$ corresponds to the embedding of vertex $u_i \in V$.
+  + These capture global properties of graphs and offer a powerful and flexible mechanism for performing machine learning tasks on them.
+  + A is real -> It's eigenvalues $\lambda_1, ..., \lambda_n$ are real.
+  + The graphs contain no self-loops, so Tr(A) = 0 (wtf?), which means the eigenvalues sum to zero.
+- The eigenvectors with the largest eigenvalues share some interesting properties.
+- The eigenvector with the biggest eigenvalue is specifc, as the i'th component of this vector gives the eigenvector centrality score of vertex v_i in the graph.
+  + Eigenvector centrality is a measure of global connectivity of a graph which is captured by the spectrum of the adjacency matrix.
+- We can also use the next to largest eigenvalues. Note that we work with the magnitude, so the sign of the eigenvalue is irrelevant.
+- A graph G can be represented as a bag-of-vectors. E.g. graph G can be represented as the set: $(u_1, u_2, ..., u_n)$, where each vector of the set corresponds to the representation of each vertex $u_i \in V$.
+*** Earth Mover's Distance
+- The similarity of two graphs G_1 and G_2 is formulated as the minimum travel cost between the two graphs, which is provided by a linear program which just minimises the distance between each vertex, which is the euclidean distance within the embedding space. 
+- Note these guys work on labeled data and we want the distance between pairs of vertices with different labels to be large, thus it is just set to the largest possible value (which is apparently $\sqrt{d}$.)
+- Let D be the matrix from computing all the pair-wise distances between graphs G_1 and G_2. If D is then a Euclidean Distance Matrix (EDM), then we could use D to define a positive semi-definite kernel matrix K: $K = -\frac{1}{2} JDJ$, where $J$ is the centering matrix (wat).
+- However, in our setting, K is not positive semi definite, as D is not euclidean, we can use the SVM trick to convert it into such.
+*** Pyramid Match Graph Kernel
+- Based on the Pyramid Match Kernel
+- Generates PSD (positive semi definite) kernel matrices
+- The basic idea is to map the bag-of-vector representations of graphs to multi-resolution histograms and then compare these histograms with a weighted histogram intersection measure in order to find an approximate correspondence between the two sets of vectors.
+- The algorithm works by partitioning the feature space into regions of increasingly larger size and taking a weighted sum of the matches that occur at each level. 
+  + Two points are said to match if they fall into the same region.
+  + Something something histogram intersection
+- Apparently the pyramid match kernel is a Mercer kernel, so by computing it for all pairs of graphs, a PSD kernel matrix can be built.
+- Complexity is $O(dnL)$ for n nodes.
+- Works for unlabeled graphs
+- Can be modified to work for labeled graphs.
+  + Only vertices that share label should be able to be matched
+  + Instead of representing each graph as a set of vectors, they can be represented as a set of sets of vectors, where each internal set corresponds to a specific label and contains embeddings of the vertices with that label.
+*** Experiments
+- They test on all the graph kernels used in DGK (Weisfeiler-Lehman, Graphlet, Shortest path and random walk)
+- The EMD and Pyramid Match (PM) methods works well for unlabeled graphs
+- The PM is in general very good
+- They also did very well on labeled graphs.
+- The EMD and PM are slow though
+** NetLSD: Hearing the Shape of a Graph
+- Ideally graph comparisons should be invariant to the order of nodes and the sizes of compared graphs, adaptive to the scale of graph patterns and scalable
+- They present the Network Laplacian Spectral Descriptor (NetLSD)
+- Is a permutation, size-invariant, scale-adaptive and efficiently computable graph representation method that allows for straightforward comparisons of large graphs.
+  + Permutation is the order and formally states that if two graphs are isomorphic, they should have distance 0
+  + Scale-adaptive means it can handle comparisons both at a local level but also at a global level and that the representation should contain both local and gloval features.
+  + Size-invariance means that if two graphs essentially show the same thing but at different sizes, their distance from each other should be 0.
+- NetLSD extracts a compact signature that inherits the formal properties of the Laplacian spectrum, specifically its heat or wave kernel.
+- Grounded in spectral graph theory, NetLSD allows for constant time similarity computations at several scales
+- So essentially, we can view a graph of nodes and edges as a graph with an x and y axis, where the x axis is the scale
+*** Related work
+**** Direct methods
+- Stuff like graph edit distance is heavy
+**** Kernel methods
+- No graph kernel achieves both scale-adaptive and size-invariant graph comparisons.
+- Kernels are expensive to compute
+**** Statistical representations
+- Quadratic time complexity
+**** Spectral representations
+- Spectral graph theory is effective in the comparison of 3D objects
+- Apparently it's clever
+*** Problem statement
+- G = (V,E)
+- A representation is a function $\sigma : G -> R^N$ from any graph G within a collection of graphs, to an infinitely dimensions real vector. The element j of the representation is denoted at $\sigma_j(G)$.
+- A representation-based distance is a function $d^\sigma : R^N \times R^N -> R_0^+$ on the representations of two graphs $G_1, G_2 \in G$, that returns a positive real number.
+- The distance should be pseudometric, so it should be symmetric and support the triangle inequality.
+*** NetLSD
+- A useful metaphor is that of heating the graph's nodes and observing the heat diffusion as time passes. Another is that of a system of masses corresponding to the graph's nodes and springs corresponding to its edges. The propagation of mechanical waves through the graph is another way to capture its structural invariants. In both cases, the overall process describes the graph in a permutation-invariant manner, and embodies more global information as time elapses. Their representation employs a trace signature encoding such a heat diffusion or wave propagation process over time.
+- Two graphs are compared via the L_2 distance among trace signatures sampled at selected time scales.
+**** Spectra as representations
+- The spectrum of a graph is defined as the eigenvalues of the laplacian of the adjacency matrix.
+- The laplacian spectrum encodes important graph properties such as the normalised cut size used in spectral clustering. Likewise, the normalised laplacian spectrum can determine whether a graph is bipartite, but not the number of its edges.
+- Thus, rather than consider the laplacian spectrum per se, they consider an associated heat diffusion process on the graph to obtain a more expressive representation in a manner reminiscent of random walk models.
+- So the main idea is that we consider the heat equation based on the laplacian of the adjacency matrix: $\frac{\delta u_t}{\delta t} = - L u_t$ where $u_t$ are scalar values on vertices representing the heat of each vertex at time t. The solution to this provides the heat at each vertex at time t, when the initial heat $u_0$ is init with a fixed value on one of the vertices.
+  + It's closed-form solution (what) is given by some n x n heat kernel matrix H_t which can be computed directly by exponentiating the laplacian eigenspectrum
+- However, as the heat kernel involves pairs of nodes, it is not directly usable to compare graphs, so they consider instead the heat trace at time t: $h_t = tr(H_t) = \Sum_j e^{-t\lambda_j}$.
+- The NetLSD representation consists then of a heat trace signature of graph G, i.e. a collection of heat traces at different time scales, $h(G) = \{h_t\}_{t>0}$.
+- An alternative is the wave equation, which is pretty much the same: $\frac{\delta^2 u_t}{\delta t^2} = - L u_t$ and the wave trace signature is then: $w_t = tr(W_t) = \Sum_j e^{-it \lambda_j}$.
+**** Scaling to large graphs
+- Full eigendecomposition of the laplacian: $L = \Phi \wedge \Phi^T$, takes $O(n^3)$ time and $\Theta(n^2)$ memory.
+- This allows them to compute the signatures of graphs efficiently, but the direct computation is impossible, so they need to approximate the heat trace signatures. 
+- The first proposal is to use a Taylor Expansion, which allows them to compare two graphs locally in $O(m)$. (Note that m is the amount of edges perhaps).
+  + Is useful on very large graphs on which eigendecomposition is prohibitive, however, for manageable graph sizes we adopt a more accurate strategy based on approximating the eigenvalue growth rate.
+- They compute k eigenvalues on both ends of the spectrum (what) and interpolate a linear growth of the interloping eigenvalues. 
+**** Properties of heat trace
+- *Permutation invariance:* Isomorphic graphs are isospectral, hence their respective heat trace signatures are equal
+- *Scale-adaptivity:* The value of t (that time thing) can be tuned to either produce local connectivty (at low values) and global connectivity at large values.
+- *Size-invariance:* We can normalise the heat trace signatures, thus making it size-invariant.
+*** Experiments
+- NetLSD is very scalable.