Graph Analytics
Analyzing graph
The procedure of examining and drawing conclusions from graph data structures is known as “graph analytics.”
A mathematical abstraction called a graph is used to depict the connections between various elements. In a graph, nodes (also called vertices) stand for entities, while edges (sometimes called links) show the connections between those entities. Numerous issues in a variety of areas, including social network analysis, recommendation systems, fraud detection, cybersecurity, and logistics optimization, can be resolved using graph analytics.
Problem Statement:
In this project, we are finding the important users and estimating the probability of an upcoming association (edge) between two nodes, being aware that the nodes in the graph’s current state do not have any associations. Here we have a set of nodes in a social network graph.
We also predict the nodes, which are central to the graph.
Motivation:
Here, the edges mentioned are forms of following or mutual objectives, friendships, and partnerships.
Here we particularly study and build about the Google Plus social network, with the following domains:
- application of a friend’s advice generally to a specific user
- Predicting hidden links in a terrorist group’s social network presence and locating their leaders and major influencers
- Marketing products to specific audiences involves both discovering potential clients and using extremely influential people as spokespersons.
- suggesting potential interactions or partnerships inside an organization that have not yet been found.
This model we create is also useful for other social networks like Twitter and Facebook.
Process of Knowledge Engineering:
There are four main steps in the project:
About the dataset:
We have acquired the dataset from the Stanford Snap Project: http://snap.stanford.edu/data/egonets-Gplus.html.
This dataset includes information about user profiles, circles, and ego networks. The node features are likely to refer to information about individual users, such as their name, age, location, interests, and other demographic or behavioral data.
Circles, in this context, probably refer to the groups of users that were manually created by users through the "share circle" feature. These circles could represent different interests, hobbies, professions, or other characteristics that users share in common.
Nodes: 107614
Edges: 13673453
Building Link Prediction Model:
- First, we will give the connected node weights score (x, y) to the pair of given nodes x, y> based on the taken values.
- The approach we have used here is:
(a) Node neighborhood-based approaches: Several strategies are based on the notion that two nodes x and y are more likely to connect up in the future if their sets of neighbors (x) and (y) have a significant amount of overlap.
(b) There are several techniques that improve the concept of shortest-path distance by taking into account all possible paths between two nodes. These methods work by utilizing the collective information from the entire set of paths.
Representation:
Here, we used Python. igraph provides a variety of tools for creating, manipulating, and analyzing graphs, including functions for generating random graphs, measuring network properties, and visualizing graphs.
Centrality:
For measuring centrality, we have used four methods:
- Degree of the nodes:
The main objective of this method is to determine the nodes with the highest number of immediate neighbors, which is commonly referred to as "degree" in graph theory.
The input required for this procedure is a graph and a specific node within it.
The output obtained from this method is the degree values of the nodes in the graph.
2. Centrality Closeness:
The fundamental concept of this approach is to identify a node that has a higher average proximity to other nodes, which is known as a "central node" in graph theory.
To execute this technique, a graph and a particular node are required as inputs.
The output of this method is a standardized value between 0 and 1, where a score of 1 indicates that the node is highly central within the graph.
3. The central concept of betweenness centrality:
The approach is to identify nodes that act as bridges, brokers, or gatekeepers within a network. In other words, nodes that lie on many of the shortest paths between other nodes are considered to be central actors.
To implement this method, a graph and a specific node within it are required as inputs.
The output obtained from this technique is a normalized value between 0 and 1, where a score of 1 indicates that the node is highly central within the network.
4. The main concept of eigenvector centrality is:
It is used to identify nodes that are connected to other highly central nodes within a graph. In other words, nodes that are connected to other nodes that are themselves highly connected are considered to be central actors.
This method requires only the graph itself as input.
The output is a value between 0 and 1, representing the Eigenvector Centrality Score of each node in the network.
Link prediction:
Along with the usability criteria and experiments conducted, we decided to use the Support Vector Machine (SVM) classification algorithm as our machine learning approach.
The main concept behind SVM is to create a hyperplane that can segregate the two classes, linked and unlinked, based on their feature vectors.
The input required for this approach is a graph dataset that is labeled with either linked or unlinked connections, along with a feature dictionary that contains approximately 230 features for each ego network.
The output obtained from the SVM model is a predicted association between two nodes [x, y], which is either 0 (indicating no association) or 1 (indicating a link).
We split our dataset into a 2:1:1 ratio for training, validation, and testing purposes.
Validation:
Performance Evaluation:
Accuracy: 70%
Precision Score: 68%
F1 Score: 65%(It considers both precision and recall of the test to compute the score.)
Thank You!!!!