How to Manage Very Large Graphs in NodeXL

Dec 20, 2008 at 1:24 PM
I have a graph with 25,000+ verticies and 55,000+ edges that I want to graph and explore.  NodeXL was able to graph it but it rendered as a large black "blob" which is pretty much unusable.

In particular I am looking for sub-graph clusters (please excuse me if my terms are wrong, I am still learning the "right" graph terminology) where groups of verticies are have a higher number of edges.

Are there any tips or tricks to getting that kind of information out of such a large graph?  Or should I be using another tool?

Dec 20, 2008 at 5:05 PM

When graphs are very large several techniques may be applied to filter or reduce the data set to a more manageable size.

One suggestion is to avoid visualizing the whole graph with the "Show/Hide Graph Pane" and to use the "Create Subgraph Images" feature instead.  This will create and insert thumbnail images of each node's "ego-network" out a selectable number of degrees.  These images can also be written to a collection of files in a sub-directory.  These images may reveal a range of different patterns of connection within the data set.

Another suggestion is to use filtering: the "Dynamic Filters" feature exposes sliders that allow you to define which nodes and edges should remain in the whole graph displayed in the pane opened with the "Show/Hide Graph Pane" feature.  Once you select "Read Workbook" in this window you will get a display of the whole "blob" network.  Using the dynamic filter to exclude all but the strongest ties, for example, can reveal something about the core members of the network.

Filtering can also be accomplished through the use of formula in the "Visibility" column of the Vertex worksheet.  If this cell is "0" then the node will not be shown.  Thus a formula like:


can filter out all nodes that fail to have a value above a threshold.  Complex visibility decision rules can be built from multiple columns and conditional formula.

Be sure that you have run a "Calculate Graph Metrics" to populate your workbook with network measurements about each node and edge.  These can be useful features to sort or filter your data set.  When these measures are present in the workbook the "Dynamic Filters" pane will expose them to allow filtering.  These measures can also be used in formula written in the "Visibility" column.

Filtering the network to reveal extremes of different metrics can be useful: have a look at the people with the highest in-degree or out-degree.  Sort the data by clustering coefficient or betweenness centrality scores. 

Layouts other than Fruchterman-Riengold can be useful!  Try sorting the edges worksheet by an attribute of your data and try the "grid" layout which aligns the nodes in rows (sorted by their appearance in the edge list) while still displaying the edges between them.

These methods should reveal structures within the data set that are now obscured by the "blob"!


Dec 20, 2008 at 5:37 PM

Time is another useful dimension to slice very large networks into more manageable (and meaningful) forms.

If your edge list has time stamps, the Dynamic Filters feature will expose a time filter slider with which you can define what time ranges to include in the network.

Please note that if you have a time stamped edge list you may want to avoid the "Merge duplicate edges" feature which which compresses all incidences of an edge to a single edge with a tie strength equal to the count of merged edges.

Dec 20, 2008 at 5:42 PM
Thanks Marc,

I am trying several of your suggestions.  I do not have any timestamp data in my graph. The nodes in my graph are effectively cell's in a spreadsheet and I am doing analysis on the relationships between the cells based on the formula references.  And yes it is a pretty large spreadsheet with 25,000+ active cells but truthfully its one of my smaller ones.  I have several that consist of 500,000+ active calculation cells with about 40+ million edges but before I tried those I wanted to "start small" as it were. :)