the issue about extracting subgraph and insufficient memory

Apr 15, 2012 at 9:51 AM

Hi,

Please forgive me for this simple question!

I have a rather large network (40000 nodes, approx 400000 edges),  I want to export the subgraph from one specified node at some depth level into  New NodeXL workbook( I want to get the edges in the subgraph instead of the subgraph image).

So first, I must show the graph in the graph pane using some layout algorithm, and right-click the vertex and select Select Subgraphs from the right-click menu. However, the question is that as the network is quite large, the computer cannot show the graph in the graph pane, and reports the exception: the computer cannot have enough memory to lay out the graph. So I cannot right click the specified node and select Subgraphs from the menu. (my computer has 2GB memory, and use 32bit win7 OS).

So could NodeXL have other ways to export the subgraph into a New NodeXL workbook or must my computer add the memory to 4GB and use 64bit os?

Thanks in advanced!

jojo0214.

Apr 16, 2012 at 3:28 PM

One thing you might try is to change the layout algorithm to something that uses less memory.  The default Fruchterman-Reingold algorithm uses a lot of memory (as does the Harel-Koren Fast Multiscale algorithm), and that's where your computer may be running out.  Try changing NodeXL, Graph, Layout to Random and see if NodeXL succeeds in showing the graph.  It won't be pretty, but perhaps you don't care about that for your purposes.

I should mention that even if you don't run out of memory, NodeXL is going to be painfully slow to use with 400,000 edges.  It was intended for small- to medium-sized graphs of perhaps several thousand edges, and it slows down dramatically when you get into the hundreds of thousands.

-- Tony

Coordinator
Apr 16, 2012 at 4:10 PM
Edited Apr 16, 2012 at 4:40 PM

To build on Tony's reply, I also should call out some best practices for dealing with larger scale networks in NodeXL.

NodeXL and all network analysis tools have multiple levels, each with its own performance and scale constraints.  Often the most restrictive level's constraints are imposed on all the others.  For NodeXL one of the more performance intensive tasks is the visual display of the network graph.  If you avoid this step NodeXL can still perform some useful work for you, like calculating metrics and summary data (however slowly) and extracting sub-graphs.  You will still face constraints at other levels, like the row limit in Excel which varies with version from 1 to 4 million rows.  

If you close the network display pane the overhead of drawing the network is removed.

Then, from the Vertex worksheet you can select just the vertices you wish to include in your (more manageable sized) sub-graph.

Right-click one of the selected Vertices and pick "Select Subgraphs".

You can then use the NodeXL>Data>Export>Selection To New NodeXL Workbook feature to export your sub-graph.

The resulting sub-graph might be modest enough in size to enjoy the full set of NodeXL features.

I should note that your hardware could be significantly upgraded to improve scale and performance.  You mention you are currently running 2GB of memory and 32bit Windows7 and you propose 4GB and 64bit Windows.  May I suggest that 8 or even 16 GB of RAM with Windows7 64 bit is an even more appropriate environment for large scale graph analysis on the desktop/laptop?  NodeXL loves RAM.

That said, even with LOTs of RAM, you will still face constraints at other levels, eventually hitting the limits of the CPU.  

Visualization constraints are even lower, depending on screen size.  Even with very large displays, the number of discrete nodes that can be displayed meaningfully remains relatively low, the the thousands and not millions (just consider the number of pixels on a typical screen, most have about 1 million pixels and nodes need more than a pixel to be meaningfully displayed).  Therefore, the judicious use of the NodeXL "Groups" and "Collapse Groups" feature can be very helpful.

See the following posts for more information about creating and managing groups of vertices in NodeXL:

http://www.connectedaction.net/2011/05/04/nodexl-v-167-new-features-for-handling-groups-of-nodes-in-a-network-and-a-few-other-things/

http://www.connectedaction.net/2011/04/26/nodexl-conditionally-autofill-collapse-group-in-v-166/

http://www.connectedaction.net/2011/04/25/nodexl-clusters-components-and-groups-creating-and-managing-collections-of-vertices/

Regards,

Marc

Apr 17, 2012 at 1:54 AM
Edited Apr 17, 2012 at 3:34 AM

Hi, marcsmith

I know what you mean. And I tried to close the network display pane. And opened the Vertex worksheet to right-clicked one specified node. However, the right-click menu did not displayed the "Select Subgraphs"( I know the overhead of NodeXL is Graph Visualization, that's why I put a question before:  about how to avoid using the "showing graph" to get my subgraphs workbook). 

So does the version of NodeXL or EXCEL(2007) results in this situation?

Thanks in Advanced.

jojo0214

Apr 17, 2012 at 3:38 AM
tcap479 wrote:

One thing you might try is to change the layout algorithm to something that uses less memory.  The default Fruchterman-Reingold algorithm uses a lot of memory (as does the Harel-Koren Fast Multiscale algorithm), and that's where your computer may be running out.  Try changing NodeXL, Graph, Layout to Random and see if NodeXL succeeds in showing the graph.  It won't be pretty, but perhaps you don't care about that for your purposes.

I should mention that even if you don't run out of memory, NodeXL is going to be painfully slow to use with 400,000 edges.  It was intended for small- to medium-sized graphs of perhaps several thousand edges, and it slows down dramatically when you get into the hundreds of thousands.

-- Tony

hi, tcap479, I have tried your advice, when I use the " Random" Lay out algorithm, the computer would be much slower, and finally it cannot show the graph successfully and reported the exception as below(Is there a way to extract subgraphs directly if I close the network display pane) :

---------------------------NodeXL---------------------------An unexpected problem occurred.  If it occurs again, please copy the details to the clipboard by typing Ctrl-C, then post the details to http://www.codeplex.com/NodeXL/Thread/List.aspx.


Details:


[COMException]: 异常来自 HRESULT:0x800AC472




Server stack trace: 




Exception rethrown at [0]: 
   在 System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
   在 System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
   在 Microsoft.Office.Interop.Excel._Application.Intersect(Range Arg1, Range Arg2, Object Arg3, Object Arg4, Object Arg5, Object Arg6, Object Arg7, Object Arg8, Object Arg9, Object Arg10, Object Arg11, Object Arg12, Object Arg13, Object Arg14, Object Arg15, Object Arg16, Object Arg17, Object Arg18, Object Arg19, Object Arg20, Object Arg21, Object Arg22, Object Arg23, Object Arg24, Object Arg25, Object Arg26, Object Arg27, Object Arg28, Object Arg29, Object Arg30)
   在 Smrf.AppLib.ExcelUtil.TryIntersectRanges(Range range1, Range range2, Range& intersection)
   在 Smrf.AppLib.ExcelTableUtil.TryGetTableColumnData(ListColumn column, Range& tableColumnData)
   在 Smrf.AppLib.ExcelTableUtil.TryGetTableColumnData(ListObject table, String columnName, Range& tableColumnData)
   在 Smrf.NodeXL.ExcelTemplate.Sheet2.OnGraphLaidOut(GraphLaidOutEventArgs e)
   在 Smrf.NodeXL.ExcelTemplate.Sheet2.ThisWorkbook_GraphLaidOut(Object sender, GraphLaidOutEventArgs e)
   在 Smrf.NodeXL.ExcelTemplate.ThisWorkbook.TaskPane_GraphLaidOut(Object sender, GraphLaidOutEventArgs e)---------------------------确定   ---------------------------

Apr 17, 2012 at 4:03 AM

I don't think NodeXL is appropriate for your application, because 40,000 vertices and 400,000 edges is simply too much for it to handle.  I have never tested it with more than 30,000 edges, and I stopped there because the performance rapidly deteriorated when I tried more.  (I'm the NodeXL programmer.)  I'm sorry I don't have a solution for you.

-- Tony

Apr 17, 2012 at 7:33 AM
marcsmith wrote:

To build on Tony's reply, I also should call out some best practices for dealing with larger scale networks in NodeXL.

NodeXL and all network analysis tools have multiple levels, each with its own performance and scale constraints.  Often the most restrictive level's constraints are imposed on all the others.  For NodeXL one of the more performance intensive tasks is the visual display of the network graph.  If you avoid this step NodeXL can still perform some useful work for you, like calculating metrics and summary data (however slowly) and extracting sub-graphs.  You will still face constraints at other levels, like the row limit in Excel which varies with version from 1 to 4 million rows.  

If you close the network display pane the overhead of drawing the network is removed.

Then, from the Vertex worksheet you can select just the vertices you wish to include in your (more manageable sized) sub-graph.

Right-click one of the selected Vertices and pick "Select Subgraphs".

You can then use the NodeXL>Data>Export>Selection To New NodeXL Workbook feature to export your sub-graph.

The resulting sub-graph might be modest enough in size to enjoy the full set of NodeXL features.

I should note that your hardware could be significantly upgraded to improve scale and performance.  You mention you are currently running 2GB of memory and 32bit Windows7 and you propose 4GB and 64bit Windows.  May I suggest that 8 or even 16 GB of RAM with Windows7 64 bit is an even more appropriate environment for large scale graph analysis on the desktop/laptop?  NodeXL loves RAM.

That said, even with LOTs of RAM, you will still face constraints at other levels, eventually hitting the limits of the CPU.  

Visualization constraints are even lower, depending on screen size.  Even with very large displays, the number of discrete nodes that can be displayed meaningfully remains relatively low, the the thousands and not millions (just consider the number of pixels on a typical screen, most have about 1 million pixels and nodes need more than a pixel to be meaningfully displayed).  Therefore, the judicious use of the NodeXL "Groups" and "Collapse Groups" feature can be very helpful.

See the following posts for more information about creating and managing groups of vertices in NodeXL:

http://www.connectedaction.net/2011/05/04/nodexl-v-167-new-features-for-handling-groups-of-nodes-in-a-network-and-a-few-other-things/

http://www.connectedaction.net/2011/04/26/nodexl-conditionally-autofill-collapse-group-in-v-166/

http://www.connectedaction.net/2011/04/25/nodexl-clusters-components-and-groups-creating-and-managing-collections-of-vertices/

Regards,

Marc

 

Hi, marcsmith

I know what you mean. And I tried to close the network display pane. And opened the Vertex worksheet to right-clicked one specified node. However, the right-click menu did not displayed the "Select Subgraphs"( I know the overhead of NodeXL is Graph Visualization, that's why I put a question before:  about how to avoid using the "showing graph" to get my subgraphs workbook). 

So does the version of NodeXL or EXCEL(2007) results in this situation?

Thanks in Advanced.

jojo0214

Apr 17, 2012 at 8:09 AM
tcap479 wrote:

I don't think NodeXL is appropriate for your application, because 40,000 vertices and 400,000 edges is simply too much for it to handle.  I have never tested it with more than 30,000 edges, and I stopped there because the performance rapidly deteriorated when I tried more.  (I'm the NodeXL programmer.)  I'm sorry I don't have a solution for you.

-- Tony

Hi, Tony

I think we could have a solution for my application. This morning I tried another way to check the performance of NodeXL: Firstly, I closed the Graph display pane, then opened the Vertex worksheet to left-clicked one specified node. Furthermore, I chose the "NodeXL>Analysis>Subgraph Images" to create a subgraph BMP image. And I found that the computer could create the image in a very high speed at a depth level of 1.5, 2, and 2.5 separately. So through this I believe that the computer also could extract all the edges of the subgraph(1.5, 2, and 2.5) into the New NodeXL workbook. 

Now the biggest problem is : How to Right-click one of the selected Vertices and pick "Select Subgraphs" without using the Graph display pane ? (the overhead of Graph Visualization is extremely high and we don't need Visualization)

(PS: I have tried the way offered by Marc, but my right-click menu didn't display the "Select Subgraphs",  it is just an ordinary Excel 2007 right-click menu. So may we could add a new function to create subgraph and extract into New Nodexl Workbook without using the Graph Display Pane through programming? )

Apr 17, 2012 at 4:14 PM
Edited Apr 17, 2012 at 6:44 PM

The lack of NodeXL menu items when you right-click a row in the Vertices worksheet is a known bug.  It affects only some computers, and I have not been able to figure out the difference between computers that have the bug and those that don't.  (It's not an Excel 2007 vs. 2010 issue; I know that.)  The bug does not affect any of my computers, which is why I haven't been able to track it down and fix it yet.

However, even if you had a Select Subgraphs item on your right-click menu, it would not solve your problem.  Select Subgraphs actually uses the graph in the graph pane to do its work behind the scenes, even if the graph pane is closed, and if the graph hasn't been shown yet, "Select Subgraphs" is grayed out.  So we're back to the problem of being able to show the graph before you can select a subgraph, and you are having problems showing the graph because it has so many edges.

There is a chance that if you run NodeXL on a 64-bit version of Windows with a 64-bit version of Office and a lot of memory, NodeXL will be able to show your large graph, from which you could then select and export a subgraph.  That's a big "if," though, and you may invest a lot of time, money and effort only to find that it still doesn't work.  If it were me, I would find some other tool to extract the subgraph I was interested in, and then use NodeXL to show and interact with the small- to medium-sized subgraph, a task for which it is very well suited. 

-- Tony

Apr 17, 2012 at 10:16 PM
Hi Marc,

Thank you for the links to the Connected Action articles announcing
enhancements to new versions of NodeXL. Have these been posted right
along? In following the discussions, enhancements have been mentions that
I have never heard of. Are you considering putting out an updated manual?
I, for one, would find it VERY helpful if all of the additions were
collected in one place.

Thank you and Tony for continuing to maintain and improve this most
valuable tool.

Peter
Coordinator
Apr 17, 2012 at 11:46 PM

Hello!

Yes, http://connectedaction.net is a good place to look for NodeXL updates and short tutorials.

Please note that the NodeXL help file is kept up to date with all features.

A new volume about NodeXL is in the works, stay tuned!

-

Marc

Apr 18, 2012 at 12:46 AM

Hi, Peter:

If you would like to be notified of new NodeXL releases, which occur with some regularity, you can sign up for notifications on our CodePlex page at http://nodexl.codeplex.com/releases.  Look for the "Release Notification" link.

Also, there is a painfully detailed list of NodeXL changes at http://nodexl.codeplex.com/wikipage?title=CompleteReleaseHistory.

-- Tony

Apr 18, 2012 at 8:28 AM
tcap479 wrote:

The lack of NodeXL menu items when you right-click a row in the Vertices worksheet is a known bug.  It affects only some computers, and I have not been able to figure out the difference between computers that have the bug and those that don't.  (It's not an Excel 2007 vs. 2010 issue; I know that.)  The bug does not affect any of my computers, which is why I haven't been able to track it down and fix it yet.

However, even if you had a Select Subgraphs item on your right-click menu, it would not solve your problem.  Select Subgraphs actually uses the graph in the graph pane to do its work behind the scenes, even if the graph pane is closed, and if the graph hasn't been shown yet, "Select Subgraphs" is grayed out.  So we're back to the problem of being able to show the graph before you can select a subgraph, and you are having problems showing the graph because it has so many edges.

There is a chance that if you run NodeXL on a 64-bit version of Windows with a 64-bit version of Office and a lot of memory, NodeXL will be able to show your large graph, from which you could then select and export a subgraph.  That's a big "if," though, and you may invest a lot of time, money and effort only to find that it still doesn't work.  If it were me, I would find some other tool to extract the subgraph I was interested in, and then use NodeXL to show and interact with the small- to medium-sized subgraph, a task for which it is very well suited. 

-- Tony

Hi, Tony

I am really appreciate that you could provide me with such advice. And here is one simple request.

could you recommend some tools that rely on memories less to extract subgraphs for me? I think your words "If it were me, I would find some other tool to extract the subgraph I was interested in, and then use NodeXL to show and interact with the small- to medium-sized subgraph, a task for which it is very well suited. " have enlightened me. And I will find some tools to extract the subgraph, and use NodeXL(I think NodeXL is powerful software to analyze the complex network, and I love it) to measure some metrics of the subgraphs.

Thank you.

jojo0214

Apr 18, 2012 at 5:34 PM

I suspected that would be the next question!  No, I don't have a tool to recommend, but perhaps others reading this can help out.

To summarize, JoJo has a large network (400,000 edges) from which he wants to extract a subgraph for a particular vertex.  What program can he use to do this?

-- Tony

Apr 23, 2012 at 8:49 PM

I work with large networks a lot and have successfully imported / exported to and from Pajek, (http://pajek.imfm.si/doku.php?id=download) which is a good open source program for working with large networks.  It has some good methods for reducing large networks to more manageable sizes.  I'll concur that 30,000 is pretty much a practical upper limit.  in the past week I've run networks 21,000 edges, 32,000 edges and about 40,000 edges and they took 15 seconds, 1 minute and several minutes to run.  Your best bet (or at least mine) is to use Pajek or database tools to reduce or extract sub-networks of about 20,000 edges or less which is manageable in Node XL.  Having said all that I also have to say that Node XL has rapidly become my favorite network program.  I'll use others when I have to but this is now my go to program.  Thanks to Marc and the development team!

Apr 24, 2012 at 1:13 AM
scottdempwolf wrote:

I work with large networks a lot and have successfully imported / exported to and from Pajek, (http://pajek.imfm.si/doku.php?id=download) which is a good open source program for working with large networks.  It has some good methods for reducing large networks to more manageable sizes.  I'll concur that 30,000 is pretty much a practical upper limit.  in the past week I've run networks 21,000 edges, 32,000 edges and about 40,000 edges and they took 15 seconds, 1 minute and several minutes to run.  Your best bet (or at least mine) is to use Pajek or database tools to reduce or extract sub-networks of about 20,000 edges or less which is manageable in Node XL.  Having said all that I also have to say that Node XL has rapidly become my favorite network program.  I'll use others when I have to but this is now my go to program.  Thanks to Marc and the development team!

Thank you! I will try Pajek to reduce the size of this graph, and thanks again for your sincere advice!

Apr 24, 2012 at 5:34 PM

You're welcome.  Actually, I just ran across this paper this morning.  It looks interesting and potentially useful.  

Willer, D., van Assen, M. A. L. M. and Emanuelson, P. "Analyzing Large Scale Exchange Networks." Social Networks 34(2): 171-180.

Exchange theories or their implementations in algorithms have limited utility because they can be applied only to quite small networks. They cannot be applied to larger networks until that size limit is removed. Domain Analysis cuts networks into smaller pieces at the boundaries of strong power domains. Domain Analysis identifies strong power and breaks, and distinguishes domains that function exactly as they would were they free-standing, and components that do not. Support for the finding of breaks and the distinction between domains and components are obtained using both experimental data and simulations based on X-Net. To illustrate the use of Domain Analysis, it is applied to find the incidence of strong power in large exchange networks. The application shows that the incidence of strong power decreases as network density increases, and that strong power occurs only infrequently in dense networks. We conclude by calling for ever more general analytic procedures.