Way to Stop a Crawl without Losing all the Data?

Jan 17, 2013 at 10:13 PM
Edited Jan 17, 2013 at 10:18 PM


Okay, so I'm doing a data crawl on Twitter, and it's been going on for days.  It's at some 2,200 pages of downloads right now, and it's still going.  Is there any way to stop the crawl without losing the collected data so far?  Usually when it's been going for a few days, it'll either stop on its own and offer a partial data extraction, or the whole system crashes, and I have nothing.

Is there an in-between way of salvaging what's been captured without waiting to see if it stops on its own or crashes? 


Is there a natural limit on a crawl?  Is it about 80,000 nodes the top end?  And if so, where are the limits coming from?  Is it my computer that's the limiting factor? The social media platform? The NodeXL software? 


Is there a way to pause a crawl to save the data, so there is an intermediate point of data collection beyond which one would not lose the data? (Excel's automatic backup does not seem to work when using NodeXL to do a data crawl.)  


On a side note, is it ever possible to know at the beginning of a data crawl how long it might take, or does the computer only know to grab the next thing and not have a sense of the totality of its task?  (And if the computer can know, is there a way to throw that information up for the user?)

Jan 18, 2013 at 6:32 AM


To answer your specific questions:

1. There is no in-between way of salvaging what's been captured so far.

2. When asking for follows relationships with "Limit to X people" unchecked, there is no "natural" limit on a crawl.  NodeXL asks Twitter for all the followers and doesn't stop until Twitter says, "that's all."

3. There is no way to pause a crawl.

4. It's theoretically possible for us to look at the number of followers each person has and estimate how long it might take to get all of them, but we're not doing that right now.  I can see how that information might be useful.

On a more general note, I don't think NodeXL is a good tool for getting the large volume of data you are interested in.  Twitter just makes that too difficult by 1) limiting how many requests programs like NodeXL can make each hour; and 2) reserving the right to stop providing data any time its servers get too busy.  We've had many reports of people waiting hours and days for a network, and then being unceremoniously kicked out for unknown reasons.  There is also the problem of network glitches, which NodeXL tries to recover from but quits after so many tries.

NodeXL's Twitter importers work fine for small and medium networks for most people, and for larger networks for those NodeXL users who have been "whitelisted" by Twitter.  Whitelisting means that the per-hour limits are lifted.  Unfortunately, Twitter will no longer whitelist people.

-- Tony

Jan 18, 2013 at 11:30 AM

Hello, Tony:  Thanks for your detailed response. This helps immensely. 

The whitelisting does benefit the crawl speed, for sure. I was able to get whitelisted in Twitter about a month ago, so that was all working then. 

I have one other tool that I'm trying that enables some of this data extraction. I'm not sure what else is out there that turns out such beautiful graphs and is so fun to use though.