Size Limits on Twitter Import

Mar 14, 2012 at 2:58 PM

I'm including NodeXL in a workshop I'm giving on 25 low-cost/no-cost tools for evaluators. I've now used it successfully in a number of ways, but want to gain a better understanding of the limitations, in particular around Twitter, and couldn't find the documentation (sorry! point me in the right direction and I can be more self-sufficent and read).

The limit for Twitter import around a hashtag seems to be 1500 or so entries as that is the limit on the Twitter search API, and those entries will select out the more important based on twitter's somewhat nebulous definition of such.

What I can't find is the limit on Twitter followers. When I import my followers it is a multi-hour process and I have fewer than 2000. What is the limit on the number of followers that can realistically be imported (whether a function of twitter limits or NodeXL/excel limits)? I have some workshop attendees representing large organizations ranging up to the CDC with its 80,000 followers.

I realize that once a large dataset is imported, the visualization will need to be filtered to make any meaning from it, but am trying to understand the limits on import.

Thank you for guidance,

Susan

Mar 15, 2012 at 6:37 AM
Edited Mar 15, 2012 at 6:38 AM

Susan:

The bottleneck is due to limits that Twitter places on the number of information requests that programs like NodeXL can make per hour.  If you check the "I don't have a Twitter account" option in a NodeXL dialog box, then Twitter will answer only 150 requests before it forces NodeXL to pause for an hour.  Twitter will provide up to 100 followers per request, so it will take NodeXL at least 6 hours to get all the CDC's followers, for example.  But Twitter can and does throttle the maximum rate based on its server load (see https://dev.twitter.com/docs/rate-limiting/faq), so it can take even longer than that.  And we've seen reports of Twitter outright refusing additional requests, instead of just forcing NodeXL to pause, so in the worst case you can wait five hours and then get a "sorry, Twitter quit" message from NodeXL.  That has never happened to me, but it has to others.

You can improve the situation by checking the "I have a Twitter account" option in the NodeXL dialog box, in which case Twitter increases the rate to 350 information requests per hour.  You're still looking at a long wait, however.

Twitter used to offer something called "whitelisting" to qualified individuals, which caused the limits to go away.  They stopped whitelisting new people some time back.

So it's not good news for NodeXL users who want big networks, but I hope you at least know what's going on now.

-- Tony

Mar 15, 2012 at 12:08 PM

tcap479,

I'm fairly new to Nodexl, but I have certainly experienced the problems related to a Twitter search failing after the hour long pause.  It seems to happen with or without authentication, when you hit the rate limit, apparently randomly.  Often I'll see NodeXl successfully pause and resume two or three times before failing in the 4th hour for some inexplicatble reason.

I'm wondering if there is any way NodeXL could at least salvage the data that has been successfully received, rather than just returning nothing after Twitter fails 4 hours into a search?

Is there not some way that when the rate limit is reached, NodeXL pauses but also loads the data already gathered into the spreadsheet?  I don't see why a Twitter failure 4 hours later should result in the complete loss of the data already downloaded.  Perhaps I'm missing something.

I'm loving the program, and trying to convince my colleagues to love it, but this is causing me some embarrassment when 4 hours later we end up with zero data.  

Alternately, is there a way to do a smaller search (thus avoiding the 4 hour rate-limited marathon) that builds on a previous search?  That is, a search that starts where a previous search left off?

 

Your thoughts would be welome!

John

Mar 15, 2012 at 3:31 PM

Thanks Tony. I think I have a better understanding of it now. I've had the best luck running major requests in the early hours of the day, when traffic seems lightest.

It sounds like the process is likely best for relatively small networks.

Mar 15, 2012 at 6:23 PM

Can someone post the message that pops up when Twitter randomly quits after several hours?  It's never happened to me (I'm the NodeXL programmer), so I don't know what Twitter's message is in such cases.  If I know the exact message, I might be able to salvage a partial network, which would clearly make more sense than discarding the network and wasting hours of waiting.

You can usually press Ctrl-C to copy an error message to the Windows clipboard when a message box pops up.

Thanks,
    Tony

Mar 16, 2012 at 2:23 AM

Here are two different error messages I received.  

I ran a basic "import from Twitter search network" search on one keyword, asking for follows relationships, replies to and mentions and limited to 500 people.

The first time, this error popped up just as NodeXL was querying Twitter after it's second hour long wait (so, at the start of getting the third set of data):

 

---------------------------

NodeXL

---------------------------

The network couldn't be obtained.  Details:

[IOException]: Unable to read data from the transport connection: An established connection was aborted by the software in your host machine.

 [SocketException]: An established connection was aborted by the software in your host machine

---------------------------

OK   

---------------------------

 

 

So I re-ran the exact same import and again, after two hours, just as it was starting to query for the third time it gave me this error:

 

 

---------------------------

NodeXL

---------------------------

The network couldn't be obtained.  Details:

The Twitter Web service refused to provide the requested information.  A likely cause is that you have made too many requests in the last hour.  (Twitter limits information requests to prevent its service from being attacked.  Click the 'Why this might take a long time: Twitter rate limiting' link for details.)

Wait 60 minutes and try again.

---------------------------

OK   

---------------------------

 

The second of the two errors is the one I get most of the time.  It doesn't make any sense though because as I said this usually pops up after NodeXL has just finished waiting for an hour after hitting the rate limit.  It happens on both unauthenticated and authenticated searches.  It's not always two hours though - I've had this error after 3 or 4 hours too.

 

Hope that helps.

John

Mar 16, 2012 at 5:02 PM

Thank you, John.

-- Tony

Apr 18, 2012 at 6:42 PM

Hi Tony  - any update on a workaround for the above? I'm having the same problems, same error messages. Frustrating to go for 9 house and then get cutoff and left with no data. Some is certainly better than none! 

 

Many thanks!

 

-Charlton

Apr 18, 2012 at 6:49 PM

Charlton:

Yes.  The most recent NodeXL release (version 1.0.1.209, on 2012-04-17) includes this change:

* If Twitter refuses to provide more information even after NodeXL pauses for "rate limiting," you will now be given the option to import the partial network that was obtained at that point. Previously, the partial network was discarded and all your time was wasted.

You can get the most recent release at http://nodexl.codeplex.com/releases.

-- Tony

Apr 18, 2012 at 6:56 PM

Never mind Tony. I just downloaded the latest version of the software and see that it is included. Thanks!

May 10, 2012 at 9:17 AM
Edited May 10, 2012 at 9:18 AM

Can you estimate how long it will take to "import from Twitter user's network" if I choose the following options:

User has 15,000 followers, is following 7,000

Add a vertex for both followers and following

Add an edge for each followed/following relationship

1.5 levels

I have a Twitter account and have authorized it

(Note:

I tried this with much more rigorous options (2 levels, replies and mentions relationships, include latest tweet column) and it kept having to retry each hour, but only got a partial network after about 15 hours and is taking a long time to show the graph for a sheet with 196K edges, so I gave up. The file so far is 82 MB.

I want to try this again with the proposed lesser options but don't want to start the task if I'm not going to be able to finish it tomorrow.)

May 10, 2012 at 5:58 PM

Programmer's rough calculations:

For 1.0 starting point, the IDs of the user's followers and followings are obtained 5,000 at a time, so 22,000 followers and followings for user take 5 requests.

Follower and friend names are obtained 100 at a time, so 22,000 follower and following names take 220 additional requests.

For the 1.5 connections, must get follower and following IDs for each of the 22,000 followers and followings.  Assuming each of the followers and followings has no more than 5,000 followers and followings, this requires 22,000 additional requests.

Total requests: At least 22,225.

Maximum requests allows before a one-hour pause: 350.

Minimum number of one-hour pauses: 64.

Answer: It will take at least 3 days.  And that assumes that Twitter won't arbitrarily kick you out in the middle of getting the network, which it has been known to do at times of high traffic.

-- Tony

May 10, 2012 at 6:44 PM
Thanks Tony. I'll start it on my desktop PC and see if it will run successfully over the weekend.

Angela

From: tcap479
Sent: 5/10/2012 9:58 AM
To: wangela@live.com
Subject: Re: Size Limits on Twitter Import [NodeXL:348565]

From: tcap479

Programmer's rough calculations:

For 1.0 starting point, the IDs of the user's followers and followings are obtained 5,000 at a time, so 22,000 followers and followings for user take 5 requests.

Follower and friend names are obtained 100 at a time, so 22,000 follower and following names take 220 additional requests.

For the 1.5 connections, must get follower and following IDs for each of the 22,000 followers and followings. Assuming each of the followers and followings has no more than 5,000 followers and followings, this requires 22,000 additional requests.

Total requests: At least 22,225.

Maximum requests allows before a one-hour pause: 350.

Minimum number of one-hour pauses: 64.

Answer: It will take at least 3 days. And that assumes that Twitter won't arbitrarily kick you out in the middle of getting the network, which it has been known to do at times of high traffic.

-- Tony

May 24, 2012 at 9:11 AM

I have a more sophisticated version of this problem, which doesn't relate NodeXL directly, but maybe you can give me some clues for solving it.

Suppose, I've downloaded n tweets with hashtag #somehashtag within t hours, exceeding 350 request/hour limit during each hour. How do I estimate the overall amount N of tweets with hashtag #somehashtag that passed Twitter during these hours?

Do you know any measures or methods to do this?

May 24, 2012 at 3:53 PM

I don't, but I wonder if there is some Web site out there that aggregates and makes available such information.  Perhaps someone else has some suggestions.

-- Tony

Oct 17, 2013 at 11:28 AM
Hi Tony,

Could you estimate how long it will take to "import from Twitter user's network" if I choose these following options:

User has 510,000 followers, is following 151

Add a vertex for each person following the user

Add an edge for each followed/following relationship

1.5 levels

Limit to 100 people

I have a Twitter account and have authorized it

(I have been waiting for like an hour or so. NodeXL pauses every 15 minutes and still nothing. I have a project I have to finish before tomorrow. I thought limiting it to max 100 people would get the data quicker, but it aint working).

Thanks!
Oct 17, 2013 at 1:21 PM
Tatiana,
I suspect the problem is that this is just computationally intensive, and depending on how NodeXL makes it's API calls, it could also be hitting the rate limit a lot.

If the starting node has 500,000 friends/followers (that's a lot!), then some of those followers are also probably really popular (the "birds of a feather" phenomenon in SNA).

Even though you're limiting your results to 100 out of 500,000, because you're asking for a 1.5 depth graph, it's still has to compare each of those 100 people to literally tens or hundreds of millions of possible connections.

Think of it like this. If you have 100 people and you want to find the connections between them and the "ego" node, you just have to look at the ego node's friend/follower list, and if the person is there, draw an edge between them. You only have to do 100 "checks" and only have to make 1 API call to do it.

But if you want connections among all of the 100 people, then NodeXL has to go and get and compare all of their friend/follower lists. That's 100 API calls (depending on how NodeXL actually codes the API accesses). Then, once you have the 100 peoples' follower lists, depending on who these people are, they may have thousands or even hundreds of thousands of friends/followers (even millions if you're looking at any kind of celebrities), and NodeXL has to look at each one to determine if that node is one of the 100 original nodes, and if so, draw an edge.

So, say each person has on average 10,000 followers. That's 5,000,000 nodes, each of which could have up to 100 relationships (one with each of the 100 original nodes). So you have to do 100 checks for each of 5,000,000 nodes - so 500 million comparisons. And it has to do this TWICE (once for follower relationships and once for friend relationships).

Add to that the Twitter API rate limit that only lets NodeXL grab a limited number of follower/ids per 15 minutes, and you could be waiting a long time.

Tony may be able to give you more insight into the way in which NodeXL does it's API calls in cases like this, but just computationally, that could take a while depending upon how powerful your computer is.

I would suggest that you choose a seed node with a lot fewer followers. Someone not famous, and unlikely to have a lot of popular friends/followers. Since you only want a graph with 100 nodes, choosing someone with 500,000 followers is probably excessive.

John
Oct 17, 2013 at 7:22 PM
Edited Oct 17, 2013 at 7:23 PM
Tatiana and John:

This is an older conversation and the Twitter rate limits have changed dramatically since I replied to wangela. My reply to her is no longer correct.

John, you've pointed out two bottlenecks in getting the network: computational limits and Twitter rate limits. NodeXL uses a quick-lookup dictionary to do those millions of comparisons you mentioned, so they're actually pretty fast. That dictionary (and the network itself) requires memory, though, so there is a chance your computer could run out of memory with a large enough network.

With a "limit to 100" setting, you shouldn't hit computational limits, but the new Twitter rate limits will try your patience.

Twitter now allows NodeXL to ask for a user's follow relationships 15 times per 15 minutes, or one per minute. In this case it has to ask for relationships for 101 users. That will take about an hour and 41 minutes. If you specified "Add a vertex for both", it will take twice that long.

We have had reports of Twitter randomly refusing further requests smack in the middle of a large network, so there is a chance that NodeXL won't be able to give you what you want. You'll get an error message explaining what happened in that case.

-- Tony