Twitter collection creates duplicate edges (thus bad data)?

Sep 2, 2011 at 3:56 AM
Edited Sep 2, 2011 at 4:01 AM

Love the tool, most of the time. I have a question about what's going on with the Tweet-collecting. I've been collecting Tweets from a search using a hashtag and a date range. At first I was concerned I wasn't getting all the tweets, but it seems that you can now get all the tweets with the addition of NonRepliesToNonMentionsEdges. But now I'm worried about duplicates and their (negative) impact on the accuracy of graphs, data, and metrics generated by NodeXL.

Here are some excerpts from three separate searches. I reformatted the rows for easier reading. The only difference between the searches are the edges that NodeXL is told to create:

Create only RepliesToEdges:
missnett1	Replies to	fringeobservers	9/1/2011 14:02	@fringeobservers is giving away #Fringe season 3 on bluray or dvd. Please let me #winfringe!!! PLEASE!!!
				
Create only MentionsEdges:
missnett1 Mentions fringeobservers 8/31/2011 16:06 I can't wait for season 4 of #Fringe. I want to #winfringe season 3 on bluray or DVD from @fringeobservers! missnett1 Mentions fringeobservers 9/1/2011 14:02 @fringeobservers is giving away #Fringe season 3 on bluray or dvd. Please let me #winfringe!!! PLEASE!!! Create only NonRepliesToNonMentionsEdges: missnett1 Tweet missnett1 8/31/2011 16:06 I can't wait for season 4 of #Fringe. I want to #winfringe season 3 on bluray or DVD from @fringeobservers! missnett1 Tweet missnett1 9/1/2011 14:02 @fringeobservers is giving away #Fringe season 3 on bluray or dvd. Please let me #winfringe!!! PLEASE!!! Create all three edge-types: missnett1 Mentions fringeobservers 8/31/2011 16:06 I can't wait for season 4 of #Fringe. I want to #winfringe season 3 on bluray or DVD from @fringeobservers! missnett1 Mentions fringeobservers 9/1/2011 14:02 @fringeobservers is giving away #Fringe season 3 on bluray or dvd. Please let me #winfringe!!! PLEASE!!! missnett1 Replies to fringeobservers 9/1/2011 14:02 @fringeobservers is giving away #Fringe season 3 on bluray or dvd. Please let me #winfringe!!! PLEASE!!!

Also keep in mind that RT's all show up as Mentions but "missnett1" never actually RT's @fringeobservers that's not captured here.

So, what's going on here? Why are there two loop-edges created as "Tweets" (i.e., Tweet that is not a "replies-to or mentions" as it reads in the UI) that are also created as replies-to and mentions? What am I missing / doing wrong?

And then there's some weird overlap between these categories, right? The status missnett1 posts at 14:02 is created as a mention-edge, repliesto-edge. What's the difference then? Is it just that "replies-to" detects that the target vertex is the first token in the Tweet content? That's fine if that's the case, but (a) it's an interpretation that doesn't always hold, and (b) you still end up with a duplicate edge because of the mentions-edge. I guess you can't reply to someone without mentioning them, but that's not really why both edges are created, right?

Wouldn't it be better for edge types - Tweets, mentions, replies - to be mutually exclusive and completely constitutive of an ostensible Status category?

Please feel free to point out any boneheaded mistakes I may have made to get to this point. And also, my research has nothing to do with @fringeobservers, although that show is really bad-ass.

Thanks.

Sep 2, 2011 at 5:52 PM
Edited Sep 2, 2011 at 5:53 PM

Hello, Warren:

You're not making any boneheaded mistakes, but NodeXL is.  From the new bug report created as a result of your post:

'In the Twitter Search network (NodeXL, Data, Import, From Twitter Search Network), if you checked only the "Tweet that is not a replies-to or mentions" edge option, you would get edges that were actually replies-to or mentions. That option did work properly if you also checked the "Replies-to" and "Mentions" options, but it gave inaccurate results if it was the only checked option.'

This bug will be fixed in the next NodeXL release (version 1.0.1.176), due out in about two weeks.  In the meantime, you can work around the bug by checking all three checkboxes and then filtering the results for whichever edges you actually want.  Thank you for reporting this.

On your other points, a "replies-to" is indeed a tweet that mentions another included tweeter at the start of the tweet.  By "included tweeter" I mean a person who is included in the Vertices worksheet.

Mentions and replies-to cannot be mutually exclusive, because a reply-to is by nature also a mentions.

-- Tony

 

Sep 2, 2011 at 10:18 PM

On second thought, my last sentence was a circular argument.  A reply-to is also a mentions only because we defined it that way.  I'll bring up the idea of making them all mutually exclusive at our next meeting.

-- Tony

Sep 2, 2011 at 10:38 PM

That's great news about the update. Thanks, Tony. Good to see the tool has active development behind it.

I agree with your last comment AND the one before about mentions and replies. That's what I was getting at originally when I said that you can't reply to someone without mentioning them. I think that defining a reply-to by interpreting a status with an @user as the first token is too fuzzy of a category. Me, I'd like to see these edges created:

  • tweets as self-loops (as they are now) but EXCLUDING statuses that have an @user in the content. Also, I'd like to see the option to remove loops like in Pajek (if this is already a feature I haven't seen it) or at least to not show them in visualizations (as Gephi does it).
  • retweet edges between A and B for each instance where @A has a status that contains "RT @B" but not necessarily at the beginning of a status.
  • mention edges between A and B for each instance where @A has a status in which @B appears and not in the form "RT @B".

Tweet-data is social data, so it's inherently messy. I think the categories thus defined reasonably capture different basic relations without "over-interpreting" what a string of text categorically means. Again, thanks for the ongoing development efforts, and thanks for listening to my laundry list of selfish demands. <g>