Import JSON data from Twitter Streaming API to build a network from

Mar 31, 2014 at 5:20 PM
Edited Mar 31, 2014 at 5:20 PM
If my understanding is correct, the Twitter Search feature is based on the Search API, which goes back ~7 days.

Recently I've been collecting data from the Twitter Streaming API, then extracting the relationships from the data using a python script called MentionMapper. This works well to get the bare network, but leaves behind all the valuable associated metadata. Since I'm not a programmer by any stretch of the imagination, I was wondering if I would be possible to somehow import the raw JSON data into NodeXL to process, like it does with the Search API data?

My hopes are to be able to use all awesome of the features of NodeXL (like Top URLs, Top Domains, Top Hashtags, Top Words, Top Relied-To, etc.) in my analysis of Twitter Streaming API data.
Mar 31, 2014 at 7:10 PM
Hello, Curtis:

Yes, that's correct: NodeXL uses the Twitter Search API, which only goes back a week or so.

I haven't used MentionMapper, and I can't seem to find it online. Can you tell me what it outputs? (A sample line would be good, if that's possible.) I'm wondering if its output could be rearranged and copied to appropriate columns in a NodeXL workbook, which would then let you use NodeXL's "Twitter Search Network Top Items" feature.

-- Tony
Mar 31, 2014 at 7:47 PM
Edited Mar 31, 2014 at 7:49 PM
Here is a link to the mentionMapper.py: https://github.com/alexhanna/hse-twitter/blob/master/bin/mentionMapper.py
#!/usr/bin/env python

import json, sys

def main():

    for line in sys.stdin:
        line = line.strip()

        data = ''
        try:
            data = json.loads(line)
        except ValueError as detail:
            continue

        if not (isinstance(data, dict)):
            ## not a dictionary, skip
            pass
        elif 'delete' in data:
            ## a delete element, skip for now.
            pass
        elif 'user' not in data:
            ## bizarre userless edge case
            pass
        else:
            if 'entities' in data and len(data['entities']['user_mentions']) > 0:
                user          = data['user']
                user_mentions = data['entities']['user_mentions']

                for u2 in user_mentions:                
                    print "\t".join([
                        user['id_str'],
                        u2['id_str'],
                        "1"
                        ])

if __name__ == '__main__':
    main()
The output of how it's currently configured is below:
user_id1    user_id2    1
user_id1    user_id2    1
user_id2    user_id3    1
user_id3    user_id4    1
user_id3    user_id5    1
...
I've adapted the code to user screen_name instead of user_id, but that's about as far as my programming knowledge goes! :)
Apr 1, 2014 at 4:46 AM
Curtis:

First, in reply to your original question about importing raw Twitter JSON, we don't have anything that does that right now. We've talked about adding such a feature to NodeXL's Import menu, but that's still in the conceptual stage and I don't know when we'll actually implement it.

However, it should be possible to write a program in Python (or any other language, for that matter) that parses the Twitter JSON and outputs the parsed information as tab-delimited text with the column names that NodeXL uses (Vertex 1, Vertex 2, Relationship, Tweet, and so on). You could then import that text into a NodeXL workbook, and NodeXL could run its graph metrics on your data without knowing that it came from you instead of from its own Twitter Search Network importer. That kind of parsing is exactly what the Twitter Search Network importer does internally as it receives data over the network from the Twitter Search API.

There is a fair amount of programming involved, though. Not difficult programming, but a bunch of "pull out this data and put it here" kind of stuff. I don't know how ambitious you're feeling about this idea. If you want to pursue it, let me know and I can tell you about the columns that NodeXL would need.

-- Tony
Apr 1, 2014 at 3:01 PM
Thanks for the response Tony. Like I said, although I'm not a programmer, this would be something I would like to try to make work. Please do provide me with the columns needed to replicate the Twitter Search API input.
Apr 1, 2014 at 4:54 PM
Okay. I recommend doing this in two stages. Get the first stage to work, then I'll tell you about the second stage. (The second stage just adds more columns to the first stage.)

In the first stage, you'll need three columns named "Vertex 1", "Vertex 2" and "Relationship".

For each tweet, you need to output one or more rows with these three fields. Separate the fields with tabs to make it easier to import them into Excel, and write everything to a file instead of to the screen.

If the tweet has a non-empty "in_reply_to_screen_name" field, output one row with "Vertex 1" = data['user']['screen_name'], "Vertex 2" = data['in_reply_to_screen_name'], and "Relationship" = "Replies to".

Note that the "Vertex 1" and "Vertex 2" values I'm using here are Twitter screen names, not Twitter user IDs. Your sample code used user IDs, and you can continue to use user IDs if you want (NodeXL doesn't care if Vertex 1 and Vertex 2 are names or numbers), but the user IDs won't mean anything to anyone because they're just numbers. Twitter screen names are more interesting.

Then you need to output one row for each "user_mentions" value, unless the mentioned user was the replied-to user you just output. (In NodeXL, a replied-to user is not also a mentioned user. It's one or the other, not both.) You already have code to do that, except that you'll have to add an "if" clause to exclude the mentioned user if she was also the replied-to user. And because we're using screen names instead of user IDs, you would replace the "id_str" indexers in your "for u2 in user_mentions" loop with "screen_name". The "Relationship" value for these rows should be "Mentions".

Did the tweet neither mention nor reply to anyone? In NodeXL, that case is indicated by a single row that connects the tweeter to himself. (It's called a self-loop edge.) For this case, you output a row with "Vertex 1" = data['user']['screen_name'], "Vertex 2" = the same thing, and "Relationship" = "Tweet".

To summarize, for each tweet you will output zero or one "Replies-to" edges, zero or more "Mentions" edges, and zero or one "Tweet" edges.

Now you have a tab-delimited text file that contains three columns, and you can import that text file into a plain Excel workbook. (Not a NodeXL workbook, just a plain Excel workbook.) See http://www.ehow.com/how_4449036_import-delimited-file-excel.html for instructions on how to do that. Leave the plain Excel workbook open.

For the final step, open a NodeXL workbook from the Windows Start menu or Start screen. You'll now have two open workbooks: the plain Excel workbook containing your Twitter data, and an empty NodeXL workbook. In the ribbon of the NodeXL workbook, go to NodeXL, Data, Import, From Open Workbook. In the Import from Open Workbook dialog box, you should see your plain Excel workbook in the "Open workbooks to import from" list. Select your plain Excel workbook, check "Columns have headers", check "Is Edge Column" for the three columns, and specify "Vertex 1" and "Vertex 2" for the "Which edge column" drop-down lists. Click "Import" and your three columns will be imported into the NodeXL workbook. Click "Show Graph" to see the edges you created.

-- Tony
Apr 1, 2014 at 4:57 PM
I forgot to mention that you should begin your text file with a header row that looks like "Vertex 1[tab]Vertex 2[tab]Relationship".

-- Tony
Apr 1, 2014 at 8:00 PM
Awesome, thanks for the detailed response Tony! Doesn't sound terribly difficult, so we'll see how it goes!
May 8, 2014 at 6:56 PM
Edited May 8, 2014 at 6:57 PM
Hi Tony!

I know it's been awhile, but I've successfully executed step one of your directions, and am now ready to proceed to step two!

Thanks again for your help. I'm looking forward to seeing what I can do.
May 11, 2014 at 7:47 PM
Edited May 11, 2014 at 7:47 PM
Congratulations, I'm glad to hear it's going well.

Your initial post mentioned that you were interested in having NodeXL generate the "Top" items for you. To make that work, you need to add the following columns to your tab-delimited text file:

Tweet
URLs in Tweet
Domains in Tweet
Hashtags in Tweet
Imported ID

Here are the JSON paths for those columns. The JSON is documented at https://dev.twitter.com/docs/api/1.1/get/search/tweets

Tweet: text
URLs in Tweet: entities, urls
Domains in Tweet: Not available in the JSON; you have to parse the domains from the URLs yourself
Hashtags in Tweet: entities, hashtags
Imported ID: id_str

Once you get those columns into the NodeXL workbook--and make sure you use the exact column names I've listed here--your should be able to go to NodeXL, Analysis, Graph Metrics and use the "Words and word pairs" and "Twitter search network top items" items. Note that in the Graph Metrics dialog box, you have to specify the "Tweet" column name for the "Words and word pairs" option via the Options button.

-- Tony
May 12, 2014 at 12:23 PM
Thanks Tony, I appreciate your help!

How does NodeXL handle multiple hashtags or multiple links in a single tweet?

Also, I assume if I wanted to also include user information I would just include that as well, and set it as a Vertex 1 property when importing?

Thanks again!
May 12, 2014 at 3:40 PM
Edited May 12, 2014 at 3:42 PM
On the multiple hashtag/URL question: See "Entities in Twitter Objects" at https://dev.twitter.com/docs/entities . If you search for "user_mentions" on that page, you'll see an example of how Twitter formats multiple entities. (user_mentions, hashtags, and urls are all types of what Twitter calls "entities".)

-- Tony
May 12, 2014 at 3:41 PM
On the user information question: Yes, your assumption is correct.

-- Tony