Visualizing twitter clusters with gephi: walkthrough

I like step-by-step walkthroughs. They’re task-oriented and context-specific, not universal like conventional documentation. They’re good because no matter what it is I’m doing, someone else has certainly done it before. They’re also good, because when I take the time to figure something out, it’s nice to have a record of it in case I need to do it again. I make them for myself often; generally just a record of ‘Terminal’ commands I used to get something done (they’re more complicated when there are GUIs involved).

So, here’s a walkthrough of how I made my Twitter maps. I’ve tried to keep it simple, but you’ll need to know a bit of MySQL to make this particularly useful.

(UPDATE May 26 2011: I left out some of the MySQL details from the post originally; I’ve updated the scripts and instructions [changes in bold], so hopefully now it’ll work. Thanks to digitip.net for pointing it out!)

First:

  • I’m assuming you’re using a fast Mac with lots of RAM. For my work, I was using a Mac Pro 8-core with 10GB RAM. You also need to be running Java 6, which on a Mac (most easily) means running OS 10.6. In order for Java to have access to enough RAM, you have to make one change.

Grabbing the data:

  1. Install MySQL.
  2. Put the php scripts (twittermap-0.02.zip) and unzip them in a working folder (say, ~/Desktop/twittermap).
  3. Download twitter-async and unzip it into the same folder.
  4. Download php-social-graph and put it into the same folder.
  5. Register a new application at https://dev.twitter.com/apps/new
  6. Open keys.php in a text editor, and insert the consumer key and consumer secret as found on your app’s info page on Twitter, and the token and secret as found on the ‘My Token’ page linked from your app’s info page.
  7. Go to the Apple Menu -> System Preferences -> MySQL and start MySQL if it’s not already running.
  8. In a Terminal window:
  • mysql -u root
  • create database twitmap;
  • quit
  • mysql -D twitmap -u root < schema.sql
  • sudo pear install HTTP_Client
  • sudo mkdir /var/mysql
  • sudo ln -s /tmp/mysql.sock /var/mysql/mysql.sock
  • php robot.php test gggg
    • that’s php robot.php <batchname> <starting_twittername>
  • php userinfo.php
    • this’ll need to be run a bunch of times depending on how many users we need to download info for; it only does so many in a batch so that it doesn’t take forever

Setting up the gephi environment:

  1. Download gephi.
  2. Find the gephi app icon, right click it and select ‘Show Package Contents’
  3. Navigate to Contents -> Resources -> gephi -> etc
  4. Open gephi.conf in a text editor. On the default_options line, change these two values:
    • -J-XmsXXm to -J-Xms256m
    • -JXmxXXm to -J-Xmx9000m
  5. Save and close the file.
  6. Launch gephi!

Making maps:

  1. In gephi: File->Import Database…->Edge List
    • Host: 127.0.0.1
    • Port: 3306
    • Database: twitmap
    • Username: root
    • Password: <blank>
    • Node Query: SELECT screen_name as id, followers_count FROM users
      • Eventually you’ll make this more complex, but this should be fine for now.
    • Edge Query: SELECT source,target FROM gsgrel WHERE batchname=’test’
      • Eventually you’ll make this more complex, but this should be fine for now.
  2. Click OK, and wait for it to import (a few seconds). In my test case, I got 140446 nodes and 200437 edges. This is an impractically large number of nodes to map, so we filter…
  3. Open Filters->Topology, and drag ‘In Degree Range’ to the Queries window.
  4. Click on ‘In Degree Range’ in the Queries window.
  5. In the ‘In Degree Range Settings’ window, double-click the number at the left side of the slider (“1”), so it’s highlighted.
  6. Type “3” then hit enter. (This means that it will eliminate the users that don’t have at least 3 followers — 3 incoming ‘edges.’)
  7. Click ‘Filter’. In my test case, it reduced the number of nodes to 10599 and edges to 54425. This is still high, but a better number to work with. The nodes appear in the Graph window, randomly organized into a square. Let’s change the layout.
  8. In the Layout window, select Yifan Hu.
  9. Click ‘run’.
    • It’s handy to open the OS X Activity Monitor, and keep an eye on the CPU usage and memory usage of gephi. Sometimes if the CPU usage is high but nothing’s happening on screen, it means gephi has frozen, and will need to be force quit.
  10. Watch it go (cool!). Once things slow down, it’ll stop by itself. Click ‘run’ again (one or more times) if you want to make it continue.
  11. Next, we’ll set the node size. (Make sure the layout has stopped running. Click stop if it hasn’t.) In the Ranking window, click on the ruby.
  12. Select ‘InDegree’
  13. Set Min size to 10, and Max size to 150. Click Apply. The nodes have changed in size, to reflect how many incoming edges they have in the network.
  14. Now we’ll turn on labels. At the bottom of the Graph window, click the black “A” (as opposed to the blue one).
  15. Select Node Size.
  16. Slide the label size slider (just a little to the right) almost to the far left.
  17. Click the grey “T” to the far left. Wait for the labels to appear.
    • If it’s annoying when the network “focuses” on a single node’s connections when you pass the mouse over a node, click the select box tool at the top left of the Graph window.
  18. Now, for fun, we’ll colourize the nodes, also by InDegree. In the Ranking window, click on the colour wheel.
  19. Select InDegree.
  20. Change the colours if you want. (This is more confusing than it should be.)
  21. Click Apply.
  22. Now we’ll colour by ‘Closeness Centrality’. In the Statistics window, click ‘Run’ next to ‘Avg. Path Length’.
  23. Keep the defaults in the next window (Directed, don’t Normalize Centralities)
  24. Wait for it to finish running; you’ll get a report window. Click close.
  25. In the Ranking window, click the colour wheel.
  26. Select ‘Closeness Centrality’ (if it doesn’t show up, click around the menus and come back to it)
  27. Click apply.
  28. Now we’ll colour by cluster. In the Statistics window, click ‘Run’ next to Modularity.
  29. Keep the defaults in the next window (Randomize). Click ok.
  30. The report will pop up. Note the number of communities, and click close. (In my test case, it was 144, which is kind of high to be useful. I just don’t think the clustering algorithm here is suited to large networks like these. It’s possible to write another algorithm as a plugin for gephi, but that’s outside of the scope of my project. For now.)
  31. In the Partition window, click the ‘reload’ button.
  32. Select Modularity Class.
  33. Click apply.
  34. Now we’ll try a Force Atlas layout. This is probably not going to work well (or at all) if you have more than a couple thousand more nodes. (Use more aggressive filtering to get the number down.)
  35. In the Layout window, select Force Atlas, and click run.
    • This one takes much longer. If nothing happens in the first 10 seconds, you’re probably out of luck. Click stop and hope it does. If it doesn’t (after, say, 30 seconds), you’re probably best off force-quitting gephi and starting over.

This is as far as this walkthrough goes — from here forward, it’s just about experimenting. If you’re especially into clustering, focus on trying out different ways of filtering. It takes a while to figure out, but if you do it with datasets that aren’t too big, the fastest way to learn it is to just try stuff and see what happens…

This entry was posted in infoviz. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *