While exploring my little segment of the tweeter-verse, I discovered a tweet from Francine Hardaway about a challenge from Brian Roy.
hardaway RT @briantroy: Developer challenge: Take the just signal data from Demo 09 and tell us something interesting. http://bit.ly/16XIzj
…
Here is the challenge - Parse the data, tell us who gets mentioned most, most good mentions, most bad mentions, most whatever. Make it enlightening. All I ask is that when you publish your site/data you give attribution to justSignal by saying we collected and provided the data for you.
After reading Brian's blog, I decided to attempt the developer's challenge.
Hopefully the following information will help you understand how I used IBM's DB2 to help me manipulate the XML data. I wanted to see easy it was to extract meaningful results from the XML data file collected by the justSignal Tracker for Demo 09. I downloaded the XML data from this link, http://justsignal.com/demo09-03-02-2009.xml
I use DB2 version 9.5 at work and I have a copy of IBM’s DB2 Express C (http://www-01.ibm.com/software/data/db2/express/) installed on one of my home systems. I created a very simple table with one column for the XML data.
DROP TABLE JUSTSIGNAL.DEMO_09 @
CREATE TABLE JUSTSIGNAL.DEMO_09 (TWEETS XML)@
I then imported the single XML file into the table using the following:
C:\Users\Public\demo09>db2 -td@ -vf foo.sql
import from foo.del of del xml from .
insert into justsignal.demo_09
SQL3109N The utility is beginning to load data from file "foo.del".
SQL3110N The utility has completed processing. "1" rows were read from the input file.
SQL3221W ...Begin COMMIT WORK. Input Record Count = "1".
SQL3222W ...COMMIT of any database changes was successful.
SQL3149N "1" rows were processed from the input file. "1" rows were
successfully inserted into the table. "0" rows were rejected.
Number of rows read = 1
Number of rows skipped = 0
Number of rows inserted = 1
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 1
I spent a few minutes creating a view using the XML data that I inserted into the database. This view allowed me to treat the XML data as if it were columns of data within a normal relational database table. I was able to create several simple SQL queries to retrieve data like Brian suggested.
--
DROP VIEW JUSTSIGNAL.VIEWTWEETS
@
CREATE VIEW JUSTSIGNAL.VIEWTWEETS AS
SELECT TMP.* FROM
JUSTSIGNAL.DEMO_09,
XMLTABLE(
'$tweets/xml/tweets/tweet' PASSING TWEETS AS "tweets"
COLUMNS
TW_ID INTEGER PATH './tw_id',
TO_USER_ID INTEGER PATH './to_user_id',
TO_USER VARCHAR(32) PATH './to_user',
FROM_USER_ID INTEGER PATH './from_user_id',
FROM_USER VARCHAR(32) PATH './from_user',
ISO_LANGUAGE_CODE CHAR(2) PATH './iso_language_code',
PROFILE_IMAGE_URL VARCHAR(128) PATH './profile_image_url',
CREATED_AT VARCHAR(32) PATH './created_at',
IS_RETWEET CHAR(5) PATH './is_retweet',
TEXT VARCHAR(160) PATH './text'
) AS TMP
Here are the top 5 tweeters.
C:\Users\Public\demo09>db2 -td@ -vf Top5.sql
select from_user as "Tweeter", count(from_user) as "Count"
from justsignal.viewtweets
group by from_user
order by 2 desc
fetch first 5 rows only
Tweeter Count
-------------------------------- -----------
nanpalmero 197
demochatter 156
ejosowitz 42
larrymagid 42
Rafe 36
These 5 tweeters have the most replies.
C:\Users\Public\demo09>db2 -td@ -vf MostRepliedTo.sql
DROP VIEW JUSTSIGNAL.VIEWREPLIES
DB20000I The SQL command completed successfully.
CREATE VIEW JUSTSIGNAL.VIEWREPLIES AS
SELECT SUBSTR(TEXT,1,LOCATE(' ', TEXT))
AS TWEETER FROM JUSTSIGNAL.VIEWTWEETS
WHERE TEXT LIKE '@%'
GROUP BY TEXT
DB20000I The SQL command completed successfully.
SELECT substr(TWEETER,1,32), COUNT(TWEETER) as "Replies"
FROM JUSTSIGNAL.VIEWREPLIES
GROUP BY TWEETER
ORDER BY 2 DESC
FETCH FIRST 5 ROWS ONLY
1 Replies
-------------------------------- -----------
@Rafe 10
@sarahintampa 10
@nanpalmero 8
@podboy 7
@demochatter 6
5 record(s) selected.
This simple query allowed me to count the tweets and retweets.
C:\Users\Public\demo09>db2 -td@ -vf TweetRetweet.sql
select is_retweet, count(is_retweet) as “Count”
from justsignal.viewtweets
group by is_retweet
order by 2 desc
IS_RETWEET Count
---------- -----------
false 3000
true 167
2 record(s) selected.