Tuesday, March 03, 2009

justSignal Developer's Challenge

While exploring my little segment of the tweeter-verse, I discovered a tweet from Francine Hardaway about a challenge from Brian Roy.

hardaway RT @briantroy: Developer challenge: Take the just signal data from Demo 09 and tell us something interesting. http://bit.ly/16XIzj

Here is the challenge - Parse the data, tell us who gets mentioned most, most good mentions, most bad mentions, most whatever. Make it enlightening. All I ask is that when you publish your site/data you give attribution to justSignal by saying we collected and provided the data for you.

After reading Brian's blog, I decided to attempt the developer's challenge.

Hopefully the following information will help you understand how I used IBM's DB2 to help me manipulate the XML data. I wanted to see easy it was to extract meaningful results from the XML data file collected by the justSignal Tracker for Demo 09. I downloaded the XML data from this link, http://justsignal.com/demo09-03-02-2009.xml

I use DB2 version 9.5 at work and I have a copy of IBM’s DB2 Express C (http://www-01.ibm.com/software/data/db2/express/) installed on one of my home systems. I created a very simple table with one column for the XML data.

DROP TABLE JUSTSIGNAL.DEMO_09 @

CREATE TABLE JUSTSIGNAL.DEMO_09 (TWEETS XML)@


I then imported the single XML file into the table using the following:


C:\Users\Public\demo09>db2 -td@ -vf foo.sql

import from foo.del of del xml from .

insert into justsignal.demo_09

SQL3109N The utility is beginning to load data from file "foo.del".

SQL3110N The utility has completed processing. "1" rows were read from the input file.

SQL3221W ...Begin COMMIT WORK. Input Record Count = "1".

SQL3222W ...COMMIT of any database changes was successful.

SQL3149N "1" rows were processed from the input file. "1" rows were

successfully inserted into the table. "0" rows were rejected.

Number of rows read = 1

Number of rows skipped = 0

Number of rows inserted = 1

Number of rows updated = 0

Number of rows rejected = 0

Number of rows committed = 1


I spent a few minutes creating a view using the XML data that I inserted into the database. This view allowed me to treat the XML data as if it were columns of data within a normal relational database table. I was able to create several simple SQL queries to retrieve data like Brian suggested.

--

DROP VIEW JUSTSIGNAL.VIEWTWEETS

@


CREATE VIEW JUSTSIGNAL.VIEWTWEETS AS

SELECT TMP.* FROM

JUSTSIGNAL.DEMO_09,

XMLTABLE(

'$tweets/xml/tweets/tweet' PASSING TWEETS AS "tweets"

COLUMNS

TW_ID INTEGER PATH './tw_id',

TO_USER_ID INTEGER PATH './to_user_id',

TO_USER VARCHAR(32) PATH './to_user',

FROM_USER_ID INTEGER PATH './from_user_id',

FROM_USER VARCHAR(32) PATH './from_user',

ISO_LANGUAGE_CODE CHAR(2) PATH './iso_language_code',

PROFILE_IMAGE_URL VARCHAR(128) PATH './profile_image_url',

CREATED_AT VARCHAR(32) PATH './created_at',

IS_RETWEET CHAR(5) PATH './is_retweet',

TEXT VARCHAR(160) PATH './text'

) AS TMP


Here are the top 5 tweeters.


C:\Users\Public\demo09>db2 -td@ -vf Top5.sql

select from_user as "Tweeter", count(from_user) as "Count"

from justsignal.viewtweets

group by from_user

order by 2 desc

fetch first 5 rows only


Tweeter Count

-------------------------------- -----------

nanpalmero 197

demochatter 156

ejosowitz 42

larrymagid 42

Rafe 36


These 5 tweeters have the most replies.


C:\Users\Public\demo09>db2 -td@ -vf MostRepliedTo.sql

DROP VIEW JUSTSIGNAL.VIEWREPLIES

DB20000I The SQL command completed successfully.

CREATE VIEW JUSTSIGNAL.VIEWREPLIES AS

SELECT SUBSTR(TEXT,1,LOCATE(' ', TEXT))

AS TWEETER FROM JUSTSIGNAL.VIEWTWEETS

WHERE TEXT LIKE '@%'

GROUP BY TEXT

DB20000I The SQL command completed successfully.

SELECT substr(TWEETER,1,32), COUNT(TWEETER) as "Replies"

FROM JUSTSIGNAL.VIEWREPLIES

GROUP BY TWEETER

ORDER BY 2 DESC

FETCH FIRST 5 ROWS ONLY

1 Replies

-------------------------------- -----------

@Rafe 10

@sarahintampa 10

@nanpalmero 8

@podboy 7

@demochatter 6

5 record(s) selected.


This simple query allowed me to count the tweets and retweets.


C:\Users\Public\demo09>db2 -td@ -vf TweetRetweet.sql

select is_retweet, count(is_retweet) as “Count”

from justsignal.viewtweets

group by is_retweet

order by 2 desc

IS_RETWEET Count

---------- -----------

false 3000

true 167

2 record(s) selected.