Home > Code, Graphics > Firefox visualization competition submission description

## Firefox visualization competition submission description

December 15, 2010 Leave a comment Go to comments

Firefox has a competition to visualize some data they’ve collected (found through this FlowingData.com post).  After a week and a half of intense work, I have finished and submitted a visualization.  Links are below, but first: Special thanks to Rob Ploutz-Snyder and Sheila Moore for some very helpful suggestions, and Phillip Fiedler for helping me get the visualization online.  And thanks to the maker of the JavaScript library raphael.js.

If you use Internet Explorer, you can see a static version of the visualization here. I don’t think the dynamic version will work for you.  Also, the dynamic version won’t work if you have JavaScript disabled.

Everyone else (hopefully) can see the visualization, here. As you look over the visualization, keep in mind this rule: If your cursor changes from the default arrow, this indicates there is some kind of interactivity available.  The histograms can be dragged left and right to move them in their viewing windows. The proportion graphs on the right hand side have even more interactivity, but they come with their own explanation, which can be accessed by clicking on the “?” at the top.

## Background info and details

Below, I’ll lay out exactly which data set I’m using, explain how I made my visualization, and give some code I used.

### Dataset

I am using data from the study titled “A Week in the Life of a Browser – Version 2“. Specifically, I am using data from “witl_large.tar.gz”,  which contains three files.   One file, users.csv, has info for all (?) 27,000-some users in the study.  From this file I only use the number of extensions (programs that users can install to add functionality to Firefox) and the operating system.

A second file, events.csv, contains event data for some 26,000 of the 27,000 users.  From this file I only use events pertaining to the number of open tabs and the number of bookmarks.  The last file, survey.csv, contains survey info for 4,000 users who completed a survey.

### Issues

There were no issues with the users.csv file.  In the events.csv file, there were a few anomalies that forced me to exclude a few users from consideration:

• users 752, 15757, and 19054 do not have any tab or bookmark events,
• users 4378 and 5022 sometimes have “NaN” as the number of tabs,
• another 104 survey participants do not have any event data, and
• another 16 users have tab data, but no bookmark data.

One of the dropped survey participants actually did not answer any of the survey questions.   I did not screen for this so there might still be non-participants in the “survey participants”.

### Code

#### Part 1. Getting the data

I used Python to pull out the data I wanted and to compute the stats.  At first (not knowing how long this would take) I thought I was going to make a series of three visualizations, so I wanted to put the relevant tab and bookmark events into separate files to make them more convenient to work with.  The code below does this, putting the events into tabEvents.csv and bmkEvents.csv.

import csv

infile = open('events.csv', 'r')
outfile = open('bkmkEvents.csv', 'wt', encoding='utf-8')
csvread = csv.reader(infile)

for row in csvread:
if row[1] == "event_code":
pass
elif int(row[1]) in [8,9,11]:
if int(row[1]) == 8:
try:
int(row[2].split(' ')[0])
except:
print(row)
if int(row[1]) == 8 or (int(row[1]) == 9 and row[2] == "New Bookmark Added") or (int(row[1]) == 11 and row[2] == "Bookmark Removed"):
outfile.write( "%s\n" %','.join(row[0:6]) )

infile.close()
outfile.close()


Next, I am going to make mini data tables for three of the parameters of interest: number of extensions, number of tabs, and number of bookmarks. I’m going to treat the operating system data a little different. This isn’t necessarily a good idea; it’s just how I did it. The mini data tables will be lists. Each entry of this list will be a 2-element list with ID and data value.

For number of tabs and number of bookmarks I’m going to use a kind of per-user average. Ideally, I would have a time-weighted average, so if a user had 1 tab open for 1 hour and 2 tabs open for 9 hours that person’s average number of tabs would be $(1 \cdot 1 + 2 \cdot 9)/10 = 1.9$. The unweighted average would be $(1 + 2) / 2 = 1.5$. Now, Firefox has provided some timestamp data, so it might be possible to do a time-weighted average, but I don’t trust the timestamps provided. (I would say more about this, but WordPress is already draining all desire to write anything more than necessary.  I have a window to type in that is literally less than 1.5 inches tall. I could type this blog post out somewhere else and import it but WordPress removes whitespace when you copy and paste.  Formatting the code below was painful.  I can’t guarantee it’s correct.) Anyway, I am going to compute the unweighted average number of tabs and bookmarks. (Actually, one could argue that there is s little bit of weighting due to how I make the averages, sort of a startup-weighted average.) Here is the code:

infile = open('tabEvents.csv','r')
csvread = csv.reader(infile)

tabData = []
ev = []
id = None
for row in csvread:
if id == None:
id = int(row[0])
elif int(row[0]) != id:
tabData.append([id,sum(ev)/len(ev)])
ev = []
id = int(row[0])
ev.append(int(row[4]))

tabData.append([id,sum(ev)/len(ev)])

infile.close()

infile = open('bmkEvents.csv', 'r')
csvread = csv.reader(infile)

ev = []
bmkData = []
id = None
numBmks = None
for row in csvread:
if id == None:
id = int(row[0])
elif int(row[0]) != id:
bmkData.append([id,sum(ev)/len(ev)])
ev = []
numBmks = None
id = int(row[0])
if int(row[1]) == 8:
numBkmks = int(row[2].split(' ')[0])
elif int(row[1]) == 9:
numBkmks += 1
elif int(row[1]) == 11:
numBkmks -= 1
ev.append(numBkmks)

bmkData.append([id,sum(ev)/len(ev)])

infile.close()

evntID = [x[0] for x in bmkData]
tabData = [x for x in tabData if x in evntID]  # remove the users that have no bookmark data


The second-to-last last line made a list of all event users. I will use this later to exclude the survey participants that don’t have event data.

Next I need to make the data table for the number of extensions, and a list that groups users’ IDs together by operating system. All flavors of Windows XP will be considered one group. Likewise for Vista, Windows 7, OS X, and Linux. The lone user of Sun OS is the last group.

extData = []
OSsData = [[],[],[],[],[],[]]
OSNames = ["WINNT Windows NT 5.1", "WINNT Windows NT 6.0", "WINNT Windows NT 6.1", "Darwin Intel Mac OS X 10.5", "Darwin PPC Mac OS X 10.5"]

infile = open('users.csv', 'r')
csvread = csv.reader(infile)

for row in csvread:
if row[0] != "id" and int(row[0]) in evntID:
extData.append([int(row[0]),int(row[6])])

if row[3][0:19] == OSNames[0][0:19]:
OSsData[0].append(int(row[0]))
elif row[3][0:20] == OSNames[1][0:20]:
OSsData[1].append(int(row[0]))
elif row[3][0:20] == OSNames[2][0:20]:
OSsData[2].append(int(row[0]))
elif row[3][0:6] == "Darwin":
OSsData[3].append(int(row[0]))
elif row[3][0:5] == "Linux":
OSsData[4].append(int(row[0]))
elif row[3][0:3] == "Sun":
OSsData[5].append(int(row[0]))

infile.close()


The last bit of data I need is a list of survey users, remembering to exclude the users without event data.

infile = open('survey.csv', 'r')
csvread = csv.reader(infile)

survID = []
for row in csvread:
if row[0] != 'user_id' and int(row[0]) in evntID:
id = int(row[0])
survID.append(id)

infile.close()


The above mini data tables were the versions will “all users”. I need versions with just survey partipants.

sTabData = [x for x in bmkData if x[0] in survID]
sBmkData = [x for x in tabData if x[0] in survID]

sExtData = [x for x in extData if x[0] in survID]
sOSsData = []
for l in OSsData:
sOSsData.append([x for x in l if x in survID])


#### Part 2. Getting medians, maxima, and histograms

The median is easy; it’s just the middle data value.  So here’s a function to get the median.


def getMedian(data):
data2 = sorted(data)
if len(data2) % 2 == 0:
median = ( data2[int(len(data2)/2) - 1] + data2[int(len(data2)/2)] )/2
else:
median = data2[int(round(len(data2)/2,0) - 1)]
return median



The histograms can be made with these two functions.

from math import ceil

def makeBins(data, binWidth = None, binShift = None):
data2 = sorted(data)

index75 = int(round(3*len(data2)/4,0)) - 1
index25 = int(round(len(data2)/4,0)) - 1
iqr = data2[index75] - data2[index25]

minD = data2[0]
maxD = data2[-1]

if binWidth == None:
preh = 2*iqr*len(data)**(-1/3)  #Freedman-Diaconis rule.  Wikipedia.
nbins = ceil((maxD-minD)/preh)
h = ceil((maxD-minD)/nbins*1000) / 1000
else:
h = binWidth
minD = minD + binShift
nbins = ceil((maxD-(minD))/h)
cutoffs = [minD + r*h for r in range(nbins+1) if minD + (r-1)*h <= maxD]

binSizes = []

cutindex = 1
previ = 0
for i in range(len(cutoffs)-1):
binSizes.append( len( [x for x in data2 if  cutoffs[i] <= x < cutoffs[i+1]] ) )

return binSizes, cutoffs

def getHist (data, binWidth = None, binShift = None, latex = False):
binSizes, cutoffs = makeBins(data, binWidth, binShift)
binTotal = sum([x for x in binSizes])
binHeights = [round(x/binTotal*100,4) for x in binSizes]

print("heights = [", end='')
for i in range(len(cutoffs)-1):
if binSizes[i] != 0:
if i != len(cutoffs) - 2:
print(binHeights[i], end=',')
else:
print(binHeights[i], end='')
print("];")

print("cutoffs = [", end='')
for i in range(len(cutoffs)-1):
if binSizes[i] != 0:
if i != len(cutoffs) - 2:
print(cutoffs[i], end=',')
else:
print(cutoffs[i], end='')
print("];")


Then, for instance, to get the median, maximum, and histogram for number of extensions for all users I use

ext = [x[1] for x in extData]
getMedian(ext)
max(ext)
getHist(ext, binWidth = 1, binShift = -0.5)


#### Proportion graphs (the graphs on the right)

First, I want to divide up the data into the categories. I’m going to create a list for each variable, where each position of the list corresponds to one of the categories. Each position will be filled with a list of all of the IDs of users whose data falls in that category. (I could just count the number of users, rather than store the IDs, but the stored IDs will be useful in minute.)  Then, I’ll calculate what per cent of the total is in each category, and convert that into a bar width.

cutpoints = [[0,5,10,20,200],[0,2,5,10,400],[-1,25,50,150,80000]]

extBoxIDs = []
tabBoxIDs = []
bmkBoxIDs = []
OSsBoxIDs = OSsData[0:-1]

sExtBoxIDs = []
sTabBoxIDs = []
sBmkBoxIDs = []
sOSsBoxIDs = sOSsData[0:-1]

for i in range(4):
extBoxIDs.append([x[0] for x in extData if cutpoints[0][i] < round(x[1],1) <= cutpoints[0][i+1]])
tabBoxIDs.append([x[0] for x in tabData if cutpoints[1][i] < round(x[1],1) <= cutpoints[1][i+1]])
bmkBoxIDs.append([x[0] for x in bmkData if cutpoints[2][i] < round(x[1],1) <= cutpoints[2][i+1]])

sExtBoxIDs.append([x[0] for x in sExtData if cutpoints[0][i] < round(x[1],1) <= cutpoints[0][i+1]])
sTabBoxIDs.append([x[0] for x in sTabData if cutpoints[1][i] < round(x[1],1) <= cutpoints[1][i+1]])
sBmkBoxIDs.append([x[0] for x in sBmkData if cutpoints[2][i] < round(x[1],1) <= cutpoints[2][i+1]])

widths = []
for thisbox in [extBoxIDs,sExtBoxIDs,tabBoxIDs,sTabBoxIDs,bmkBoxIDs,sBmkBoxIDs,OSsBoxIDs,sOSsBoxIDs]:
widths.append([round((350 - 3.5*(len(thisbox)-1)) * len(x)/sum([len(x) for x in thisbox]),1) for x in thisbox])


In the last line , the “-3.5*(len(thisbox)-1)” leaves room for the gaps between the bars. (This introduces a tiny amount (1%) of distortion when comparing any of the first three graphs with the last one.)

The very last thing to do is to calculate the intersections of all of these sets of user IDs. All of the interactivity in the right section of the graph is based on intersections of categories. The following messy bit of code calculates the sizes of these intersections.

It would take a lot of work to explain the code in detail. I’ll just point out that each g*Effect contains multiple images of the entire graph. Also, “g0” refers to the “Number of extensions” section, “g1” refers to the tab section, and so on. The “all user” graphs are the first half of g*Effect, and the “survey” graphs are the second half. Here is the code for the first version of graphs on the right-hand side.

g0Effect = [[[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]]],
[[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]]]]

g1Effect = [[[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]]],
[[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]]]]

g2Effect = [[[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]]],
[[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]]]]

g3Effect = [[[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]]],
[[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]],
[[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0,0]]]]

for h in range(4):
g = [g0Effect, g1Effect, g2Effect, g3Effect][h]
idBox0 = [extBoxIDs, tabBoxIDs, bmkBoxIDs, OSsBoxIDs][h]
idBox1 = [sExtBoxIDs, sTabBoxIDs, sBmkBoxIDs, sOSsBoxIDs][h]
for i in range(len(idBox0)):
print("i =", i)
for id in idBox0[i]:
for j in range(4):
if id in extBoxIDs[j]:
g[0][i][0][j] += 1 / len(extBoxIDs[j])
if id in tabBoxIDs[j]:
g[0][i][1][j] += 1 / len(tabBoxIDs[j])
if id in bmkBoxIDs[j]:
g[0][i][2][j] += 1 / len(bmkBoxIDs[j])
for k in range(5):
if id in OSsBoxIDs[k]:
g[0][i][3][k] += 1 / len(OSsBoxIDs[k])

for id in idBox1[i]:
for j in range(4):
if id in sExtBoxIDs[j]:
g[1][i][0][j] += 1 / len(sExtBoxIDs[j])
if id in sTabBoxIDs[j]:
g[1][i][1][j] += 1 / len(sTabBoxIDs[j])
if id in sBmkBoxIDs[j]:
g[1][i][2][j] += 1 / len(sBmkBoxIDs[j])
for k in range(5):
if id in sOSsBoxIDs[k]:
g[1][i][3][k] += 1 / len(sOSsBoxIDs[k])

for j in range(4):
g[0][i][0][j] = round(g[0][i][0][j],2)
g[0][i][1][j] = round(g[0][i][1][j],2)
g[0][i][2][j] = round(g[0][i][2][j],2)

g[1][i][0][j] = round(g[1][i][0][j],2)
g[1][i][1][j] = round(g[1][i][1][j],2)
g[1][i][2][j] = round(g[1][i][2][j],2)

for k in range(5):
g[0][i][3][k] = round(g[0][i][3][k],2)
g[1][i][3][k] = round(g[1][i][3][k],2)


The code for the second version of the graphs is similar.  In the above, just replace (*BoxID[]) by len(idBox0[i]) or len(idBox0[i]), whichever is appropriate.  I would just put the code below, but WordPress sucks for trying to write out code, and I’m.

Categories: Code, Graphics
1. No comments yet.
1. December 20, 2010 at 2:39 pm
2. January 16, 2011 at 11:43 am