CLUSTERfoo!

Possibly the best blog on the topic of me.

Latest Tweets

Latest Blog Posts and Articles

Reusing Javascripts Across Browser and Server

The following method will allow you to reuse the same javascript file across multiple environments:

  1. As an npm/bower package.
  2. As a drop-in browser script.
  3. As a node.js module, using require.
  4. As a Browserify module, using require.

We’ll start with the follwing script. It’s made up of two utility functions, foo and bar:

my_module.js

1function foo() { console.log("Foo!"); }
2function bar() { console.log("Bar!"); }

Step 1: Preparing our script for the browser.

First, we must prepare our script for the browser. I like to wrap and namespace all my browser scripts using my unique github handle. For an in-depth explanation of what’s going on, check out my guide on namespacing javascripts.

my_module.js

 1;(function() {
 2    // Create an object to contain all that we wish to export.
 3    var C = {};
 4
 5    // Alias "C" to our namespace obejct.
 6    appendToNamespace("Clusterfoo", "myModule", C)
 7    function appendToNamespace(namespace, elementName, elementValue) {
 8        // ... see namespacing guide
 9    }
10
11    // This is how we export our public functions.
12    C.foo = foo;
13    function foo() { console.log("Foo!"); }
14
15    C.bar = bar;
16    function bar() { console.log("Bar!"); }
17
18    // All other functions remain private:
19    function baz() { console.log("No way to call me!"); }
20})();

That’s all we need for the browser. The script is ready to use:

my_page.html

 1<html>
 2    <head>
 3        <script src="my_module.js"></script>
 4    </head>
 5    <body>
 6        <script>
 7            // initialize my namespace
 8            Clusterfoo();
 9            // use functions
10            Clusterfoo.myModule.foo(); // -> "Foo!"
11            Clusterfoo.myModule.bar(); // -> "Bar!"
12        </script>
13    </body>
14</html>

Step 2: Preparing Script for Node / Browserify

Next, we prepare our script so that it is compatible with the require API used in node.js and browserify. This requires a bit more trickery, but nothing too complicated. All we care about is dealing with either a window object or an exports object:

my_module.js

 1// Our wrapper function now accepts two arguments.
 2;(function(window, exports) {
 3    var C = {};
 4
 5    appendToNamespace("Clusterfoo", "myModule", C)
 6    function appendToNamespace(namespace, elementName, elementValue) {
 7        // ... see namespacing guide
 8    }
 9
10    // Attach our functions to either the namespace object
11    // or the exports object.
12    exports.foo = C.foo = foo;
13    function foo() { console.log("Foo!"); }
14
15    exports.bar = C.bar = bar;
16    function bar() { console.log("Bar!"); }
17
18// If window/exports are undefined, simply pass in empty objects.
19})(typeof window !== "undefined" ? window : {},
20   typeof exports !== "undefined" ? exports : {});

Step 3: Using our Script

Our script is now ready! All we need to do is package it using npm and Bower:

$ cd my_module
$ npm init
# npm will walk you through creating your package.

$ bower init
# same...

$ git add --all && git commit
$ git push origin master

Now all we need to do to use our script is install…

$ npm install git://github.com/clusterfoo/my_module
$ bower install clusterfoo/my_module

… and require our module (will also work in the browser if you use browserify):

some_script.js

1var myModule = require('my_module');
2
3myModule.foo() // "Foo!"

Here’s a sample module I’ve put together for demonstration.

Comments? Click me!

A Namespacing Function for Client-Side Javascripts

Here’s a simple method you can use to namespace your client-side scripts. There’s many ways to do this, probably better ones. The point is to do it at all.

Suppose we’ve two scripts, foo.js and bar.js.

foo.js:

1function foo() { console.log("Foo!") };

bar.js:

1function bar() { console.log("Bar!") };

This is what most javascripts look like. This is why puppies die, the bad guys always win, and we can never have anything nice.

So, before we even worry about namespacing, let’s start with the essentials: always wrap it. We don’t want to muddy up our global namespace. Ever. Except when we do.

First rule of Javascript: Always wrap everything.

Just like they taught you in school. (Unless you went to one of those schools down south where they teach that the only sure-fire way to prevent name collisions is to never write code in the first place – yet here you are, so how’d that work out?)

foo.js

1// semicolon to avoid concatenation errors
2;(function() {
3    function foo() { console.log("Foo!") };
4})(); // <-- function calls itself.
5
6// Nice and clean global context.

Ok, now we’re ready for our namespacing function. This function creates a function that initiates a namespace:

appendToNamespace()

 1 function appendToNamespace(namespace, elementName, elementValue) {
 2    // If namespace has already been initialized, simply append new
 3    // element.
 4    if (typeof window[namespace] === "object") {
 5      window[namespace][elementName] = elementValue;
 6    }
 7
 8    // If namespace has not been initialized, but an initialization function
 9    // exists, copy its old value so that new initialization function may
10    // call it.
11    var oldNamespaceFunction = function(){};
12    if (typeof window[namespace] == "function") {
13      oldNamespaceFunction = window[namespace];
14    }
15
16    // Create function that intializes namespace.
17    if (typeof window[namespace] !== "object") {
18      window[namespace] = function() {
19        window[namespace] = {};
20        // Call old initialization function too.
21        oldNamespaceFunction();
22        window[namespace][elementName] = elementValue;
23      }
24    }
25  }

Now foo.js becomes:

foo.js

1;(function() {
2    appendToNamespace("YourNameSpace", "foo", foo)
3    function foo() { console.log("Foo!") };
4})();

Ditto for bar.js.

index.html

 1<html>
 2    <head>
 3        <script src="foo.js"></script>
 4        <script src="bar.js"></script>
 5        <script>
 6            // initialize YourNameSpace
 7            YourNameSpace();
 8        </script>
 9    </head>
10    <body>
11        <script>
12            // Now your scripts are available under the YourNameSpace
13            // object.
14            YourNamespace.foo(); // -> "Foo!"
15            YourNamespace.bar(); // -> "Bar!"
16        </script>
17    </body>
18</html>

Note that it doesn’t matter in what order initialization happens:

index.html

 1<html>
 2    <head>
 3        <script src="foo.js"></script>
 4        <script>
 5            // initialize YourNameSpace beofre second script is loaded.
 6            YourNameSpace();
 7        </script>
 8        <script src="bar.js"></script>
 9    </head>
10    <body>
11        <script>
12            YourNamespace.foo(); // -> "Foo!"
13            YourNamespace.bar(); // -> "Bar!"
14        </script>
15    </body>
16</html>

Of course, it’s possible to do this in such a way that initialization is not necessary at all. The reason I choose to it this way is because explicitly requiring the script to initialize the namespace makes it easier to write scripts that play nice across platforms (say, when using libraries like browserify – see my guide on sharing scripts across environments).

Comments? Click me!

Eigengoogle. How the Google PageRank Algorithm Works

– IF THIS MESSAGE IS SHOWING, EQUATIONS HAVE NOT RENDERED. IF YOU ARE ON RSS, VISIT THE ORIGINAL POST. JAVASCRIPT REQUIRED. –

While we’re on the subject of sorting things online, we might as well talk about Google: the 93-billion dollar company whose main export is taking all the things ever and putting them in the right order. If there’s one thing Google knows best, it’s sorting stuff.

I was curious how it all works, and it turned out really interesting, plus I got to learn a bit about Markov chains. It all starts with an algorithm called PageRank1. According to Wikipedia,

Pagerank uses a model of a random surfer who gets bored after several clicks and switches to a random page. It can be understood as a Markov chain in which the states are pages, and the transitions are the links between pages. When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection (the random surfer chooses another page at random).

The PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix.

In this post I’ll try to break that down and provide some of the background necessary to understand Google PageRank.

Graphs as Matrices

A graph is a collection of nodes joined by edges. If the edges are arrows that flow in one direction, we call that a directed graph. A graph whose edges have each been assigned a “weight” (usually some real number) is a weighted graph.

A graph of n nodes can be represented in the form of an n x n adjacency matrix, M = [m_ij] such that m_ij is equal to the weight of the edge going from node j to node i:

[0, 1, 0, 0]
[1, 0, 2, 0]
[2, 1, 0, 1]
[0, 0, 4, 0]

Stochastic Matrices

The term “stochastic” is used to describe systems whose state can only be described in probabilistic terms (i.e: the likelihood of some event happening at any given time).

Scenario: Consider two competing websites. Every month, the first website loses 30% of its audience to the second website, while the second website loses 60% of its audience to the first.

If the two websites start out with 50% of the global audience each, how many users will each website have after a month? After a year?

This scenario can be represented as the following system:

P = [0.7, 0.6],    x_0 = [0.5, 0.5]
    [0.3, 0.4]

This is a Markov chain with transition matrix P and a state vector x_0.

The transition matrix is called a stochastic matrix; it represents the likelihood that some individual in a system will transition from one state to another. The columns on a stochastic matrix are always non-negative numbers that add up to 1 (i.e: the probability of at least one of the events occurring is always 1 – the likelihood of a user either staying on the same website, or leaving, is always 100%. He must choose one of the two).

The state after the first month is

-- EQUATION NOT RENDERED IN RSS OR WITH JAVASCRIPT DISABLED --

So, after the first month, the second website will have only 35% of the global audience.

To get the state of the system after two months, we simply apply the transition matrix again, and so on. That is, the current state of a Markov chain depends only on its previous state. Thus, the state vector at month can be defined recursively:

-- EQUATION NOT RENDERED IN RSS OR WITH JAVASCRIPT DISABLED --

From which, through substitution, we can derive the following equation:

-- EQUATION NOT RENDERED IN RSS OR WITH JAVASCRIPT DISABLED --

Using this information, we can figure out the state of the system after a year, and then again after two years (using the Sage mathematical library for python):

1P = Matrix([[0.70, 0.60],
2            [0.30, 0.40]])
3x = vector([0.5,0.5])
4P^12*x
5# -> (0.666666666666500, 0.333333333333500)
6P^24*x
7# -> (0.666666666666666, 0.333333333333333)

So it seems like the state vector is “settling” around those values. It would appear that, as EQ, EQ is converging to some such that EQ. As we’ll see below, this is indeed the case.

We’ll call this the steady state vector.

Eigenvectors!

Recall from linear algebra that an eigenvector of a matrix is a vector such that:

-- EQUATION NOT RENDERED IN RSS OR WITH JAVASCRIPT DISABLED --

for some scalar EQ (the eigenvalue). A leading eigenvalue is an eigenvalue EQ such that its absolute value is greater than any other eigenvalue for the given matrix.

One method of finding the leading eigenvector of a matrix is through a power iteration sequence, defined recursively like so:

-- EQUATION NOT RENDERED IN RSS OR WITH JAVASCRIPT DISABLED --

Again, by noting that we can substitute EQ, and so on, it follows that:

-- EQUATION NOT RENDERED IN RSS OR WITH JAVASCRIPT DISABLED --

This sequence converges to the leading eigenvector of EQ.

Thus we see that the steady state vector is just an eigenvector with the special case EQ.

Stochastic Matrices that Don’t Play Nice

Before we can finally get to Google PageRank, we need to make a few more observations.

First, it should be noted that power iteration has its limitations: not all stochastic matrices converge. Take as an example:

 1P = Matrix([ [0, 1, 0],
 2             [1, 0, 0],
 3             [0, 0, 1]])
 4
 5x = vector([0.2, 0.3, 0.5])
 6
 7P * x
 8# -> (0.3, 0.2, 0.5)
 9P^2 * x
10# -> (0.2, 0.3, 0.5)
11P^3 * x
12# -> (0.3, 0.2, 0.5)

The state vectors of this matrix will oscillate in such a way forever. This matrix can be thought of as the transformation matrix for reflection about a line in the x,y axis… this system will never converge (indeed, it has no leading eigenvalue: EQ).

Another way of looking at is by drawing its graph:

Using our example of competing websites, this matrix describes a system such that, every month, all of the first website’s users leave and join the seconds website, only to abandon the second website again a month later and return to the first, and so on, forever.

It would be absurd to hope for this system to converge to a steady state.

States 1 and 2 are examples of recurrent states. These are states that, once reached, there is a probability of 1 (absolute certainty) that the Markov chain will return to them infinitely many times.

A transient state is such that the probability is EQ that they will never be reached again. (If the probability is 0, we call such a state ephemeral – in terms of Google PageRank, this would be a page that no other page links to):

There are two conditions a transition matrix must meet if we want to ensure that it converges to a steady state:

It must be irreducible: an irreducible transition matrix is a matrix whose graph has no closed subsets. (A closed subset is such that no state within it can reach a state outside of it. 1, 2 and 3 above are closed from 4 and 5.)

It must be primitive: A primitive matrix is such that, for some positive integer EQ, EQ is such that EQ for all EQ (that is: all of its entries are positive numbers).

More generally, it must be positive recurrent and aperiodic.

Positive recurrence means that it takes, on average, a finite number of steps to return to any given state. Periodicity means the number of steps it takes to return to a particular state is always divisible by some natural number (its period).

Since we’re dealing with finite Markov chains, irreducibility implies positive recurrence, and primitiveness ensures aperiodicity.

Google PageRank

We are now finally ready to understand how the PageRank algorithm works. Recall from Wikipedia:

The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain in which the states are pages, and the transitions, which are all equally probable, are the links between pages.

So, for example, if we wanted to represent our graph above, we would start with the following adjacency matrix:

[0, 0, 0.5, 0,   0],
[0, 0, 0.5, 0.5, 0],
[1, 1, 0,   0,   0],
[0, 0, 0,   0,   0],
[0, 0, 0,   0.5, 0]

For the algorithm to work, we must transform this original matrix in such a way that we end up with an irreducible, primitive matrix. First,

If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.

When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection.

        [0, 0, 0.5, 0,   0.2],
        [0, 0, 0.5, 0.5, 0.2],
S =     [1, 1, 0,   0,   0.2],
        [0, 0, 0,   0,   0.2],
        [0, 0, 0,   0.5, 0.2]

We are now ready to produce , the Google Matrix, which is both irreducible and primitive. Its steady state vector gives us the final PageRank score for each page.

The Google Matrix

The Google Matrix for an matrix is derived from the equation

-- EQUATION NOT RENDERED IN RSS OR WITH JAVASCRIPT DISABLED --

Where is an matrix whose entries are all 1, and is referred to as the damping factor.

If , then . Meanwhile, if all of the entries in are the same (hence, the original structure of the network is “dampened” by , until we lose it altogether).

So the matrix is a matrix that represents a “flat” network in which all pages link to all pages, and the user is equally likely to click any given link (with likelihood ), while is dampened by a factor of .

Google uses a damping factor of 0.85. For more on this, I found this paper.

tl;dr: the second eigenvalue of a Google matrix is , and the rate of convergence of the power iteration is given by . So higher values of imply better accuracy but worse performance.

With some moving stuff around, we can see that

-- EQUATION NOT RENDERED IN RSS OR WITH JAVASCRIPT DISABLED --

For all up to , which means that is indeed stochastic, irreducible, and primitive. Cool.

In conclusion,

Imgur


1. Actually, it all started with the HITS algorithm, which PageRank is based off of. More details here.

Comments? Click me!

All Articles