Friday, December 31, 2010

On Wikipedia

I have certainly found Wikipedia to be the most useful resource for just about any topic, but I was wondering if there was any useful way of quantifying this.  Here's my attempt.

The utility of a resource is related to what you gt from reading it, taking into account both correct and incorrect information.  Wikipedia skeptics would point out that Wikipedia has much more incorrect information than a more "scholarly" source, even measured per unit of information in the resource, but I'd argue that this isn't the correct measure.  Instead, I'd argue that the best--or at least a good--way to measure the usefulness of a resource is to think about what happens when you attempt to find a specific piece of information on it.  Let's say that you're looking for some fact A.  If you let the total information of a resource (weighted by usefulness*) be I, and the total information in the world (again, weighted by usefulness) be T, then the odds that A is in a resource R are I/T.  If it's not their, the usefulness of the resource is 0; if it is, then let's say that the usefulness is X if the information is correct, and Y if it's incorrect (Y would presumably be negative).  Let the probability that a random fact, weighted by usefulness, is correct in a given resource be P.  Then, the expected value of looking up A in R is (I/T)*(P*X-(1-P)*Y).  This simplifies to I*[P(X+Y)-Y]/T.  T, however, is constant throughout all resources, and without loss of generality I'll define the unit of usefulness to be the UN, and the units of the above equation to be T*UN, thus meaning that the usefulness of a resource is I*[P(X+Y)-Y] (measured in UN).

Now, let's try to compare two resources with the above equation.  I'll attempt to compare Wikipedia with the Encyclopedia Britannica.  Let's say, for the sake of argument, that Britannica has no errors (i.e. P=1), and that Wikipedia has 1 error in every 100 pieces of information (a figure that I think is way to high--articles contain thousands of pieces of information and most don't contain any errors), i.e. P=.99.  The length of Wikipedia is about 25 times as long as Britannica (yes, this is according to Wikipedia; I'm willing to take the chance that it's wrong); this number will likely double every few years for a little while, but let's even keep it constant at 25.  Then, the usefulness of Wikipedia is 25[.99*X-.01*Y], and the usefulness of Britannica is X.  I would normally assume that if we let X be normalized to 1, then Y would be about 3 (which is to say that if you were given 3 correct pieces of information and 1 incorrect one, you'd be breaking even); this would mean that Wikipedia comes out to 24 UN, with Britannica at 1 UN--not even close.  But let's see, for the sake of argument, what Y would have to be for them to be equal.  Again normalizing X to 1, we get that 25[.99-.01*Y]=1, or Y=95.  So the break even point would be if it were the case that a person who received 1 incorrect piece of information and 94 equally useful correct pieces of information were getting a bad deal.  Remember, this is all using assumptions that I would guess are not fair to Wikipedia; in addition to those stated above, I would expect that while Wikipedia is currently 25 times as big as Britannica, this number is not weighted by usefulness and that if one were to weight it by usefulness (as one should do, but is hard to do quantitatively without knowing things like aggregated browsing history) it'd be much larger--possibly into the 100's.

But this is all something that should be intuitively obvious for someone not biased by the prudishness of tradition--if you actually want to know something, nothing compares to Wikipedia.  I recently wanted to get a sense of Colorado senator Michael Bennet; campaign websites for both him and his opponent would obviously be biased (and short on facts), and Britannica didn't even have an article on him.  Sure it's possible that the Wikipedia article misspelled something, but if I had opted for other resources I wouldn't have learned what his stance on major issues were.  Wikipedia is the best, most useful resource there ever has been.


*: What I mean by this is just that we care more that it correctly states a US Senator's political party--an often desired fact--than that it correctly states the year that the Canon PowerShot A470 was first made.  If you want to define this mathematically, just weight each piece of information by the product of how often it is desired by how important it is that it is correctly known, and have the whole set of usefulness normalized to 1.  I'll follow this convention throughout the article.

No comments:

Post a Comment