Statistics on reddit’s top 10,000 titles with NLTK

Drawing inspiration from this blog post on title virality I wanted to investigate what makes these top 10,000 titles the best of their breed. Which are the best superlatives? Who/what’s the most popular subject? Let’s start with some statistics:

  • On Feb. 03, 14:10:45 (UTC) the all-time top 10,000 submissions on reddit (/r/all) had a total of 82,751,429 upvotes and 62,655,532 downvotes (56.9% liked it).
  • 5.2 years between the oldest and newest submission
  • 8,331,382 comments. That’s about 833 comments per submission.
  • The #1 post has 26,758 – 4,882 = 21,876 points
  • The #10,000 post has 15,166 – 13,679 = 1,487 points
  • And now some graphs….

Adjectives – reddit loves “new”, “old”, “good” and “right”

Adjectives

Top Adjective, Superlative – “Best” is the best

Questions reddit loves how?

Questions

What’s reddit talking about? People.

Or news, the president, man…

Reddit appreciates personal content about you, this, it and I.

Even NLTK doesn’t understand these…

I’m pretty sure you don’t need example links for these…

The top 10,000 seem to come mostly from 17:00 UTC and rarely from around 12:00 UTC

This isn’t exactly the probability of succeeding to hit the front page as it’s not clear at what time submission count is highest. But it’s something.

An apology

This is my first time using NLTK and though I’m ok at coding I most certainly have no idea how to parse natural language. Here’s hoping this was somewhat insightful.

I have no idea what I'm doing

Appendix

The Google App Engine Glass Ceiling

With the new billing arrangements, each and every paid GAE app costs at least $2.10 per week which is supposedly $9 per month ($9.125 by my calculation) regardless of quota usage. This cost does cover whatever quotas your app consumes and the regular free quotas are still “free”.

Now I have a GAE app that sends 70-80 emails per day where the free limit is 100. I’d gladly switch over to the paid side of GAE just to be sure that if it ever passes the 100 mark I don’t have any failed email requests but the price of that is 9$ per month. GAE is extremely expensive for apps that just barely brush the end of their free quotas. In order to actually use the $9 per month minimum I’d have to send out 3000 emails per day (at $0.0001 per email).

I don’t know if the free quota on email recipients is really low or if sending out an email is extremely cheap. Either way, GAE expects me to scale from 100 to 3000 while paying the price of 3000. Who knows if I’ll ever even reach that mark?

If google keeps with this plan, I’m probably never going to start another GAE app that has a chance to grow. Every time I have a chance of hitting the quota limits I have 2 choices:

  • Pay google and be screwed over for an indefinite amount of time until I reach the next landmark.
  • Migrate to a cheaper shared hosting option until I reach the next landmark.

Thanks, but no thanks. That’s the GAE glass ceiling.

Appendix

  • Other than this problem I do like GAE. It’s a shame I have to leave it.
  • I’ve made about 11 small python GAE apps. Only 2 of which ever reached the aforementioned glass ceiling.
  • This issue shouldn’t bother you if your app is already big enough to cost more than $9.
  • Maybe google can’t bill less than $9 per month? I doubt it, android apps can cost $0.99.
  • A proposed solution: Google takes $9 of credit at a time from your google wallet and eats quotas out of that. When the $9 run out, it bills another 9. Sounds reasonable and “don’t be evil” to me. Another thing that could be nice would be to allow multiple paid apps to feed from the same budget.

Precision, recall, sensitivity and specificity

Nowadays I work for a medical device company where in a medical test the big indicators of success are specificity and sensitivity. Every medical test strives to reach 100% in both criteria. Imagine my surprise today when I found out that other fields use different metrics for the exact same problem. To analyze this I present to you the confusion matrix:

Confusion Matrix

Confusion Matrix

E.g. we have a pregnancy test that classifies people as pregnant (positive) or not pregnant (negative).

  • True positive – a person we told is pregnant that really was.
  • True negative – a person we told is not pregnant, and really wasn’t.
  • False negative – a person we told is not pregnant, though they really were. Ooops.
  • False positive – a person we told is pregnant, though they weren’t. Oh snap.

And now some equations…

Sensitivity and specificity are statistical measures of the performance of a binary classification test:

Sensitivity

Specificity

sensitivity and specificity

Sensitivity in yellow, specificity in red

 

In pattern recognition and information retrieval:

Precision

Recall

Let’s translate:

  • Relevant documents are the positives
  • Retrieved documents are the classified as positives
  • Relevant and retrieved are the true positives.
Precision, recall

Precision in red, recall in yellow

Standardized equations

  • sensitivity = recall = tp / t = tp / (tp + fn)
  • specificity = tn / n = tn / (tn + fp)
  • precision = tp / p = tp / (tp + fp)

Equations explained

  • Sensitivity/recall – how good a test is at detecting the positives. A test can cheat and maximize this by always returning “positive”.
  • Specificity – how good a test is at avoiding false alarms. A test can cheat and maximize this by always returning “negative”.
  • Precision – how many of the positively classified were relevant. A test can cheat and maximize this by only returning positive on one result it’s most confident in.
  • The cheating is resolved by looking at both relevant metrics instead of just one. E.g. the cheating 100% sensitivity that always says “positive” has 0% specificity.

More ways to cheat

A Specificity buff – let’s continue with our pregnancy test where our experiments resulted in the following confusion matrix:

8 2
10 80

Our specificity is only 88% and we need 97% for our FDA approval. We can tell our patients to run the test twice and only double positives count (eg two red lines) so we suddenly have 98.7% specificity. Magic. This would only be kosher if the test results are proven as independent. Most tests are probably not as such (eg blood parasite tests that are triggered by antibodies may repeatedly give false positives from the same patient).

A  less ethical (though IANAL) approach would be to add 300 men to our pregnancy test experiment. Of course, part of our test is to ask “are you male?” and mark these patients as “not pregnant”. Thus we get a lot of easy true negatives and this is the resulting confusion matrix:

8 2
10 380

Voila! 97.4% specificity with a single test. Have fun trying to get that FDA approval though, I doubt they’ll overlook the 300 red herrings.

What does it mean, who won?

Finally the punchline:

  • A search engine only cares about the results it shows you. Are they relevant (tp) or are they spam (fp)? Did it miss any relevant results (fn)? The ocean of ignored (tn) results shouldn’t affect how good or bad a search algorithm is. That’s why true negatives can be ignored.
  • doctor can tell a patient if they’re pregnant or not or if they have cancer. Each decision may have grave consequences and thus true negatives are crucial. That’s why all the cells in the confusion matrix must be taken into account.

References

http://en.wikipedia.org/wiki/Confusion_matrix

http://en.wikipedia.org/wiki/Sensitivity_and_specificity

http://en.wikipedia.org/wiki/Precision_and_recall

http://en.wikipedia.org/wiki/Accuracy_and_precision

Python 3 Wall of Shame Updates

Earlier in December I was approached by Chris McDonough with a reddit pm asking if I could or would implement some kind of behavior regarding a  “Python 2 only” classifier on the wall of shame. After some aggressive googling I found the original discussion in catalog-sig. The idea was to add a classifier that signified “the authors have no current intention to port this code to Python 3”. By declaring such an intent, Chris explained, a python package should be erased from the wall of shame. Not that I completely understood this intuition but still I tried to somehow apply myself to the effort of improving the WOS. So here’s what’s new:

  • Packages with the “Programming Language :: Python :: 2 :: Only” trove classifier will have a lock next to their package with a mouse over explaining their intent.
  • Packages that have an equivalent py3k package are now not erased from the wall but rather show a link to the equivalent package. This rightfully boosts the compatibles count by 4. Note that packages that would doubly boost the count are still erased (eg Jinja is erased because Jinja2 is in the top 200).
  • Packages that are python 3 compatible but lack the trove classifier won’t stay red if brought to my attention. I’ve always stated the WOS can only be as good as pypi, not better. Hoping that in time PyPI would become more accurate, this move saddens me a bit. To keep a bit of the spirit the artificially green packages have a red triangle signifying the maintainer’s lack of trove classifiers (again with a relevant mouse over).
  • The WOS is now written for python 2.7 and migrated to the HRD, woohoo!

Please  do contact me if there are any more inaccuracies or mistakes. I’m reachable at ubershmekel at gmail and by comments on this blog.

Ps, we’re at 57/200, so maybe by this time next year we can have that Python 3 Wall of Superpowers party! Amen to that…

Google App Engine costs the same as any shared hosting

There are many good and bad things about GAE but this issue with the new pricing is just strange in my eyes:

Every paid app must pay google at least $9 each month regardless of usage.

The main awesome thing about GAE has always been the pay-as-you-need pricing model. This concept is completely shattered now for a certain scale of apps. The Python 3 Wall of Shame needed a few extra DB writes to finish the day nicely which would have cost roughly 10 cents a day. But now google will be rounding that up to 30 cents a day.

Apps that grow to use $1 worth of quotas a month are much better off heading to some form of shared hosting for $5.50 a month until they hit that $9 a month necessity. That’s the case with the Python3WOS and another one of my apps. I can’t move the wall of shame as I don’t have the time to do the porting (locked in ಠ_ಠ), but that other app is simple enough. So google will never find out if it could have been an app that’s actually worth $9 monthly.

Youtube going to fullscreen skips bug

Whenever I switch between fullscreen and back in this video http://www.youtube.com/watch?v=lOTPCaItx3w for example – youtube skips 2 seconds forward in time. Even while paused.

I found two relevant issues:

http://www.google.com/support/forum/p/youtube/thread?tid=2960543d59021a08&hl=en

http://www.google.com/support/forum/p/youtube/thread?tid=134f7450b6d08d3c&hl=en

The first one suggests to disable flash’s hardware acceleration which does work (right click the video and then click settings).

But why should we do that?

Also did I mention I dislike losing all the video I buffered in 360p when youtube auto-switches to 480p with fullscreen? I had to manually disable that too.

¯\_(ಠ_ಠ)_/¯

Android API Bugs – AudioRecord

When asking android to record from the mic to a buffer you have to do something like:

mRecorder = new AudioRecord(AudioSource.MIC, RATE, CHANNEL_MODE,
 ENCODING, BUFFER_SIZE);
if (mRecorder.getState() != AudioRecord.STATE_INITIALIZED) {
    // fail and bail
}
mRecorder.startRecording();
bytesRead = mRecorder.read(audioBuffer, offsetInBuffer, readSize);
mRecorder.stop();
mRecorder.release();

Now the bug that drove me mad was that read() returned zero when I left the application (using the home button) and immediately returned to it. Luckily I was toying around with BUFFER_SIZE when I noticed

mRecorder.getState()

failed when BUFFER_SIZE was set to 0x100, I guessed it was too small and indeed one must call getMinBufferSize() to find out the limits of AudioRecord instantiation as documented.

Don’t use getMinBufferSize() + 1 btw as that throws an IllegalArgumentException at AudioRecorder construction:

12-10 21:49:33.359: E/AndroidRuntime(28950): java.lang.IllegalArgumentException: Invalid audio buffer size.

Using something like getMinBufferSize() + 100 does work at first but when you use the home-button to leave and return it causes the read() returning zero bug from time to time. Not always, but from time to time. Also make sure you release() the recorder because forgetting that will also allow you to construct the recorder but reads will fail.

I can’t begin to describe how frustrating finding that bug was.

Now audio programmers should know they’re supposed to read a size that’s a divisor of the buffer size, yes. This doesn’t excuse the API from behaving itself and throwing relevant exceptions and errors that are debuggable and documented. This was SdkLevel 8 btw. Things like this make you feel as though these are the guys that wrote the api:

Eventually the working code will be pushed to the android tuner google code project for your enjoyment.

Duplicating Streams of Audio with Python

This morning I made a python script that uses pyaudio to read from one audio device and pipe to the next, I call it replicate.py.

This is a really old problem for me, ever since I first had 4.1 speakers and winamp only played on the front 2. Nowadays I just want VLC to play on both the TV and the computer speakers without switching between audio output modules in the preferences or fiddling with the default audio output in Windows 7.

PyAudio was really nice and easy to use, I just wish asynch io was added so I could lower the latency a bit as I’m getting 240 ms right now which is very far from perfect.

Pendulums, WebGL and three.js

Here’s the waves pendulum three.js simulation I made.

 So I wanted to simulate a magical pendulum with waves to prove my point that the shapes are the result of a dead simple arithmetic progression. I was almost correct.

After testing, I saw that when the frequency is an arithmetic progression we get the awesome patterns. The problem is that achieving such a feat by modifying the length of the strings alone is a bit harder. Here’s omega, or the angular frequency from hyperphysics:

w = sqrt(g/L)

So all I had to do was choose omegas, increment them and from that calculate the string lengths. I got mixed up and solved the problem in a much more complicated way.

Anyhow, by faking it (choosing my omegas with bogus L’s) I get a prettier result. Headache averted. I’m not sure these swing angles are simple pendulums anyway.

WebGL and three.js are indeed awesome. It does have its gotchas but I was just so impressed with http://lights.elliegoulding.com/ and other things in the three.js gallery. It’s amazing how simple and accessible opengl is now that it’s in the browser. The “hello world” of about 20 lines for a rotating cube was good though I think it should include the WebGL detection in it.

Android api quirk number 5

Wow this is silly.

public void drawRect(float left, float top, float right, float bottom, Paint paint)

Usually it doesn’t matter if top is higher than bottom. Either way a rectangle is drawn.

UNLESS, top is outside of the screen. Then no rectangle is drawn at all.

So you can have rectangles that are partially visible, but only if their top is greater than their bottom.

ZOMG that was a hard bug to figure out….