The fluid language of the Web

We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10.  You can download it here.

We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10.  Our findings are interesting:

  • The union of the word set is just shy of 110K.  This means that 10% of the words either fell in or out of the top 100K.  This is a turnover rate higher than I expected.
  • Some words that are newly in the top list are what you'd expect (unigram log10 probability difference shown parenthetically):
    • espnlosangeles (21.88993), an ESPN satellite established during 2009
    • debate2010 (21.53613)
  • Some words took a predictable jump:
    • ipad (2.560667), a product introduced mid-year
  • Quite a few words newly in the mix are not conversational words:
    • childreplyhtml (22.09848)
    • focaladvid (21.76564)

Curious indeed.

Comments (1)
  1. gwern says:

    > childreplyhtml

    This suggests the data is dirty in some respect, doesn't it? Humans don't use that sort of word, that's obvious some sort of HTML source fragment creeping into the n-grams.

Comments are closed.

Skip to main content