Google research publishes their massive words/phrases database:
===
All Our N-gram are Belong to You
We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
Watch for an announcement at the LDC, who will be distributing it soon, and then order your set of 6 DVDs.
===
This team can be contacted at: ngrams@google.com
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment