Perwad English Corpus

There are thirteen million words from two sets of text, top books from Project-Gutenberg and Forbes’ billionaire profiles, in PEC currently. Here is the statistics.

Vocabulary size Share in written English
by Oxford by Perwad
10 25% 26.1%
100 50% 54.8%
1000 74.1% 79.2%
2000 81.3% 85.4%
3000 85.2% 88.7%
4000 87.6% 90.7%
5000 89.4% 92.2%
12,500 95% 96.6%

PEC in Github
1k_lemmas

10k_lemmas