Once there lived a poor woodcutter. He used to cut trees in the woods. One day he was cutting wood on the bank of a river. His axe fell down into the river. The river was deep. He could not take his axe out. He sat on the bank and began to weep.
Mercury, the god of water appeared. He asked the reason of his weeping. The woodcutter told the whole story. Mercury dived into the water and brought a golden axe. The woodcutter refused to take it. Mercury again dived and brought a silver axe. The woodcutter did not take it either. Then he brought an iron axe. The woodcutter took it gladly. Mercury was much pleased. He rewarded the woodcutter with the other two axes.
Let us find some keywords for the story. One simple approach is that choose the words which are more frequent in the documents as keywords.
The frequency list of the words in the document is given below. The list prepared without considering the meaning of the words. So axe and axes were considered as different words.
the, of, to etc. possess no characteristics of the document. These words are more close to grammar than meaning. They are connecting words to form sentence. These words are occurring frequently in English language and should be excluded from the frequency list.
How do we come up with the common words in English language?
Oxford has done a good job of identifying common words in English language. There are some remarkable findings in their research. The first ten most common lemmas(a lemma is the base form of a word. For example, climbs, climbing, and climbed are all examples of the one lemma climb) is constituted 25% of the corpus they built and the first 100 lemmas are 50% of the corpus. The 100 lemmas list is given below.
Here is the frequent words excluding the hundred words.
Say these eight words(woodcutter, axe, mercury, river, brought, dived, water, bank) to any one of your friends. If the friend ever read the story, it would remind her the story. Otherwise just ask to any one of the search engines!
It produced a good set of keywords even though it failed to identify significant keywords like gold, silver, etc.