What is meant by Corpus and Vocabulary in Natural Language Processing?

A corpus of text is the entire set of documents considered. The meaning of a document in Natural Language Processing is very specific to the context, as the text being analyzed could be entire journal articles or short movie reviews. A single sentence that can fit into a Dataframe can even be considered a document. The vocabulary refers to the union of all words that appear throughout the entire corpus. For example, in the following corpus

  1. It is cold outside today.
  2. I love the beach.
  3. Pizza is for lunch today.

The vocabulary would be {It, is, cold, outside, today, I, love, the, beach, Pizza, for, lunch}. 

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute
Here goes your text ... Select any part of your text to access the formatting toolbar.