Counter is a simple way to find the text occurrences in a given text. You can use it to create tag cloud also. Let's look into an example to see how this works
Let's see the word occurrences from a given url in the web. The code can be used to process any palin text also. Just pass it to the Counter method as a list of words.
Let's look at the code now
#imports
>>> import urllib
>>> from collections import Counter
#Point to a website which you want to hit
>>> loc = urllib.urlopen("http://www.lalitbhatt.net")
#read the text
>>> text = loc.read()
# Find the counter
>>> words_counter = Counter(string.split(text))
# Show the most common 10 words. You can pass any number as parameter.
# Not passing any number will result in showing all the counters
>>> words_counter.most_common(10)
It will not show any meaningful result. You can use one of the libraries like BeautifulSoap or some regular expression to strip the html tags. Also you might want to build a dictionary of common words which can be stripped out to make any meaningful inference.
No comments:
Post a Comment