Calculating Relativity

I have done some interesting work at Yahoo! in the past couple of months! One of the tasks was to calculate relation and relativity between data!

Yahoo! has an excellent Content Analysis API  that analyses the content and returns it with the keywords. So I applied the following the logic to determine what data is related!

1. Count the occurence of every keyword in the data and organise in a descending order! Take the top 3 keywords and search the database for data with those keywords!

2. An Exclusion list: Maintain a list of keywords for exclusion. These are the keywords that are too generic to determine the relativity! e.g.. Australia, New Zealand etc. If data contains only 1 keyword “Australia” repeated a 100 times, you dont really know what its abt! It could be about anything Australian or in Australia etc. So its too generic to determine the relativity! In such a case the relativity module is completely dropped!

If the data contains less than three keywords and if it contains atleast one keyword thats NOT in the  exclusion list then use the keywords to query the database to get the data with those keywords!

3. Not necessary that still the data is all related! So how would you determine which data is  more related to the current data than the rest!

We now take the resulted data and then calculate the occurrence of each of the keywords and depending upon that determine if its relevant or not!

4. Last but not the least Category of the data also plays a key role! So see that the data is of the same  category as of the current!

So applying the above principles the relativity seemed to be working pretty fine atleast 99% of the time but research still goes on to make it more perfect!

If you have got any other ideas or suggestions or comments feel free to drop it in :)

Leave a Reply