Saturday, December 4, 2010

automatic document classification with Alfresco Part 2

In the first part of this article, i explained how you can use Lucene to query a document (Word, PDF etc...), and find matches for specific keywords, which was necessary for us in order to automatically identify the document's category based on its content.

We've chosen a simple approach to demonstrate the automatic classification extension : if a document contains the name of a category, then it belongs to it, of course we can use other approaches like assigning multiple keywords to a category, example : if a document contains one of the following words "java, .Net, c#..." then assign it to category "Software development", it can easily be implemented once you finish reading and understanding this article, and of course how you implement it depends on your specific needs, you might need some more advanced classification algorithm.