Tuesday, November 30, 2010

Alfresco automatic document classification : Part 1

Alfresco is capable of handling multiple classifications, or hierarchies of classification, it's a very useful feature, and can make your life a lot easier when looking for documents, especially the ones with no indexed content like pictures, scanned documents etc...
Classifying a document in Alfresco can be as easy as few clicks on the browser, however it can be very time-consuming process if you are uploading many documents every day, or if you are migrating to Alfresco : Imagine having to manually classify a few thousands of documents!
If you are still classifying documents manually, analyzing their content, and sorting them into categories, you might be interested in finding out how you can extend Alfresco to automatically classify your documents for you.

Once a user uploads a document to the repository, our extension will fetch the list of all defined categories, and then crawls the document looking for matches, ex : if a document contains the word 'Software', and there is a category named 'Software' defined in Alfresco, this document will be added to it. This is the simplest approach, a more advanced one will be explained in later posts.

That said, we will create a SearchUtils helper class that will handle searching the document for keywords, in order to search all types of documents we will be using Apache Lucene, a powerful search engine, already integrated in Alfresco, to lookup the categories in the document.

The main method is :

   public static List<String> search(File file, Set<String> keyWords) {  
     List<String> result = new ArrayList<String>();  
     RAMDirectory idx = new RAMDirectory();  
   
     try {  
       // Make a writer to create the index  
       IndexWriter writer =  
           new IndexWriter(idx, new StandardAnalyzer(), true);  
   
       // Add the Document Object  
       writer.addDocument(fileToDocument(file));  
   
       // Optimize and close the writer to finish building the index  
       writer.optimize();  
       writer.close();  
   
       // Build an IndexSearcher using the in-memory index  
       Searcher searcher = new IndexSearcher(idx);  
   
       // find to which categories this document belons  
       for(String keyWord :keyWords){  
         if(search(searcher,keyWord)) result.add(keyWord);  
       }  
       searcher.close();  
     } catch (IOException ioe) {  
   
       ioe.printStackTrace();  
     } catch (ParseException pe) {  
       pe.printStackTrace();  
     }  
       
     return result;  
   } 

it takes as arguments the document, and a list of keywords to lookup, and returns a list of the keywords (categories) found in the document.
In order for Lucene to treat the document, we need to convert the file into a Document object recognized by Lucene, which is why we used another method "fileToDocument" in order to handle the conversion, this method is a part of Lucene's demos that can be found in source distribution in the official website :

   public static Document fileToDocument(File f)  
     throws java.io.FileNotFoundException {  
   
   // make a new, empty document  
   Document doc = new Document();  
   
   // Add the path of the file as a field named "path". Use a field that is  
   // indexed (i.e. searchable), but don't tokenize the field into words.  
   doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));  
   
   // Add the last modified date of the file a field named "modified". Use  
   // a field that is indexed (i.e. searchable), but don't tokenize the field  
   // into words.  
   doc.add(new Field("modified",  
     DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),  
     Field.Store.YES, Field.Index.NOT_ANALYZED));  
   
   // Add the contents of the file to a field named "contents". Specify a Reader,  
   // so that the text of the file is tokenized and indexed, but not stored.  
   // Note that FileReader expects the file to be in the system's default encoding.  
   // If that's not the case searching for special characters will fail.  
   doc.add(new Field("contents", new FileReader(f)));  
   
   // return the document  
   return doc;  
  }  

And finally the method "search", that builds a Lucene Query Object, executes it, and returns true if any occurrences of the keyword are found.

   private static boolean search(Searcher searcher, String queryString)  
     throws ParseException, IOException {  
   
     // Build a Query object  
   
     Query query = new QueryParser().parse(  
       queryString, "content", new StandardAnalyzer());  
   
     // Search for the query  
     Hits hits = searcher.search(query);  
   
     // Examine the Hits object to see if there were any matches  
     int hitCount = hits.length();  
     if (hitCount == 0) {  
       return false;  
     }  
       
     return true;  
   }  

With this, our class "SearchUtils" is complete, in my next post i will explain how to extend and customize Alfresco, and use what we learned today in order to automaticaly classify documents on the repository based on its content.

1 comment:

  1. Is there any download associated with this blog post. i'd like to try this solution but it lacks details

    ReplyDelete