Saturday, December 4, 2010

automatic document classification with Alfresco Part 2

In the first part of this article, i explained how you can use Lucene to query a document (Word, PDF etc...), and find matches for specific keywords, which was necessary for us in order to automatically identify the document's category based on its content.

We've chosen a simple approach to demonstrate the automatic classification extension : if a document contains the name of a category, then it belongs to it, of course we can use other approaches like assigning multiple keywords to a category, example : if a document contains one of the following words "java, .Net, c#..." then assign it to category "Software development", it can easily be implemented once you finish reading and understanding this article, and of course how you implement it depends on your specific needs, you might need some more advanced classification algorithm.

The first question we might ask is : Where does the code to customize Alfresco goes? meaning, which class should we extend or modify in order to add the automatic classification logic? , if you want it applied all across the repository, you should extend the implementation of "NodeService", especially the method "createContent", if you want to be applied only for the JSF web-client users, you will need to extend the managed bean of the "Add content" dialog :

 import org.alfresco.web.bean.content.AddContentDialog;  

Now we will create our own custom version of "AddContentDialog" : 

 /**  
  *  
  * @author Haltout Sohaib  
  */  
 public class CustomAddContentDialog extends AddContentDialog{ 
 }  

As for all other Alfresco Dialogs, the business logic is implemented in the method :

 protected String finishImpl(FacesContext context, String outcome) throws Exception;  

Meaning that's the method we should override in order to add our own custom classification logic, that will be executed right after the document node is created at the repository :

   @Override  
   protected String finishImpl(FacesContext context, String outcome) throws Exception {  
     //We fetch any change in the navigation outcome  
     String outc = super.finishImpl(context, outcome);  
     addCategories();  
     return outc;  
   }  

 Now we will implement the method "addCategories()" :

   private Map<String,NodeRef> getAvailableCategories(){  
     Map<String,NodeRef> categories = new HashMap<String,NodeRef>();  
     //Get the nodeService  
     NodeService nodeService = this.getNodeService();  
     //We fetch a collection of all available categories  
     Collection<ChildAssociationRef> assoc =   
          categoryService.getCategories(StoreRef.STORE_REF_WORKSPACE_SPACESSTORE,   
           ContentModel.ASPECT_GEN_CLASSIFIABLE, CategoryService.Depth.ANY);  
     for(ChildAssociationRef child : assoc){  
       NodeRef node = child.getChildRef();  
       //Get the name of the category from nodeService  
       String name = (String) nodeService.getProperty(node, ContentModel.PROP_NAME);  
       //Add the category to the map  
       categories.put(name, node);  
     }  
     return categories;  
   }  

First we fetched all the available categories defined in Alfresco, searched the content of the uploaded document finding matching categories using the "SearchUtils" class we defined in the  previous article, and then add each one to the created node.
We will now implement the method  "getAvailableCategories()" :

   private Map<String,NodeRef> getAvailableCategories(){  
     Map<String,NodeRef> categories = new HashMap<String,NodeRef>();  
     //Get the nodeService  
     NodeService nodeService = this.getNodeService();  
     //We fetch a collection of all available categories  
     Collection<ChildAssociationRef> assoc =   
          categoryService.getCategories(StoreRef.STORE_REF_WORKSPACE_SPACESSTORE,   
           ContentModel.ASPECT_GEN_CLASSIFIABLE, CategoryService.Depth.ANY);  
     for(ChildAssociationRef child : assoc){  
       NodeRef node = child.getChildRef();  
       //Get the name of the category from nodeService  
       String name = (String) nodeService.getProperty(node, ContentModel.PROP_NAME);  
       //Add the category to the map  
       categories.put(name, node);  
     }  
     return categories;  
   }  

In the method "addCategoriesToNode()" we will add the provided list of categories to the created node :

   private void addCategoriesToNode(ArrayList<NodeRef> foundCategoriesRef) {  
     //Get the nodeService  
     NodeService nodeService = this.getNodeService();  
     if(!nodeService.hasAspect(this.createdNode, ContentModel.ASPECT_GEN_CLASSIFIABLE))  
     {  
       HashMap<QName, Serializable> props = new HashMap<QName, Serializable>();  
       props.put(ContentModel.PROP_CATEGORIES, foundCategoriesRef);  
       nodeService.addAspect(this.createdNode, ContentModel.ASPECT_GEN_CLASSIFIABLE, props);  
     }  
     else  
     {  
       nodeService.setProperty(this.createdNode, ContentModel.PROP_CATEGORIES, foundCategoriesRef);  
     }  
   }  

We used the bean "categoryService()", we will need to add the getters and setters for dependency injection :

   private CategoryService categoryService;  
   /**  
    * Get the value of categoryService  
    *  
    * @return the value of categoryService  
    */  
   public CategoryService getCategoryService() {  
     return categoryService;  
   }  
   /**  
    * Set the value of categoryService  
    *  
    * @param categoryService new value of categoryService  
    */  
   public void setCategoryService(CategoryService categoryService) {  
     this.categoryService = categoryService;  
   }  

And finally, now that our bean is ready, we will need JSF to be aware of it, let's open the file "faces-config-custom.xml" located in the "Web-Inf" directory, and add our bean definition :

 <managed-bean>  
    <description>  
      Custom bean that backs up the Add Content Dialog  
    </description>  
    <managed-bean-name>AddContentDialog</managed-bean-name>  
    <managed-bean-class>com.hsohaib.alfresco.CustomAddContentDialog</managed-bean-class>  
    <managed-bean-scope>session</managed-bean-scope>  
       <managed-property>  
      <property-name>categoryService</property-name>  
      <value>#{CategoryService}</value>  
    </managed-property>  
    <managed-property>  
      <property-name>nodeService</property-name>  
      <value>#{NodeService}</value>  
    </managed-property>  
    <managed-property>  
      <property-name>fileFolderService</property-name>  
      <value>#{FileFolderService}</value>  
    </managed-property>  
    <managed-property>  
      <property-name>searchService</property-name>  
      <value>#{SearchService}</value>  
    </managed-property>  
    <managed-property>  
      <property-name>navigator</property-name>  
      <value>#{NavigationBean}</value>  
    </managed-property>  
    <managed-property>  
      <property-name>browseBean</property-name>  
      <value>#{BrowseBean}</value>  
    </managed-property>  
    <managed-property>  
      <property-name>contentService</property-name>  
      <value>#{ContentService}</value>  
    </managed-property>  
    <managed-property>  
      <property-name>dictionaryService</property-name>  
      <value>#{DictionaryService}</value>  
    </managed-property>  
   </managed-bean>  

Note that the bean name must be "AddContentDialog", otherwise Alfresco will use the already defined bean for managing the "Add Content" dialog.

That's it folks, the extension should be working now, you can test it by adding documents containing the names of multiples categories you have defined, and they should all be classified automatically without your intervention.

0 comments:

Post a Comment