Saturday, July 12, 2014

A quick and dirty algorithm for Maximum Common Substructure

While there is vast amount of literature dedicated to algorithms for Maximum Common Substructure (MCS) search, particularly with respect to molecular graph (see for instance: http://scholar.google.co.in/scholar?hl=en&q=maximum+common+substructure&btnG=), I wanted to quickly write an algorithm without reading any literature on the subject. And thus this algorithm was written, typically over the weekend.

First consider finding an MCS between two molecular graphs:

1. Convert the molecule into a bit representation. This is akin to generating fingerprint of the molecule, but in this case a fingerprint is generated for each Atom in the following way:

  • Each Atom is represented as a bit-set of 64 bits, that will be able to capture a maximum of seven bonds that are connected to the atom under consideration.
  • Each set of 9-bits represents a connection: 7 bits for storing atomic number of connected atom type, 2 bits for bond type (0-single, 1-double, 2-triple, 3-aromatic).  
  • The order in which the bits are stored in the bit-set is always in sorted order: first sorted on atomic number and then on the bond type.
2. Match the corresponding bit atom representations (generated above)  among the two molecular graphs, and build a pair list. This matching is done in two stage: the first stage involves exact bit pattern match, where as the second stage involves partial match. When doing the partial match, the atom pair match that has the maximum bit matches is taken. The pair list thus generated becomes the seed for 'growing' the graph that would hopefully be the maximum common substructure.

3. For each of the pair list generated in step 2, start growing the graph using a breadth first traversal method, at each stage checking if the corresponding expansion in the paired molecule is also possible. The pair stops growing, when no such expansion is possible. And the algorithm stops when there is no growth for any of the pairs in the list. The pair with the maximum number of elements then forms the maximum common substructure. Use this pair to next generate the template molecular sub-graph representing MCS.  

For finding MCS among a set of molecules: 

4. Find the molecule with least number of atoms from the set, and mark it as reference molecule. Follow steps 1 to 3 for the pair of reference molecule and every other molecule in the set. Instead of just picking the pair with maximum elements in step 3, save the complete list (L). Next, pick the pair list that has least number of elements as reference (Lp). For a member in this list (iLp), search the complete list (L), to see if every list has the member iLp, if yes add it to a new list (Lnew). Repeat this procedure for every member of list Lp. Finally, the list with maximum number of elements in Lnew is the MCS for the set of molecules. 

That is it. I have tested the algorithm on different sets of congeneric series of molecules, and it pretty much works for all. For non-congeneric set however, one needs to build a cluster algorithm to find the MCS in sub group of molecules, that is work for some other weekend. 

For now, an implementation of this is available as a scripting function within VLifeMDS (www.vlifesciences.com, visit https://portal.vlifesciences.com/ to buy a license). I will also hopefully make an implementation available with MeTA Studio soon.

  

No comments: