SciFed Drug Delivery Research Journal

An Improved Similarity Searching using Bit-indexing and Enhanced Merge Search Strategy

Research Article

Received on: August 21, 2017

Accepted on: August 21, 2017

Published on: September 20, 2017

1Salih Abdullah Ahmad, *Nurul Hashimah Ahamed Hassain Malim

*Corresponding author: Ahmad S A & Malim N H A H, School of Computer Sciences, UniversitiSains Malaysia, Penang, Malaysia. E-mail:; Tel: (+604)6534645


        In studies related to large data handling such as those chemical databases, several problems and deficiencies in the system were highlighted. One of them is the low-performance efficiency of Chemoinformatics applications, such as Similarity Searching to traverse through the whole database range. Previous studies have shown that several essential subjects related to Similarity Searching algorithm have been designed to overcome the low performance. However, it comes with a set of limitation. This proposal utilizes the Bit-indexing method to overcome this problem and maintain the reliability of the system. The Bit-indexing idea was represented by reducing the number of elements for each compound array of 2D fingerprint representation, which led to an accelerated Similarity Searching algorithm of huge chemical databases. To evaluate the new technique, two benchmark databases MDDR and ChEMBL were used with four different new targets at 25%, 50%, 75%, and 100%. In both databases, the Bit-indexing Similarity Searching was shown to be more efficient than the normal Similarity Searching module which was implemented in the Chemoinformatics applications. Finally, it can be claim that this method has been proven to accelerate the Similarity Searching algorithm by transforming the fingerprint representation to bit-indexing representation.


              Chemoinformatics; Similarity Searching; Bit Indexing ; Merge Search Strategy



        Chemical informatics (or Chemoinformatics) is a branch of chemistry sciences that specializes in computational approaches to handle chemical problems specifically in the treatment of chemical structural information [1]. It consists of the design, analysis, organization, management, creation, retrieval, dissemination, visualization, as well as the use of chemical information and molecular descriptors to perform an extremely important role in all these processes. It provides the fundamental tool to transform chemical information into a numerical code that is compatible with the use of informatics procedures [23]. It also includes several arithmetical ways which have been formulated via continuous researches to simplify the means of making drug at the best possible cost and time [4]. 
         One of the widely used methods for detecting a class of chemical compounds which affect a quantifiable biological response for any available standard [5] in Chemoinformatics is called High-Throughput Screening (HTS). This method is efficient but with the increasing size of the database, the cost of this process makes it less viable. so, Virtual screening (VS) is later introduced as an alternative to HTS and it has become one of the significant applications in Chemoinformatics. It assists in offering the starting point of pharmaceuticals development instead of the high-priced HTS. 
        VS are known as a type of arithmetical techniques that is adopted as a cost-effective complement for high throughput  screening. It can be achieved by executing appropriate chemical similarity methods. VS functions by classifying huge databases of chemical compounds into various drug types based on compounds possibilities of binding with a particular drug target [6]. The classification is done by discovering the similarity between chemical compounds. It includes a variety of arithmetical techniques that allow the reduction of enormous chemical databases through synthesizing processes by using computer application [7]. 
Similarity Searching (SS) is one of the virtual screening techniques that have been widely investigated for several years. This technique has been developed into a useful tool in most applications in various areas and identified as the simplest technique to determine a similarity between any two chemical compounds [8]. In addition, SS is the perfect solution when there is an enormous size of databases, as well as if there is an increase in data complexity [9]. 
        Despite SS confirmed worthiness, it suffers from increasing time consumption with an enormous increase for databases elements. The great demand for chemical databases has become a major problem in the current settings, where there are many colossal libraries, which have billions of molecules, and may be reaching trillions of compounds in some virtual combinatorial libraries used in the drug industry. Accordingly, similarity assessments are often unable to keep pace with the increase in database size. This issue has led to clear time consumption, and this will be an impact on similarity assessment [4] . Hence, with the growth in size of available chemical databases, any advancement of SS will surely take a considerable execution time. 
There are numerous studies looking at the potential of accelerating the SS methods [10]. Often, researchers suggest parallel implementation to accelerate SS by enforcing similarity calculation on the Graphic Processing Unit (GPU) [11, 12]. However, we anticipate a way of achieving this without venturing into the parallel implementation of the algorithm. Hence, this study aims to improve on the similarity searching algorithm by minimizing time factor consumption without the need of parallelism. 
        After taking a deep look at the chemical representation, we found that it could be transformed into another representation that could traverse easily for similarity searching. This representation is called BitIndexing and the traversing method used to perform similarity searching on the representation is called the merge subroutine. The advantage of the Bit-indexing idea that is represented by reducing the number of elements for each compound array shall lead to an accelerated similarity searching algorithm of huge chemical databases.
        This work focuses on transforming the fingerprints of chemical databases (MDDR and ChEMBL) into bitindexing representation that would reduce the number of elements to be handled by the data structure and later adapted by merging subroutine procedure into the similarity searching algorithm for a pair wise bit-indexing representation comparison. The effectiveness is then evaluated in terms of execution time and elements size by comparing the normal similarity searching algorithm performance to the modified (by adaptation) similarity searching algorithm with the increasing size of the chemical databases. 
Background and Related Work 
         This section will look into two distinct topics. The first is related to the conventional similarity searching that is widely used in Chemoinformatics and its components. Whilst, the second look at indexing method applicable to the chemical representation available in this domain to be used as an alternative to SS. 
Similarity Searching 
       SS as described earlier, works by screening database for compounds that is similar to the query being sought. It involves a comparison of the whole structure of the query (hereafter query compound) to the structures of the compounds in the database (or hereafter database compounds) according to similarity quantification [13, 14]. But, the query compound and database compounds must be represented by an exactly similar system [15]. The representation system of chemical compounds is referred to as molecular descriptors. The degree of structural commonality between molecular descriptors being compared is ascertained by a similarity score produced after applying similarity coefficients to pair wise comparison between the query compounds to each database compound. This will allow the ranking of database compounds in certain orders (increasing or decreasing) to the query compound according to the score where compounds that reside on the top of the list are its closest neighbor [14 , 16]. This list is often referred to as ranking list to the user [17] where the effectiveness of SS is being measured using recall and precision at a certain cut-off, usually 1% of the database size [18]. 
Molecular Descriptor and Coefficient 
         The concept underlying SS is the Similar Property Principle [19] which states that compounds with similar structures tends to exhibit similar bioactivity. The similar properties can be quantified using similarity measures, which is a combination of a molecular descriptor and a similarity coefficient. Hence, the performance of SS would be tightly influenced by the ability of the descriptor to extract as much meaningful information to characterize the compound and also the ability of the coefficient to manipulate the information of the descriptor. At the moment the best molecular descriptor i.e. fingerprints is ECFP4 which is a binary fingerprint from the Scitegic's Pipeline Pilot software and the best coefficient is always the industrial standard Tanimoto. These similarity measures has reached the edge of stability where there are hardly any arguments being raised questioning their accuracy. However, the highlight is now shifted to accelerations of SS as a whole. A single SS on a 100K compound database took 8.4 seconds and 22.4 seconds for 650K compound database. This value will keep increasing with the size of database and especially with the addition of the number of SS that needs to be executed. Note that, this is only taking the conventional SS in consideration. There are other variations of enhanced SS which add more processing steps such as Similarity Fusion [20] Group Fusion [21]. Turbo Similarity Searching [6, 22] etc that would further extend the execution time. These have been the reasons for Chemoinformaticians to venture into the adaptation of more powerful infrastructures such as the GPU.
          Molecular descriptors are representation systems that capture the information of atoms and bonds that made up molecules as a machine readable format. These could be in the form of 1-,2- or 3-dimensional as shown in Figure 1. The simplest representation is the 1-dimensional (1D) representation that expresses molecules based on number of atoms and their bonding arrangement [23] into a single molecular formula form from which many physiochemical properties can be derived such as the molecular weights, logP value and molar refractivity. However, an overall value derivation would not possibly allow a specific extraction of the number of occurrence of atoms types, what more their atomic configurations that are the ensembles of structural characteristics of a molecule [24]. The 2-dimensional (2D) descriptors tackle this by utilizing details about connectivity of particular atoms in molecules expressed by a structural formula whilst the 3-dimensional (3D) of molecule is achieved by putting atoms into 3D Cartesian space and by characterizing their common connectivity in 3D space [24].
Figure 1: Show the Types of Molecular Descriptors 

         Within the 2D representation, atoms are denoted by their chemical symbols and their link is represented by chemical bonds; one hyphen (-) for single bond, two hyphens (=) for double bond and three hyphens (≡) for triple bond. In comparison to the previous 1D descriptors, 2D descriptor offers a great deal of information. It is easy to extract e.g. connectivity of atoms, number of rotatable bonds (single bonds) or molecular link index or Kier molecular form indicator [25, 26]. The 2D representation system has given birth to a new descriptor known as 2D-fragment based fingerprints (often referred to as fingerprints). It is a fixed size representation of structural attributes for compounds based on bits representation [27]. In the most common method, the structure of a molecule is split into sub structural fragments and collected into a string (binary fingerprints) or vector (counts fingerprints or holograms). For the case of binary fingerprints, if the fragment is available in the molecule, the corresponding string element gets 1, or else, it gets 0. Holograms or counts on the other hand, take note on the number of occurrence of a fragment rather than just their presence in the molecule. The emergence of 2D fragment-based descriptors has allowed similarity search systems and classifications to be implemented effectively [ 27] Binary fingerprints are particularly beneficial, as there are maximum effectiveness in computer algorithms which use binary strings [28]. Figure 2 shows a simple illustration of the binary fingerprint derivation [29]
Figure 2: Fingerprint Representation of Chemical Compound [29].
      Similarity coefficients are mathematical formulae that enable the process of mapping pairs of well-matched molecular descriptors to real values as a measurement of the degree of equality between them [ 30]. Similarity coefficients are often classified into four main categories i.e. Correlation, Probabilistic, Association and Distance. Correlation coefficient calculates the degree of inter-correlation  between collections of values which distinguish two of the descriptors [31]. Distance coefficients measure the degree of differences between a pair of descriptors [32]. Probabilistic coefficients, while not commonly used in evaluating molecular similarity, put emphasis on the distribution of the frequencies of descriptors to the members of a data set, providing additional significance to a match on an irregularly occurring variable [33]. Association coefficients are widely used coefficients for VS that highlight to which degree pair of descriptors is identical. These types of coefficients are also known as coefficient of identity [34] and they are widely used with binary representations. They calculate how strong the relation between two compounds is; where value of 0 implies no relationship, and value of 1 or more indicates the existence of relationship (value of some coefficients may exceed 1). 
      Association's coefficients have been widely used for calculating resemblance between compounds represented by binary fingerprints. Coefficients that belong to this type are Simple Matching, Tanimoto, Cosine, Dice, Euclidean, Soergel, Rusell-Rao and Forbes coefficients [ 35]. Tanimoto coefficients (Tc) in particular have been the industrial standard for VS, especially SS. It has become the coefficient of choice to calculate similarities for binary fingerprints. Tc uses the ratio of the intersecting group (or overlapping group) to determine the resemblance in the groups [36]. As mentioned earlier, binary fingerprint is simply a bit string [37] where "0" indicates the absence of a fragment while "1" indicates a presence of a fragment. Hence, in the case of binary fingerprints, Tc uses a common presence (i.e. 1) of fragments in both fingerprints in comparison to quantifying the degree of resemblance between them. The similarity calculations of binary fingerprints are based on four terms [38]. Let A and B be the compounds under comparison represented by binary fingerprints (bit strings) in which the similarity between them (SA, B) is described in Equation 1. 
       Where a and b represent the numbers of bit locations set to "1" in compound A and B, respectively. Whilst, c represents the number of common bits set to "1" in both compounds A and B [39]. The Tc value produced is also referred to as a similarity score and it ranges from 0 to 1 where scores close to 0 indicate dissimilarity between the compounds and scores close to 1 indicate similarity between compounds [40]. 
Drugs Database 
      There are many commonly used chemical databases in Chemo informatics for experimental work on either drug candidate searching or reactions predictions. Amongst them are Drug Bank, PubChem, ChEMBL, MDDR etc. These databases are either categorized under commercialized or public databases. We reviewed two databases that were used in our work. 
       MDDR or MDL Drug Data Report is a commercialized database that contains 185,844 compounds tested for potential therapeutic activity that were compiled from the patent literature, published documents, congress proceedings and meeting reports [41]. It contains both the structures of compounds and their derivatives. Compound linked information includes bioactivity, therapeutic classification, detailed patent information, literature references, company names, compound codes, generic names, trade names and development phase [42]. 
        ChEMBL is a common chemical database that is maintained by the European Bioinformatics Institute (EBI) involving functional, binding and ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) information for a huge number of drug-like bioactive compounds. The data are manually extracted from the main published literature on a standard basis, in addition standardized and curated to maximize their quality across a wide drug-discovery research problems and range of chemical biology. The database includes 5.4 million bioactivity measurements for approximately 1 million compounds and 5200 protein targets [43]. 
Indexing and Searching 
        Indexing is the work of classifying and supplying an index to ease the retrieval, classification, assortment, compartmentalization, and categorization of objects. The indexes include several kinds such as unique indexes, non-unique indexes, clustered indexes and non-clustered indexes. The Unique indexes are indexes, which assist to keep data reliability by making sure there are no two rows of data in a table with the same key values. Hence, this signifies that the index key contains no replicate values. The non-unique indexes are not utilized to enforce limitations on the tables that are connected with it. Rather, this is only employed to enhance search efficiency by keeping a sorted order of data values that are utilized repeatedly [44, 45]. 
      The Clustered indexes, which arrange and stores the rows in the data pages that are compatible with the arrangement of the rows in the index. It is able to optimize the efficiency of almost all query procedures since they offer an extra linear access path to the data, which has been saved in pages. In addition, rows with identical index key values are also saved with each other [45]. The non clustered index could be described in a table or a way to show a clustered index or even in a group. Every index row in the non-clustered index includes the key value as well a row locator. The work of this locator is to indicate to the data row in the clustered index or group owning the key value [46, 47]. The recent index strategy that has become formally available in PostgreSQL is KNN (K Nearest Neighbor) a smart index, which is able to retrieve K rows that are well arranged by their distance from the required lookup aim. An example of KNN usage is in fuzzy text lookup, where this can be utilized for arrangement of fulltext search results by how well they correspond with the search-terms [48]. 
           Bitmap indexing is often an appropriate method for low-cardinality column databases [49] that contain a small selection of special values that may be absolute or pertaining to the number of records containing the data. The Boolean data represent a significant situation for simple bitmap indexing, which has two values, true (1) and false (0). The primary idea behind a prevailing bitmap indexing is by using a chain of bits or in order to indicate whether a characteristic in an interconnected group of fields (tuple) is equal to a particular value or not [50]. The most popular manner for a bitmap index on an indexed characteristic includes just one vector of bits for each attribute value, whereas the measurement of every bitmap is such as the cardinality of the indexed relation. Figure 3 shows a simple example for bitmap of a value list index for a (12) record relation, whereas every column in part (B) refers to a bitmap BV affiliated to an attribute value (v) [ 51]. 
Figure 3: Simple Example of 12 Record Bitmap Indexing [51
       The group-bitmap index is able to establish a new index design and retrieve problems to the subset lookup problem by evaluations of recovery or association rules and item group from the relational database. The purpose of the group bitmap indexing is to enable speed up lookup for subsets and content-based association rules retrieval in relational databases. The primary idea of the group bitmap index is usually to create a binary key, known as group bitmap key, related to every item group. The group bitmap keys are considered components of the item set. While retrieving information, these keys are employed to prune those collections of elements that do not possess the subset for lookup. In the next section, Figure 4 illustrates an example of a simple group bitmap index, in which the item represents "on" bits locations in binary set and a group represents every binary set (bitmap keys) [52]. 
Figure 4: Example of Simple Group Bitmap Index [52
Searching Dystem for Index Structure 
        Index search only applies a minor portion of the huge database information which is the indexed part [53]. This type of search employs an index to find the location of item. The index is searched by binary search technique or another algorithm to get to the position of the data [ 54]. The binary search is an effective directory search of checking an element into sorted array that in a way is like the bracketing approach uses information obtained with every stimulus display to define the subsequent phase of the search [55]. 
Binary search divides the sorted array into two portions, one of them is the rescindable part and it operates on the other part. Whereas, binary searching algorithm repeats this process, halving the size of the process able part of the series every time until it finds the result [56]. The lead to each-probe ignores half of the keys and then reduces search complexity [57]. In Figure 5 below an example of binary search for 76 in an arranged array is illustrated. Each time the number (76) is in comparison with the target, the potential location is found within the unshaded half of the array in every cycle [58].
Figure 5: Example to Find the Number 76 in an Arranged Array by using Binary Search [58]
         Linear search is the default way to sequence searches which look up a key value inside a set of elements. Linear searches work in a similar manner as the searches in the non-arranged array. The significant difference is always used with arranged array, the lookup stops when an item with a bigger key is found [59]. The aim of the lookup can be basically to check if the key exists in the ordered array, to search for a value associated with the key, or even to improve on the key value within the array [60]. 
        In general, pair wise comparison is a process of comparing two arrays or vectors that have the same or different sizes, and there are many possibilities for comparison operation which are represented through searching for common values, searching for a maximum value, and searching for a minimum value between them. There are two common methods used for pair wise comparison; the Brute Force and Merge Subroutine comparison. Brute Force works based on the computation of the inner product of every document array with the other document array. This means, for two sorted arrays (m, n), an O (m * n) comparison is at least needed. The main drawback of this method is a high complexity of execution time [61, 62]. On the other hand, Merge Subroutine comparison works based on two sorted arrays in order to do effective comparison between them. The main advantage of this technique is it reduces the comparison between two arrays by factor O (m+n-1), which minimizes the complexity of time [63].
Data Transformation using Bit Indexing 
        The major idea of Bit-indexing is when it saves an address value for "on" bits in long binary series, which refers to the locations with digit 1 and eliminate the locations of digit 0, then it is put into new integer number sequences (one dimension array). Bit-indexing produces the lower elements of the array for the long binary string, thus helping to reduce the search time complexity. It offers the best lookup result in a huge binary database. Figure 6 below shows the simple idea of Bit-indexing and how it has reduced the elements of the binary string array. 
      The new idea of Bit-indexing is used in this study to improve the similarity searching by reducing the number of array elements of a binary structure for each compound in a database, which accelerates the SS. In contrast, the compound structures found in a normal binary fingerprint database has an array size of 1024 elements (or bits), which include 0 and 1. Bit-indexing is based on the search for the present bits in the fingerprint representation and it saves the location value of these bits in a new array to create a new structure type of the database. Thus, this new method eliminates all bits (value 0) from the binary fingerprint structure. Table 1 illustrates the time complexity for each compound structure in terms of reading a compound or calculating the present bits for compounds before and after indexing. 
        Table 1 shows that after removing all zeros from each compound's binary form, the maximum numbers of array elements for MDDR and ChEMBL compounds become 177 from 1024 and 207 from 1024, respectively. Figure 7 shows the Bit-indexing Procedure. Figure 8 illustrates the usage of Bit-indexing and its benefits. 
Table 1: Shows Time Complexity for Compounds Before and After using Bit-Indexing 
Figure 6: The Idea of Bit-Indexing 
Figure 7: Flowchart to Show the Bit-Indexing Procedure for Compound 
Figure 8: Simple Example for Bit-Indexing Idea 
Merge Subroutine with TC 
       The previous sections of this study illustrated the significance of Tc in measuring similarities among compounds. In this study, one request was to implement Tc calculation of the common present bits between a target and the database compounds. In the binary fingerprint structure, this method was easy because all compounds have the same number of elements (or bits), which was 1024, and were calculated by the horizontal AND operation [64]. However, after using Bit-indexing, the elements of the compound structure changed. Thus, the process utilized a merged subroutine comparison technique to find the common values between two arrays. To use this method in finding common values between blocks T and D, the maximum T + D - 1 comparison was required [64]. As the main objective of this study was to reduce the consumed time, Bit-indexing had an important role in reducing the number of elements of the compounds in the database, which would significantly reduce the time consumed to read the massive data used. Tables 2 and 3 show the reduction caused by the complexities of reading large databases (MDDR and ChEMBL) used in this study, before and after Bit-indexing. 
Table 2: The Time Complexity of Databases Reading Before Bit-Indexing
Table 3: The Time Complexity of Databases Reading After Bit-Indexing 
          Nevertheless, the influence of Bit-indexing is not limited to reducing the large read-only files, but it also has an important role in other aspects of the improved SS algorithm. Evidently, Bit-indexing influences the calculation of the similarities between chemical compounds that have been measured using Tc. Bit-indexing is very useful in processing the requirements of the Tanimoto equation, which is the total number of units (or present bits) in the binary structure of a compound and the total common units of two compounds that have been used to measure the degree of similarities. With a binary structure consisting of 1024 elements (or bits), the possibility of calculating the number of present bits for each compound was large and represented by 1024 prospects. 
          After harnessing the new idea, present bits became the array elements number of the compounds in the new format, which was the total number of present bits. This change was brought about by the basic role of Bit-indexing in which the 0 of the binary form of the compound was ignored because it was abundant. This matter has been clarified in Figure 8. It became the largest probability in calculating the number of present bits in the MDDR database with 177 prospects, and in the ChEMBL database with 207 prospects after being detected in both cases with 1024 possibilities. 
          The possibility has been reduced to less than a quarter. Tables 4 and 5 show the reduction of calculation of present bits. Table 6 shows the reduction of calculation of common values for two sorted arrays (m, n). Bit-indexing reduced the time complexity of many aspects used in the study, which involved merge subroutine comparison to find common values between compounds. Hence, the compounds became representative of the different sizes of the database, which were generated after Bit-indexing. 
Merge subroutine comparison has been explained in the previous sections and the concept is illustrated in Figure 9. 
Table 4: The Time Complexity for Calculating (a,b,c) Before Bit-Indexing 
Table 5: The Time Complexity for Calculating (T, D) after Bit-Indexing 
Table 6: The Time Complexity for Calculating the Common Values After Bit-Indexing 
Figure 9: The Idea of a Merge Subroutine Comparison between Two Arrays 
       This provided the percentage in reducing or calculating the number of units for the compounds, and in calculating the common values between the compounds using equations before and after utilizing the new method. Equations 2, 3, 4, 5, 6, and 7 show the percentage of the time complexity for the requirements of this study that were significantly affected, to accelerate the similarity searching. The first two equations show the percentage of the operations before the proposed method was used depending on table 4. As well as other equations that show the percentage of operations after applying Bit-indexing depending on the tables 5 and 6.
        All these explanations led to several points: the reduction in time consumed, and similarity searching algorithm was hastened and improved using the new concept. The points are as follows: 
1. Reduce the array elements of the compound structure and then reduce the time consumed for reading huge databases. 
2. Reduce the array elements of the compound structure and then reduce time consumed for calculating the present bits for each compound. 
3. Reduce the array elements of compound structure and then reduce time consumed for calculating the common present bits between compounds. 
        After all these contributions to Bit-indexing, a new representation of the Tanimoto equation (KSTc) was developed. The common values between two compounds were compensated instead of the common present bits. In the new structure, the number of array elements for a target compensates instead of the calculation of the present bits for the target compound. Moreover, the number of array elements for the database compound in the new structure compensates instead of the calculation of the present bits for the database compound. The target array elements (T), database compound array elements (D), common value arrays between two compounds (Cv) or Cv equal to array elements for intersection between (T, D), and KSTc form the new Tanimoto equation, where KS is author's name (Kaka Salih), as shown in Equations 3.7 and 3.8. 
Experimental Details 
       The results of the tests performed on two databases are provided below. First, MDDR consisting of 102540 compounds/molecules, and second ChEMBL consisting of 657733 compounds/molecules. Both MDDR and ChEMBL were represented by ECFP4 fingerprint before Bit-indexing. In addition, we chose the query (target compounds) from one of four targets, which is explained in the table 8. However, the main condition for the SS execution was that all target and database compounds must have the same representative structure. Therefore, the selection of the structures used in the proposed system is explained through the following steps:
  • Before Bit-indexing, use ECFP4 fingerprint (binary string), in which each compound is represented in 1024 bits. 
  • After Bit-indexing the structure of the compounds, turn to the sorted integer's string that only has 1 bit in the binary fingerprint. The compounds represented in arrays for different numbers of elements depend on the number of present (or "on") bits in each representation of the compound fingerprint.
          Tanimoto coefficient was modified to accommodate the new Bit-indexing representation. Table 7 shows the formula for Tc in its original form for ECFP4 representation, as well as the modified form for Bitindexing representation. Lastly, the results were arranged in a descending order inside the text file using quick sort method. Thus, the results shown at the top of the file showed highly similar compounds, and the proportion of similarity decreased as we go towards the bottom of the text file. 
Table 7: Tanimoto Coefficient Before and After Bit-Indexing 
Results and Discussion 
         The similarity search method consists of numerous steps and the present bit has substantial influence on several of SS steps, if not on all of them. However, the 2D fingerprint representation includes 0 and 1, whereas Bitindexing incorporates all present bits (or 1) and eliminates all absent bits (or 0). Therefore, we created four targets based on the number of present bits to quantify the benefits of the proposed method used in this study. These four targets have the 25%, 50%, 75%, and 100% of the present bits, respectively. Table 8 presents the characteristics of these four targets.
Table 8: The Attributes of Test Targets 
         We evaluated the effects of these test targets on Tc and the results of the similarity search. We then estimated the result of SS before and after using Bit-indexing for the general type target, and measured the accuracy of the proposed method through recall process. All these performance calculations are discussed in the following subsections. The execution time of a sequential application includes the measurement of all processing times from the start to the end of its execution in a computer [65]. The SS execution time started from reading a compound in the database and ended upon writing the result into a text file. In this study, we did not mention the pre-processing time because only a single time was needed to convert a binary string to an integer string array before applying the SS method.
        The two sequential execution times for a similarity searching algorithm before and after Bit-indexing method were represented by the normal SS time and Bit-indexing SS. Since Tc and SS included several operations, different results were being obtained. Consequently, the average of ten tests was taken to show the results before and after the Bit-indexing. 
      The speed-up (S) is an important factor in parallel programming as it measures the optimization performance ratio of the execution times of parallelism and sequential programming. The parallel and sequential times are denoted as Tpar and Tser, respectively. The speed-up value was be calculated by dividing the elapsed time of a serial program by the elapsed time of a parallel program [66]. Equation 10 expresses the speed-up for a parallel system.
        Using Equation 10, the speed-up for this study was calculated to obtain the execution speed of the Bitindexing method. Moreover, we used the normal SS time and the Bit-indexing SS time instead of the Tser and Tpar, respectively. Equation 11 was used to determine the speedup of the proposed system. 
         The performance percentage can be used to determine the applicable gain time for a parallel program that estimates from the variance between the serial and parallel execution times divided by the serial execution time [66,67]. The general form of Equation 12 is shown below, where PG is the performance gain, and Tser and Tpar are the serial and parallel execution times, respectively. In this study, Equation 12 was used to determine the performance gain between normal and Bitindexing SS. Equation 13 demonstrates this measure and the performance gain of Bit-Indexing represented by BPG.
          Recall (Re) is a common rule used to assess the accuracy of a similarity search performance. Numerous activity classes were present, including diverse numbers of compounds, and used to measure the accuracy of similarity search. Ten targets were randomly chosen to measure the recall performance within the 1% of the top results of SS [ 6]. Table 9 shows the activity classes in MDDR that was used in this study [42].
Table 9: The Activity Classes Created by Hert in MDDR 
         The recall ratio was calculated by dividing the number of activities retrieved (or number of matched id) over the number of class compounds (or number of class id). Equation 14 expresses Re.
The Tc is an important way to determine the degree of similarity between chemical compounds and is represented by a 2D fingerprint. In addition, the present bits had a major role in the development of this process. Therefore, Bit-indexing had a significant influence on the implementation time of this process. We compared the Tc execution time for the structure of the binary string with the new Tc execution time for the structure of the integer string. 
         In this study, we used the two above-mentioned datasets (MDDR, ChEMBL) in the evaluation of the experimental data. Moreover, we also compared the four new targets with each of the compound found in these databases. The four targets were used in the MDDR and ChEMBL because both have a binary string structure prior to the Bit-Indexing and an integer string structure was obtained after the Bit-Indexing. The following steps demonstrate the development of the operation in determining the time consumed by Tc before and after the Bit-indexing:
  • At the first stage, the implementation time (in seconds) of Tc was determined using ECFP4 fingerprint representation for the four target compounds and the compounds of the two databases. The runtime of Tc for the four targets was calculated using the equation [Tc = c/ (a + b - c)]. In addition, the results were arranged in descending order from the time of 100% to 25%. 
  • At the second stage, after converting the structure into an integer string, the new equation 8 was used to determine the execution time of Tc. This procedure had the same ordering of targets. All these results are shown in two Figures 10 and 11.
Figure 10: Comparison of the Tc Execution Time Before and After Bit-Indexing of the Four Test Targets in MDDR
Figure 11: Comparison of the Tc Execution Time Before and After Bit-Indexing of the Four Test Targets in ChEMBL 
        The above Figures show that the execution time of the Tc for the four new compounds in the case of normal similarity searching did not significantly change in each of the databases. The Tc execution time for MDDR ranged between 0.77 s to 0.60 s and the Tc execution time for ChEMBL ranged between 4.73 s to 3.58 s. However, we observed an evident reduction on the execution time of Tc only in the second case using Bit-Indexing, through which the time of target 25% became 0.44 s compared with the 0.60 s of the MDDR database and 2.90 s compared with the 3.58 s of the ChEMBL database. We focused on the 25% target because the number of present bits in all the compounds in both databases was less than the present bits for this target. 
      The results showed that reducing the number of present bits of the target compound reduced the execution time using the Bit-indexing method because the number of array elements for the compounds decreased. However, the number of array elements remained fixed for the compounds when the proposed method was not used, in which each compound array had 1024 elements (or bits). For this reason, Bit-Indexing re-duces the calculation of the present bits and the comparisons to calculate the common values between two arrays, as mentioned in Chapter three. The next two Figures show the results of the execution time (in seconds) for the similarity search method. All these results were obtained after running the two sequential models of SS before and after Bit-indexing on the same computer and using the same operating system. These Figures 12 and 13 show the variations in time between using normal SS and Bit-indexing SS.
Figure 12: Comparison of the SS Execution Time Before and After Bit Indexing of the Four Test Targets in MDDR
Figure 13: Comparison of the SS Execution Time Before and After Bit-Indexing of the Four Test Targets in ChEMBL
       Figures 12 and 13 show an evident improvement in implementation time between the two similarity search algorithm before and after Bit-Indexing. Generally, the impact on the time factor was evident when using the proposed method. By contrast, the execution time for SS of each target decreased at least by six times in the MDDR database and at least twice in the ChEMBL database. The differences in execution time using the four new targets were substantial. By comparing the two methods, the Bitindexing method was evidently faster than the normal method of similarity search. This difference indicates the efficacy of the new method because of the high number of present bits (207 bits) for all the compounds in the employed databases which was approximately equal to the fourth proposed target of 25%. Thus, we will demonstrate the speed of the new method in detail in the succeeding section.
Figure 14: Speedup in MDDR by using the Proposed Method
Figure 15: Speedup in ChEMBL by using the Proposed Method 
Speed up 
         This study used different targets through their characteristics in terms of content. We then tested the two databases by conducting similarity search algorithm before and after Bit-indexing to observe and evaluate the execution time. It was observed that there was an evident difference in reducing the times when the proposed method was used. Figures 14 and 15 illustrate the speed-up of the algorithm based on the execution time of SS before and after Bit-indexing. These values were calculated using Equation 11 and presented by Tables 10 and 11.
Table 10: The Bit-Indexing Speed-up Result in MDDR
Table 11: The Bit-Indexing Speed-up Result in ChEMBL
         An increase in speed-up of similarity search was observed when the number of the present bits of the target compound decreased. Contrastingly, the algorithm speed-up was higher by 8.53 and 3.21 times when using the MDDR and ChEMBL databases, respectively, in the fourth goal (25%) testing. The above results confirmed the efficacy of the new theory because all the compounds used in this study contain a number of "on" bits that were less than the fourth goal (25%). As an acceleration using the proposed method was observed, a proportional gain for the performance was expected, as shown in Figures 16 and 17. In addition, the BPG was measured using Equation 13 and presented in Tables 12 and 13. 
Figure 16: The Percentage of Performance Gain for MDDR 
Figure 17: The Percentage of Performance Gain for ChEMBL
Table 12: The Bit-Indexing Performance Gain result in MDDR 
Table 13: The Bit-indexing Performance Gain result in ChEMBL
       First, the speed-up of the algorithm was evidently reflected by the increase in performance gain, as shown in Figures 16 and 17. The first highest rate was recorded with a performance gain of 88.28% in the MDDR database that corresponded to 8.53 times of speed-up. Second, the largest rate was recorded with the performance gain of 68.88% in the ChEMBL database that corresponded to 3.21 times of speedup. All these results showed the performance efficiency of the new method. The accuracy of the proposed method will be discussed in the following section. 
        The previous results proved several aspects of the proposed approach, such as the improvement of the execution speed and enhancement of the performance gain. In this part of the study, we verify the accuracy of the results from the new method results using recall (Re). The results of this new method were compared with those from the study of Malim. In this study, seven activity classes were selected. These results showed that the recall was taken from the top 1% of the list per each activity class. By contrast, the recall values before and after Bit-indexing were very similar. The score after using Bit-indexing would sometimes be large, which indicated an increase in accuracy. For example, in the Renin activity class (shown in Table 14), the recall increased for several compounds, such as compound 226459 whose recall value increased from 12.6 to 12.7, thus indicating an increase in accuracy. Hence, the benchmarks of accuracy are shown in table 15 referring to average (Av) of recall for activity classes before and after proposed method. Whereas the averages before applying the proposed method relied on the results of Malim in 2011 [6].
Table 14: Rennin Activity Class
Table 15: Benchmarks of Accuracy Before and Bfter Bit-Indexing 
         The main objective of this study is to present an efficient solution to reduce time complexity of Similarity Searching when performed on large databases. By changing the 2D fingerprints structure using bit indexing to a new representation of "on" bits representation and the use of merge subroutine comparison, the time complexity of the algorithm reduces significantly. The proposed method worked by transforming the binary string to an integer string by eliminating the "off" bits (or 0) and retaining the location values of "on" bits (or 1). This is due to the huge number of "off" bits which is not used in the calculation of the similarity resemblance rather the main importance falls on the "on" bits that have the main influence on the similarity calculation. Hence, the time complexity of reading the two benchmark databases is reduced. The transformation of representation affected the similarity calculation in which the normal Tanimoto coefficient formula is longer be applicable. Hence the second objective of this research was to modify the Tanimoto coefficient to be used by the merge subroutine comparison. Once again, the time complexity to calculate common values between two sorted arrays is reduced (as shown in Tables 4 and 6). 
       The evaluation of the performance efficiency is done by using four metrics that focused on speed and accuracy. Initially, the recall value was compared between our works to the work of  [6]  as a benchmark on the accuracy of the proposed method. It was shown that the recall obtained in this work is similar to the work of  [6 which denotes the accuracy of our proposed method. The next three metrics were used to demonstrate the speed difference between the SS methods before and after Bitindexing. The metrics used included the execution time, speed-up, and percentage of performance gain. From the results, it was observed that the Bit-indexing similarity searching is proven to be 8.53 times faster than traditional similarity searching when applied on MDDR database, and 3.21 times faster than normal similarity searching when applied on ChEMBL database. As an overall conclusion, we may conclude that the Bit-indexing similarity search has been proven to show efficient performance as compared to traditional similarity search by achieving less execution time.


  1. Maggioni M, Santambrogio MD, Liang J (2011) GPUAccelerated Chemical Similarity Assessment for Large Scale Databases.Procedia Computer Science 4: 2007-2016.
  2. Serpone N, Dondi D, Albini A (2007) Inorganic and Organic UV Filters: Their Role and Efficacy in Sunscreens and Suncare Products. Inorganica Chimica Acta 360: 794-802.
  3. Puzyn T, Cronin MT, Leszczynski J (2010) Recent Advances in QSAR Studies: Methods and Applications. Challenges and Advances in Computational Chemistry and Physics 8.
  4. Bell F, Sacan A (2012) Content Based Searching of Gene Expression Databases Using Binary Fingerprints of Differential Expression Profiles. Health Informatics and Bioinformatics (HIBIT) 107-113.
  5. Patel DA, Patel AC, Nolan WC, et al. (2014) High-Throughput Screening Normalized to Biological Response Application to Antiviral Drug Discovery. Journal of bimolecular screening 19: 119-130.
  6. Malim NHH (2011) Enhancing similarity searching, PhD thesis. The University of Sheffield.
  7. Walters WP, Stahl MT, Murcko MA (1998) Virtual Screening- an Overview. Drug Discovery Today 3: 160-178.
  8. Holliday JD, Willett P, Xiang H (2012) Interactions between Weighting Scheme and Similarity Coefficient in SimilarityBased Virtual Screening. International Journal of Chemo informatics and Chemical Engineering (IJCCE) 2: 28- 41.
  9. Simon C, Daniel R (2011) Met genomic Analyses: Past and Future Trends. Applied and Environmental Microbiology 77: 1153-1161.
  10. Zhang D, Wang J, Cai D, et al. (2010) Self-Taught Hashing for Fast Similarity Search. Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval 18-25.
  11. Liao Q, Wang J, Webster Y, et al. (2009) GPU Accelerated Support Vector Machines for Mining High-Throughput Screening Data. Journal of chemical information and modelling 49: 2718-2725.
  12. Haque IS, Pande VS, Walters WP (2010) SIML: a Fast SIMD Algorithm for Calculating LINGO Chemical Similarities on GPUs and CPUs. Journal of chemical information and modelling 50: 560-564.
  13. Todeschini R, Consonni V, Xiang H, et al. (2012) Similarity Coefficients for Binary Chemo informatics Data: Overview and Extended Comparison Using Simulated and Real Data Sets. Journal of chemical information and modelling 52: 2884-2901.
  14. Wassermann AM, LounkineE,Glick M (2013) Bio turbo Similarity Searching: Combining Chemical and Biological Similarity To Discover Structurally Diverse Bioactive Molecules. Journal of chemical information and modelling 53: 692-703.
  15. Salim N, Holliday J, Willett P (2003) Combination of Fingerprint-Based Similarity Coefficients Using Data Fusion. Journal of chemical information and computer sciences 43: 435- 442.
  16. Willett P (2013) Fusing Similarity Rankings in Ligand-Based Virtual Screening. Computational and Structural Biotechnology Journal 5: 1-6.
  17. Brown N (2009) Chemo informatics-an Introduction for Computer Scientists. ACM Computing Surveys (CSUR) 41: 8.
  18. Whittle M, Gillet VJ, Willett P, et al. (2006) Analysis of Data Fusion Methods in Virtual Screening: Similarity and Group Fusion. Journal of chemical information and modelling 46: 2206-2219.
  19. Johnson MA, Maggiora GM (1990) Concepts and applications of molecular similarity. Wiley.
  20. Whittle M, Gillet VJ, Willett P, et al. (2006) Analysis of data fusion methods in virtual screening: theoretical model. Journal of chemical information and modelling 46: 2193-2205.
  21. Hilmi MN, Al-laila MH, Malim NHAH (2016) Accelerating group fusion on multicores and graphics processing units. Journal of Information Processing System.
  22. Al-laila MH, Hilmi MN, Malim NHAH (2016) Accelerating Turbo Similarity Searching on Multi-cores and Many-cores Platforms. In Advanced Computer and Communication Engineering Technology 81-92.
  23. Yukihira D, Miura D, Fujimura Y, et al. (2014) MALDI Efficiency of Metabolites Quantitatively Associated With Their Structural Properties: A Quantitative Structure-Property Relationship (QSPR) Approach. Journal of the American Society for Mass Spectrometry 25: 1-5.
  24. Bunin BA, Siesel B, Morales GA, et al. (2008) Chemo informatics: Theory, Practice and Products.
  25. Kier LB (1997) Kappa Shape Indices for Similarity Analysis. Medicinal Chemistry Research 7: 394-406.
  26. Kudera BM (2013) Implementation and Comparison of Methods for pKa Calculation, PhD thesis.
  27. Dimitrov I, Naneva L, Doytchinova I, et al. (2013) AllergenFP: Allergen city Prediction by Descriptor Fingerprints. Bioinformatics 619.
  28. Backman TWH, Girke T (2014) Cheminformatic Analysis of High- Throughput Compound Screens. Plant Chemical Genomics 145-157.
  29. Abhik (2013) Characterizing 2D structures with descriptors and fingerprints.
  30. Bajorath J, Vogt M, Duffy BC, et al. (2014) Chemo informatics for Drug Discovery.
  31. Bajorath J (2004) Chemo informatics: Concepts, Methods, and Tools for Drug Discovery. Methods in molecular biology.
  32. Esposito G, Sarnacki BG, Chelliah HK (2012) Uncertainty Propagation of Chemical Kinetics Parameters and Binary Diffusion Coefficients in Predicting Extinction Limits of Hydrogen/Oxygen/Nitrogen non-Premixed Flames. Combustion Theory and Modelling 16: 1029-1052.
  33. Gottschalk F, Scholz RW, Nowack B (2010) Probabilistic Material Flow Modeling for Assessing the Environmental Exposure to Compounds: Methodology and an Application to Engineered Nano-TiO< sub> 2 Particles. Environmental Modelling & Software 25: 320-332.
  34. Zegers FE, ten Berge JMF (1985) A Family of Association Coefficients for Metric Scales. Psychometrika 50: 17-24.
  35. Salim N, Holliday J, Willett P (2003) Combination of Fingerprint-Based Similarity Coefficients Using Data Fusion. Journal of chemical information and computer sciences 43: 435- 442.
  36. Willett P (2006) Enhancing the Effectiveness of LigandBased Virtual Screening Using Data Fusion. QSAR and Combinatorial Science 25: 1143-1152. 
  37. Mehta A, Sonam S, Gouri I, et al. (2014) SMMRNA: a Database of Small Molecule Modulators of RNA. Nucleic acids research 42: 132-141.
  38. Willett P, Barnard JM, Downs GM (1998) Chemical similarity searching. Journal of chemical information and computer sciences 38: 983-996.
  39. Castro EA, Haghi AK (2011) Advanced Methods and Applications in Chemo informatics: Research Progress and New Applications.
  40. Hu Y, Stumpfe D, Bajorath J (2013) Advancing the Activity Cliff Concept. F1000Research 2.
  41. Southan C, Varkonyi P, Muresan S (2009) Quantitative Assessment of the Expanding Complementarily Between Public and Commercial Databases of Bioactive Compounds. Journal of Cheminformatics 1: 1-17.
  42. Hert J, Keiser MJ, Irwin JJ, et al. (2008) Quantifying the Relationships among Drug Classe. Journal of chemical information and modeling 48: 755-765.
  43. Gaulton A, Bellis LJ, Bento AP, et al. (2012) ChEMBL: a Large-Scale Bioactivity Database for Drug Discovery. Nucleic acids research 40: 1100-1107.
  44. Machines IBMIB (2014) Database Fundamentals.
  45. 45. Navathe SB, Elmasri R (2010) Fundamentals of database systems. Upper Saddle River NJ: Pearson 652-660.
  46. Server SQL (2012) the following table lists the types of indexes available in SQL Server and provides links to additional information.
  47. Strate J, Krueger T (2012) Expert Performance Indexing for SQL Server.
  48. Akhilandeswari P, George JG (2014) secure text steganography. Proceedings of International Conference on Internet Computing and Information Communications 1-7.
  49. Sharma V (2005) Bitmap Index vs. B-tree index: Which and when.
  50. Wu MC, Buchmann AP (1998) Encoded Bitmap Indexing for Data Warehouses. 2013 IEEE 29th International Conference on Data Engineering (ICDE), IEEE Computer Society 220.
  51. Chan CY, Ioannidis YE (1998) Bitmap Index Design and Evaluation. ACM SIGMOD Record 355-366.
  52. Morzy T, Zakrzewicz M (1998) Group Bitmap Index: A Structure for Association Rules Retrieval. KDD 284-288
  53. Dutta R, French S, Janakiraman J (2003) Method and System for Augmenting Web-Indexed Search Engine Results with Peerto-Peer Search Results. 
  54. Pcmag (2014) Definition of: Indexed Search.
  55. Tyrrell RA, Owens DA (1988) A Rapid Technique to Assess the Resting States of the Eyes and other Threshold Phenomena: the Modified Binary Search (MOBS). Behaviour Research Methods, Instruments and Computers 20: 137-141.
  56. Cormen TH, Leiserson CE, Rivest RL, et al. (2011) Introduction to Algorithms. 2nd Edition.
  57. Comer D (1899) English Dictionary Searching with Little Extra Space. IEEE Computer Society 209.
  58. Shaffer CA (2009) A Practical Introduction to Data Structures and Algorithm Analysis Third Edition (Java).
  59. Goodrich MT, Tamassia R (2008) Data Structures and Algorithms in Java, John Wiley and Sons.
  60. Hester JH, Hirschberg DS (1985) Self-Organizing Linear Searc. ACM Computing Surveys (CSUR) 17: 295-311.
  61. Lin J (2009) Brute Force and Indexed Approaches to Pair wise Document Similarity Comparisons with Map Reduce. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval 155-162.
  62. Sedgewick R, Flajolet P (2013) an Introduction to the Analysis of Algorithms, Addison-Wesley.
  63. Dasgupta S, Papadimitriou CH, Vazirani U (2006) Algorithms. McGraw-Hill, Inc.
  64. Kobayashi M, Murase T, Kuriyama A (2000) A Longest Prefix Match Search Engine for Multi-Gigabit Ip Processing. Communications 1360-1364.
  65. Grama A, Gupta A, Karypis G, et al. (2003) Introduction to Parallel Computing. Introduction to Parallel Computing 1.
  66. Thomas MP, Bhattacharjee S, Cheng C, et al. (2014) Computational Science and Engineering.
  67. Naser MAS, Rashid NA, Aboalmaaly MF (2012) QuickSkip Search Hybrid Algorithm for the Exact String Matching Problem. International Journal of Computer Theory and Engineering 4.