-
Essay / Mogamod Overview: Multi-objective Genetic Algorithm for Pattern Discovery
Table of ContentsIntroductionMethodsSimilaritySupportGenetic OperatorsResultsConclusionMulti-objective evolutionary algorithm is a popular approach that has been widely used in optimization problems. This research on Using a Multi-Objective Genetic Algorithm for Pattern Discovery (MOGAMOD) was the first study to apply a multi-objective genetic algorithm to the pattern finding problem. By maximizing three conflicting objectives: pattern length, similarity, and support, the pattern pattern can be obtained with high accuracy and low execution time. The MOGAMOD algorithm used a popular high-performance multi-objective genetic algorithm called non-dominated sorting algorithm (NSGA-II) with an adaptation to the pattern search problem to find the optimal pattern. What makes NSGA-II more efficient than other algorithms is that it has two unique operations, mutation and crossover, which constantly produce different sets of solutions and compare them to achieve an optimal end result. The algorithm was tested and analyzed for several samples with different properties: single sample, corrupted sample, invaded sample, multiple pattern. The results were compared to three conventional methods, using statistical approaches, to show their effectiveness and superiority. Say no to plagiarism. Get a tailor-made essay on “Why violent video games should not be banned”?Get the original essayIntroductionSequence motifs are defined as repeated motifs in DNA that can be found in DNA regulatory sites . These regulatory sites and motif instances are found to be responsible for the protein-binding role of the gene sequence in order to start the transcription process. Instances of motifs found in DNA sequences usually have slight variations in their components. The discovery of motif instances on DNA and their regulatory regions is crucial for understanding the relationship between DNA and proteins such as nucleases and transcription factors; it is also the key factor to control gene expression and identify drug targets for personalized medicine. In real-world problems, DNA can contain up to 220 million nucleotide base pairs and motif instances are typically short (30 nucleotide pairs). As a result, biological experimental approaches have been developed to extract instances of motifs from given DNA samples. The most popular methods are DNase footprinting, gel shift analysis, and linker analysis. These biological approaches require considerable labor and laboratory time as the sequence length or number of sequences increases. Therefore, computational methods with statistical approaches have been developed to find patterns in given DNA samples, such as Gibbs Sampler and Consensus. However, these algorithms also exhibit high time complexities as the dimensions of the DNA template increase. They also do not take into account other cases where the sample does not contain pattern instances in some sequences or where multiple instances exist in a sequence. In this report, a new approach using a multi-objective genetic algorithm is presented as an alternative to classical statistical approaches. Instead of optimizing for a single objective and having extremely poor performance on other objectives such as similarity or length of the final pattern, this new approach producesresults that make trade-offs between objectives to resolve problems encountered in other methods. The multi-objective genetic algorithm is designed to maximize three properties of the final pattern: similarity, length and support. The algorithm proposed in this paper is tested with three datasets and compared with other well-known biological methods to demonstrate its effectiveness and superiority in terms of accuracy and time complexity. It is also compared to the single objective genetic algorithm to provide a better understanding of the trade-off between the objectives of the problem. MethodsThe Multi-Objective Genetic Algorithm for Pattern Discovery (MOGAMOD) was built based on a popular high-performance multi-objective genetic algorithm called Non-Dominated Sorting Algorithm (NSGA-II). NSGA-II is a population-based tool, often used in optimization problems to find fast and efficient global optima. It is established based on Darwin's principle of natural selection to arrive at the best defined solution for the given problems. The first step in a genetic algorithm is to establish a randomly generated initial population containing individuals representing possible solutions to the problem. In this case, an individual was created as an array containing n genes corresponding to n numbers of DNA sequences in the problem. Each gene was then divided into two parts: the weight (wi) and the possible starting location of the motif instance (si). The weight values in the table indicated the probability that the potential pattern existed in the matching sequence, these values ranged from 0 to 1. MOGAMOD was designed to allow users to set a threshold of wi so that the matching sequence with a low wi can be excluded from the pattern discovery process. The starting location (si) variables indicated the potential starting position of the pattern instance in that corresponding sequence. In this research, si was limited between 7 and 64. Each individual in the population was then evaluated by a fitness function constructed based on three objectives: similarity, pattern length and support. SimilarityIn the pattern discovery problem, similarity is defined as a measure of resemblance across all pattern instances of an individual. The similarity value of an individual was calculated from the position weight matrix in each sequence by taking the average of the probability of the most popular nucleotide. This value also ranged from 0 to 1 and indicated the probability that the current pattern would be chosen as the pattern. In the pattern discovery problem, the length of the pattern is always a goal that each algorithm tries to maximize in order to reduce the probability of having false pattern instances and thus increase the chance of obtaining a strong pattern.SupportThe An individual's support value was determined by the number of sequences that were used to compose the candidate motif. This value was created to exclude "corrupt" sequences that had no pattern instances in order to obtain a strong final pattern without taking these sequences into account. In conclusion, to solve the pattern discovery problem, MOGAMOD was created to optimize three objectives of a final pattern: Similarity – Pattern Length – Support. From the initial population, the strongest individuals were selected to move on to the next generation. A fitness function was created to determine whether an individual's goal was sufficiently strong relative to other individuals in the current population. Individuals were first ranked based on fitness using a non-dominated sorting algorithm. This algorithm has a.