Abstract
In current era of ubiquitous smart devices, detecting malware is becoming an endless battel between ever evolving malware and anti-virus program which leads to increase in day-by-day processing of security related data. For detecting those malware various approaches has been developed over time. One of the approach among them is Deoxyribonucleic acid (DNA) sequence analysis. This includes comparision of sequencs in order tosearch similarity, identification of intrinsic features of sequence search, identification of differences and variations, revealing the evolution and genetic diversity of sequences and identifiction of molecular structure from given sequence. Over time massive inprovement in DNA sequencing has lead to prolifration of bioinformatics tools and as increase in usability of this tools has begun these tools has encountered little adverse impact. This paper will explain the primary concept of DNA sequence analysis, relationship between computer system and DNA sequence and malware detectetion technique used to avoid possible attacks using DNA sequence analysis.
DNA sequence analysis
DNA is basically a way of storing information. Generally, it encodes instructions for making living things but it can be used for other purposes as well. In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution[1]. Due to increase in the amount of methods of high-throughput production of gene and protein sequences, the rate at which new sequences are added into databases is increasing rapidly. Now a days, scientists are comparing these new sequences with known functions so that they can understand the biology of an organism from which they can generate the new sequence. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences.
Relation between Computer security analysis and DNA sequencing
Fast improvement in cost and time is required to sequence and analyze DNA. In past decade, cost of sequencing a human genome has decreased 100,000 fold or more which was made possible by using parallel processing. Now a days, we can sequence hundreds of millions of DNA strands simultaneously which has opened so many opportunities in increase of applications in domains ranging from human behavior, personalized medicine to study of microorganisms in our gut.
Usually computers are utilized to process, analyze and store all this millions of DNA sequences and due to rapid improvement in technology new and unexpected interactions between electronic and biological systems has been noticed over time. Once DNA is sequenced, it is usually processed and analyzed by numerous computer programs which are called the DNA data processing pipeline. Generally, it analyze the computer practices of commonly used open-source programs in this pipeline.[2] Scientists has utilized DNA to store books, recordings, Amazon gift card and even GIFs. Researchers from University of Washington has managed to take over computer by encoding a malicious program in DNA.
Malware detection techniques
Early proposed detection techniques were on basis of static analysis which includes examining binary code and identifying malicious code without execution. But now a days, inspecting binary code is difficult and since obfuscation techniques such as polymorphism, encryption, or packing become more sophisticated. In addition, this static analysis depends on pre- built signature database which make hard for them to detect new unknown malware until signature is updated. To reduce these limitations of static analysis and compliment it, dynamic analysis has been found and widely used now a days to achieve effective malware detection. Dynamic analysis executes malware and detects its behaviors. Mainly two approaches are used for dynamic analysis: Control flow analysis and API call analysis. Both trace malware based on analysis of similarity between already known and new ones. Many currently available API call techniques reveal the characteristics of malware in same class quickly but fail to show sequence of malware behavior and easy to evade by different malware authors’ inserting and executing dummy and redundant API calls. Some other researchers extracts API call sequence for each class and develop static signatures based on it. But creating signatures from extracting frequently found call sequence for malware in each class does not allow them to detect malware in known form.
Due to this requirement for new approaches in API call sequence analysis incurred. The information gathered through the dynamic approach can also be processed using simple statistics such as frequency counting and data mining or machine learning [3]. Recent studies focuses on the fact that the critical low-level system call sequence does not change until the main purpose of the malware does not change so the focus of them is on API call sequence for certain function of malware instead of call sequence for malware in each class. Sequence alignment algorithm is used to extract the similar subsequences from different sequences. This algorithm have been applied in natural language processing and biometrics and have provided excellent results.
How a DNA can be used to compromise computer?
The researchers at University of Washington try to mimic an adversary and (1) synthesize a real, biological DNA sequence with a malicious, embedded exploit. Then experimentally evaluate the impact of that exploit DNA on a victim by having the victim (2) sequence that DNA using standard sequencing methods and (3) post-process the DNA sequence with a realistic program — a program that a scientist might use to analyze the resulting DNA sequence [2]. They got the results which shows that while their exploited program is vulnerable to basic buffer overflow exploit, the security of the overall DNA sequencing pipeline is not much better.
In their experiment they used FASTQ compression utility, fqzcomp, which is designed to compress sequences. For experiment they inserted vulnerability into this utility by copying fqzcomp from https://sourceforge.net/projects/fqzcomp/ and inserted into version 4.6 of source code; a function which processes and compresses DNA reads individually using fixed-size buffer to store the compressed data. This modification cause buffer overflow with longer than expected DNA read by hijacking control flow. As expected, use of fixed-size buffer is vulnerability in system since fqzcomp already contains more than two dozen static buffers. They modified 54 lines of C++ code and removed 127 lines from fqzcomp. This modified version used a simple 2-bit DNA encoding scheme such as four nucleotides were encoded as two bits- A as 00, C as 01, G as 10, and T as 11 - packing bits into bytes starting with the most significant bits. They ran the target Cpts_483 Topics in Computer Science program in a simplified computing environment and disabled common security features like stack canarie and ASLR and marked stack as executable.
Today, any fixed-size buffer would likely be vulnerable, as new longer read sequencing technologies can produce reads that are upwards of 60,000 bases[4]. Their exploit triggered a buffer overflow when program tried to read the 176 base pairs on their strand and portion of code also granted the team remote control of the sequencing machine’s computer and later crash the system. Their demonstration serves as a warning sign about a new kind of attack that could occur someday.
Another researcher’s team did experiment by setting up virtual environment to run malicious programs to trace API call sequence in runtime. They used the Detours hooking library supported by Microsoft to trace API call. Before the target function starts, the Detour function leaves the log of target function’s name which allows them to trace API call sequence. They utilize VirtualBox to execute malware and observe it’s activity which was 32-bit Windows XP Service Pack 3. They set up maximum monitoring period as two minutes for the default value to trace API call sequence.
For DNA sequence alignment they used ClusterX which is widely used freeware in genome sequence analysis such as DNA, RNA or protein sequences. Their experiment results showed facts that malware in the same family shares much common call subsequences. On the other hand, malware in different classes can have common call sequences [3].
Conclusions
In this paper, we’ve seen the method of API call sequence analysis and control flow and how malware can be added using DNA sequence. Malware detection system depend on signature of a malware’s static information, like file size, process and its artifacts. From above we found that antivirus vendors’ labeling of malware could be less accurate to be applied in the dynamic analysis of API call sequences. Therefore, they fail to detect new unknown malware until the signature has been updated.
References
- Sequence analysis. https://en.wikipedia.org/wiki/Sequence_analysis
- Peter Ney, Karl Koscher, Lee Organick, Luis Ceze, Tadayoshi Kohno (August 2017). Computer Security, Privacy, and DNA Sequencing: Compromising Computers with Synthesized DNA, Privacy Leaks, and More. https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-ney.pdf
- Youngjoon Ki, Eunjin Kim, Huy Kang Kim (June 2015). A Novel Approach to Detect Malware Based on API Call Sequence Analysis. https://journals.sagepub.com/doi/full/10.1155/2015/659101
- Pacific Biosciences Of California. Smrt sequencing: Read lengths (February 2016). http://www.pacb.com/smrt-science/smrtsequencing/read-lengths/
- Researchers Embed Malware Into DNA to Hack DNA-Sequencing Software (August 2017). https://spectrum.ieee.org/the-human-os/computing/software/researchers-embed-malicious-code-into-dna-to-hack-dna-sequencing-software