Friday, April 5, 2019

Machine Learning in Malware Detection

forge Learning in Malwargon Detection1.0 Background searchMalw atomic number 18 was first created in 1949 by John von Neumann. Ever since then, more(prenominal) and more malw atomic number 18s atomic number 18 created. Antivirus comp whatever are constantly looking for a method acting that is the ab expose effective in detecting malware. One of the most famous method use by antivirus company in detecting malware is the signature based detection. however over the years, the growth of malware is increasing uncontrollably. Until recent year, the signature based detection break been proven unavailing against the growth of malware. In this research, I get down chosen another method for malware detection which is weaponing machine accomplishment method on to malware detection. use the selective informationset that I get from Microsoft Malware Classification Ch whollyenge (BIG 2015), I volition bring forth an algorithm that pass on be sufficient to detect malware effectively with low false commanding error.1.1 job StatementWith the growth of technology, the number of malware are also increasing day by day. Malware now are designed with mutation characteristic which causes an enormous growth in number of the variation of malware (Ahmadi, M. et al., 2016). Not solitary(prenominal) that, with the help of automated malware gene telld tools, novice malware author is now able to easily generate a youthful variation of malware (Lanzi, A. et al., 2010). With these growths in new malware, handed-down signature based malware detection are proven to be ineffective against the vast variation of malware (Feng, Z. et al., 2015). On the other hand, machine skill methods for malware detection are proven effective against new malwares. At the same time, machine learning methods for malware detection have a in high spirits false positive rate for detecting malware (Feng, Z. et al., 2015).1.2 ObjectiveTo investigate on how to implement machine learning to malware detection in order to detection unknown malware. To develop a malware detection software that implement machine learning to detect unknown malware. To vali assure that malware detection that implement machine learning bequeath be able to achieve a high accuracy rate with low false positive rate.1.3 Theoretical / Conceptual Framework1.4 SignificanceWith elevator car Learning in Malware detection that have a high accuracy and low false positive rate, it will help end user to be free from fear malware damaging their computer. As for organization, they will have their system and load to be more secure.2.0 Literature Re facial expression2.1 Over take careTraditional security product uses virus image scanner to detect malicious statute, these scanner uses signature which created by reverse engineering a malware. But with malware that became polymorphic or metamorphic the traditional signature based detection method used by anti-virus is no long effective against the electric curren t issue of malware (Willems, G., Holz, T. Freiling, F., 2007). In current anti-malware products, there are two main task to be carried out from the malware digest process, which are malware detection and malware classification. In this paper, I am commission on malware detection. The main quarry of malware detection is to be able to detect malware in the system. There are two type of summary for malware detection which are dynamic analysis and static analysis. For effective and efficient detection, the uses of suffer extraction are recommended for malware detection (Ahmadi, M. et al., 2016). There are various type of detection method, the method that we are using will be detecting through hex and assembly file of the malware. Feature will be extracted from twain hex view and assembly view of malware files. After extracting get to its category, all category is to be unify into one feature vector for the classifier to exam on them (Ahmadi, M. et al., 2016). For feature selec tion, separating binary file into blocks to be liken the similarities of malware binaries. This will reduce the analysis overhead which cause the process to be faster (Kim, T.G., Kang, B. Im, E.G., 2013). To build a learning algorithm, feature that are extracted with the label will be undergo classification with using any classification method for example Random Forest, Neural Network, N-gram, KNN and many others, but Support Vector Machine (VCM) is recommended for the presence of noise in the extracted feature and the label (Stewin, P. Bystrov, I., 2016). As to generate result, the learning warning is to demonstrate with dataset with label to generate a graph which indicate detection rate and false positive rate. To go through the best result, repeat the process using many other classification and create learning mildew to test on the same dataset. The best result will the one graph that has the highest detection rate and lowest false positive rates (Lanzi, A. et al., 2010). 2.2 Dynamic and unchanging AnalysisDynamic Analysis runs the malware in a simulated environment which usually will be a sandbox, then within the sandbox the malware is executed and being observe its behavior. Two approaches for dynamic analysis that is comparing image of the system in advance and after the malware carrying into action, and monitors the malware action during the execution with the help of a debugger. The first approach usually give a treat which will be able to obtain similar report via binary observation go the other approach is more difficult to implement but it gives a more detailed report more or less the behavior of the malware (Willems, G., Holz, T. Freiling, F., 2007). nonmoving Analysis will be studying the malware without executing it which causing this method to be more safe comparing to dynamic analysis. With this method, we will dissemble the malware workable into binary file and hex file. Then study the opcode within both file to compare with a pre-generated opcode profile in order to search for malicious code that exist within the malware executable (Santos, I. et al., 2013).All malware detection will be needed either Static Analysis or Dynamic Analysis. In this paper, we will be focusing on Static Analysis (Ahmadi, M. et al., 2016). This is because, Dynamic analysis has a drawback, it can only run analysis on 1 malware at a time, making the whole analysis process to get under ones skin a long time, as we have many malware that needed to be analysis (Willems, G., Holz, T. Freiling, F., 2007). As for Static Analysis, it mainly uses to analyze hex code file and assembly code file, and compare to Dynamic Analysis, Static Analysis take much short time and it is more convenient to analyze malware file as it can schedule to scan all the file at once even in offline (Tabish, S.M., Shafiq, M.Z. Farooq, M., 2009).2.3 Features line of descentFor an effective and efficient classification, it will be wise to extract feature from both hex view file and assembly view file in order to retrieve a complementary date from both hex and assembly view file (Ahmadi, M. et al., 2016).Few types of feature that are extracted from the hex view file and assembly view file, which is N-gram, Entropy, Image Representative, String Length, Symbol, Operation Code, Register, Application Programming Interface, Section, Data Define, various (Ahmadi, M. et al., 2016). For N-gram feature, it usually used to classify a sequence of action in different areas. The sequence of malware execution could be capture by N-gram during feature extraction (Ahmadi, M. et al., 2016). For Entropy feature, it extracts the probability of uncertainty in a series of byte in the malware executable file, these probability of uncertainty is depending on the amount of information on the executable file (Lyda, R.,Hamrock, J,. 2007). For Image Representative feature, the malware binary file is being read into 8-bit vector file, then aim into a 2D array fil e. The 2D array file can be visualize as a black and gray image whereas grey are the bit and byte of the file, this feature look for mutual in bit arrangement in the malware binary file (Nataraj, L. et al., 2011). For String Length feature, we open distributively malware executable file and view it in hex view file and extract out all ASCII string from the malware executable, but because it is difficult to only extract the actual string without extract other non-useful element, it is call for to choose important string among the extracted (Ahmadi, M. et al., 2016). For Operation Code features, Operation code also known as Opcode are a type of instruction syllable in the machine language. In malware detection, different Opcode and their frequency is extracted and to compare with non-malicious software, different set of Opcodes are identifiable for either malware or non-malware (Bilar, D., n.d.). For Register feature, the number of register purpose are able to assist in malware cl assification as register renaming are used to get ahead malware analysis more difficult to detect it (Christodorescu, M., Song, D. Bryant, R.E., 2005). For Application Programming Interface feature, API calling are code that call the function of other software in our case it will be Windows API. There are large number of type of API calls in malicious and non-malicious software, is hard to differentiate them, because of this we will be focusing on top frequent used API calls in malware binaries in order to bring the result encompassing(prenominal) (Top maliciously used apis, 2017). For Data Define feature, because not all of malware contains API calls, and these malware that does not have any API calls they are mainly contain of operation code which usually are db, dw, dd, there are sets of features (DP) that are able to define malware (Ahmadi, M. et al., 2016). For Miscellaneous feature, we choose a few word that most malware have in honey oil from the malware dissemble file (A hmadi, M. et al., 2016).Among so many feature, the most appropriate feature for our research will be N-gram, and Opcode. This is because it is proven that there two feature have the highest accuracy with low logloss. This two feature appears a great deal in malware file and it already have sets of well-known features for malware. But the drawback using N-gram and Opcode are they require a lot of resource to process and take a lot of time (Ahmadi, M. et al., 2016). We will also show other feature to compare with N-gram and Opcode to verified the result.2.4 ClassificationIn this section, we will not review about the algorithm or mathematical formula of a classifier but rather their nature to able to have advantage over certain condition in classifying malware feature. The type of classifier that we will review will be Nearest Neighbor, Nave Bayes, Decision tree, Support Vector Machine and XGBOOST 21 (Kotsiantis, S.B., 2007) (Ahmadi, M. et al., 2016).As we need a classifier to gear mechanism our data with the malware feature, we will need to review the classifier to choose the most appropriate classifier that are able to have the best result. The Nearest Neighbor classifier are one of the simplest method for classifying and it is normally implement in case-based cerebrate 21. As for Nave Bayes, it usually generates simply and constraint model and not suitable for ir lawful data input, which make it not suitable for malware classification because that the data in malware classification are not regular (Kotsiantis, S.B., 2007). For Decision Tree, it classify feature by sorting them into tree node base on their feature values and each branch represent the node value. Decision Tree will determine either try or false based on node value, which make it difficult to dealt with unknown feature that are not stored in tree node (Kotsiantis, S.B., 2007). For Support Vector Machine, it has a complexity model which enable it to deal with large amount of feature and still be able to obtain close result from it, which make it suitable for malware classification as malware contains large number of feature (Kotsiantis, S.B., 2007). For XGBOOST, it is a climbable tree boosting system which win many machine learning competition by achieving state of fraud result. The advantage for XGBOOST, it is suitable for most of any scenario and it run faster than most of other classification proficiency (Chen, T., n.d.).To choose a Classification for our malware analysis, we will be choosing XGBOOST, as it is suitable for malware classification, it also recommended by superior from Microsoft Malware Classification Challenge (Ahmadi, M. et al., 2016). But we will also use Support Vector Machine, as it too is suitable for malware classification and we will use it to compare the result with XGBOOST to get a more consummate result.ReferencesAhmadi, M. et al., 2016. Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. ACM Confer ence on Data and Application credential and Privacy, pp.183-194. operational at http//doi.acm.org/10.1145/2857705.2857713.Amin, M. Maitri, 2016. A Survey of Financial Losses Due to Malware. proceeding of the Second worldwide Conference on Information and Communication Technology for Competitive Strategies ICTCS 16, pp.1-4. Available at http//dl.acm.org/citation.cfm?doid=2905055.2905362.Berlin, K., Slater, D. Saxe, J., 2015. Malicious conduct Detection Using Windows Audit Logs. minutes of the 8th ACM Workshop on Artificial Intelligence and Security, pp.35-44. Available at http//doi.acm.org/10.1145/2808769.2808773.Feng, Z. et al., 2015. HRS A Hybrid Framework for Malware Detection. , (10), pp.19-26.Han, K., Lim, J.H. Im, E.G., 2013. Malware analysis method using visualization of binary files. Proceedings of the 2013 Research in Adaptive and Convergent Systems, pp.317-321.Kim, T.G., Kang, B. Im, E.G., 2013. Malware classification method via binary content comparison. Informati on (Japan), 16(8 A), pp.5773-5788.Kksille, E.U., Yalnkaya, M.A. Uar, O., 2014. Physical Dangers in the Cyber Security and Precautions to be Taken. Proceedings of the 7th world(prenominal) Conference on Security of Information and Networks SIN 14, pp.310-317. Available at http//dl.acm.org.proxy1.athensams.net/citation.cfm?id=2659651.2659731.Lanzi, A. et al., 2010. AccessMiner Using System-Centric Models for Malware Protection. Proceedings of the 17th ACM Conference on Computer and Communications Security CCS10, pp.399-412. Available at http//dl.acm.org/citation.cfm?id=1866353%5Cnhttp//portal.acm.org/citation.cfm?doid=1866307.1866353.Nicholas, C. Brandon, R., 2015. Document Engineering Issues in Document Analysis. Proceedings of the 2015 ACM Symposium on Document Engineering, pp.229-230. Available at http//doi.acm.org/10.1145/2682571.2801033.Patanaik, C.K., Barbhuiya, F.A. Nandi, S., 2012. Obfuscated malware detection using API call dependency. Proceedings of the First Internati onal Conference on Security of Internet of Things SecurIT 12, pp.185-193. Available at http//www.scopus.com/inward/record.url?eid=2-s2.0-84879830981partnerID=tZOtx3y1.Pluskal, O., 2015. Behavioural Malware Detection Using Efficient SVM Implementation. RACS Proceedings of the 2015 Conference on research in adaptive and convergent systems, pp.296-301.Santos, I. et al., 2013. Opcode sequences as imitation of executables for data-mining-based unknown malware detection. Information Sciences, 231, pp.64-82.Stewin, P. Bystrov, I., 2016. Detection of Intrusions and Malware, and Vulnerability Assessment, Available at http//dblp.uni-trier.de/db/conf/dimva/dimva2012.htmlStewinB12.Willems, G., Holz, T. Freiling, F., 2007. Toward automated dynamic malware analysis using CWSandbox. IEEE Security and Privacy, 5(2), pp.32-39.Tabish, S.M., Shafiq, M.Z. Farooq, M., 2009. Malware detection using statistical analysis of byte-level file content. Proceedings of the ACM SIGKDD Workshop on CyberSecuri ty and Intelligence Informatics CSI-KDD 09, pp.23-31. Available at http//portal.acm.org/citation.cfm?doid=1599272.1599278.Lyda, R.,Hamrock, J,. 2007.Using Entropy Analysis to Find Encrypted and Packed Malware.Nataraj, L. et al., 2011. Malware Images Visualization and robotlike Classification.Bilar, D., Statistical Structures Fingerprinting Malware for Classification and Analysis Why Structural Fingerprinting?Christodorescu, M., Song, D. Bryant, R.E., 2005. Semantics-Aware Malware Detection.Top maliciously used apis. https //www.bnxnet.com/top-maliciously-used-apis/, 2017.Weiss, S.M. Kapouleas, I., 1989. An Empirical Comparison of Pattern Recognition , Neural Nets , and Machine Learning Classification Methods. , pp.781-787.Kotsiantis, S.B., 2007. Supervised Machine Learning A Review of Classification Techniques. , 31, pp.249-268.Chen, T., XGBoost A Scalable Tree Boosting System.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.