Creating Your Record
The model takes in 30 different numeric variables that all represent different features and attributes of a potential phishing website. The dataset that this model is based off was created by created by Rami M. Mohammad of the School of Computing and Engineering, University of Huddersfield.
For full details on how the different attributes use the attached word document. When creating the CSV that will be used for the predictions, the attributes are categorized by either a -1, 0, or 1.
- A -1 means that is does not contain an attribute/feature that would classify it as a phishing site.
- A 0 means it feature/attribute is labeled as suspicious.
- A 1 means it's feature/attributes is labeled as a potential phishing feature.
Example Data Set:
having_IP_Address { -1|1 } | URL_Length { 1|0|-1 } | Shortining_Service { 1|-1 } | having_At_Symbol { 1|-1 } | double_slash_redirecting { -1|1 } | Prefix_Suffix { -1|1 } | having_Sub_Domain { -1|0|1 } | SSLfinal_State { -1|1|0 } | Domain_registeration_length { -1|1 } | Favicon { 1|-1 } | port { 1|-1 } | HTTPS_token { -1|1 } | Request_URL { 1|-1 } | URL_of_Anchor { -1|0|1 } | Links_in_tags { 1|-1|0 } | SFH { -1|1|0 } | Submitting_to_email { -1|1 } | Abnormal_URL { -1|1 } | Redirect { 0|1 } | on_mouseover { 1|-1 } | RightClick { 1|-1 } | popUpWidnow { 1|-1 } | Iframe { 1|-1 } | age_of_domain { -1|1 } | DNSRecord { -1|1 } | web_traffic { -1|0|1 } | Page_Rank { -1|1 } | Google_Index { 1|-1 } | Links_pointing_to_page { 1|0|-1 } | Statistical_report { -1|1 } |
-1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | -1 | 1 | -1 | 1 | -1 | -1 | -1 | 0 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | -1 |
1 | 1 | 1 | 1 | 1 | -1 | 0 | 1 | -1 | 1 | 1 | -1 | 1 | 0 | -1 | -1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | -1 | -1 | 0 | -1 | 1 | 1 | 1 |
1 | 0 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | -1 | 1 | 0 | -1 | -1 | -1 | -1 | 0 | 1 | 1 | 1 | 1 | 1 | -1 | 1 | -1 | 1 | 0 | -1 |
When creating your CSV the values should follow the same ordering as the dataset shown above.
Making your Prediction
When making your predictions with SageMaker the model will output both a score ranging from 0-1. The close the score is to 0 the more likely the site is a phishing site.
Work Cited
Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Mohammad, Rami, McCluskey, T.L. and Thabtah, Fadi (2012) An Assessment of Features Related to Phishing Websites using an Automated Technique. In: International Conferece For Internet Technology And Secured Transactions. ICITST 2012 . IEEE, London, UK, pp. 492-497. ISBN 978-1-4673-5325-0