HBKU - QCRI
Internet Cyber Threat Intelligence

Malicious domains, URLs and IPs are one of the major platforms for attackers to launch cyber attacks. The cyber security group has established a long-lasting effort to accurately and scalably detect and predict malicious Internet entities. It develops graph inference and graph learning techniques to analyze the footprint that attackers left when they deploy malicious campaigns, including phishing, malware-distribution and botnets, and builds Hemaya Domain, a scalable and real-time malicious domain detection, prediction and analysis system. It further designs solutions to tackle some of the key challenges when applying AI to solve cyber security problems, in particular the scarcity and lack of reliability of labeled datasets, the ad hoc practice to form malicious ground truth and the data bias issue when integrating data from multiple sources

Description, Goals, and Focus

The increasing number of cyber-attacks today is often facilitated by the use of malicious domains and IPs, which attackers frequently change to avoid detection and prolong their attacks. This makes traditional blacklisting methods ineffective and often only results in action after the damage has already been done. Our project aims to tackle this issue by proactively identifying and blocking these malicious domains before they can be used in an attack, thus allowing for early mitigation of the threat.

To meet the needs of our stakeholders, we have developed multiple sub-projects that fall under the broader category of generating intelligence on internet cyber threats.

Our most advanced project aims to identify potentially malicious domains in their early stages by using a combination of global and local features. We use Passive DNS traces to identify associations among domains and apply inference methods to identify those closely linked to known malicious domains. However, this global approach alone is not enough, and we also consider local features such as lexical details, hosting information and registration information. We use graph neural networks like GraphSage to integrate both global and local features for improved detection of malicious domains and URLs.

Another crucial aspect of identifying malicious domains is determining how they are hosted, as the approach to mitigation can vary greatly depending on the answer. To tackle this issue, we have developed several machine learning models using content-agnostic features of malicious websites. Additionally, it is a significant open problem to accurately identify where these domains are hosted. Attackers often change hosting IPs to evade detection, so timely detection of the IP addresses used by attackers is crucial in order to take mitigative actions and allow for legitimate activities.

All of our research efforts rely on ground truth data of domains and IPs from whitelists and blacklists. However, these lists have several limitations such as: (1) they may contain false positive and false negative results, (2) some blacklists can be slow to update, (3) some lists specialize in certain types of malicious domains or IPs and (4) some blacklists rely on other blacklists. To address the problem of noisy malicious ground truth, we developed SIRAJ, a framework for aggregating intelligence from potentially conflicting sources such as VirusTotal engines to produce high-quality results. SIRAJ is flexible and can be used for diverse domains such as URLs, malware, and IPs, and it works well even when there is limited to no labeled data available.

We are currently addressing the problem of bias in benign ground truth. While there are many ways to obtain malicious ground truth from various sources such as blacklists and thresholding of VirusTotal, SIRAJ, etc. there is no analogous resource for benign URLs. This results in a frequency-biased sample that produces biased classifiers. Our goal is to develop a sampling methodology that selects a subset of entities from a large and diverse universe of entities such as URLs, resulting in models that perform better than those trained on existing common approaches.

We also focus on phishing attacks since they are the most common attacks in Qatar. We observe that phishing sites have been increasingly acquiring certificates to look more legitimate and reach more victims. With the introduction of Google’s Certificate Transparency (CT) Logs, phishing domains are forced to be CT-compliant in order to present victims with their pages and increase the effectiveness of their attacks. This provides us with an opportunity to predict  long-term and new phishing domains early by monitoring the domains appended to the logs in real time to predict phishing domains.

Existing phishing detection techniques continue to rely on blacklists or content-based analysis, which suffer from various challenges and exhibit considerable detection delays as they are reactive in nature. We seek to understand if artifacts of phishing are manifested in other sources of intelligence related to a domain. We construct thoroughly-verified realistic benign and phishing datasets. We study various novel aspects and characteristics computed from viable sources of data including Certificate Transparency Logs, and passive DNS records. We show clear differences between benign and phishing domains that can pave the way for content-agnostic approaches to predict phishing domains before the contents of these webpages reach users. We create models and show that it is possible to (1) perform content-agnostic predictions with very high precision and recall, and (2) predict phishing domains early compared to state of the art content-based tools such as VirusTotal.

A key objective in our research is to make our findings readily available to stakeholders in Qatar and the broader global security community. To achieve this, we have created Hemaya, a domain name search engine that utilizes both our own research outcomes and external blacklists (such as VirusTotal) and whitelists (such as Alexa Top 1M). It offers proactive intelligence on any domain, whether new or established, on the internet.

Current Status

The project has successfully achieved most of its goals, including the early detection and effective mitigation of malicious domains using models that classify IPs and domains into public and dedicated entities for improved accuracy. Additionally, models have been developed to classify domains as attacker-owned or compromised for better response decisions. The SIRAJ framework was also created to generate high-quality lists of malicious domains by combining multiple sources of intelligence. The team is currently working on addressing bias in benign ground truth data and expects to complete this task by the end of the year. The Hemaya platform has been developed and is currently in operation, and the team is working on integrating SIRAJ into the platform, which is expected to be ready by the end of the year.

Publications

  1. E. Choo, M. Nabeel, M. AlSabah, I. Khalil, T. Yu, W Wang, “DeviceWatch: A Data-Driven Network Analysis Approach to Identifying Compromised Mobile Devices with Graph-Inference,” ACM Transaction on  Privacy and Security. 26(1): 9:1-9:32 (2023). PDF
  2. S. Thirumuruganathan, M. Nabeel, E. Choo, I. Khalil, T. Yu, “SIRAJ: A Unified Framework for Aggregation of Malicious Entity Detectors,” IEEE Symposium on Security and Privacy, 2022, pp. 507-521. PDF
  3. M. AlSabah, M. Nabeel, Y. Boshmaf, E. Choo, “Content-Agnostic Detection of Phishing Domains using Certificate Transparency and Passive DNS,” RAID 2022, pp. 446-459. PDF
  4. S. Vidyakeerthi, M. Nabeel, C. Elvitigala, C. Keppitiyagama, “PhishChain: A Decentralized and Transparent System to Blacklist Phishing URLs,” The Web Conference 2022, pp. 286-289. PDF
  5. R. Silva, M. Nabeel, C. Elvitigala, I. Khalil, T. Yu, and C. Keppitiyagama, “Compromised or Attacker-Owned,” A Large-Scale Classification and Study of Hosting Domains of Malicious URLs,” USENIX Security Symposium 2021, pp. 3721-3738.
  6. M. Nabeel, I. Khalil, B. Guan, and T. Yu, “Following Passive DNS Traces to Detect Stealthy Malicious Domains Via Graph Inference,” ACM Trans. Priv. Secur. (TOPS) 23(4): 17:1-17:36 (2020). PDF
  7. P. Xia, M. Nabeel, I. Khalil, H. Wang, and T. Yu, “Identifying and Characterizing COVID-19 Themed Malicious Domain Campaigns,” CODASPY, 2021, pp. 209-220. PDF
  8. I. Khalil, B. Guan, M. Nabeel, T. Yu, “A Domain is only as Good as its Buddies: Detecting Stealthy Malicious Domains via Graph Inference,” CODASPY 2018: 330-341 (Best Paper Award). PDF
  9. Y. Zhauniarovich, I. Khalil, T. Yu, M. Dacier, “A Survey on Malicious Domains Detection through DNS Data Analysis,” ACM Computer Surveys 51(4): 67:1-67:36 (2018) PDF
  10. I. Khalil, T. Yu, and B. Guan, “Discovering Malicious Domains through Passive   DNS Data Graph Analysis,” ACM Asia Conference on Computer and Communications Security (ASIACCS 2016), May 30 - June 3, 2016, in Xi'an, China. PDF

Patents

  1. F. Deniz, I. Khalil, M. Nabeel, and T. Yu. “Methods and Techniques to Proactively Detect Malicious Domains Using Graph Representation Learning.” Provisional Patent# D2022-0113. Disclosure date: Dec 22, 2022.
  2. M. Nabeel, I. Khalil, E. Choo, and T. Yu. “Methods and Techniques to Generate High-Quality Threat Intelligence from Aggregated Threat Reports.” Provisional Patent# D2021-0081. Disclosure date: Oct 28, 2021.
  3. Mashael Alsabah, Mohamed Nabeel, Yazan Boshmaf. “Phishing domain detection systems and methods.” US Patent App. 17/229,386, 2021.
  4. M. Nabeel, I. Khalil and T. Yu. “Methods and Techniques to Classify Malicious Domain Hosting Types.” Provisional Patent#: D2020-0020. Disclosure date: Feb 17, 2020.
  5. M. Nabeel, I. Khalil and T. Yu. “Brand squatting domain detection systems and methods.” US Patent, Application number 17558986. 2022.
  6. E. Choo, M. Nabeel, I. Khalil, M. Alsabah, T. Yu, and W. Wang. “Compromised mobile device detection system and method.” US Patent, Application number 17495391. 2022.
  7. M. Nabeel, I. Khalil, E. Choo, and T. Yu. “Method and system for domain maliciousness assessment via real-time graph inference.” US Patent Number 11206275. 2021.
  8. I. Khalil, T. Yu and M. Dacier. “Method to identify malicious web domain names thanks to their dynamics.” US Patent US 10,681,070 B2. 2020.

Impact

This project was highly successful and had a substantial impact thanks to our partnerships with local stakeholders such as MOI, QFIT, and Qatar Airways. This allowed us to focus our research and development efforts on addressing the specific challenges faced by these stakeholders. For instance, using our research-based techniques, we were able to detect and inform three main ministries in Qatar of malicious domains that had gone undetected by their security controls. 

The success of our project is demonstrated by the numerous outcomes we achieved including: (i) the publication of ten papers, one of which received a best paper award, (ii) the filing of eight patents, four of which were licensed to a startup company, (iii) the development of a scanner used by VirusTotal, (iv) the receipt of a prestigious “Tech Team Award” from Hamad Bin Khalifa University, (v) a third place award for a poster at the HBKU Research Day, (vi) funding from QNRF-TUBITAK ($1.5m) and TDF ($90K) grants, and (vii) the development and implementation of a prototype platform called Hemaya, which provides daily domain intelligence lists, real-time domain intelligence, brand monitoring, and IP vulnerability scanning.

Direction and Future Plans

Our future plans for this project involve bringing it to a successful conclusion by the end of this year. In the upcoming year, we will focus on addressing the remaining research challenge of handling bias in the benign ground truth, as well as finalizing the integration of SIRAJ and Hemaya. A significant effort will be directed towards commercializing Hemaya, either through spinning it off as a startup or licensing the remaining patents for integration with other market solutions. We aim to take the innovative technology developed through this project and bring it to the market to benefit users and stakeholders. Additionally, we will actively explore potential partnerships and collaborations that can help us achieve our commercialization goals. Overall, our goal is to ensure that the results of our research have a positive impact on industry and the wider community.