Generating High-Quality Labeled Datasets for AI-enabled Threat Detection through Expert-Led Security Annotation
Executive Summary
A global cybersecurity company providing security platforms struggled to build its next-generation AI-based threat detection system because of the sheer amount of raw security data to be processed. With petabyte-scale logs, subtle signs of compromise hidden in normal system activity, the client grappled with data quality problems, as well as the need for accurately labelled datasets to train machine learning models. Existing data labeling services did not possess the domain expertise in security to identify threats, and the client’s security team didn’t have the capacity to build these training sets at scale.
Crest Data tackled these issues by deploying a holistic data labeling strategy for high-consequences security use cases, such as the creation of a data labeling team with extensive cybersecurity expertise. Through the development of a scalable data labeling infrastructure to process millions of events per day, Crest Data developed more than 500,000 high-quality labeled datasets used to train models for a range of threat detection use cases. This expert-driven process, which included training on specific attack patterns and multiple quality control steps, led to improved threat detection rates, reduced false positives to eliminate alert fatigue, and the ability for the client to discover previously unknown complex attack patterns.
About the Customer
A leading global cybersecurity platform provider serving Fortune 500 companies and government agencies needed to enhance their threat detection capabilities through machine learning. Their security products protect millions of endpoints worldwide and rely on increasingly sophisticated detection algorithms to identify emerging threats.
Customer Challenge
The client, a world-leading cybersecurity company, struggled with a number of operational and technical challenges in the development of its next-generation AI-based threat detection system.
The main challenges were:
- Large-scale and complex data: The client had to ingest petabytes of raw security logs and threat data. It was challenging to detect subtle signs of compromise due to their rarity and the ability to hide in legitimate activities.
- Lack of Specialized Expertise: Conventional data labeling services did not have the expert security knowledge to identify threats. As a result, it was difficult to build machine learning datasets.
- Lack of Time and Resources: The client’s security teams lacked the time to manually label training data at the scale needed for their global operations.
- Data Quality Issues: Security data sets often had an inconsistent structure and a lack of complete data, making it difficult to train suitable models.
- Critical Training Needs: The client needed highly accurate data sets to train models for detecting complex attacks and preventing false alarms, but did not have an efficient method of generating these data sets.
Customer Solution
Crest Data delivered an end-to-end data labeling solution for critical security use cases. The client required expert analysis and unprecedented scale for training their next-generation AI threat detection systems.
The solution comprised:
- Expert Security Annotation: Crest Data brought together a team of experts in cybersecurity and refined the data annotation process to suit security-specific data. This team assigned precise labels to intricate patterns of threats and anomalies in extensive security logs using robust quality control procedures.
- Industry-Leading Scale: The approach included creating a scalable platform to process and label millions of security events per day. This led to over 500,000 quality-labeled datasets, allowing for more targeted training of machine learning models for different detection use cases.
- Domain-Specific Training & QA: Crest Data tailored training curricula for attack pattern recognition, alert prioritization, and false positive detection to maintain accuracy. The team built a multi-level verification process and dedicated QA teams with deep security capabilities for consistency and quality.
- Advanced Technical Focus: The solution focused on key areas of security:
- Threat Classification: Accurate classification of events by threat type and severity, including the stages of attack campaigns.
- Anomaly Identification: Contextual classification of statistical anomalies to discriminate between normal system behaviour and security threats.
- Attack Pattern Recognition: Detection of multi-stage attack campaigns and inter-relating disparate security events.
- Alert Prioritization: Risk classification and business impact analysis for potential threats.
- Structured Implementation Approach: The implementation was managed with a rigorous approach that started with a security assessment and workflow design, followed by a pilot project with rigorous testing and validation, and then large-scale production and continuous improvement with feedback.
Outcomes
The deployment of the custom data labeling solution by Crest Data had a profound impact on the client’s threat detection system, translating into the following key business outcomes:
- Higher Model Accuracy: The client achieved a higher level of accuracy in detecting threats across all their security solutions.
- Lowered Alert Fatigue: The solution greatly reduced the number of false positives, helping to alleviate security team fatigue.
- Enhanced Detection Scope: The solution expanded its detection capabilities to cover complex attack patterns that were previously unknown, improving security coverage.
- Sustainable Competitive Advantage: The client’s superior detection rates for zero-day and advanced malware threats enabled them to maintain a competitive edge and better serve their customers.
About Crest Data
Crest Data is a data and AI-driven technology solutions provider for enterprises and technology innovators across cybersecurity and cloud security, helping them move faster and more securely. We offer specialized services in AI & ML and MLOps to enhance next-generation threat detection capabilities. Our specialization includes Data Labeling & Annotation for Security, efficiently bridging the gap between massive raw data volumes and robust machine learning by providing expert-led annotation for complex security logs.




