Anomaly Detection in HTTP Requests: A Machine Learning Approach
Published:
A while back, I was given an interesting assignment: build a model to detect anomalous HTTP requests. The goal was to identify malicious web traffic by analyzing patterns in normal and anomalous requests. This led me to explore the CSIC 2010 dataset, a well-known benchmark for HTTP anomaly detection.
Even though the first assignment was to apply unsupervised learning, I took a supervised learning approach. However, since the dataset was originally designed for unsupervised learning, I decided to explore that path as well.
In this blog, I’ll walk through both approaches—supervised (Random Forest) and unsupervised (Isolation Forest & Autoencoders)—and compare their performance.
The Dataset: CSIC 2010 HTTP dataset
The CSIC 2010 HTTP dataset contains over 36,000 normal requests and 25,000+ anomalous requests (web attacks) targeting an e-commerce application. The data is structured into:
- Normal Traffic (Training)
- Normal Traffic (Test)
- Anomalous Traffic (Test)
While the dataset is designed for unsupervised learning, I started with a supervised approach by combining normal and anomalous data into a labeled dataset.
Key Challenges
- Unsupervised nature: No labels for training, only normal traffic.
- Raw text format: HTTP logs need parsing into structured features.
Step 1: Preprocessing the Raw Data
The most critical (and tedious) step was parsing the raw .txt
files into a structured format. Here’s how I tackled it:
Key Challenges:
- The dataset came as raw HTTP requests in
.txt
files. - Requests were split across files labeled by type (
normalTrafficTraining.txt
,anomalousTrafficTest.txt
, etc.). - Each request needed parsing into features like HTTP method, URL, headers, and body.
Parsing Workflow:
- Extract Request Components:
- Split raw text into individual requests using regex (e.g.,
GET
orPOST
as delimiters). - Parse each request into structured fields:
def parse_http_request(request_text): # Extract method, URL, headers, etc. return { 'Method': method, 'URL': url, 'HTTP_Version': http_version, 'User-Agent': user_agent, 'Body': body # if present }
- Split raw text into individual requests using regex (e.g.,
- Label the Data:
- Filenames indicated whether a request was
Normal
orAnomalous
and part ofTraining
orTest
sets.
- Filenames indicated whether a request was
- Combine and Save:
- Merge parsed requests into a DataFrame and save as CSV for training/testing.
Output Files:
http_requests_all.csv
(all parsed requests).http_requests_train.csv
(normal training data).http_requests_test.csv
(normal + anomalous test data).
Step 2: Feature Engineering
Once the data was parsed, I engineered features to help the model distinguish normal from anomalous traffic:
- Combine Text Features:
- Concatenated
Method
,URL
,HTTP_Version
,User-Agent
, andBody
into a single text feature. - Example:
df['text'] = df['Method'] + ' ' + df['URL'] + ' ' + df['HTTP_Version'] + ' ' + df['User-Agent']
- Concatenated
- Text Cleaning:
- Lowercased text and removed special characters.
- TF-IDF Vectorization:
- Converted text into numerical features using TF-IDF with 1-2 word n-grams.
- Limited to the top 1,000 features to manage dimensionality.
Step 3: Model Training
Approach 1: Supervised Learning (Random Forest)
Steps
- Parsed HTTP logs into structured fields (
Method
,URL
,Headers
,Body
). - Engineered features using TF-IDF (1-2 word n-grams).
- Trained a Random Forest on labeled data (normal vs. anomalous).
Results
📊 79% accuracy
✅ High precision on normal traffic (1.00)
⚠️ Lower precision on anomalies (0.55)
Takeaway:
✔️ Works well when labeled data is available.
❌ Struggles with unseen attack patterns.
I chose a Random Forest classifier for its robustness with text data and ability to handle imbalanced datasets:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(max_features=1000, ngram_range=(1, 2))),
('clf', RandomForestClassifier(n_estimators=100, class_weight='balanced'))
])
Why Random Forest?
- Handles non-linear relationships well.
- Provides feature importance scores (useful for interpreting HTTP attack patterns).
Step 4.1: Evaluation
The model achieved 79% accuracy on the test set, with:
- High precision on normal traffic (1.00) → Few false positives.
- Lower precision on anomalies (0.55) → Some attacks were missed.
Visualizations:
- ROC Curve (AUC = 0.86):
- Showed good trade-off between true positives and false positives.
- Confusion Matrix:
- Highlighted misclassified anomalies.
- Feature Importance:
- Revealed key n-grams (e.g., suspicious URL patterns).
Approach 2: Unsupervised Learning
Since real-world traffic often lacks anomaly labels, I tried two unsupervised methods:
1. Isolation Forest
- Idea: Detects anomalies as “easy-to-isolate” outliers.
- Implementation:
model = Pipeline([ ('tfidf', TfidfVectorizer(max_features=1000)), ('clf', IsolationForest(contamination=0.1)) ]) model.fit(normal_training_data)
- Results:
precision recall f1-score support Normal 0.61 0.90 0.73 36000 Anomalous 0.53 0.17 0.26 24668
✅ Catches 90% of normal traffic (low false alarms).
❌ Misses 83% of attacks (low recall).
2. Autoencoder (Deep Learning)
- Idea: Learns to reconstruct normal traffic; anomalies have high reconstruction error.
- Implementation:
autoencoder = Model(inputs=input_layer, outputs=decoder) autoencoder.compile(optimizer='adam', loss='mse') autoencoder.fit(normal_data, normal_data, epochs=10)
- Results:
precision recall f1-score support 0 0.65 0.96 0.77 36000 1 0.80 0.23 0.36 24668
✅ Better balanced performance (66% accuracy vs. 60% for Isolation Forest).
❌ Still misses many attacks (only 23% recall).
Key Takeaways
Metric | Supervised (RF) | Unsupervised (IF) | Unsupervised (AE) |
---|---|---|---|
Accuracy | 79% | 60% | 66% |
Anomaly Recall | 71% (F1) | 17% | 23% |
Normal Recall | 72% | 90% | 96% |
Insights
- Parsing is Half the Battle:
- Raw HTTP data requires careful preprocessing. The
parse_http_request
function was the backbone of this project.
- Raw HTTP data requires careful preprocessing. The
- Text-Based Features Work:
- Even simple TF-IDF features captured meaningful patterns in HTTP requests.
- Room for Improvement:
- Adding length-based features (e.g., URL length) or rule-based flags (e.g., detecting SQL keywords) could boost performance.
✔️ Supervised learning wins when labels exist—better at catching attacks.
✔️ Unsupervised is realistic for production (no need for labeled attacks).
⚠️ Hybrid approach may be best:
- Use unsupervised for baseline filtering.
- Add rules/SVM for known attack patterns.
Next Steps
- Feature engineering: Add URL length, special characters, SQL keywords.
- Ensemble models: Combine Isolation Forest + Autoencoder predictions.
- Adaptive thresholds: Adjust anomaly sensitivity dynamically.
Final Thoughts
This project was a great dive into applied anomaly detection. While the results were promising, the real value came from understanding the nuances of HTTP traffic and the importance of clean data.
For more details, check out the full notebook here.
Full notebook: GitHub Link
#MachineLearning #CyberSecurity #AnomalyDetection #DataScience