Automated ETL pipeline for massive data collection and analysis with intelligent multi-pass scraping, ML gender prediction, quality tracking, and real-time dashboards using Airflow, Spark, and Elastic Stack.
1. Context and Objectives
This project implements a production-ready ETL pipeline for monitoring Instagram account followings with automated change detection, machine learning predictions, and comprehensive data quality tracking. The system runs 24/7 with complete autonomy once Docker Desktop is launched.
Key Objectives
Automated Surveillance: Scrape Instagram followings every ~4 hours (6 times/day) with anti-detection strategies
Change Detection: Identify new followings and unfollows through intelligent daily comparisons
ML Predictions: Automatic gender prediction with confidence scores using machine learning
Quality Tracking: Comprehensive quality score system tracking scraping completeness and accuracy
Real-Time Dashboards: Modern web dashboard and Kibana visualizations with advanced filters
Structured Data Lake: Three-layer architecture (RAW → FORMATTED → USAGE)
The main challenge was designing a robust system that balances data collection frequency, anti-detection mechanisms, and quality assurance while maintaining 24/7 autonomous operation.
Figure 1 - High-level overview of the Instagram surveillance pipeline
Démo d'un run RAG (mode visuel activé via serveur X11)
⚠️ ATTENTION : LES DONNEES RECUPEREES NE RESPECTENT PAS LA RGPD ET CE PROJET A ETE REALISE DANS UN CADRE ACADEMIQUE UNIQUEMENT
2. System Architecture
The ETL pipeline follows a modular microservices architecture orchestrated by Apache Airflow with full Docker containerization.
Automated Execution Flow
The system executes on a precisely timed schedule (Europe/Paris timezone):
6 Daily Scrapings: 02:00, 06:00, 10:00, 14:00, 18:00, 23:00 (+ random delay 0-45min for anti-detection)
3 Passes Per Scraping: Multiple passes with 60-120s random delays to simulate human behavior
Daily Aggregation: 23:00 - Fusion of all 6 daily scrapings with deduplication
Daily Comparison: 23:00 - Detection of new followings and unfollows (Day vs Day-1)
Microservices Stack
Orchestration
Apache Airflow 2.10.3
LocalExecutor
DAG Scheduler 24/7
Data Collection
Selenium 4.36
Chrome Headless
Cookie Authentication
Processing
PySpark 4.0.1
Pandas
Scikit-learn 1.6.0
Storage & Visualization
PostgreSQL 14
Elasticsearch 8.11
Kibana 8.11
Flask Dashboard
Data Lake Architecture
data/
├── raw/ # Raw JSON from scraping
│ └── YYYY-MM-DD_HH-MM-SS/
├── formatted/ # Cleaned data with ML predictions
│ └── YYYY-MM-DD_aggregated/
└── usage/ # Daily aggregations & comparisons
└── YYYY-MM-DD_comparatif/
Prérequis et guide d'installation
Prérequis
Docker Desktop - Pour la conteneurisation de tous les services
VS Code - Éditeur de code recommandé
Extension Chrome "Cookies.txt" - Pour extraire les cookies d'authentification Instagram
Compte Instagram personnel - Nécessaire pour l'authentification et le scraping
Add your own Instagram cookies using a free chrome extension (Get cookies.txt)
4
Press enter to continue
5
Check on Docker desktop if containers are running and healthy
6
Wait for final checks while Dashboard, Elastic & Kibana are getting accessible
7
Optional:make help → see all available commands
8
Optional:make status
3. ETL Pipeline Stages
Three-stage pipeline combining Selenium scraping, PySpark processing, and multi-target storage.
Extraction: Selenium-based scraping with cookie auth, 3 passes, and rate limiting (60-120s delays)
Transformation: PySpark processing with ML gender prediction, quality scoring, and daily aggregation
Loading: PostgreSQL, Elasticsearch, and JSON data lake (RAW/FORMATTED/USAGE layers)
1
make open
2
Dashboard, Kibana & Airflow launched
3
Change instagram accounts to scrape (targets) on dashboard or .txt
4
Deleting 8 followings from my personal account for demo
5
Adding 7 new accounts as new followings for my personal account for demo
Modifier la liste des comptes instagram surveillés
1
Lancement du DAG (mode visuel activé)
2
DAG Task 1 "generate_scripts" : Read accounts (targets)
3
DAG Task 2 "run_single_scripts[3 for demo]" : 3 pass of 1 scrapping executed for every target, 6x a day with random delay to avoid detection
4
DAG Task 3 & 4 : "aggregate_results" & "index_to_elasticsearch" : Aggregate all scraping of the day at 23:00 automatically to compare every day and have a passive full surveillance without getting caught !
This project is provided for educational and research purposes only. Use responsibly and in compliance with Instagram's Terms of Service and data protection laws (GDPR).