Visual Intelligence Platform (VIP)
A free, local-first, Apple Silicon-only photo intelligence tool for large RAW photo libraries (50K–100K+ files). No cloud. No subscriptions. Everything runs on-device.
Scans your entire photo library recursively — Canon CR3, Sony ARW, Nikon NEF, DNG and more
Detects and embeds faces using InsightFace Buffalo_L (RetinaFace + ArcFace, 512-D embeddings)
Clusters faces into people groups automatically via HDBSCAN
Tags photos with objects, animals, scenes, geography, and places using YOLOv11, Places365, OpenCLIP, and BioCLIP
Resolves GPS coordinates to human-readable place names via Nominatim/OSM
Lets you name people through a browser UI — no ML jargon, just face tiles and name inputs
Writes metadata back into the original RAW files using ExifTool (XMP, IPTC, MWG face regions)
Makes files searchable in macOS Spotlight immediately after writeback
Phase
What Happens
Phase 1 — Scan
Walk folders, SHA-256 hash, EXIF extraction, iCloud stub detection
Phase 2 — Embed
Extract embedded JPEG preview → RetinaFace detect → ArcFace embed
Phase 3 — Cluster
HDBSCAN groups faces into person clusters
Phase 4 — Tag
YOLOv11 objects/animals + Places365 scenes + CLIP landmarks + BioCLIP species + Nominatim GPS
Phase 5 — Writeback
ExifTool writes PersonInImage, Subject, Keywords, RegionInfo, Location into files
flowchart TD
USER([👤 User])
subgraph FE["Frontend (React + Vite)"]
UI_ADMIN["Admin Page\n(set folder path, mode, settings)"]
UI_GALLERY["Gallery / People / Map views"]
WS["WebSocket listener\n(/ws/progress)"]
end
subgraph API["FastAPI Backend"]
RT_SCAN["POST /api/admin/scan"]
RT_MEDIA["GET /api/media/*"]
RT_PEOPLE["GET /api/people/*"]
WS_SRV["WebSocket /ws/progress\n(broadcast phase progress)"]
end
subgraph PIPELINE["Ingest Pipeline (ingest.py)"]
P1["Phase 1 — Scan\nWalk folder, hash, extract EXIF\nWrite media rows to DB"]
P2["Phase 2 — Face Detection + Embedding\nExtract JPEG preview\nDetect faces (Intelligent mode)\nEmbed via ArcFace Buffalo_L\nStore face crops + 512-D vectors"]
P3["Phase 3 — Cluster\nHDBSCAN on all embeddings\nCoherence filter (intra_sim ≥ 0.85)\nBuild person clusters in DB"]
P3B["Phase 3b — Auto-name\nHigh-conf merge (≥ 0.98 sim)\nBorderline suggestions (≥ 0.63)"]
P3C["Phase 3c — VIP History restore\nRe-attach named persons across scans"]
P4["Phase 4 — Tag\nYOLO (objects/animals) · MPS\nPlaces365 ResNet (scene) · MPS\nOpenCLIP (landmarks) · MPS\nBioCLIP (species) · MPS\nGPS → Nominatim geocode"]
P5["Phase 5 — Analysis documents\nBuild Rekognition-format JSON\nper photo (faces + tags + geo)"]
end
subgraph ML["ML Layer"]
FD["FaceDetector\nAccuracy: CPU 1280×1280\nPerformance: CoreML 640×640\nIntelligent: focal-length signal\n+ oracle escalation"]
CL["Clusterer\nHDBSCAN euclidean + coherence filter"]
FAISS["FAISS index\n(512-D ArcFace vectors)"]
TAG["Tagger\nYOLO · Places365 · OpenCLIP · BioCLIP · GeoResolver"]
end
subgraph DB["Storage"]
SQLITE[("SQLite · vip.db")]
THUMBS[("Thumbnails & face crops")]
end
USER -->|"configure & trigger scan"| UI_ADMIN
UI_ADMIN -->|"POST"| RT_SCAN
RT_SCAN -->|"spawn background task"| P1
P1 -->|"media rows"| SQLITE
P1 --> P2
P2 --> FD
FD -->|"embeddings"| SQLITE
FD --> THUMBS
P2 --> P3
P3 --> CL
CL --> FAISS
CL -->|"person clusters"| SQLITE
P3 --> P3B --> P3C --> P4
P4 --> TAG
TAG -->|"tags, labels, GPS"| SQLITE
P4 --> P5
P5 -->|"analysis JSON"| SQLITE
PIPELINE -->|"phase events"| WS_SRV
WS_SRV -->|"live progress"| WS
WS --> UI_GALLERY
USER -->|"browse photos / people / map"| UI_GALLERY
UI_GALLERY -->|"REST queries"| RT_MEDIA
UI_GALLERY -->|"REST queries"| RT_PEOPLE
RT_MEDIA --> SQLITE
RT_PEOPLE --> SQLITE
RT_PEOPLE --> FAISS
Loading
Model
Purpose
Backend
InsightFace Buffalo_L
Face detection + 512-D embeddings
ONNX (CPU EP)
YOLOv11s
Object + animal detection (COCO 80 classes)
ultralytics / MPS
Places365 ResNet-50
Scene/geography classification
PyTorch / MPS
OpenCLIP ViT-B/32
Zero-shot landmark recognition (56 landmarks)
open-clip-torch
BioCLIP
Species-level animal classification (150+ species)
open-clip-torch
Nominatim / OSM
GPS → human-readable place name
geopy
All models run 100% locally . No data leaves your machine.
Field
Content
XMP:PersonInImage
Named persons (e.g. Alice, Bob)
XMP-mwg-rs:Regions
Face bounding boxes with names (Lightroom / Capture One compatible)
XMP:Subject / IPTC:Keywords
All tags with prefixes: obj:, animal:, geo:, place:
XMP:Location
GPS-resolved place name
GPS fields
Preserved as-is from original EXIF
macOS on Apple Silicon (M1/M2/M3/M4 series) — not Intel
Python 3.11+
Node.js 18+
ExifTool (brew install exiftool)
~800MB disk for ML model cache (auto-downloaded on first Phase 4 run)
# 1. Clone
git clone https://github.com/sifaralways/VisualIntelligencePlatform.git
cd VisualIntelligencePlatform
# 2. Bootstrap (installs deps, creates venv, inits DB)
./setup.sh
# 3. Start
./start.sh
# Backend → http://localhost:7474
# Frontend → http://localhost:5173
Layer
Tool
Backend
Python 3.11 + FastAPI + uvicorn
Database
SQLite (aiosqlite), WAL mode, FK constraints
Face ML
InsightFace Buffalo_L, ONNX CPU EP
Object ML
YOLOv11s (ultralytics), PyTorch MPS
Scene ML
Places365 ResNet-50, PyTorch MPS
CLIP ML
OpenCLIP ViT-B/32, BioCLIP
Geo
geopy / Nominatim (OpenStreetMap)
Clustering
HDBSCAN (hdbscan)
Vector Index
FAISS (flat, 512-D)
Metadata write
ExifTool CLI (subprocess, 30s timeout)
Frontend
React 18 + Vite 5 + Tailwind CSS v4
VisualIntelligencePlatform/
├── backend/
│ ├── main.py # FastAPI app entry point
│ ├── config.py # All settings (Pydantic)
│ ├── admin.py # Admin reset / stats
│ ├── database/
│ │ ├── db.py # aiosqlite pool
│ │ └── migrations/
│ │ ├── 001_initial.sql # Core schema (7 tables)
│ │ └── 002_tags.sql # media_tags table
│ ├── scanner/
│ │ ├── walker.py, hasher.py, exif_reader.py, preview_extractor.py
│ ├── ml/
│ │ ├── face_detector.py # InsightFace RetinaFace (CPU EP)
│ │ ├── embedder.py # InsightFace ArcFace, 200×200 crops
│ │ ├── clusterer.py # HDBSCAN
│ │ ├── object_detector.py # YOLOv11s
│ │ ├── scene_classifier.py # Places365
│ │ ├── landmark_recogniser.py # OpenCLIP zero-shot
│ │ ├── species_classifier.py # BioCLIP
│ │ ├── geo_resolver.py # Nominatim
│ │ └── tagger.py # Orchestrator for all taggers
│ ├── pipeline/
│ │ └── ingest.py # 4-phase pipeline orchestrator
│ ├── writeback/
│ │ ├── exiftool.py, fields.py, engine.py
│ └── api/routes/
│ ├── pipeline.py, persons.py, faces.py, media.py
│ ├── search.py, writeback.py, tags.py, admin.py
├── frontend/src/
│ ├── pages/
│ │ ├── PeoplePage.tsx # Face tiles, naming, face review, eject
│ │ ├── PipelinePage.tsx # Scan controls + live progress
│ │ ├── SearchPage.tsx
│ │ ├── WritebackPage.tsx
│ │ └── AdminPage.tsx # Stats + scoped resets
│ └── api/client.ts # Typed API client
├── requirements.txt
├── setup.sh
├── start.sh
├── SOLUTION_DESIGN.md # Full architecture & decisions doc
└── High level BRD.md # Original business requirements
Method
Endpoint
Description
POST
/api/pipeline/scan
Start pipeline (all 4 phases)
WS
/ws/progress
Real-time progress events
GET
/api/persons
All persons with photo counts + thumbnails
GET
/api/persons/{id}/faces
All face crops for person review
PATCH
/api/persons/{id}
Set name / merge
DELETE
/api/faces/{id}/from-person
Eject a misassigned face
GET
/api/tags/{media_file_id}
All ML tags for a photo
GET
/api/tags/summary/top
Most frequent tags across library
GET
/api/writeback/preview
Dry-run: files + fields to be written
POST
/api/writeback/confirm
Execute ExifTool writes
GET
/api/admin/stats
DB counts + pipeline state
POST
/api/admin/reset/{scope}
Scoped reset (faces, clusters, tags, all)
tail -f ~ /Library/Logs/VIP/vip.log
Rotating log, 10MB × 5 files.
Non-commercial use only. InsightFace Buffalo_L weights are permitted under non-commercial licence.