Investigation Assistant - Anti-Corruption Tool for Romanian Journalists

Investigation Assistant — Anti-Corruption Tool for Romanian Journalists
Overview
Investigation Assistant is a tool for journalists and activists working on anti-corruption investigations in Romania. It ingests public data — from SEAP/SICAP public procurement records, integritate.eu wealth declarations, ONRC company registry, and press archives — and surfaces the hidden connections between politicians, companies, and institutions that are normally buried across dozens of scattered databases.
The goal: reduce a three-week manual investigation into a three-hour conversation with a knowledge graph.
🎯 How It Works
- Aggregation — Automated ingestion from public sources (integritate.eu, SEAP/SICAP, ONRC, press archives)
- Knowledge graph — Neo4j graph of people, companies, institutions, contracts, and wealth declarations
- Anomaly detection — Automatic red flags: conflicts of interest, single-bidder contracts, companies at the same registered address, suspicious diploma origins
- Agentic chat — Natural-language questions answered via Gemini with function calling + SSE streaming over the graph and document store
🛠️ Tech Stack
| Component | Technology |
|---|---|
| App framework | Next.js 16, React 19, TypeScript |
| Knowledge graph | Neo4j (Docker) with Cypher queries |
| Contract store | SQLite via better-sqlite3 (SEAP contracts 2007–2024) |
| Vector search | ChromaDB for semantic retrieval over declarations and press |
| LLM layer | Google Gemini — Vision for PDF extraction, agentic chat with function-calling tools |
| Graph visualization | react-force-graph-2d |
| Web scraping | Puppeteer (requires Romanian IP for cdep.ro / senat.ro) |
| UI | Tailwind CSS v4, Lucide icons |
🏗️ Data Pipelines
SEAP public procurement (2007–2024)
Downloads contracts from data.gov.ro + OpenTender, imports into SQLite. Covers 17 years of Romanian public tenders — every state contract over a certain threshold, with contracting authority, winner, price, subject, and date.
Parliamentary education
Scrapes biographies from cdep.ro, senat.ro, and Wikipedia, then uses Gemini LLM to extract structured education records. Cross-references against a curated list of 22 "dubious" universities (legally dissolved, ARACIS "lack of trust" ratings, doctoral plagiarism mills, journalistic investigations). Current coverage: 314 of 464 MPs (68%) have education records; 76 MPs flagged with 98 problematic diplomas.
Wealth declarations (integritate.eu)
Downloads PDF declarations across multiple legislatures (2024, 2020, 2016, 2012, 2008, 2004) from Senate and Chamber of Deputies. Uses Gemini Vision to extract structured asset/income data from scanned PDFs — a problem OCR alone fails on because the forms are inconsistent across years.
ONRC company registry
Enriches graph with company ownership, board composition, and registered address clusters (critical for detecting shell-company networks at the same address).
🚩 Automated Red Flags
The system continuously surfaces anomalies:
- Conflict of interest — contracts awarded to companies owned by relatives of decision-makers
- Single-bidder procurement — tenders "won" with zero competition
- Address clusters — multiple companies registered at the same residential address
- Education anomalies — diplomas from dissolved or discredited institutions
- Unexplained wealth deltas — declaration-to-declaration jumps without corresponding income
🏗️ Agentic Architecture
The chat isn't a RAG pipeline in the naive sense — it's an agent with function-calling tools:
query_graph(cypher)— direct Neo4j queriessearch_contracts(filter)— structured SQLite queries over SEAP datasemantic_search(text)— ChromaDB vector search over declarations and pressget_person(id)— canonical profile with all linked entitiesget_company(cui)— full company dossier with owners, contracts, flags
Gemini decides which tools to call, chains them, and streams the answer via SSE. This lets journalists ask compound questions like "Which companies owned by members of the current government have won SEAP contracts in the last 3 years, and what are the anomalies?" — and get a real, sourced answer.
💡 Why This Matters
Romanian anti-corruption journalism is strong but resource-constrained. A single investigation at Recorder, PressOne, or Rise Project can take weeks of cross-referencing scattered public databases. The information is already public — it's just fragmented. Investigation Assistant is infrastructure: it doesn't replace journalists, it gives them leverage.
🔒 Ethics & Safeguards
- Uses only public-source data (procurement, wealth declarations, company registry, Wikipedia)
- Flags are not accusations — they are starting points for human journalism
- Source citations required on every AI-generated claim
- No private data ingestion — no social media scraping, no personal communications