Duplicate Detection in CivicReport: Municipal Deduplication Architecture

A pothole on Main Street gets reported on Monday by a driver. On Tuesday, a pedestrian reports the same pothole from the sidewalk. On Wednesday, a cyclist files a third report. By Friday, there are six reports for the same hole, three of them assigned to different staff members, two of them already marked "In Progress" independently.

This is not a hypothetical edge case. In our data from municipalities running CivicReport, duplicate reports account for 15-30% of all submissions. Left unchecked, they waste staff time, inflate issue counts, and make resolution metrics unreliable. CivicReport's duplicate detection engine is designed to catch these before they reach the admin dashboard. Here is the architecture.

Why Municipal Deduplication Is Hard

Duplicate detection in a municipal context is different from deduplicating a customer database. Customer records have structured fields: email, phone, company name. Municipal reports have unstructured text descriptions, user-submitted locations with varying accuracy, and photos taken from different angles at different times of day.

The challenges are specific:

Location imprecision: Two citizens reporting the same issue might place their map pins 50 meters apart. GPS accuracy on mobile phones varies. Indoor reports are even less precise.
Language variation: One person writes "pothole near the bus stop." Another writes "road damage on Oak Street." A third writes "hole in the asphalt." All three describe the same issue.
Temporal spread: The same issue can be reported days or weeks apart. A pothole reported in January and reported again in March might be the same unrepaired hole or a new one nearby.
Category ambiguity: A broken streetlight could be categorized as "Streetlight," "Infrastructure," "Roads," or "Public Safety" depending on the citizen and the municipality's category setup.

Any system that relies on exact matches will miss most duplicates. We needed something that handles fuzzy data across multiple dimensions.

The Three-Layer Detection Pipeline

CivicReport's duplicate detection runs as a three-layer pipeline. Each layer applies a different matching strategy, and a report must pass through all three layers before it enters the admin queue.

Layer 1: Spatial Proximity

The first filter is geographic. When a new report comes in, the system queries all existing open issues within a configurable radius. The default is 100 meters for point issues like potholes and graffiti, and 500 meters for linear issues like road damage or water leaks.

The radius is configurable per category because issue footprints vary. A single pothole is a point. A burst water main affects a stretch of road. Using the same radius for both would either miss linear duplicates or flag too many false positives for point issues.

The spatial query uses a PostgreSQL GiST index on the report location column, which keeps lookups fast even with tens of thousands of active issues. A spatial proximity query runs quickly thanks to PostgreSQL GiST indexing.

Layer 2: Temporal Relevance

Once we have a set of spatially nearby issues, we filter by time. A report from six months ago is almost certainly not a duplicate of a report today, even at the exact same location. Most municipalities resolve issues within 30 days, so the default time window for duplicate matching is 60 days.

The temporal filter also considers the issue status. A report cannot be a duplicate of an issue that is already "Completed" or "Archived" unless the citizen explicitly references it. If someone reports a pothole that was repaired last month, that is a new issue, not a duplicate.

CivicReport's 8-status lifecycle (Application, Review, Approved, Active, Report, Verify, Completed, Archived) makes this straightforward. We only compare against issues in the first six statuses. Completed and Archived issues are excluded from the duplicate check.

Layer 3: Semantic Similarity

This is where the system gets interesting. The spatial and temporal filters give us a small set of candidate issues. The semantic layer determines whether the new report actually describes the same problem.

We use a two-stage comparison. First, we compare the category assignments. If the new report is categorized as "Roads" and all nearby candidates are categorized as "Parks," we can immediately rule them out as duplicates. Category mismatch is a strong negative signal.

Second, we compare the text descriptions. Rather than simple keyword matching, we use a combination of techniques:

Keyword extraction: Pull significant nouns and adjectives from both the new report and candidate descriptions. "Pothole," "hole," "asphalt," and "damage" are semantically related for road issues.
Phrase matching: Look for shared location references. "near the bus stop on Oak Street" in two reports is a strong signal, even if the rest of the description differs.
Description length ratio: If one report is three sentences and another is two words, they are less likely to be duplicates than two reports of similar length and detail level.

The similarity score is a weighted combination of these factors. If it exceeds a configurable threshold (default: 0.7 on a 0-1 scale), the system flags the new report as a potential duplicate.

The Merge Workflow

When the system detects a potential duplicate, it does not silently merge the reports. Instead, it routes the new report to the approval queue with a "Potential Duplicate" flag and a link to the existing issue.

The admin reviewer sees both reports side by side: the original with its full timeline and the new submission with its description, photos, and location. The reviewer has three options:

Confirm duplicate: The new report is merged into the existing issue. The citizen who submitted the duplicate gets an automatic notification that their report has been linked to an existing issue, with a link to track its progress on the public map.
Not a duplicate: The new report enters the normal workflow as a standalone issue. The system logs this as a false positive, which is used to tune the similarity thresholds over time.
Related but separate: The reports are linked as related issues but processed independently. Useful when two reports describe nearby but distinct problems.

This human-in-the-loop approach is deliberate. Automatic merging without review risks suppressing legitimate reports. A citizen who reports a pothole and gets no acknowledgment because the system silently merged their report will report it again, or worse, lose trust in the platform entirely.

Image Comparison

CivicReport accepts photo uploads with reports. When a potential duplicate is detected, the system also compares the submitted photos against photos attached to the existing issue.

Full image similarity analysis would be computationally expensive and unnecessary for this use case. Instead, we compare image metadata (GPS coordinates embedded in EXIF data, if available) and basic visual features like dominant colors and aspect ratio. A pothole photo taken from a car and a pothole photo taken on foot will have different angles, but they will share similar dominant colors (gray asphalt, dark hole) and similar aspect ratios.

Image comparison is used as a supplementary signal, not a primary one. It increases confidence in a duplicate match but does not override the text and location analysis.

Performance and Scaling

The duplicate detection pipeline runs synchronously during report submission. This means the citizen sees the result immediately: either their report enters the queue, or they get a message saying "This may be a duplicate of an existing report" with a link to track the original.

The three-layer approach keeps the computation fast. The spatial filter narrows candidates from thousands to a handful. The temporal filter narrows further. The semantic comparison runs on at most 5-10 candidate issues, which takes milliseconds. The three-layer approach keeps detection fast — spatial narrows the field, temporal narrows further, semantic comparison runs on only a handful of candidates.

For larger municipalities with high submission volumes, the spatial query benefits from PostgreSQL's GiST indexing. The spatial query benefits from PostgreSQL's GiST indexing, which scales well with large datasets. The bottleneck is not the database; it is the semantic comparison, which is bounded by the small candidate set.

Self-Hosting and Data Sovereignty

CivicReport runs on the Fimula Platform with two deployment options. Fimula Lite uses shared infrastructure with row-level security for tenant isolation. Fimula Core provides a dedicated PostgreSQL instance per tenant.

The duplicate detection engine runs entirely within the tenant's database. No report text, photos, or location data is sent to an external service. This is important for municipalities because civic reports often contain location data that reveals citizens' daily patterns. Under GDPR, this is personal data, and sending it to a third-party API for analysis would require explicit consent and a data processing agreement.

By keeping the detection pipeline local, CivicReport avoids this problem entirely. The municipality's data stays in their database, on their infrastructure (or the shared EU-hosted infrastructure for Lite tenants).

Open-Source Considerations

CivicReport is available as a self-hosted solution. Municipalities that want full control over their data and the ability to inspect and modify the duplicate detection logic can deploy on their own infrastructure.

The detection thresholds (spatial radius, temporal window, similarity score) are all configurable per tenant through the admin dashboard. For municipalities that want to go further, the scoring weights and category-specific rules are stored in database tables that can be modified directly.

This configurability matters because duplicate patterns vary by municipality. A dense urban center gets more spatially close reports than a rural municipality. A tourist-heavy city gets more reports in multiple languages. The system needs to adapt to these differences without requiring code changes.

Duplicate detection is not the most visible feature in a civic reporting platform. Citizens don't see it. It doesn't appear on the public map. But for municipal staff who process hundreds of reports per week, it is the difference between a manageable workload and an overwhelming one. If you are evaluating civic reporting platforms, ask how they handle duplicates. The answer tells you a lot about how well the system understands municipal operations.

Duplicate Detection in CivicReport: Architecture of a Municipal Deduplication Engine