How I Ended Up Writing My Own Photo Deduplication Tool

(I may rewrite this..but this is what the Robot Overorlds suggested)

It was much easier managing our family photos when I was the primary photographer. I’d take a day’s worth of photos, pull them into Google Picasa, tag the bad shots (why delete them when disk space felt infinite compared to the days of floppies and tiny SD cards), and sort the rest into meaningful folders.

To make browsing easy, I used lightweight PHP gallery scripts. For a while, I relied on one from phpix.org (now defunct). Later I switched to sfpg (Single File Photo Gallery), which would take a folder, generate thumbnails, and dynamically create a photo gallery. Jen got used to this setup — if she wanted to revisit memories, she knew exactly where to go. It worked beautifully: organized, simple, predictable.

Then everything changed. Jen and the boys got their own phones. I started leaning more on mine too, syncing photos with Google Photos. With iCloud, Google, and cheap portable hard drives all in the mix, pictures started scattering everywhere — external drives, random exports, phone backups, cloud albums.

The problem wasn’t really that I couldn’t find photos. I usually could. The problem was that Jen couldn’t. She’d ask, “Where are the photos of William’s 15th birthday?”

In the old world, the answer was easy: open the 2012/Birthdays/William folder or check the sfpg gallery. In the new world, those photos might be on Google Photos, or iCloud, or maybe buried on some portable drive. For her, the system had collapsed. That’s when I realized I needed to fix this — not just for me, but for the whole family.


Trying the Existing Tools

I didn’t jump straight into coding. I tried the usual suspects:

  • Lightroom / Darktable: great editing environments, but too focused on image processing — not consolidation.
  • TagThatPhoto: promising for face recognition, but not built for large-scale cleanup.
  • Phockup: I liked its simplicity, but it ignored tags and special folder context.
  • PhotoPrism / PhotoStructure: slick interfaces, but they didn’t anticipate my messy, multi-drive situation.
  • Immich: the most promising of all — I’ll probably still use it as the final home — but it wasn’t built for the hard part: untangling duplicates across dozens of scattered folders first.
  • ExifTool: indispensable for metadata, but it’s a scalpel. I needed a system around it.

Each solved part of the problem. None solved my problem:

  • Consolidating scattered photos into one place.
  • Deduplicating by both exact file hash and pixel content.
  • Preserving meaningful folder context (bad, Originals, _private).
  • Working in idempotent, non-destructive steps I could safely re-run.
  • Giving me CSV exports for auditing so nothing happened in a black box.

That’s when it clicked: I didn’t just need a photo manager. I needed a consolidator. A staging ground that would clean, deduplicate, and structure everything before I handed it off to Immich or whatever tool came next.

So I built one.


The Journey of Building PhotoCat

Step 1: Hashes Everywhere

The obvious start was hashing files with SHA-256. It worked for exact duplicates — until I realized many “duplicates” slipped through because their metadata differed. Same pixels, different hashes. Oops.

Step 2: Content Hashing

ExifTool to the rescue. It can return raw image bytes stripped of metadata, so I could hash the actual pixels. But hashing 100,000 images this way would take forever. Solution: only content-hash the chosen files from each duplicate group.

Step 3: Sorted vs. Unsorted Roots

Not all folders should become tags. Carefully named ones like Family/Christmas/2008 deserved tags. Phone dumps from DCIM? Noise. So I split scans into two modes:

  • Sorted roots → tags derived from folder names.
  • Unsorted roots → just catalogued.

Step 4: Special Folders

I almost flattened away bad, Originals, _private. But those labels meant something. Now they’re preserved under the destination date folder:

E:\PhotoLibrary\2009\2009-08-15\bad\photo.jpg

Step 5: Learning “Idempotent”

Somewhere along the way, I learned a new SAT word: idempotent. It means you can run an operation over and over, and the result doesn’t change after the first time.

And I realized: that’s how I wanted PhotoCat to behave.

  • Re-run scan → it only picks up new files.
  • Re-run hash → it fills in the blanks.
  • Re-run propose → it just updates the plan.

That gave me confidence to experiment without fear of breaking anything.

Step 6: Progress Bars

Early runs were silent. Waiting hours with no output was unnerving. So I made another rule: every heavy step needs a progress bar.

Step 7: Copy, Not Move

When it came time to consolidate, I played it safe. By default, PhotoCat copies files. Hardlinks are optional, but originals stay untouched.

Step 8: Quarantine, Not Delete

Deleting is final. Instead, redundant files get moved into a quarantine tree that mirrors their original paths. Only when I’m 100% confident do I delete.

Step 9: Local Content Duplicates

Finally, I noticed duplicates sneaking into the same date folder. Same pixels, different metadata. So I built refine-content-local, which suppresses all but the best version based on:

  • More tags = better.
  • Preferred format (RAW/HEIC > JPEG > PNG).
  • Larger size wins ties.

Looking Ahead: The “Back of the Photo” Vision

Consolidation is step one. But I’m thinking about the long-term too.

Back in the print era, people would write notes on the back of photos: “Grandpa John, 1947 picnic, Lake George.” Sometimes even drawing arrows or circles around heads in group shots. I want to bring that idea forward into the digital world.

Imagine:

  • An annotated copy of a group photo, with numbered dots over each face.
  • A sidebar listing names and biographical notes:
    • #1 William Payne – 15th birthday, b. 2009
    • #2 Jen Payne – Mother
  • The original photo stays pristine, but the “back-of-photo” version carries the story.

And here’s the cool part: metadata standards already support this.

  • XMP RegionInfo / IPTC ImageRegions let you store who is in the photo and where they are, as rectangles or polygons with names attached.
  • Example: “Person=William Payne at Rectangle(0.35,0.22,0.15,0.20).”
  • Tools like ExifTool, Digikam, PhotoPrism, even Immich are starting to use this.

So the future I want looks like this:

  • Originals → clean, deduplicated, metadata-rich, living in Immich or a similar system.
  • Annotated derivatives → “back of the photo” images with overlays and notes for humans.
  • Machine-readable metadata → EXIF/XMP regions storing who’s where, ready for genealogy systems.

Genealogy as a Long-Term Consideration

Genealogy isn’t my primary goal, but it’s always in the back of my mind. If I’m doing all this work, why not make sure it supports the long game?

  • If a photo carries structured tags (PersonInImage, Event, Location), it can be connected to family trees (GEDCOM, Baserow, etc.).
  • If faces are outlined with regions, future tools — or even AI — can link them to biographical databases.
  • If I ever build the “annotate” feature into PhotoCat, it could export:
    • photo.jpg (original)
    • photo-annotated.jpg (with numbered dots + notes)
    • photo.json (mapping regions to people/IDs)

That’s the modern “back of the photo.” Both human-friendly and machine-friendly.


Looking Back

This journey wasn’t just about writing a script. It was a chain of realizations and course corrections:

  • Hashing isn’t enough.
  • Content hashing is necessary, but only in smart places.
  • Not all folders should contribute tags.
  • Special folders carry meaning.
  • Every operation should be idempotent.
  • Silence is terrifying — progress bars are mandatory.
  • Copy, don’t move. Quarantine, don’t delete.
  • Even within a folder, duplicates lurk.
  • And long term, the “back of the photo” still matters — digitally.

Where I Am Now

Today, PhotoCat can:

  • Catalog 100k+ files into SQLite.
  • Hash them (file + content).
  • Propose a clean date-based structure.
  • Preserve tags and special subfolders.
  • Suppress redundant duplicates.
  • Export everything to CSV for auditing.
  • Copy files safely into a new library.
  • Quarantine originals before deletion.

It’s not flashy. It’s a command-line tool with progress bars. But it gives me something more important: a path back to clarity — for Jen, for the boys, and maybe even for future generations.

You can find the code at https://github.com/tompayne36/photocat

Add a Comment

Your email address will not be published. Required fields are marked *