{"id":1019,"date":"2025-08-26T21:06:48","date_gmt":"2025-08-27T01:06:48","guid":{"rendered":"https:\/\/paynecentral.com\/tompayne\/?p=1019"},"modified":"2025-08-26T21:06:48","modified_gmt":"2025-08-27T01:06:48","slug":"how-i-ended-up-writing-my-own-photo-deduplication-tool","status":"publish","type":"post","link":"https:\/\/paynecentral.com\/tompayne\/2025\/08\/26\/how-i-ended-up-writing-my-own-photo-deduplication-tool\/","title":{"rendered":"How I Ended Up Writing My Own Photo Deduplication Tool"},"content":{"rendered":"\n<p>(I may rewrite this..but this is what the Robot Overorlds suggested)<\/p>\n\n\n\n<p>It was much easier managing our family photos when I was the <strong>primary photographer<\/strong>. I\u2019d take a day\u2019s worth of photos, pull them into Google Picasa, tag the bad shots (why delete them when disk space felt infinite compared to the days of floppies and tiny SD cards), and sort the rest into meaningful folders.<\/p>\n\n\n\n<p>To make browsing easy, I used lightweight PHP gallery scripts. For a while, I relied on one from <em>phpix.org<\/em> (now defunct). Later I switched to <strong>sfpg (Single File Photo Gallery)<\/strong>, which would take a folder, generate thumbnails, and dynamically create a photo gallery. Jen got used to this setup \u2014 if she wanted to revisit memories, she knew exactly where to go. It worked beautifully: organized, simple, predictable.<\/p>\n\n\n\n<p>Then everything changed. Jen and the boys got their own phones. I started leaning more on mine too, syncing photos with Google Photos. With iCloud, Google, and cheap portable hard drives all in the mix, pictures started scattering everywhere \u2014 external drives, random exports, phone backups, cloud albums.<\/p>\n\n\n\n<p>The problem wasn\u2019t really that <em>I<\/em> couldn\u2019t find photos. I usually could. The problem was that <strong>Jen couldn\u2019t.<\/strong> She\u2019d ask, <em>\u201cWhere are the photos of William\u2019s 15th birthday?\u201d<\/em><\/p>\n\n\n\n<p>In the old world, the answer was easy: open the <code>2012\/Birthdays\/William<\/code> folder or check the sfpg gallery. In the new world, those photos might be on Google Photos, or iCloud, or maybe buried on some portable drive. For her, the system had collapsed. That\u2019s when I realized I needed to fix this \u2014 not just for me, but for the whole family.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Trying the Existing Tools<\/h2>\n\n\n\n<p>I didn\u2019t jump straight into coding. I tried the usual suspects:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Lightroom \/ Darktable<\/strong>: great editing environments, but too focused on image processing \u2014 not consolidation.<\/li>\n\n\n\n<li><strong>TagThatPhoto<\/strong>: promising for face recognition, but not built for large-scale cleanup.<\/li>\n\n\n\n<li><strong>Phockup<\/strong>: I liked its simplicity, but it ignored tags and special folder context.<\/li>\n\n\n\n<li><strong>PhotoPrism \/ PhotoStructure<\/strong>: slick interfaces, but they didn\u2019t anticipate my messy, multi-drive situation.<\/li>\n\n\n\n<li><strong>Immich<\/strong>: the most promising of all \u2014 I\u2019ll probably still use it as the <em>final home<\/em> \u2014 but it wasn\u2019t built for the hard part: untangling duplicates across dozens of scattered folders first.<\/li>\n\n\n\n<li><strong>ExifTool<\/strong>: indispensable for metadata, but it\u2019s a scalpel. I needed a system around it.<\/li>\n<\/ul>\n\n\n\n<p>Each solved <em>part<\/em> of the problem. None solved <em>my<\/em> problem:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Consolidating scattered photos into one place.<\/li>\n\n\n\n<li>Deduplicating by both exact file hash and pixel content.<\/li>\n\n\n\n<li>Preserving meaningful folder context (<code>bad<\/code>, <code>Originals<\/code>, <code>_private<\/code>).<\/li>\n\n\n\n<li>Working in <strong>idempotent, non-destructive steps<\/strong> I could safely re-run.<\/li>\n\n\n\n<li>Giving me CSV exports for auditing so nothing happened in a black box.<\/li>\n<\/ul>\n\n\n\n<p>That\u2019s when it clicked: I didn\u2019t just need a photo manager. I needed a <strong>consolidator<\/strong>. A staging ground that would clean, deduplicate, and structure everything before I handed it off to Immich or whatever tool came next.<\/p>\n\n\n\n<p>So I built one.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The Journey of Building PhotoCat<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Hashes Everywhere<\/h3>\n\n\n\n<p>The obvious start was hashing files with SHA-256. It worked for exact duplicates \u2014 until I realized many \u201cduplicates\u201d slipped through because their metadata differed. Same pixels, different hashes. Oops.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Content Hashing<\/h3>\n\n\n\n<p>ExifTool to the rescue. It can return raw image bytes stripped of metadata, so I could hash the actual pixels. But hashing 100,000 images this way would take forever. Solution: only content-hash the <em>chosen files<\/em> from each duplicate group.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Sorted vs. Unsorted Roots<\/h3>\n\n\n\n<p>Not all folders should become tags. Carefully named ones like <code>Family\/Christmas\/2008<\/code> deserved tags. Phone dumps from <code>DCIM<\/code>? Noise. So I split scans into two modes:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Sorted roots<\/strong> \u2192 tags derived from folder names.<\/li>\n\n\n\n<li><strong>Unsorted roots<\/strong> \u2192 just catalogued.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Special Folders<\/h3>\n\n\n\n<p>I almost flattened away <code>bad<\/code>, <code>Originals<\/code>, <code>_private<\/code>. But those labels meant something. Now they\u2019re preserved under the destination date folder:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>E:\\PhotoLibrary\\2009\\2009-08-15\\bad\\photo.jpg\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Learning \u201cIdempotent\u201d<\/h3>\n\n\n\n<p>Somewhere along the way, I learned a new SAT word: <strong>idempotent<\/strong>. It means you can run an operation over and over, and the result doesn\u2019t change after the first time.<\/p>\n\n\n\n<p>And I realized: that\u2019s how I wanted PhotoCat to behave.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Re-run <strong>scan<\/strong> \u2192 it only picks up new files.<\/li>\n\n\n\n<li>Re-run <strong>hash<\/strong> \u2192 it fills in the blanks.<\/li>\n\n\n\n<li>Re-run <strong>propose<\/strong> \u2192 it just updates the plan.<\/li>\n<\/ul>\n\n\n\n<p>That gave me confidence to experiment without fear of breaking anything.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Progress Bars<\/h3>\n\n\n\n<p>Early runs were silent. Waiting hours with no output was unnerving. So I made another rule: every heavy step needs a progress bar.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Copy, Not Move<\/h3>\n\n\n\n<p>When it came time to consolidate, I played it safe. By default, PhotoCat <strong>copies<\/strong> files. Hardlinks are optional, but originals stay untouched.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8: Quarantine, Not Delete<\/h3>\n\n\n\n<p>Deleting is final. Instead, redundant files get moved into a <strong>quarantine tree<\/strong> that mirrors their original paths. Only when I\u2019m 100% confident do I delete.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 9: Local Content Duplicates<\/h3>\n\n\n\n<p>Finally, I noticed duplicates sneaking into the same date folder. Same pixels, different metadata. So I built <strong>refine-content-local<\/strong>, which suppresses all but the best version based on:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>More tags = better.<\/li>\n\n\n\n<li>Preferred format (RAW\/HEIC > JPEG > PNG).<\/li>\n\n\n\n<li>Larger size wins ties.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Looking Ahead: The \u201cBack of the Photo\u201d Vision<\/h2>\n\n\n\n<p>Consolidation is step one. But I\u2019m thinking about the long-term too.<\/p>\n\n\n\n<p>Back in the print era, people would write notes on the back of photos: <em>\u201cGrandpa John, 1947 picnic, Lake George.\u201d<\/em> Sometimes even drawing arrows or circles around heads in group shots. I want to bring that idea forward into the digital world.<\/p>\n\n\n\n<p>Imagine:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An <strong>annotated copy<\/strong> of a group photo, with numbered dots over each face.<\/li>\n\n\n\n<li>A sidebar listing names and biographical notes:\n<ul class=\"wp-block-list\">\n<li><em>#1 William Payne \u2013 15th birthday, b. 2009<\/em><\/li>\n\n\n\n<li><em>#2 Jen Payne \u2013 Mother<\/em><\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>The original photo stays pristine, but the \u201cback-of-photo\u201d version carries the story.<\/li>\n<\/ul>\n\n\n\n<p>And here\u2019s the cool part: metadata standards already support this.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>XMP RegionInfo \/ IPTC ImageRegions<\/strong> let you store <em>who<\/em> is in the photo and <em>where<\/em> they are, as rectangles or polygons with names attached.<\/li>\n\n\n\n<li>Example: \u201cPerson=William Payne at Rectangle(0.35,0.22,0.15,0.20).\u201d<\/li>\n\n\n\n<li>Tools like ExifTool, Digikam, PhotoPrism, even Immich are starting to use this.<\/li>\n<\/ul>\n\n\n\n<p>So the future I want looks like this:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Originals \u2192 clean, deduplicated, metadata-rich, living in Immich or a similar system.<\/li>\n\n\n\n<li>Annotated derivatives \u2192 \u201cback of the photo\u201d images with overlays and notes for humans.<\/li>\n\n\n\n<li>Machine-readable metadata \u2192 EXIF\/XMP regions storing who\u2019s where, ready for genealogy systems.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Genealogy as a Long-Term Consideration<\/h2>\n\n\n\n<p>Genealogy isn\u2019t my primary goal, but it\u2019s always in the back of my mind. If I\u2019m doing all this work, why not make sure it supports the long game?<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If a photo carries structured tags (<code>PersonInImage<\/code>, <code>Event<\/code>, <code>Location<\/code>), it can be connected to family trees (GEDCOM, Baserow, etc.).<\/li>\n\n\n\n<li>If faces are outlined with regions, future tools \u2014 or even AI \u2014 can link them to biographical databases.<\/li>\n\n\n\n<li>If I ever build the \u201cannotate\u201d feature into PhotoCat, it could export:\n<ul class=\"wp-block-list\">\n<li><code>photo.jpg<\/code> (original)<\/li>\n\n\n\n<li><code>photo-annotated.jpg<\/code> (with numbered dots + notes)<\/li>\n\n\n\n<li><code>photo.json<\/code> (mapping regions to people\/IDs)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p>That\u2019s the modern \u201cback of the photo.\u201d Both human-friendly and machine-friendly.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Looking Back<\/h2>\n\n\n\n<p>This journey wasn\u2019t just about writing a script. It was a chain of realizations and course corrections:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Hashing isn\u2019t enough.<\/li>\n\n\n\n<li>Content hashing is necessary, but only in smart places.<\/li>\n\n\n\n<li>Not all folders should contribute tags.<\/li>\n\n\n\n<li>Special folders carry meaning.<\/li>\n\n\n\n<li>Every operation should be idempotent.<\/li>\n\n\n\n<li>Silence is terrifying \u2014 progress bars are mandatory.<\/li>\n\n\n\n<li>Copy, don\u2019t move. Quarantine, don\u2019t delete.<\/li>\n\n\n\n<li>Even within a folder, duplicates lurk.<\/li>\n\n\n\n<li>And long term, the \u201cback of the photo\u201d still matters \u2014 digitally.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Where I Am Now<\/h2>\n\n\n\n<p>Today, PhotoCat can:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Catalog 100k+ files into SQLite.<\/li>\n\n\n\n<li>Hash them (file + content).<\/li>\n\n\n\n<li>Propose a clean date-based structure.<\/li>\n\n\n\n<li>Preserve tags and special subfolders.<\/li>\n\n\n\n<li>Suppress redundant duplicates.<\/li>\n\n\n\n<li>Export everything to CSV for auditing.<\/li>\n\n\n\n<li>Copy files safely into a new library.<\/li>\n\n\n\n<li>Quarantine originals before deletion.<\/li>\n<\/ul>\n\n\n\n<p>It\u2019s not flashy. It\u2019s a command-line tool with progress bars. But it gives me something more important: a path back to clarity \u2014 for Jen, for the boys, and maybe even for future generations.<\/p>\n\n\n\n<p>You can find the code at <a href=\"https:\/\/github.com\/tompayne36\/photocat\">https:\/\/github.com\/tompayne36\/photocat<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>(I may rewrite this..but this is what the Robot Overorlds suggested) It was much easier managing our family photos when I was the primary photographer. I\u2019d take a day\u2019s worth of photos, pull them into Google Picasa, tag the bad shots (why delete them when disk space felt infinite compared to the days of floppies [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1019","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/posts\/1019"}],"collection":[{"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/comments?post=1019"}],"version-history":[{"count":1,"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/posts\/1019\/revisions"}],"predecessor-version":[{"id":1020,"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/posts\/1019\/revisions\/1020"}],"wp:attachment":[{"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/media?parent=1019"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/categories?post=1019"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/paynecentral.com\/tompayne\/wp-json\/wp\/v2\/tags?post=1019"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}