xemle / home-gallery

Self-hosted open-source web gallery to view your photos and videos featuring mobile-friendly, tagging and AI powered image discovery
https://home-gallery.org
MIT License
735 stars 50 forks source link

Incorrect date added to database when GPS data is set incorrectly #143

Open jgyprime opened 1 week ago

jgyprime commented 1 week ago

Thank you for this great and amazing software. I've been using it with almost 4 TB of personal photos (only photos). I think that I have more than 500k photos there...

But I think I found something that can be improved.

After the initial indexation of photos finished (it took several days on my low powered Celeron NAS), I observed that a lot of my photos were added to the database incorrectly, with 1970 as year... And I started investigating the reason.

For example, in the photo I uploaded, the GPS data is set incorrectly in the photo exif: 20240223_160931_029_gpsdata_1970

# exiftool 20240223_160931_029_gpsdata_1970.jpg | grep -i date
File Modification Date/Time     : 2024:02:23 16:09:36+02:00
File Access Date/Time           : 2024:06:21 16:00:10+03:00
File Inode Change Date/Time     : 2024:06:21 15:55:47+03:00
Modify Date                     : 2024:02:23 16:09:35
Date/Time Original              : 2024:02:23 16:09:35
Create Date                     : 2024:02:23 16:09:35
GPS Date Stamp                  : 1970:01:01
GPS Date/Time                   : 1970:01:01 00:00:00Z
Create Date                     : 2024:02:23 16:09:35.449423
Date/Time Original              : 2024:02:23 16:09:35.449423
Modify Date                     : 2024:02:23 16:09:35.449423

When added to gallery, the date is set to 1970 (date is taken from GPS info in exif)... I do not know how that GPS date got there, but I can assume that the phone tried to get the GPS date and time, but because the GPS on the phone was disabled, it got back to a default value of something from 1970...

I also found the source of the problem in the source code here: https://github.com/xemle/home-gallery/blob/master/packages/database/src/media/date.js#L44 const dateKeys = ['GPSDateTime', 'SubSecDateTimeOriginal', 'DateTimeOriginal', 'CreateDate'] If I remove the 'GPSDateTime' item from line 44, then everything works correctly after rebuilding and re-indexing the database.

What do you think? Is an improvement possible in this case? For example:

Unfortunately, my knowledge of the js language is very close to 0, so I would prefer for someone with enough knowledge to find a potential implementation here.

Thank you for reading my very long post. Thank you for creating such a nice software.

xemle commented 1 week ago

Hi @jgyprime

thank you for using HomeGallery and I am glad that you like it.

Further, thank you for reporting your issue with the date. You did a great job nailing the problem and provided a test picture. Awesome.

Yes. My assumption was: If there is a date provided by GPS, it should be quite accurate. However your picture has 1) no further GPS coordinates and 2) the date 1970:01:01 00:00:00Z is the typical UNIX birth date.

Do you think it would be sufficient to allow the GPS date only if GPS coordinates are available? This would keep the basic assumption but will check it in detail...

xemle commented 1 week ago

@jgyprime Since you reporting that you like to use 500k images: Please be aware of #134 which discusses some limits of HomeGallery with larger image count for the database

jgyprime commented 1 week ago

@jgyprime Since you reporting that you like to use 500k images: Please be aware of #134 which discusses some limits of HomeGallery with larger image count for the database

After removing the gps date info (as I said above) the indexation has restarted. Right now, it is indexing, it managed to index approximately 45k pictures... I do not know how long it will take, but I will let it finish. I've already seen that discussion, if I reach any limitation, then I will try to figure out what limitation it has reached.

My NAS is a Terramster F4-421 Cpu: intel celeron j3455 Ram: 12 gb ddr3 (it came with 4 gb, I added another 8gb from an old laptop) I ditched the proprietary os and installed a debian + utilities I need. The main drive (os and utilities) is a 250 gb SSD. The "storage" drive for the photos is a 8 tb WD Red Pro HDD.

jgyprime commented 1 week ago

Do you think it would be sufficient to allow the GPS date only if GPS coordinates are available? This would keep the basic assumption but will check it in detail...

Sure. For me it is good enough. Right now I am using the version I compiled by myself from source wuth my change. For what I need, it is good enough.

xemle commented 1 week ago

I've already seen that discussion, if I reach any limitation, then I will try to figure out what limitation it has reached.

Alright. Please push me if you reach problems. It bugs me that there is a problem which should not be there in theory. Since I do not face the problem I need an external push and someone who really want to have it solved.

Thank you for the details of your system. It helps to know the target systems.

For me it is good enough. Right now I am using the version I compiled by myself from source wuth my change.

Awesome. Currently I am implementing a plugin system. When I stumble across this part I will ensure that the GPS date will only taken if there is also a GPS position.

In the meanwhile if you find a better strategy to identify the date, please let me know.

jgyprime commented 3 days ago

In the meantime, the indexation finished I observed only ~100k photos were indexed. When I searched for jpg files, I found ~400k photos There are other formats there (png, gif and other).

I have a few questions:

xemle commented 3 days ago

In the meantime, the indexation finished I observed only ~100k photos were indexed. When I searched for jpg files, I found ~400k photos There are other formats there (png, gif and other).

Do you have lots of binary duplicates? Do you have files which lead to the same SHA1 checksum?

* is there any limitation to file / folder naming?

No, there are no limits. Neither in file count nor in folder depth. All files should be considered.

Do you use any file filter which excludes some of the files?

* how is the software handling duplicate named files?

The file needs to be unique by OS filename for the file indexer and unique by SHA1 for the database. Same SHA1 is handled as duplicate and file data are merged.

There are corner cases with side cars of duplicate files, I can go in depth with that if requested.

But basically if you just copy a image/folder byte-by-byte from one place to another OS path these files are duplicates. Even if later if they are renamed since there file content is unchanged and contains the same data. This is a design decision with the goal to show only unique media by the assumption that most people have no clue how many duplicates they are storing and IMHO it does not give any value to show pictures twice.

To identify the files which are indexed you can dump information from index files *.idx like

zcat Picutures.idx | jq .data[].filename | wc -l

This should print the count of your files which should be about 400k according to your provided information.

To identify the entries from the database you can run

zcat database.db | jq .data[].id | wc -l

To identify unique database entries you can run

zcat database.db | jq .data[].id | sort -u | wc -l

The later should than print about 100k according to your provided information.

Maybe it is worth reading the internals of the gallery to gain further insights and to clarify further questions.

Thank you for reporting your experience and questions.

xemle commented 3 days ago

is there any limitation to file / folder naming?

One more thing: HomeGallery imports the files in chunks to deal with internal limitations and to provide early feedback (show images in the browser). So the media import might also in a intermediate state and not all your files are imported yet?

This import process can be restarted and does not need to be run in one single run.