IQDB dupe checking on upload

Posts: 24 · Views: 1592
  • 14059

    After a few weeks of testing, I'm happy to say we're ready to finally get IQDB dupe checking up and running on upload! This should help cut down on dupes by a ton.

    If you have any issues, let us know in this thread. If you think you get hit by a false positive, include the wallpaper you were uploading (upload to another image hosting site so we can see it) and what ID the dupe detection thinks it matches.

    We've also upped the filesize limit from 10MB to 20MB. However, I do suggest being 'smart' about what you upload. For example, try to avoid 20MB PNGs when a 2MB JPG would look 99% as good. Compression sucks, but most the time if done right, the differences are negligible.

    Any and all other feedback on uploading is welcome here as well!

  • 14061

    Sweeeeeeeeeeet. I get the feeling that Alpha is nearing a close.

  • 14062

    It only seems to work on recently uploaded wallpapers? Anything older than 2 months seems to go through even though they're dupes. I guess it doesn't matter. Alpha and all.

    Recent uploads work great though.

    Last updated
  • 14063

    Yeah, it seems like the database got purged because it's no longer showing any similar results.

  • 14064

    Similiar Search is really not working at the moment :-(

  • 14065

    Ugg.

    I think I know why....

    Reindexing IQDB. Everything should be back in an hour or two.

  • 14066

    Thank you.

  • 14112

    It works ok most of the times, but seems like when wallpaper is old enough system won't detect duplicate.

  • 14156

    How about preexisting dupes? I've seen a bunch of those...

  • 14278

    @vjeko said:

    How about preexisting dupes? I've seen a bunch of those...

    I too want to know. is it possible to do this ON the server hosting the images and then flagging any that are poorly tagged or lower filesize at the same resolution? flagging would keep a human element in there to determine any that should be removed. but, that also expands some man-hours needed to sift through, so it's all up to the "is it worth it" aspect, especially if alpha is coming to a close soon.

  • 14356

    sudos said:

    especially if alpha is coming to a close soon.

    Wait, this site's still in alpha? It's been like 3 years already

  • 14362

    amahran said:

    sudos said: Wait, this site's still in alpha? It's been like 3 years already

    You do realise you're still going to

    https://alpha.wallhaven.cc/

    right?

    And the little ribbon at the lower right corner...

  • 14366

    jpdokter said:

    amahran said: You do realise you're still going to

    https://alpha.wallhaven.cc/

    right? And the little ribbon at the lower right corner...

    Never even paid attention to the link. The site's one of my homepages, and who even looks at lower corners anymore, it's 2017!

  • 14408

    jpdokter said:

    amahran said: You do realise you're still going to

    https://alpha.wallhaven.cc/

    right? And the little ribbon at the lower right corner...

    Besides,

    http://wallhaven.cc/

    redirects here.

  • 14691

    I really like there. :)

  • 14767

    Last couple of days I recognized that while the detection works better than before the new dupe detection, it still has its problems. Besides those uploads I did delete myself, after I found out, that they have already been uploaded, other have been merged later with the already existing one.

    Looks like that 15 to 25% of all uploads are not recognized by the dupe detection. Not sure, what you can do about it, but the error ratio still seems to be somewhat high, even after the changes.

    Just wanted to give you some feedback.

    Last updated
  • 14771

    Not sure if it has anything to do with IQDB but I just went to check my subscriptions and there are some duplicates showing there. There are literally two of the same entries in my subscription list when I click on a tag. I went to both entries and they have the same resolution and ID.

  • 14775

    Okay, today again 7 or 8 (maybe more) of my Uploads have been merged and I personally did delete another 5 or 6 after upload.

    Did you guys disable the new dupe detection again or are you working on it somehow? Right now I don't see any difference to how it was before. Still a lot of duplicates.

    Don't get me wrong here, but I realy did hope, that the new dupe detection would change things at least to a point, where just a few duplicates would pass the checking. It's still annoying, that you upload something, take the time to tag it and add sources and such, and after a few hours someone comes along and explains to you, that all the work was for nothing. Especially, as the server is running at its limits, at least a few hours a day, and tagging takes ages.

    If I can help with anything, just let me know. For the moment, I can just give you feedback. I don't mean it bad in any way. Please don't take it the wrong way. I do apprecaite your work and I know, that you do it beside work and real life. But in the end i doesn't help. Without feedback you can't know what is going on, right?

  • 14784

    I wrote some php code a few years back, that I used to find duplicates in the images on wallhaven and the old site (that this site is based off of), and it worked really well, and it is not taxing at all. It is really good at finding images that are the same (regardless of their uploaded resolution, even if they have been reversed/flipped or the image modified (like some images having copyright or tag line, and another image not). It doesn't consider any EXIF data or file size or resolution in the search.

    I never did much with it because I had a lot going on in my life then. Honestly, it worked better than any other type of image duplicate finder that I have ever used. You just add a few more fields in your database for each image, and when images get uploaded, they get the generated search numbers, so it's live and in real time. You only have to do a full-index once. Re-indexing would not be needed unless at some point, the code was improved even more than it already is.

    You may be already doing this, but it would be nice if someone uploaded a higher resolution image of a duplicate (or a image with copyright tagging or web site tagging over layed), that it would (and this is just me brain storming, and subject to feedback)...

    a) accept the new image and copy over the tags of the previous lower version b) mark it in such a way that it is a higher resolution version of a previously uploaded image c) in some fashion or another that is intuitive and respective of the original uploader, provide a symbolic link to the new image

    If your interested in the code, I will dig it up and provide some examples to show it really works. You can even compare it with your IQDB system results. In fact, I would be super curious to how it performs against it.

  • 15105

    So, I went through about 180,000 images, some I had downloaded 2 years ago, others just recently. Basically, I found that around 11,000 images were made up of duplicates. I also checked some of the older images I had downloaded, and they had since been removed, which I assume from your own duplicate checker search. However, even with many of the duplicates removed, many still remained. Here are some snapshots of the output for some of the duplicates I found before testing to see if those images still remained.

    There are a few false positives, but not many. If I increase the ratio of a match, I will find even more duplicates, but also more false positives. Which is okay if you are doing it all by hand, but if you want it to be automated, then best to error on not having the false positives.

    Last updated
  • 15106

    Now, after I checked to see which images were still available, then it became clear that you had started to remove the duplicates. Now, I didn't go down to look at these to see if you were keeping duplicates simply because they were different resolutions or not, since I'm not sure your approach/policy on that. But, you can see in this image below, that the images that are dim w/ red border on them, are images that are no longer on wallhaven. While many duplicates are removed, there are some that still remain.

    The weakness of the duplicate checker I wrote, is that images that are B&W, or contain 80% black background or 80% white background, the false positives skyrocket and is unusable for those. However, with a little more time and persistence, it would be easy to resolve.

    Last updated
  • 15107

    Good. i am currently on page 9078. i check every picture. Took me a while :-) Will take a while till i'am finished... It is a bit tiresome but i can't stop till i am done. Whoa i have produced many duplicates. And i excuse me for every mistake i make, but i will give my best to avoid.

    Almost 200 reported Walls have been merged today. Great Work!

    Last updated

Message