Wallhaven is easy to use for external programs/bots

Posts: 3 · Views: 139
  • 13234

    AksumkA Gandalf bfoxwell

    What do you guys think of external applications using Wallhaven infrastructure?

    using my program, i have already gone through the first 20k wallpapers by hand. glancing at each one and saving it if it catches my eye. (i can go through about 1k in 10 minutes)

    an example: https://www.youtube.com/watch?v=JfMAkDZqs9I

    in case you guys are against it, the code only exists on my computer, nowhere online, and the video is unlisted. also pardon my bad video skillz. this is litterally the first video i have ever "produced"

    off topic, but some interesting things: 1) from a sample size (not random) of about 10k, the average file size seems to be about 500 kb. 2) of all the pictures, about 60% seems to be softcore porn. the website filters it, but my program doesnt :| 3) the total size of all images has to be somewhere around 275 gb, excluding metadata and whatnot.

  • 13237

    If I understand your video correctly that's a scraper, though I'm not quite sure if it's interactive or something? If you use it responsibly that's fine but if you run that on a larger scale it may impact our server performance, in which case we'd have to do something about it.

    As you have already noticed it's not particularly difficult to write a scraper for wallhaven (I have one myself to synchronize my collections). We may do a bit of work in the future to make this a little more difficult. Not because we want people to "only use wallhaven via browser" which would be silly, but rather to prevent large scale scraping. For example it's considered a matter of politeness not to parallelize a scraper (for any website, not just here).

    Meanwhile we're also hoping to one day (no ETA, sorry) offer an API. That should make it easier to create tools like this while allowing us to exact some control over the load individual tools may causing.

    As for your off topic mentions: 2) That stat is a lot lower for the overall website. 3) Slightly over 300 atm.

  • 13239

    you know i never thought to call it a scraper. this is the first one ive written and i know the term, i just didnt think of it.

    its slightly interactive. i input the starting parameters (besides the max which is usually determined by finding the highest number that doesnt give a 404. i gave it a constant value for the demonstration). I never heard that it was polite to not parallelize scrapers, but makes sense. It doesnt even improve the timing by that much (ill remove it). as for extensive use, i usually do it in 10k bursts which i then spend the next few days going over. is that alright?

    as easy as it is to do this, i suspect that it could become a popular target for scrapers. I would suggest using base 64 hashes like youtube does, instead of sequential numbers. (if you are interested i know i a nice video talking about it, though im sure you know a lot more than i)

    an api would be nice.

    I have another idea: after the website reset, everything will be removed. i could set up a program that every once and a while would check for new wallpapers and scrape them. since it would be staying up to date, i dont think it would be considered to have an extensive impact. more like a user who downloads every new wallpaper say hourly or daily.

    To prevent scraping abuse, i will not be releasing the code anywhere. and if you would like I can edit and remove the link from the post.

    on second thought i gave an old copy of the exe to a friend. its not parallelized, and i doubt he will abuse it. I will ask him to delete it.

Message