Drupal, file systems and protection against scraping.

22-12-2013 04:38AM #1

Hey all,

So my website is almost ready for content and I'm a little worried about how to set up the file system..

The site will have upwards of 50k images.
A faceted taxonomy search will display thumbnails automatically.
Logged in users can access the full image.

Apparently a private file system is super slow so really not an option if faceted search is being used like that and a public file system would allow someone to download everything..
But the images themselves are completely worthless without the taxonomy information. Could someone scrape the image and taxonomy info somehow? There are about 200 terms spread across the database.

Or is it possible to auto-ban IPs at webhost level if they access a large amount of information?

jmcc · 22-12-2013 07:36AM

The most extreme solution would be to ban all traffic from data centres and non-ISPs. There will be attempts at downloading it so you may need to include some kind of session limiting and page limiting (anything a two or three pages a second may be a scraper or bot).

It may be worth keeping an eye on this forum:
http://www.webmasterworld.com/search_engine_spiders/

There are other forums that deal with scrapers and maggots. However Cloudflare might be worth checking out as it can ban bots and scrapers before they get to the webserver. Some people have had good results but I have never used them.

Regards...jmcc

23-12-2013 01:50AM

I'm more worried about a person setting about creating a specific script to steal the images and related taxonomy.. The terms won't be on the full content page anywhere and there are eight different vocabularies so I'm hoping it would be technically difficult to scrape all the vocabularies and then match them up afterwards.

What about these htaccess files? All my data will be in a single folder so could they be used along with maybe a bandwidth cap of a say 20mb per ip per day.. I'm going to be paying quite a lot for people to categorise content and don't want someone to steal it straight away.

It will also be a private site open only to clients many of which may have employees who could be a potential scraper. No SEO at all.

jimmycrackcorm · 23-12-2013 11:39AM

There is noting to stop someone with a little knowledge from scraping your site once they have access. The only preventative measure is to install something to monitor and block such activity e.g.detect how many images are being accessed by a particular ip within a time range.

ressem · 23-12-2013 12:24PM

I guess that it depends on how users provide and receive the taxonomy information.

When an image is searched for and found under one category, are the other valid possible search terms for that image also listed or available?

Are the search terms available in a list, checkbox or site-wide auto-suggest, or would a script have to run a dictionary through the website to discover all possible categories?

How does the public view of the site provide information about the search terms used? Does the URL use REST type layout which provides information?
www.mysite.ie/search/movies/2013/

a public file system would allow someone to download everything.

Your htacccess and website should prevent anyone getting a full directory listing of all filenames. If the filenames are not predictable and script friendly then it should require a bit of effort to grab everything.
(just be careful that your file naming structure does not become a bottleneck that fails when multiple machines are adding images. Usual comp sci pattern.)

You'd request the search engine robots not to scrape the majority of your site.
Will you be watermarking the public images? It might be sensible to separate them from the paid for ones in the directory structure.
If the site is image heavy you'll likely want to have them addressed through different subdomains to allow client browsers to download more images simultaneously, to get around the browser connection limit.

24-12-2013 05:40AM

Wow big message, will take a while to work my way through it and ill reply then. Thanks.

Drupal, file systems and protection against scraping.

Comments