Wednesday, September 01, 2010

Disallow access to server stat files

So after 10 years on the web the spiders have found my server stat files, in this case AWSTATS. This is the program output I use to see data regarding site visitors; number of visits, page views, visit duration, key words and on and on. The program stores the data in large text [txt] files, but outputs the data to me in the form of charts and graphs. In fact the txt files are very large, multi-Mega Byte files, the graphs are some what smaller and more readable.

I would recommend that every web master add a 'Disallow' line to the robots.txt file to stop the web spiders from reading your stat files. In my case the line looks like this; Disallow: /awstats/.

The bottom curve is server bandwidth ~ the increase occurs as my stat files started showing up in my search results. Because the graph is set up to show number of visitors, the bandwidth is normalized. So the 100,000 horizontal bar which indicates 100,000 visits or page views... indicates 10GB of server bandwidth for the bottom curve. The 200,000 line indicates 20GB for bandwidth

I only noticed the server text files showing up in search result about a month ago, because normally I don't need to search my own engineering site ~ right, I wrote it. So I only added a 'block' to the robots file a few weeks ago; however I recommend that you block access now even if you don't have an issue. It only takes a few minutes to add and if you pay for bandwidth or blocked if you exceed bandwidth it may well be worth the time.

If you look back to 2006/2007 you can see that bandwidth tracked unique visits, but by 2008 the gap stated to widen. Two years before Google started to rank pages on down-load speed I had already started to make the web site more efficient.

Unfortunately now the bandwidth data is meaningless, because it only shows these large txt statistic files being downloaded. For example one 2.52MB txt file was downloaded 230 times last month, a 4.11MB file was downloaded 98 times. That's 328 visitors that used the search bar and received bogus results, are they going to come back for a second visit? Really its much worse, before I stopped counting, there were 1,206 people last month who thought that one of those text files was a valid search return.

5 comments:

Leroy said...

These stat text files are still an issue. However at these point most appear as crawl errors in 'my' Web Master Tools, because I've had them block via a line in my robots.txt file.

I've also made a request to have the entire directory removed from Google's system. So they can no longer be crawled and should not appear in the Google search engine results. Although they still show up, but I only made the removal request the other day.

The strange thing is that a number of sites link to them now? The good thing is that I've visited some of these sites and they are all "trash" sites. The type of site that just has dozens of snippets of text from another web site, with no content of their own. So I don't expect many visits from those type of sites.

In fact none of those sites even have a Google page rank, and at least one page no longer has a link to me. My bandwidth usage should drop back to normal this month ~ what a long process.

Leroy said...

10-23-10 The Stat files are still showing up in my search results, maybe because they have not yet been spidered. The current server bandwidth is only 12GB which is only a bit higher than the normal [12.66GB per month].

Google indicates that there are 33 Stat files that are restricted by the Robots.txt file. All of the files were detected in October, with most being detected on October 6.

Leroy said...

10-26-10 What? Now Google indicates only 27 files restricted by robots.txt, which the last one being checked on the 21st of this month.

Leroy said...

11/19/-10 Now there are 85 urls restricted by the robots.txt file; however, many are a dot 'pl' file which may be some type of report, other than a text file. The 50K .pl file has been down loaded 86 times so far this month.

Leroy said...

11-27-10 Looks like those restricted files continue to grow; now at 214 files blocked.

Post a Comment