You may have noticed “Nutch/2.2.1 (page scorer; http://stagingias.wpengine.com/site-indexing-policy/)” and be wondering why it is visiting your site, or you may want to invite the robot to crawl your site. Integral uses machine learning and other analysis to provide content rating and certification services for Brands, Agencies, Ad-Networks and Publishers. Allowing our robot to crawl your site is essential for us to be able to provide an accurate rating of your site’s content. If our robot is blocked, we will be unable to provide an accurate rating of your site and thus your site will be inaccessible to any of our partner advertisers.
To invite our robot to crawl your site, please contact: email@example.com and we will enter your site url into our crawling queue.
If there are areas of your site that you would like to prohibit the robot from crawling, simply inform us of your crawling parameters via the Standard for Robot Exclusion (SRE). The SRE standard governs the practices of most of the major Web-crawling groups and Integral strictly adheres to the standard.
When crawling your site, the Integral crawler seeks out a file called “robots.txt” which website administrators can place at the top level of a site to direct the behavior of web crawling robots.
The Integral crawler will always pick up a copy of the robots.txt file prior to its crawl of the Web. If you change your robots.txt file while we are crawling your site, please let us know so that we can instruct the crawler to retrieve the updated instructions contained in the robots.txt file.
To exclude all robots, the robots.txt file should look like this:
To exclude just one directory (and its subdirectories), say, the /images/ directory, the file should look like this:
Web site administrators can allow or disallow specific robots from visiting part or all of their site. Integral crawler identifies itself as ia_archiver, and so to allow ia_archiver to visit (while preventing all others), your robots.txt file should look like this:
To prevent ia_archiver from visiting (while allowing all others), your robots.txt file should look like this:
For more information regarding robots, crawling, and robots.txt visit the Web Robots Pages at www.robotstxt.org, an excellent source for the latest information on the Standard for Robots Exclusion.
There are a few reasons that Integral may not have visited your site. Your site may be new or we may not have been directed to your site by our Brand, Agency, or Ad-Network partners. It is also possible that your web site administrator has disallowed crawlers from visiting your site. Please read the information about robots.txt that we have provided above to ensure your preferences are being honored.