What Are Web Robots?

The Web Robots Pagesweb crawlers

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.

On this site you can learn more about web robots.


About /robots.txt

In a nutshell

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.

So don’t try to use /robots.txt to hide information.

The details

The /robots.txt standard is not actively developed. The rest of this page gives an overview of how to use /robots.txt on your server, with some simple recipes. To learn more see also the FAQ.

How to create a /robots.txt file

Where to put it

The short answer: in the top-level directory of your web server.

The longer answer:

When a robot looks for the “/robots.txt” file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.

For example, for “http://www.example.com/shop/index.html, it will remove the “/shop/index.html“, and replace it with “/robots.txt“, and will end up with “http://www.example.com/robots.txt”.

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: “robots.txt“, not “Robots.TXT.

What to put in it

The “/robots.txt” file is a text file, with one or more records. Usually contains a single record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

Note that you need a separate “Disallow” line for every URL prefix you want to exclude — you cannot say “Disallow: /cgi-bin/ /tmp/” on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server
User-agent: *
Disallow: /

To allow all robots complete access
User-agent: *

(or just create an empty “/robots.txt” file, or don’t use one at all)

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google

User-agent: *
Disallow: /
To exclude all files except one

This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

How do I prevent robots scanning my site?

The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:

User-agent: *
Disallow: /

but this only helps with well-behaved robots.


You Can Even Get a School Diploma Online

get your online diploma

A growing number of students are opting to get online diplomas. Online school diploma programs definitely provide benefits as well as versatility. However, lots of students have issues with it. And one of them is having discipline in managing their time and schedule. But, how do these online programs compare with conventional schools? How do companies and colleges feel about students getting a diploma online?

The majority of online diploma programs are recognized. In truth, numerous online programs have the exact same accreditation as most brick-and-mortar schools. They can be utilized for most college admissions. As long as the online school is correctly certified, online diplomas are no different from those provided by conventional schools.

Online school diplomas can be used to gain better employment or to advance in your current employment. In most cases, graduates do not have to disclose that they gained their diploma through an online course. Many who study online, still go to a conventional school but take some other course online so that they can get a more rounded education or get into the courses in college that they want. They take the online courses to make up credits or enhance their GPAs.

Adults can also register in online diploma programs. There is a number of online schools that now offer fast-track choices for adult students that have to obtain a diploma in leadership or business to further advance their careers or to study a new one. Would you believe you can even get a diploma even if you wanted to study something as complicated as becoming a 3d visualisation artist for some leading Hollywood studio?

Student loans are offered to assist students pay personal tuition. Expenses for online independent schools can accumulate very quickly. However, you would need to check in your state to see what government loans are available for students who wish to study online. One of the main benefits of studying online is that you can set hours that suit you. As with any online education environment, students can log in throughout school hours and “chat” with trainers online should they need any clarification on any aspect of the course.

So how do online schools compare with the conventional bricks and mortar schools? Obviously doing a course online provides no access to interact with your fellow students. But, if you are ok with that, taking an online course to gain a diploma is just as recognized as the conventional way and doesn’t hamper your chances of getting into the college or job of your choice….provided you pass that is.



You Can Learn Anything on the Web

Back in the day when I was thinking about setting up my own business, I didn’t know a lot about how to run one. This is perhaps going back over 10 years now. But nearly everything I needed to learn, I learned from the Web. I signed up for a business course online. For extra research, I found out how to handle and manage finances, how to set up a business entity, how to get good finance deals and how to find the best staff. It is safe to say that I would not have started any form of business if it had not been for the vast knowledge base found on the Internet.

As intriguing as the Web is from an educational viewpoint, for some things to be discovered, they have to be seen. This is where YouTube is an invaluable resource. I have learned so much about business through watching these online tutorials and the best part is, it is all free. I have actually utilized YouTube for countless instructional videos for other things such as new techniques and ways of playing guitar. Recently, I worked my way through an entire series on how to play jazz guitar– all from YouTube. Check out the video below for an idea of what I’m talking about:

It truly is fascinating when you think about it. And how the Internet has made it possible for the do-it-yourself movement has pro-found influence on not just our future. The Internet as a source of education has the power to enable generations of people in ways like never ever before. Interestingly, it will be mobile phones, tablets  and the Internet that will empower these future generations. On these devices, you can watch Youtube videos on just about any subject from how to train your dog to how to get a business diploma in just about any country and any discipline of business.

Over the last decade, the Web has transformed the developed world in ways never thought of. It has changed how we interact, how we discover, how we play, how we work, and how we are amused. All these things and more will certainly undergo even more extreme change.

When developing nations get their hands on the profound power of having the Web in their pocket, it will  not just change how they work, play and find information, it will dramatically transform it.

You really can learn anything on the Web. When you couple that simple fact with the large uptake on an international scale of smartphones and tablets, to see how future generations will be empowered in ways like never before, is very easy.

For more information about this article, feel free to contact us.

Understanding What The World Wide Web Is

what is the world wide webThe Internet is a worldwide, openly accessible series of interconnected computer networks that transmit data by packet changing making use of the conventional Web System. The question is though, how did it become this phenomena that is so popular and widely utilised worldwide and was it always so huge, fuelled by information about almost anything you could possibly think about and easily accessible from nearly anywhere, anytime? The simple answer is no and it’s crucial to comprehend where it all originated from to understand the best ways to use it to its maximum capacity.

The Web originates from a military project which was the “Semi Automatic Ground Environment” (SAGE) program. It networked country-wide radar systems together for the very first time. It was created around 1958 as part of an effort to gain back the lead in innovation from the Soviet Union who had just recently introduced Sputnik. J.C.R. Licklider was chosen to head the committee which controlled the SAGE task. It was  imagined as being a universal network and a unifying human revolution.

Putting the complex physical connections that comprise its infrastructure aside, the Internet is facilitated by bi or multi-lateral commercial contracts as well as technical specifications or procedures that explain the best ways to exchange information over the network.

The Web Corporation for Assigned Names and Numbers (ICANN) is the authority that coordinates the task of special identifiers on the Internet, including Internet Protocol (IP) addresses, domain and method port and criterion numbers. Whenever you state you are “on the Web” you are utilizing the Internet. Likewise when you are surfing the Web through different pages, you are moving through the World Wide Web.

Once you understand the concept, it’s not that scary after all.