URLs, HTTP, Encoding, Decoding


In order to understand CGI Input, there are a couple key concepts that we need to discuss first.

URL

URLs are familer to any web surfer, and while the syntax of URLs can seem cryptic, there is a structure behind them. A URL is a Uniform Resource Locator. It is a method for locating a resource on the Internet, and identifying how to retrieve the resource. Each part of a URL has a specific purpose, but some of the components are optional. If protocol is not specified, web browsers will default to HTTP, if port is not specified, they will default to port 80, and a URL may not have a query string.

<protocol>://<host>:<port>/<URI>?<query string>

For example:

http://www.myserver.com:80/cgi-bin/lookup.cgi?name=Andy
1^^^   2^^^^^^^^^^^^^^^ 3^ 4^^^^^^^^^^^^^^^^^ 5^^^^^^^^
  1. http is the protocol
  2. www.myserver.com is the server name
  3. the server is listening on port 80
  4. /cgi-bin/lookup.cgi is the URI
  5. name=Andy is the query string

HTTP Protocol

Hyper Text Transfer Protocol the underlying protocol used by web browsers and web servers to communicate. A web browser may understand other protocols for it's URLs, such as FTP, NNTP, file, or Telnet, but a web server speaks HTTP. An HTTP command is called a "method" and there are 2 methods that we care about here, GET and POST.

A GET or POST method is a request for information located at a specific location (URI) on the server. This URI can specify a static page, or it could be the output of a program or CGI script. GET and POST requests both request output from a server, but differ in how they pass input data (if any) to the web server.

HTTP is a plain text protocol, and an HTTP request can be sent to a web server manually via Telneting to port 80 on a server.

Encoding

Whether data is sent to a web server via GET or POST, the data is encoded so that it consists of alpha numerics without spaces. Non-alpha numeric characters are represented by a string containing the hex value of the character. This sounds obscure, so an example is probably the best way to demonstrate. If a form were to send a user name and password to a CGI program, the data that it wanted to send might look like this:

name="Andy"
password="Not Really!"

The problem with sending this data in a URL is that the spaces and special characters are confusing to parse What is sent instead is this:

name=Andy&password=Not+Really%21

The rules for encoding are this:

Decoding

Decoding user input involves reversing the encoding done by the browser and also performing security related edits on the user input strings. In order to decode user input, a CGI program will:

  1. Determine what HTTP method was used, GET or POST, and obtain the user input either from QUERY_STRING or from <STDIN>.
  2. Split the input string into separate variables.
  3. Split the variable name from the variable values.
  4. Decode the transformed spaces and non-alpha numeric values.
  5. Strip any unsafe special characters from the user input. (for example back ticks ``)

Next Previous Examples

Copyright 2001 - Andy Welter