Open Sitemap Generator - Documentation
Version 0.6 "Feed-the-Index"
Last updated on 07/05/2007Generation Process
These are the simple steps required to make a sitemap.
- If you're using the URL substitution (see fig.1), write the local URL and then the real site URL that will replace the local URL in the sitemap.
- If you want to crawl directly the site, uncheck che URL sustitution checkbox and write directly the site URL.
- If your start URL is a subfolder, the crawler will go deep only in that folder. If your starting URL is http://www.mydomain/testme/, the crawler will add http://www.mydomain/testme/first.html and http://www.mydomain/testme/again/second.html but it will not go out that "testme" folder, so http://www.mydomain/doyoucrawlme.html will not be added.
- If you need to start from a single page and not from the base folder, uncheck the "Starting from a folder" checkbox.
- Choose the full path for the sitemap file.
- When correctly creating a sitemap, there will be two files, one yoursitemap.xml and one yoursitemap.xml.gz, that is the gzipped version of the first one. Or more files if an index has been used. Use the "List" button to see the files you need to upload to the server.
- If there are URLs with error, an error log file will be written in the choose output folder, named yoursitemap.xml.error.log.
| |
Fig.1: Using the URL substitution.
| Fig.2: Crawling directly the real site.
|
- Set the right options.
- Here you can choose what to show in your sitemap for each URL. You can choose to show nothing more than the URL.
- You can specify the values for the base URL (the url from which the crawling is started) and for the other URLs.

Fig.3: Sitemap options.
- Set the index options.
- Set the base URL fo the folder in which you'll upload your sitemap files so OSG can write those absolute URLs in the index file.
- Choose if you want to add the last modified date to the sitemap files indexed.

Fig.4: Index options.
- Set the advanced options.
- You should not have to change those parameters for a basic sitemap generation.
- The extensions to ignore are the ones that OSG will not add to the crawling queue and will not add to the sitemap (in a future version you will choose if you want to only add but not crawl those files).
The "Default" button sets the extensions list to "css js ico png jpg gif bmp tiff mpg wmv mp3 mpeg zip gzip rar exe".
- In the "URLs to ignore" field you can use a regular expression to specify which URLs must be ignored and not added in the sitemap (like the ones you've excluded in your robot.txt file).
- The crawl interval, in milliseconds, is the waiting time between two file requests from the crawler to the server.
- "Ignore URL in html comments" let you choose if you want to not parse a link commented out in the page source.
- Finally you can set also a proxy setting, if needed.

Fig.5: Advanced options.
- You can edit your sitemap file (or only the index file if splitted in more files).
- You can gzip your sitemap file (or only the index file if splitted in more files).
- You can view the path of the files that you need to upload to your server.
- If there are URLs with error, you can see them and each error description.
- If present, (Using sitemap index) means that your sitemap has been splitted in more that one file.

Fig.6: After the work.
Session management
The session settings (everything that you can modify making a sitemap) can be stored in a
.osm file (in xml format).
So you can load a session to create that sitemap, without the need to set everything again.
Requirements
To use this software you need the
Microsoft .NET framework 2.0.