Setting up a mirror site for E-JC

Setting up a mirror site for E-JC

by Brendan McKay

This page describes how to set up a mirror site for E-JC on a UNIX computer (including Linux).

Naturally, you need a computer with a http server running, a reasonable amount of disk space (about 400MB will suffice at March 2002, but allow at least 600MB for expansion), and a base directory for E-JC in a place where the http server can see it.

To explain directories a bit more: When I am logged into the computer where my mirror is, the base directory for E-JC is called /cs/pub/publications/eljc, but when people use their Web browser to look in from the outside it appears to be called /publications/eljc. That is because our http server has been told that /cs/pub is the root directory for Web access.

In my description below, I will use /LOCALDIR to mean the local name of the directory (such as /cs/pub/publications/eljc), and /HTTPDIR to mean the http name (such as /publications/eljc).

You have to substitute the values for your own site wherever I mention those two names.

If you want other people to use your mirror, send a note to our Managing Editor and we will advertise it for you.

 

You have two choices for collecting the E-JC files: HTTP and FTP. The first is recommended, but we will describe both.

Mirroring via HTTP

The recommended tool for this is wget. If you don't already have it, you can fetch it from ftp://prep.ai.mit.edu/pub/gnu/wget. Installing it on most Unix systems is very simple: just unpack the archive, type configure then type make. You might want to ask your system manager to install it in a standard place.

The process of using wget is very simple:

  1. Fetch the file wgetejc and make it executable (chmod +x wgetejc).
  2. Edit wgetejc to replace the string "/LOCALDIR" by the name of your E-JC base directory.
  3. If your access to the internet must be via a local proxy server, create a file .wgetrc (including the dot) in your home directory, containing lines like these:
    
          proxy = on
          http_proxy = my.proxy.com
          proxy_user = my-proxy-username
          proxy_passwd = my-proxy-password
      
    Obviously, you have to set those variables to the correct values for your site. If your proxy server doesn't need a username or password, leave out the last two lines. If you can access the internet directly (without going through a proxy server), don't make .wgetrc at all.
  4. Now you can just execute wgetejc to start collecting files from the E-JC main site. Of course it will take a very long time the first time you use it because there are many files. Maybe quite some hours.
  5. After the first time, executing wgetejc will only collect the files that are new or changed, but since it must ask for the modification time of every file it will still take an hour or so. A log of the downloads will appear in /LOCALDIR/wget.log.
  6. To make fetching of new files automatic, you can arrange for wgetejc to be automatically executed every night. For example, the line
    25 2 * * * (date; /LOCALDIR/wgetejc) >>getem.log 2>&
    in your crontab (see crontab(1)) will cause wgetejc to be run at 2:25am each night, with the file getem.log in your home directory receiving any error messages.
  7. The alternative script wgetejc8 will only update the contents of Volume 8, in case you want to do that more often.

Mirroring via FTP

An alternative is to use FTP to collect the E-JC files. This is more complicated to set up but has the advantage of being quite a lot faster than wget. [Howver, if you are running the mirror software overnight, who cares how long it takes?] The method I will describe uses a clever perl script written by Leo Novik of the Weizmann Institute, Israel.

You need the program perl, but these days there is barely a UNIX system without it.

Here goes...

  1. Go to /LOCALDIR.
  2. Fetch the script update.pl, and rename it as getem.pl. Check that the location for Perl that appears on the first line is correct. The UNIX command "which perl" might tell you where Perl is.
  3. Create a shell script getem like this:
    
      #!/bin/sh 
    
      cd /LOCALDIR
      cp timestamp timestamp_save
      ./getem.pl ftp.combinatorics.org /pub/ejc/Journal -stamp timestamp \
            get /LOCALDIR
      if egrep -s 1900 timestamp ; then
        mv timestamp_save timestamp
        echo "replacing timestamp with previous version"
      fi
      find . -o -type d -exec chmod 755 {} \; -o -exec chmod 644 {} \;
      
  4. Create a file timestamp containing these six lines:
    
      1990
      Jan
      1
      12:00
      http:/Journal
      http:/HTTPDIR
      
  5. Now you should have three files, getem.pl, getem and timestamp. Make sure the first two are executable (chmod +x getem.pl getem).
  6. Execute getem and wait... .
  7. Keep waiting.
  8. Unless something is wrong, this will copy all of the files from the master site in Pennsylvania to your machine. If you are far away from Pennsylvania, it might take you hours. And hours. Fortunately, this only has to be done once.
  9. If it finally finishes, you should probably test it.
  10. Arrange for getem to be executed periodically. It will never take as long as the first time, but will just copy over any new stuff. What you need to do is rather system dependent; I did it by putting this entry in my crontab:
    21 8,21 * * * (date; /LOCALDIR/getem) >>getem.log 2>&1
    A log appears in the file getem.log in my home directory.

    A tiny bit of explanation.

    The first four lines of timestamp contain a date and time in the timezone of the master site in Pennsylvania. What getem.pl does is to connect to Pennsylvania by FTP and fetch any file whose creation time is later than that. Then getem.pl edits timestamp to contain the creation time of the most recent file it copied. The last two lines in timestamp tell getem.pl how to edit html files so that http addresses valid at Pennsylvania will be valid at your site instead. (We attempt to avoid site-specific addresses anyway.)

    If something goes wrong, for example FTP times out during a file transfer, you can always get back on track by manually setting back the date in timestamp.

    Comments on this description are welcome. Happy mirroring!

    Brendan McKay. bdm@cs.anu.edu.au