OSIRIS Troubleshooting and Restart Procedures

This document describes the basic restart procedures for OSIRIS.  Begin with the simple restart if the data taking system hangs up. Reset the computers if all else fails.

The basic OSIRIS system consists of 3 PCs and the sun data taking computer (ctioa1 or ctioa2) as well as the instrument itself. The instrument control software is called ICIMACS: Instrument Control and Image Acquisition System. The user interfaces with the instrument entirely from the Sun machine (ctioa1 or ctioa2) through the Prospero instrument control program (unless there is a problem; read on...).

In the console room are 2 other OSIRIS computers, both are PCs. The first is the IC (Instrument  Computer), the second the WC (Workstation Computer). Each of these PCs runs DOS and a single executable program. The IC runs a program called "IC" and the WC runs "WC."  A third PC, the IE (Instrument Electronics), is onboard OSIRIS and handles low level motor control.

The IC passes commands to the instrument and acquires image data from the instrument via optical fibers. The WC handles communications between the PCs and the Sun and the PCs and the Telescope Control System (TCS).

Simple Restart

If Prospero hangs up or dies, choose "Restart Prospero" from the OpenWindows programs menu.  When the Prospero command window appears, type "startup" at the "PR>" prompt. You will see some messages as Prospero reconnects to the data taking PCs and instrument. Follow the instructions, answering any questions (usually with a carriage return, or enter).

If the simple Restart won't work, try restarting the IC and WC.

Sometimes a simple restart won't get you up and running again. In this case, there may be some indication from Prospero which part of the system is not up. On the top bar of the Prospero status window you will see "UP" or "DOWN" flags indicating the status of each part of the system. You may also receive an error message in the command window indicating a problem with a particular computer/disk in the data taking system. A common example is that the data disks are not synched: see -ICSYNC/-DNSYNC below.

If the IC is indicated as the problem, you may try to restart the IC program. Type "exit" at the IC keyboard. When you are back at the DOS prompt, type "IC." When the IC has started again, type UARTINIT to initialize the communication ports in the HE. Look for a message indicating that the IC is talking to the instrument ("PONG Received from IE"). You should also see +SEQUENCER on the IC status display.

Similar procedures can be used to restart the WC. First exit, then type "WC." The WC will begin a telnet session with the Sun for Prospero to communicate with the PCs.

If either the WC or IC is hung completely and not responding to keyboard input, you may reboot using Ctrl-Alt-Del. If both computers are hung up, reboot the IC first, then the WC.

If a soft reboot won't work, try cycling the power on the IC and WC. Turn off both computers. Then turn on  the IC power. When the IC is back up and has rebooted, turn on the WC power. After the WC has rebooted, execute IC and WC on the respective computers.

Always follow a restart of the IC or WC with a "startup" on Prospero.

Restarting the IE

If the instrument itself is not responding, the IE will have to have its power cycled at the instrument. Go up to the telescope and cycle the power on both of the large metal boxes attached to the side of the instrument. There is a switch near the power chord of each box. The IE will reboot on power-up and start its program automatically (ieosiris.exe). It will take up to a minute or so for the IE to reboot. If the IE doesn't appear to come up and start ieosiris properly (i.e., you still can't talk to the instrument and Prospero claims it can't find the IE), a monitor and keyboard can be attached directly to the IE at the telescope so that the boot messages can be seen and ieosiris.exe can be executed manually (the IE is the gray colored box).

If the IC can not communicate with the IE, check that the flat "phone" cable between the HE and IE is connected.  You can also try swapping this cable to another port on the HE (Don't use port 1 until further notice). The cable must always be connected to COM1 on the IE. There could also be a fiber transmission problem (see below).

A HOSTS command on the IC will definitively show whether communication from the IC is reaching the IE.

There are four mechanisms which have relative positions in the instrument (xpupil, ypupil, camfocus, grattilt) which must be reset or zeroed after an IE power cycle. Always follow a restart of the IE with a reset of the xpupil, ypupil, and camfocus mechanisms. See `Reseting Mechanisms' below.

-ICSYNC/-DNSYNC

Sometimes when the full system comes back up, shared SCSI disks in the WC/IC and/or between the WC/sun will not be synched. When you type "startup" in the Prospero command window, the system will, for example, complain that "the WC and IC disks are not synched. Image storage will not work until this is done."

Prospero will issue a line telling you to type "WC RESTART" at the Prospero prompt. You can try this once: the SYNC flags on the WC should go from - to + and a startup on Prospero should be successful. If not, type >IC req initdisk at the WC keyboard. This should force the disks to synch. Again, finish by typing startup at the Prospero prompt.

An analogous problem can happen with the shared disks between the WC and sun. These disks are synched by the Caliban daemon. As long as Caliban is up and running, the DNSYNC flag on the WC should be +. If not, it is usually best to cycle the power (or reboot) the WC and restart Caliban on the sun. However, the command >CB req initdisk can be issued at the WC to try and force synching of the shared disks. A startup on Prospero should be done if the initdisk was succesful.

Restarting Caliban

Caliban is the Unix program running on the Sun machine which handles data transfer from the PCs via two SCSI disks which act as image buffers. Caliban communicates with the PCs via a serial cable in order to "sync" the transfer of image data into and out of the SCSI buffer disks and onto the Sun disks. When you login as the osiris user, Caliban should automatically start up with an iconified xterm on the Sun console. If Caliban dies, try restarting it from the OpenWindows "Programs" menu.

When starting up Prospero, you should see a message stating that Caliban is running and the data disks are synched. If Prospero complains that the data disks are not synched, follow the instructions printed in the command window. If all else fails, try to synch the disks by restarting the WC and IC as described above.

If you fill up a sunos disk...

Go to bed, game over...If you are too stubborn to quit, go into the computer room and cycle the power on the IC and WC (then follow the instructions above). This failure mode introduces an insidious aspect to data taking such that Caliban will continuously try to transfer the image which filled the disk, even though it may appear from Prospero that the system is taking data (it is, it just won't ever make it to the sun machine). You must cycle the power after filling up a sunos disk.

Talking to the TCS

If Propspero can't send commands or receive information from the TCS (note the TCS status indicator in the Prospero status  window), try restarting the TCS communications at the WC. Type TCINIT on the WC keyboard. A message from the WC, "Telescope Controller Not Responding," usually means trouble at the TCS. The electronicers should be called if this is the case (the TCS may need to be rebooted).

22 July 2000. As of this date, it may be that communications between the TCS and WC on the 4m can become unstable. The basic problem is a delay in the response from the TCS to requests from the WC and the fact that the WC does not ignore late responses from the TCS. This tends to cut the com link and produce many errors displayed on the WC screen. We believe this problem has been solved by making sure that the TCS router is not used with OSIRIS. This reduces the load on the TCS (Heurikon) and hence it is not late in responding to the WC. If the problem re-occurs even with the TCS router off, one can increase the TCPPORTDELAY in the wc.ini file on the WC to 3 seconds. In this case, the system will be stable, but note that the delay is long enough that the TCS header info is never received by the IC computer (and so won't make it to the image header). After modifying the wc.ini file, one has to restart the WC and execute a startup on Prospero.

Resetting Mechanisms

Several of the motors inside the dewar need to be reset, or sent to a "home" position after powering up the IE. These are the xpupil, ypupil, camfocus, and grating tilt. The Prospero command ie 'reset' will accomplish the reset for all motors. If the prompt does not come back, type a single ^C after several minutes have passed. Individual motors can be reset, e.g., ie 'reset grattilt'.

Fiber transmission problems

If the fiber transmission is low, communication between the IC and HE (and with the IE) will be interrupted. If you are sure the IE is up and running and connected properly to the HE, but you can still not communicate with it, then there could be a fiber transmission problem. Have the electronicos perform a transmission test of the fibers or verify that one was done on setup. The fiber signal may be too weak even if light can be seen in the fiber.

Corrupted images can also be caused by bad fiber connections, or dirty fiber connections. If corrupted images occur, inspect the fiber connections at the IC and the HE. Small "dust balls" can "grow" at the connections and foul the signal transmission.