INtime SDK Help
Distributed Systems Manager (DSM)
INtime SDK v6 > About INtime > INtime Kernel > Distributed Systems Manager (DSM)

Overview

The Distributed System Manager (DSM) is a cooperative, multiple process application that manages a distributed INtime system. The DSM tracks the state of the system, monitors the health of its components, and cleans up in the event of component termination or failure.

The DSM performs these tasks:

You can set up the DSM processes to communicate with one another via these avenues:

DSM Architecture

INtime application use NTX and the DSM to communicate with RT clients as shown in this illustration:

  1. The Windows portion of the INtime application, located on the Windows host, makes NTX calls that communicate to RT clients.
  2. NTX accesses the DSM as needed to determine client state and process dependencies.
  3. NTX communicates with the RT portion of the INtime application, located on RT clients.

The DSM consists of the Windows DSM, a process that executes on the Windows host, and the RT DSM, a process that executes on each RT client. The bulk of the DSM work is performed on the Windows host. The processes cooperate to ensure the integrity of a distributed INtime system.

The RT DSM

The RT DSM includes:

The DSM process monitoring mechanism refers to the monitoring of processes with declared dependencies. That is, the DSM continuously checks for the existence of those processes which have dependencies logged in the database. If it finds that a process has been deleted, it notifies all of the terminated process' dependencies as well as the terminated process' sponsors. The mechanisms used to monitor processes are different on Windows and RT systems.

Windows Side Monitoring on the Windows host will be done by one thread which monitors all processes in one call of WaitForMultipleObjects. When one of the processes is deleted, the call returns, the DSM responds appropriately, and then returns to waiting on the remaining processes.
RT Side The RT DSM sets up a kernel task deletion handler which is notified whenever a task (thread) is deleted in the system. If the owner-job (the thread's process) is scheduled to be deleted, the kernel task deletion handler forwards the job handle to the DSM for processing.

System monitoring is done by the NTX and the DSM only provides an API for handling shutdown/failure scenarios.

The Windows DSM

The "active" and "passive" logical entities in the Windows DSM can be best understood when discussed together according to the responsibilities which they perform. The Windows DSM provides the following functionality:

The DSM dependency database

The Dependency Database stores all process-related dependencies, system information, and sponsor registrations. The active agents on both Windows and RT nodes monitor processes with dependencies, along with other responsibilities.

In the event that those processes are deleted, the dependent processes are notified. The passive agent, implemented as an API, allows user applications to communicate shutdown information to their dependents. It also allows dependencies to be declared or revoked, as well as providing services the ability to register for use by RT applications. The RT passive agent has the same purposes as the Windows passive agent, and will be implemented in the library rtman.lib. Each of these components will be detailed in the following sections.

The active agents on both Windows and RT nodes monitor processes with dependencies, along with other responsibilities. The Windows passive agent, implemented as part of the Windows Extension Library (NTX), allows user applications to control dependency relationships. The RT passive agent has the same purposes as the Windows passive agent, and is implemented in the RT Application library, RT.LIB. Each of these components will be detailed in the following sections.

The DSM Dependency Database stores all of the dependency information in the distributed system. It also stores information about available sponsors and the current system topology. User applications implicitly access the database via the NTX and RT APIs. The NTX API accesses the database via the IDsmApi Interface. In addition, DSM active agents also access the database during all shutdown scenarios using the IDSMInternal Interface. INtime tools may also access the database using the COM interfaces. The database is implemented using the registry and the Win32 registry API.

It is accessed by user applications implicitly or explicitly via NTX, which has full access to this database. In addition, DSM active agents also access the database during shutdown.

It is important to notice that dependency and sponsor data stored in the database is volatile. In other words, these parts of the database are completely rebuilt at Windows system shutdown and restart. Only system information is retained.

Access to database information is done internally via the Win32 Register API. Access rights to the database will be handled using Windows security descriptors. Changes to database information are done by removing the information and then re-adding the updated information. The nature of the registry assures that accesses to the database are serialized and atomic. It is up to the DSM to make sure that the database is updated to maintain accurate system/process state information.

Tracking systems

Locating remote clients from the Windows portion of an INtime application includes:

System tracking on the Windows side involves adding new systems to the database, monitoring all systems in the database, and removing systems from the database.

Adding systems to the database

You can add systems to the database only with the Configuration utility. Once systems are added, the DSM tracks their state, moving clients from the INACTIVE state to the ACTIVE state when it discovers clients from the database requesting assignment. If the requesting client's state is UNKNOWN or ACTIVE when the request is received, it is assumed that the system has shut down and is now restarting. The DSM initiates the termination process for the system and then initiates system startup for the system to bring the system back to the ACTIVE state.

The following diagram illustrates the states and state transitions of a client node as seen by the Windows Host.

State Transitions of an RT client node as seen by the Windows DSM:

  1. RT client node added to the database using the Configuration Utility.
  2. Client requests assignment from the Windows Host.
  3. Client undergoes controlled shutdown.
  4. Removed from the database using the Configuration Utility.
  5. Client does not respond due to crash, hang, power-failure, network problem, or other reason.
  6. Cleanup complete.
  7. Restart detected or required.
  8. Client now responding and intact.
  9. Recovery complete.

RT Client Node States

NULL This state is only for completeness. This state simple shows that the RT Client has not yet been registered within the Distributed System by the Configuration Utility.
INACTIVE Once a Client has been registered, it is placed in the INACTIVE state. The INACTIVE state signifies that the Client can join the system at any time, but has not yet booted or started communicating. When a Client is shut down, it returns to this state.
CLEANUP When the DSM recognizes that a Client Node has rebooted (or that some catastrophic error has occurred which will require a reboot), the system enters the CLEANUP state. In this state, all references to process on the down node are removed from the database and the proper notifications are made.
ACTIVE When the DSM has responded to the request for assignment, the DSM begins system monitoring for the node to make sure that the communication remains open. The system operates in this mode until a serious system event occurs.
UNKNOWN The UNKNOWN state is entered whenever the DSM cannot be sure about the status of the Client node. This could be caused by a Network failure, by the RT system being unable to respond to Pings because of work load (or debugging, etc.), or because of system hang, crash, or restart. Once the DSM recognizes what happened, the DSM will mark the system in the appropriate state and respond accordingly.
RECOVERY The RECOVERY state is entered when the Client has been marked as UNKNOWN and now the DSM has detected that the ACTIVE status can be restored without serious integrity problems. The RECOVERY state is an intermediary state in which the Windows DSM synchronizes with the RT DSM on the Client node. This is a fairly complicated procedure and is dependent upon the nature of the communication media in order to work at all. It includes things like message recovery, process dependency checks, etc.

Removing systems from the database

The following diagram illustrates the states and state transitions of an RT DSM Component.

State Transitions of the RT DSM:

  1. Initialization complete.
  2. Assignment request returned.
  3. Windows Shutdown notification - Stay Active request received.
  4. Partial cleanup complete.
  5. Windows Active Again notification received.
  6. Synchronization complete.
  7. Windows Shutdown notification - Shutdown request received.
  8. Windows crash, blue screen, power failure, network problem, or other failure detected.
  9. Windows clean restart notification received.
  10. Windows Alive-Again notification received.
  11. Cleanup complete.

RT DSM States

INIT The RT DSM runs on RT Client nodes as either a first level job or as a loadable job. At the time of system boot, the RT DSM loads and performs the necessary initialization.
REQUEST In this state, the RT DSM makes a request to the Windows host for any configured assignments. Once the request has been responded to by the Windows@@@ DSM, the RT DSM moves into the NORMAL state.
NORMAL In the NORMAL state, the RT DSM monitors processes in dependency relationships. It remains in this state until some major system event takes place.
NT_GONE The NT_GONE state is entered when the RT DSM receives the Stay Active message from the Windows DSM. This message is sent when Windows is shutting down and the Configuration settings for the Client indicate that the Client should remain active after Windows shutdown. The Windows DSM sends the appropriate message to the RT DSM. The RT DSM then performs a "partial cleanup". A partial cleanup consists of notifying all RT Sponsor processes of the termination of their Windows dependents and notifying all RT Dependent processes that their Windows Sponsors have terminated.
WAIT The RT DSM enters the WAIT state after it has completed partial cleanup. It remains in the WAIT state until it receives notification from the Windows Host that Windows is back up and running.
RECOVERY During RECOVERY state, the RT DSM attempts to synchronize with the Windows DSM. The focus of the effort is on RT sponsors because all other stored information becomes invalid after Windows shutdown. In the case where Windows has not shutdown, the RT DSM attempts to verify that process dependencies are still valid.
UNKNOWN The UNKNOWN state is entered whenever some failure is detected. The RT DSM attempts to reestablish contact with the Windows Host in order to determine the state of the system.
CLEANUP In the CLEANUP state, the RT DSM makes all necessary notifications to processes in dependency relationships. It returns to the REQUEST state unless the RT sub-system has received a SHUTDOWN notification from the Windows DSM.

Tracking processes

Assigning dependencies

Dependency tracking on the Windows side consists of recording dependencies in the dependency database, monitoring the appropriate processes, handling controlled and uncontrolled process shutdown, and removing dependency information from the database.

Once a process is added to the database as having a dependency, the DSM begins monitoring the process (either on the Windows side or the RT side). If shutdown/termination is detected, the appropriate processes are notified. If the terminated process was a sponsor, all associated dependents are notified of the termination via the deletion mailbox for RT processes and via the callback function for Windows processes. If the terminated process was a dependent, then the associated sponsors are notified of the termination. Once notified that a process is terminated, there is no longer a need for the dependency to be in the database and it is removed.

Processes also have the ability to explicitly remove a dependency without termination.

Managing RT objects

The ability to create RT objects from an Windows process is a new feature in INtime 2.0 and brings several advantages as well as added complexity. The managing of RT objects for Windows processes consists of storing ownership information registered by the NTX at the time of RT object creation and removed by the NTX at the time of RT object deletion, monitoring life of the creating process (done the same way as for dependent processes), and deleting the objects either via the NTX call, or after process termination detection. The RT object case is not significantly different from the process dependency case.

At the time of RT Object creation by an Windows process, the dependency relationship is added to the database. Active agent monitoring of the Windows process is the same as in the case of process dependency. Standard deletion using the NTX library API causes the dependency to be removed from the database. If Windows process termination is detected by the active agent, then the associated RT objects are also deleted.

Handling process termination

When a process terminates, the DSM active agent recognizes process termination and notifies all processes marked as dependent that the process has terminated. In addition, the DSM notifies all sponsor processes of the terminated process.

Windows system shutdown is handled as multiple invocations of the process shutdown model. During controlled Windows system shutdown, every window application receives a message telling it to shut down. The DSM also receives the message and communicates the shutdown to the RT DSM Components in the system. The RT Client may choose to continue running based on its configuration. Uncontrolled Windows system shutdown is handled by RT nodes. Likewise, if an RT node is identified as shutdown, the Windows node handles the situation. All processes with RT dependents or RT sponsors are notified and the database is purged of the references to objects on the down node.

The RT process and system shutdown model is a mirror image of the Windows side. The only exception is that system shutdown is done as a message to the Process Deletion Mailbox of each process. The ability to shutdown an RT node is not given to the user in INtime 2.0. It is, however, a valuable internal feature.

System calls

This lists common operations related to distributed system management, and the RT kernel system calls that do the operations:

To . . . Use this system call . . .
Register a dependency ntxRegisterDependency
RegisterRtDependency
Unregister a dependency ntxUnregisterDependency
UnregisterRtDependency
Register a sponsor ntxRegisterSponsor
RegisterRtSponsor
Unregister a sponsor ntxUnregisterSponsor
UnregisterRtSponsor
Find the location of a sponsor ntxFindSponsor
FindRtSponsor
Notify of an event ntxNotifyEvent
RtNotifyEvent
Register a DSM event handler RegisterRtEventHandler
Unregister a DSM event handler UnregisterRtEventHandler
Change event handler priority SetRtEventHandlerPriority
Manage Windows shut down RtContinueWindowsShutdown
RtShutdownBlockReasonCreate
RtShutdownBlockReasonDestroy

This shows the order to make DSM calls:

  1. Make these calls from a thread in the process that needs to register a dependency or sponsor.
  2. Make this call from a thread in the process that created the dependency.
  3. Make these calls from the thread that created dependency.