PS Query Architecture Overview
Query and Timeout Analysis
This document provides an overview of the various timeouts that need to be considered when problems related to query occur. Before going into timeouts, an overview of the PeopleSoft Internet Architecture (PIA) follows to describe the information flow from the browser to the Database Server.
Overview of PeopleSoft Architecture
PeopleSoft Internet Architecture (PIA) is comprised of a variety of components ranging from the browser to the Database Server:
- Web Browser
- Web Server
- Application Server
- Batch Server – Process Scheduler
- Database Server
The Web Browser is the primary means by which end users and administrators access PeopleSoft Applications and Administrative tools. PeopleSoft publishes Certification and Minimum Requirements for each PIA component including the end user workstation.
A Java-enabled Web Server is required to support browser transaction requests and the Application Messaging technology. A collection of Java Servlets are required to handle the wide range of PeopleSoft transactions. PeopleSoft publishes the minimum server requirements along with the certified web servers for each version of PeopleTools by Operating System and/or hardware platform.
The Application Server is the core of the PeopleSoft Internet Architecture, which issues SQL to the database server. It is responsible for maintaining the SQL connection to the database in order to service the browser requests. PeopleSoft publishes minimum server requirements for each version of PeopleTools by Operating System and hardware platform.
PeopleSoft uses Tuxedo to manage database transactions. Jolt, which is a Tuxedo communications protocol is used to facilitate requests from the end user workstation to the database via the Application Server. Tuxedo is a BEA product, BEA is now owned by Oracle as is PeopleSoft.
The Database Server houses the PeopleSoft database schema, which includes all of the object definitions, application tables and data. The database server simultaneously handles the application server connections, development environment connections as well as batch processing running against it.
The Batch Server is where the Process Scheduler is installed and configured and is responsible for servicing user submitted jobs as well as handling batch programs.
PeopleSoft Architecture – Information Flow
The Browser uses the HTTP protocol to send requests to the Web Server. A Servlet installed in the Web Server facilitates all browser connections. It relays transaction requests to the Application Server. Servlets are Java programs that run on the Web Server. If you choose to use Oracle Application Server as your web server the OC4J Containers perform this function.
Servlets on the Web Server transmit requests and data by way of communications to Jolt. Jolt runs on the Application Server. It is the communications layer between the web-based environment and the C++ environments. The Jolt Server Listener (JSL) routes the browser request to an available Jolt Server Handler (JSH) which passes the request onto the Application Server. The Application Server submits the appropriate SQL to execute against the database.
Any PIA or 3-tier request, which is service by the Application Server, is subject to certain timeout values, so that the system can control how much time is spent on servicing (processing) each request. One of the timeout settings that affects service execution is the Service Timeout value in psappsrv.cfg. For each type of server (PSAPPSRV, PSQRYSRV, PSQCKSRV, etc.) this value can be specified in the corresponding server’s section in the psappsrv.cfg configuration file for each Tuxedo Domain (Database Environment). If this setting is greater than 0 (in seconds), the service request will be terminated after it has been executing for the specified duration (i.e., the process which is servicing the request will be terminated by Tuxedo and a message will be written to corresponding Tuxedo log file). This setting affects both 3-tier and PIA requests. Another setting which affects the execution of PIA requests is the tuxedo_receive_timeout value in the Web Server’s pstools.properties file. When a PIA request is submitted, the Web Server waits for the duration of tuxedo_receive_timeout, and when this timeout value has been exceeded, it re-issues the request. If the 2nd request times out, an error is returned to the user (i.e., the Web Server can make up to a maximum of 2 attempts to service a PIA request, provided the first submission of the request times out or causes a service exception). If the tuxedo_receive_timeout is set to a value less than the server’s Service Timeout (in psappsrv.cfg), the Web Server will re-issue the request, although the original request may still be executing. This can be prevented by checking these configuration parameters and making sure they are working together.
Now that the flow of information has been briefly summarized, we can continue the discussion by focusing on queries, since end users submit them quite often. Depending on the required information, queries are designed to return the result set of a SQL statement.
- The PSAPPSRV, PSQCKSRV, and PSQRYSRV processes are basically the same program. They are named differently so they can handle specific types of requests (jobs) thereby dividing or distributing the load on the Tuxedo Domain.
- If you choose to not configure and start the PSQRYSRV process, PSAPPSRV will take over PSQRYSRV’s job load, but that can result in slowing down overall performance. Keep in mind that if multiple queries are going to be executed by multiple end users the PSQRYSRV should be configured and used. During the configuration of the Tuxedo Domain you can specify that more than one (1) PSQRYSRV process is started. In fact you specify a minimum and maximum number of PSQRYSRV processes to initiate.
- The PSQCKSRV process is used to service 2-tier and 3-tier PeopleTools processes. For example, if users will be using the PeopleTools Query program they will have the option of connecting to the database via either 2-tier or 3-tier mode depending on the PeopleTools Configuration Manager entries. If there will be users connecting via PeopleTools and they will be submitting queries then you should configure the psappsrv.cfg Application Server to start up a minimum and maximum number of these processes.
Common Client Issues
- You may see that a query is submitted twice for each psqrysrv, for a long running query.
When any user initiates a long-running query, there is a possibility that the query may not complete within the service timeout set in the domain’s configuration file (psappsrv.cfg). As described earlier, this causes Tuxedo to terminate the PeopleSoft process, which is executing the long-running query. Although the PeopleSoft process has been instructed to die, they query may still be executing in the Database Server. As a result of the termination of the PeopleSoft process, the web server servlet, which submits the user request to the application server domain via Jolt, receives a Jolt exception and it will re-submit the request. Since the request may get submitted a second time, the database server will receive the same query again and the second request will execute in the Database server while the first request is still running. This can exasperate the original problem resulting in the failure (timeout) of both requests, creating server performance degradation, etc.
- How do I control the time it takes to process the request as far as size of query is concerned?
Many times you will see that when a query is submitted, it processes for a long time. As a guideline, you can control the amount of data that a query can return by setting the Max Fetch Size parameter in the psappsrv.cfg file. This will cause the result set to be truncated at the limit specified in the configuration file and possibly result in reducing the overall processing time. It is better to not set this value to 0, to avoid any type of performance problems. However, be aware that the user community may not always receive complete result sets of their queries.
Example Settings for PSQRYSRV
; Settings for PSQRYSRV
; UBBGEN settings
; Number of services after which PSQRYSRV will automatically restart.
; If the recycle count is set to zero, PSQRYSRV will never be recycled.
; The default value is zero.
; Number of consecutive service failures after which PSQRYSRV will
; automatically restart.
; If this is set to zero, PSQRYSRV will never be recycled.
; The default value is zero.
Allowed Consec Service Failures=2
; Max Fetch Size — max result set size in KB for a SELECT query
; Default is 100000KB. Use 0 for no limit.
Max Fetch Size=10000
- The query continues to execute in the database, even when it is not running on the application server side.
As explained above in the first point, even though a PeopleSoft process is terminated, they query may still continue to execute in the database (i.e., there is no way for a terminated process to clean up the request that is submitted to the database). Oracle has a workaround to find out if the parent process exists or not. This workaround helps to find out if the parent process is dead; if so, it terminates the child process also. However, it is possible for the Oracle process executing the query submitted via the application server to become orphaned. In those cases it is best to check with the PeopleSoft DBA and have them investigate and terminate the orphaned query process if it can be identified.
Due to the possibility of a PeopleSoft process service timeout and the consequent resubmission of the request by the Web Server, a query may get submitted to the Database Server more than once. The options for dealing with a long-running or runaway query in the Database Server are:
- Kill the database thread/process which is executing the query in question.
- Apply the dead connection detection mechanism.
Refer to the following resolutions in Customer Connection:
#200728668: Managing Long running queries after the Server has been terminated
ISSUE: Queries continue to run in my database after the Server (PSAPPSRV, PSQRYSRV) have been terminated. How can I clean these up?
RESOLUTION: I’m afraid that once the PSAPPSRV or PSQRYSRV is terminated due to a Service Time-out, there is not much PeopleSoft can do to kill the query thread in the database. Oracle though has a mechanism for determing if a thread has an active parent or if the parent is no longer alive.
On the Database side one needs to create a SQLNET.ora file in the directory $ORACLE_HOME/network/admin. Add the following parameter to the SQLNET.ora file:
Please Contact Oracle for information regarding their “Dead Connection Detection Mechanism” or use MetaLink to look this information up. A little bit of the description of this mechanism follows.
DEAD CONNECTION MECHANISM
Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including Net8. It detects when a partner in a SQL*Net V2 client/server or server/server connection has terminated unexpectedly, and releases the resources associated with it.
DCD is intended primarily for environments in which clients power down their systems without disconnecting from their Oracle sessions, a problem characteristic of networks with PC clients.
DCD is initiated on the server when a connection is established. At this time SQL*Net reads the SQL*Net parameter file and sets a timer to generate an alarm. The timer interval is set by providing a non-zero value in minutes for the SQLNET.EXPIRE_TIME parameter in the sqlnet.ora file.
When the timer expires, SQL*Net on the server sends a “probe” packet to the client. (In the case of a database link, the destination of the link constitutes the server side of the connection.) The probe is essentially an empty SQL*Net packet and does not represent any form of SQL*Net level data, but it creates data traffic on the underlying protocol.
If the client end of the connection is still active, the probe is discarded, and the timer mechanism is reset. If the client has terminated abnormally, the server will receive an error from the send call issued for the probe, and SQL*Net on the server will signal the operating system to release the connection’s resources.
On Unix Servers the sqlnet.ora file must be in either $TNS_ADMIN or $ORACLE_HOME/network/admin. Neither /etc nor /var/opt/oracle alone is valid.
It should be also noted that in SQL*Net 2.1.x, an active orphan process (one processing a query, for example) will not be killed until the query completes. In SQL*Net 2.2, orphaned resources will be released regardless of activity.
This is a server feature only. The client may be running any supported SQL*Net V2 release.
THE FUNCTION OF THE PROTOCOL STACK
While Dead Connection Detection is set at the SQL*Net level, it relies heavily on the underlying protocol stack for its successful execution. For example, you might set SQLNET.EXPIRE_TIME=1 in the sqlnet.ora file, but it is unlikely that an orphaned process will be cleaned up immediately upon expiration of that interval.
#38704- Database keeps working on transaction after Application Server process times out
ISSUE: Customer set timeouts for Tuxedo appropriately, and can see that Tuxedo is indeed killing the process that is servicing the service request. However, the database does not seem to realize that the connection was severed and therefore keeps eating up CPU time.
RESOLUTION: This depends on the type of database server used.
Oracle: A dead connection timeout value needs to be defined in the sqlnet.ora file on the database server. This parameter is SQLNET.EXPIRE_TIME = M where M is a value in minutes. Oracle recommends setting this value to 10. When that time is reached, the database system will ping the client–who is in this case a program servicing a Tuxedo service request, and when this timeout hits, if the connection is down, the transaction will be stopped and rolled back. This means: if the client is busy, nothing will happen. If the client is NOT busy, the transaction will be cleaned up. If you are using a BEQ (BEQ) then this setting will not work. You are using a BEQ connection if in your application server domain’s psappsrv.cfg file you have UseLocalOracleDB=1. You must use either TCP and IPC connections for this setting to take effect.
Informix: There is also a similar timeout, but this is non-configurable and set to 10 or 15 seconds.
DB2: There is a “unit of work timeout” on the database engine. By having the DBA set this value, you can get timeouts to take place shortly after the remote call timeout from Tuxedo.
Note: For PeopleTools 8.44
There is a new featured introduced in 8.44 called Query Kill. Users should be able to kill long running queries by navigating to PeopleTools->Utilities->Administration->Query Administration. On the execution tab they can search for long running queries and then choose the Kill Selected Query button. It should kill the PSAPPSRV or PSQRYSRV process that is associated with that query, delete the row from PSQRYTRANS and stop it at the db level as well.
-For PeopleTools 8.44 and above, also review the document KillQueryWord.doc
-PSMONITORSRV process cannot be turned off because it serves two functions: The first is to report host / domain resource information to Performance Monitor. The second is to support canceling a PeopleSoft Query in PT 8.44.
If the timeout occurs on the Database Server when the Application Server process is still waiting for the query execution to be completed, the DB client will return an appropriate error code to the Application Server process. The appsrv.log can show this error code. Similarly, when a service timeout occurs, the Web Server gets a Jolt Exception.
Note: Incident T-MJOHNS-YB7FA was reported for this problem. It is resolved in Tools 8.21 and 8.44.
Application Server Timeout
Service Timeout specifies the number of seconds Tuxedo waits for a service request, such as MgrGetObj or PprLoad, to complete before terminating it. The psappsrv.cfg file in $PS_HOME/appserv/ contains 1 Service Timeout value per server. If needed, you can increase the Service Timeout value for PSAPPSRV. This requires reconfiguring the domain after changing the Service Timeout value in psappsrv.cfg and then restarting your Application Server processes via psadmin. Here’s an example of psappsrv.cfg:
; Settings for PSAPPSRV
; UBBGEN settings
If the Service Timeout is set to 0, the request can run endlessly (it is a good idea to set this parameter to a high value rather than never letting it time out). However, a PIA request may still be subject to the tuxedo_receive_timeout on the Web Server. Service Timeouts are recorded in the TUXLOG.LOG and APPSRV.LOG. In the event of a timeout for a PSAPPSRV process, it is killed by Tuxedo. If the domain setting “Restartable” is set to Y in psappsrv.cfg, Tuxedo automatically starts another instance of the same server process to replace the killed one. TUXLOG registers the timeout message as show below:
Web Server Timeout
If an Application Server process dies because of any reason other than a service timeout or a database timeout, the servlet behavior should be analyzed. You may see that it resends the same request one more time. As mentioned previously, this is by design.
PeopleSoft recommends keeping the tuxedo_receive_timeout value higher than or equal to the service timeout value for the Application Server processes.
tuxedo_receive_timeout is the servlet timeout value in the pstools.properties file. For example, on Windows, it is located in weblogic\myserver\psftdocs\peoplesoft (WebLogic 5.10) or bea\weblogic6.1\config\peoplesoft\applications\portal\web\psftdocs\ps (WebLogic 6.1). It is set to 600 (10 minutes) by default. Here’s an example of pstools.properties:
# Tuxedo parameter settings
# default is disconnect=0, send=10, receive=10
There are 6 main parts to the query kill and timeout functionality. First, is the new PSQRYTRANS table. Every time a query is run from Query Manager or Query Viewer a row is written to this table.
Second, are changes to the Query Administrator that displays the queries that are
being run from Query Manager or Query Viewer – i.e., rows in PSQRYTRANS – there are a number of options here. This page will also provide the ability to mark a row to be killed by the system.
Third, are changes in the page to manage permission lists. A new field has been added here that represents the number of minutes that a query can run before being timed out. The number 0 represents an infinite timeout value. The query timeout vale that is used is the greatest number, or 0 if it exists, for all permissions lists of all roles that a user belongs to.
Fourth, is the QRYTIMEOUT flag in the PSQRYFLAGS table. This is a global flag that controls the enabling of the query timeout functionality.
Fifth, is the query monitor. This is an appserver function that is run every minute that looks for queries that have reached their time limit, or have been marked to be killed. It will also check for orphaned rows – ie. rows that have no timeout, and the appserver process doesn’t exist or isn’t running a query. Also, as it finds a row to kill or timeout it will ensure that the process is really a PSAPPSRV or PSQRYSRV that is running a query.
Sixth, when a query is killed we will increment a new field, NUMKILLS, in PSQRYSTATS. This is also reflected in the query stats GUI. We will also write a row in PSQRYEXECLOG. There is a new field, in this table – KILLEDREASON – that reflects why the query was killed. “T” for timed out, “K” for killed. The only other field that is updated in the row is EXECDTTM, which is set to the time the query was killed. This is also reflected in the exec log GUI.
Tests to Run:
In order to test various cases:
- There are 4 distinct ways that a query can be started: query viewer, run to HTML from Query Manager, run to file from Query Manager and Preview from Query Manager. We need to make sure that a row is correctly written to PSQRYTRANS for each of these cases.
- Note:All 4 methods do not share a common mechanism – Preview uses the Query API and the rest use the ICQuery Service.The ICQuery Service can run on either PSAPPSRV or PSQRYSRV, but both of these share the same code so we don’t need to test both process types.
- Make a query (row) for kill from the Query Administrator Page, and ensure that it is killed within approximately 1 minute.
- Setup and ensure that a query times out and is killed within approximately 1 minute of its timeout value.
- Check that the correct timeout is used when different timeout values are in multiple permission lists. When various numbers exist it uses the greatest value, when various numbers exist and if a zero exists it uses an infinite timeout.
- Ensure that the global timeout enable/disable functions by setting it through ???. Ensure that a row is still written to PSQRYTRANS when this is disabled.
- Ensure that the number of times a query is killed is reflected in the query stats GUI.
- Ensure that a row is written in the Query Exec Table. Also, need to ensure that the time the Query is killed is recorded in EXECDTTM, and the correct reason “K” or “T” is written to KILLEDREASON.
- Ensure that the Query Administrator (executing page) will display all running queries for the various options – all running queries, running queries by name and queries running for nn minutes.
If there is enough time for testing then you should also look into testing these cases:
- Ensure that the Query Monitor cleans up orphaned rows.
- Ensure that the Query Monitor will not kill a process that is running something else – i.e. the Process ID in the row can’t be the Process ID for another process other than PSAPPSRV or PSQRYSRV. Also, make sure that if the Process ID points to a PSAPPSRV or PSQRYSRV that is running a different operation it doesn’t kill it.
- Ensure that one domain’s Query Monitor does not try to do anything with queries from another Domain.
- Ensure that rows are not written to PSQRYTRANS from the Query API when it is called from something other than Query Manager and Query Viewer.