Corticon Server: Integration & Deployment Guide : Performance and tuning guide : Diagnosing runtime performance of server and Decision Services

Diagnosing runtime performance of server and Decision Services

When performance issues arise, analyzing usage characteristics might reveal the performance bottlenecks. Corticon servers always start a diagnostic thread, and a snapshot of key diagnostic metrics is taken at a regular interval. The diagnostic data is forwarded to the logging system where they are recorded. You can then run a utility that extracts diagnostic lines, and transforms them to a standard comma-separated value format you can use in a data analysis product such as Tableau or Excel to create visualizations.

Properties that control diagnostic logging

The user can control the starting of the diagnostic thread, the intervals at which diagnostic snapshots are taken, and whether diagnostic lines are accepted into logs. Add or edit the following properties in the brms.properties:

To enable (true) or disable (false) automatic startup and configuration of the server monitor thread when an ICcServer is created in the CcServerFactory. Default value is true.

com.corticon.server.startDiagnosticThread=true

To specify the wait time in milliseconds of the Server Diagnostic Monitor. Default is 30000 - 30 seconds.

com.corticon.server.DiagnosticWaitTime=30000

To set the log level and/or filters to accept diagnostic entries. The default loglevel, INFO, and higher loglevels enable filters that are set to accept DIAGNOSTIC entries. Choosing a lower loglevel and/or removing DIAGNOSTIC from the logFiltersAccept list denies generated diagnostic data entry into the log.

Note: See the topic Changing logging configuration for more information.

Example of diagnostic log entries for a server and four Decision Services

2015-05-13 14:02:34.831 INFO DIAGNOSTIC [CcDiagnosticsThread] Cc -
id=1431540154831,sthp=509.6875,shp=356.08567810058594,sex=0,stq=1564,
sec=1564,sfc=0,saex=3.9801790281329925,sawt=0
2015-05-13 14:02:34.831 INFO DIAGNOSTIC [CcDiagnosticsThread] Cc -
id=1431540154831,ds=ProcessOrder.1.10,ec=1564,aex=3,awt=0,fc=0
2015-05-13 14:02:34.831 INFO DIAGNOSTIC [CcDiagnosticsThread] Cc -
id=1431540154831,ds=Candidates.1.14,ec=0,aex=0,awt=0,fc=0
2015-05-13 14:02:34.831 INFO DIAGNOSTIC [CcDiagnosticsThread] Cc -
id=1431540154831,ds=AllocateTrade.1.14,ec=0,aex=0,awt=0,fc=0
2015-05-13 14:02:34.831 INFO DIAGNOSTIC [CcDiagnosticsThread] Cc -
id=1431540154831,ds=Cargo.1.0,ec=0,aex=0,awt=0,fc=0
2015-05-13 14:03:04.842 INFO DIAGNOSTIC [CcDiagnosticsThread] Cc -
id=1431540184842,sthp=509.6875,shp=371.9812469482422,sex=0,stq=1564,
sec=1564,sfc=0,saex=3.9801790281329925,sawt=0

To generate DIAGNOSTIC data into the server log:

The default Corticon Server settings generate diagnostic data and capture the diagnostic entries in the log.

1. In brms.properties, set the monitor interval so that unusual spikes of activity in a key metric (such as heap size) are captured. If the window is large, the heap size might go from normal to the server crashing with OutOfMemory exceptions without leaving any trail in the diagnostics entries in the log. The default value is 30000 milliseconds. You could change it to, say, 10 seconds by locating the line: #com.corticon.server.DiagnosticWaitTime=30000and then changing its value to com.corticon.server.DiagnosticWaitTime=10000

2. Confirm that the logLevel is INFO or higher and that logFiltersAccept includes DIAGNOSTICS.

3. Save the file.

4. Start Corticon Server.

5. Execute some requests or activities through the server.

6. Examine the server log at [CORTICON_WORK_DIR]\logs\ to note lines that contain "DIAGNOSTIC".

You can now run the utility that extracts only the diagnostic lines and transforms each from name=value pairs to comma-separated integer and string values. The Server and Decision Service Diagnostic data and procedures are discussed separately.

SERVER DIAGNOSTICS

The server diagnostic log entries provide general server health metrics. The following metrics are logged:

Table 39. Content of a Server diagnostic entry
Item	Description
Diagnostic set ID (id)	Grouping of log lines from a common time slice
Date Time	Timestamp of log entry
Server Total Heap size (sthp)	Corticon server JVM total memory in MB
Server Heap size (shp)	Corticon server JVM allocated memory in MB
Server Executing threads (sex)	Current count of threads actively executing Decision Services
Server Threads in Queue (stq)	Count of Decision Service invocations waiting in the queue to be executed
Server Execution Count (sec)	Total number of Decision Services executed since Corticon server startup
Server Failure Count (sfc)	Total execution failure count since Corticon server startup
Server Average Executions time (saex)	Average Decision Service execution time measured across all Decision Services,and provides the average since the Corticon server startup
Server Average Wait Time (sawt)	Average Decision Service wait time measured across all Decision Services, and provides the average since the Corticon server startup.

Once you have a log file with diagnostic entries, running a Corticon management utility takes an input file and produces each DIAGNOSTIC line as CSV data in its output file. To run the utility against a log that has been archived, extract the log file to a temporary location.

To extract server diagnostic data from a Corticon Server log:

1. Open a command prompt window at [CORTICON_HOME]\Server\bin.

2. Enter:

corticonManagement.bat -e -s -i {input_file} -o {output_file}

For example:

corticonManagement.bat -e -s -i C:\CcServer.log -o C:\CcServer_20150513.csv

When the processing completes, the input file is unchanged. The output file extracts only Server diagnostic lines, transforming each line into CSV values and a header line as shown for the log example above:

Date Time,Heap Size,Total Heap Size,Thread Pool Size,Tasks Queued,
Executions,Failures,Average Execution Time,Average Wait Time
2015-05-13 14:03:04.842,371.9812469482422,509.6875,0,1564,1564,0,3.9801790281329925,0
2015-05-13 14:03:34.854,328.77896881103516,509.75,0,1564,1564,0,3.9801790281329925,0
2015-05-13 14:04:04.827,342.8452453613281,509.8125,0,1564,1564,0,3.9801790281329925,0
2015-05-13 14:04:34.829,364.4458999633789,501.1875,0,2020,2020,0,3.704950495049505,0
2015-05-13 14:05:04.831,372.962158203125,503.625,0,2879,2879,0,3.4105592219520666,0
2015-05-13 14:05:34.833,345.2066192626953,504.875,0,4335,4335,0,3.200461361014994,0
2015-05-13 14:06:04.835,366.8616409301758,503.4375,0,4335,4335,0,3.200461361014994,0
2015-05-13 14:06:34.837,345.76478576660156,502.9375,0,4335,4335,0,3.200461361014994,0
2015-05-13 14:07:04.839,358.3552551269531,503.5,0,4448,4448,0,3.1913219424460433,0
2015-05-13 14:07:34.841,337.6990051269531,495.75,0,4448,4448,0,3.1913219424460433,0

The Server Diagnostic CSV data is compatible with analytic and visualization products such as Excel and Tableau, as illustrated:

DECISION SERVICE DIAGNOSTICS

A diagnostic entry is created for each Decision Service that is deployed to the Corticon server. If you are creating separate logs for each Decision Service, the utility runs against its corresponding log; otherwise, the Decision Service diagnostic is captured into the server log where the utility will, in turn, extract the data for one specified Decision service at a time against the same server log. The following metrics are logged for each Decision Service.

Table 40. Content of a Decision Service diagnostic entry
Item	Description
Diagnostic set ID (id)	Grouping of log lines from a common time slice
Date Time	Timestamp of log entry
Decision Service name (ds)	The name and Major.minor version of the deployed Decision Service
Execution Count (ec)	The total execution count for this Decision Service since the Corticon server startup
Average Execution time (aex)	Average execution time in the designated Decision Service since the Corticon server startup
Average Wait Thread time (awt)	The Average execution wait time for threads entering the designated Decision Service since the Corticon server startup
Failure Count (fc)	Number of failures recorded in the designated Decision Service since the Corticon server startup

Once you have a log file with diagnostic entries, running a Corticon management utility takes an input file and produces each DIAGNOSTIC line as CSV data in its output file.

To extract Decision Service diagnostic data from a Corticon Server log:

1. Open a command prompt window at [CORTICON_HOME]\Server\bin.

2. Enter:

corticonManagement.bat -e -ds {DSname} -dsv {major.minor} -i {input_file} -o {output_file}

For example:

corticonManagement.bat -e -ds ProcessOrder -dsv 1.10
-i C:\CcServer.log -o C:\ProcessOrder_1.10_20150513.csv

When the processing completes, the input file is unchanged. The output file extracts only diagnostic lines, transforming each line into CSV values and a header line as shown for the log example above:

Date Time,Service Name,Executions,Failures,
Average Execution Time,Average Wait Time
2015-05-13 14:02:34.831,ProcessOrder.1.10,1564,0,3,0
2015-05-13 14:03:04.842,ProcessOrder.1.10,1564,0,3,0
2015-05-13 14:03:34.854,ProcessOrder.1.10,1564,0,3,0
2015-05-13 14:04:04.827,ProcessOrder.1.10,1564,0,3,0
2015-05-13 14:04:34.829,ProcessOrder.1.10,2020,0,3,0
2015-05-13 14:05:04.831,ProcessOrder.1.10,2879,0,3,0
2015-05-13 14:05:34.833,ProcessOrder.1.10,4335,0,3,0
2015-05-13 14:06:04.835,ProcessOrder.1.10,4335,0,3,0
2015-05-13 14:06:34.837,ProcessOrder.1.10,4335,0,3,0
2015-05-13 14:07:04.839,ProcessOrder.1.10,4448,0,3,0

The Decision Service Diagnostic CSV data is compatible with analytic and visualization products, as illustrated in Excel:

Interpreting diagnostic data

Here is a guide to what changes in performance diagnostic values might indicate:

When the number of waiting threads (wt) goes up, it is an indication that the request demand is greater than the server capacity. Some wait time may be necessary in periods of high demand.

The average wait time (awt) can be used to determine whether server capacity should be expanded (or if consistently low, might indicate contracting server resources).

The number of executions (ex) combined with the average wait time help pinpoint whether there is a need to expand server resources or just to accept a slower response in small, high demand windows.

The number of failures (fl) is an indication that expert analysis/maintenance is needed.

The average execution time (aex) can be used to determine if there are configuration/resource issues. If this rate is not stable, it might indicate that the resource configuration is not optimal. However, this value can be dependent upon data size -- if the input data size is not stable the execution size will not be stable.