Co-ordinated DB Access (Thread-safe DB Access)

From OpenPetra Wiki
Jump to navigation Jump to search

THIS PAGE IS WORK IN PROGRESS

This page is currently being worked on. Please come back later to this page when you have been notified that it is finished!

Co-ordinated DB Access: Overview

The Problem it Solves

Before we had Co-ordinated DB Access in place, users ran into various Exceptions when multi-threaded DB access occurred - no matter whether that multi-threaded DB access was done intentionally/deliberately by the programmer, or whether it happened as something that resulted 'accidentally' because of an action the user took (and which we didn't prevent from happening). Co-ordinated DB Access not only prevents that from happening, but also provides new options to safely run program code that attempts to access the DB in parallel (however, multi-threading always brings some challenges, esp. with DB access - for details read the Challenges section).

Schematic diagram showing uncoordinated DB access

The Solution

The primary solution was to make the TDataBase Class (OpenPetra's Database Access and Abstraction Layer) thread-safe (this got addressed by solving Bug #3852), meaning that we are employing pseudoparallel execution to prevent any 'collisions' on DB Access. That in itself prevented the Exceptions mentioned earlier from happening!

Schematic diagram showing coordinated DB access

Building on that, ...

  • provisions have been put in place to allow the OpenPetra software engineers to react programmatically to various new situations where the now co-ordinated DB Access can raise specific Typed Exceptions in case the desired 'parallel-ity' cannot be achieved automatically in a given situation;
  • provisions have been made in the client-side 'Unhandled Exception Handler' to automatically show 'friendly and helpful' messages to the user when the software engineers didn't react programmatically to various new situations (yet).

The automatic 'friendly and helpful' messages may well be enough for situations in which concurrent DB access operations aren't occurring often and where the user wouldn't be too annoyed to perform any retry attempts by themselves.

The ability to react programmatically to the various new situations exists to perform automatic retries 'under the hood', to allow for 'better' provisions for the user (e.g. to provide Retry/Cancel options), and to be able to prevent the user from taking certain actions in the first place that could (later) lead to the inability to take certain actions (e.g. disallowing the opening of a screen under certain circumstances because the particular circumstance would mean that any entered data might not be 'save-able' by the user later on).

Pseudoparallel Execution

What happens in our 'co-ordinated DB Access' is that we allow only one thread at any given time access to DB-related functionality that is exposed through the TDataBase Class. That means that other threads need to wait until the first thread has finished accessing the DB through the TDataBase Class (an automatic time-out time-out is in place to mitigate 'stalling' situations).

We need to do this because we can't offer 'true' parallelism. The reasons for that are:

  • The ADO.NET DB driver model cannot maintain multiple active statements (commands) on a single DB Connection.
    • (Well, for SQLServer and Oracle there is a specific solution {'Multiple Active Result Sets', | MARS}, but it isn't a standard feature of ADO.NET and not available for other RDBMS's).
  • even the PostgreSQL RDBMS allows only one running DB Transaction per DB Connection!

Details of the Implementation

What is Done Automatically And What Needs to be Handled Manually

Automatic (and Fully Transparent): Thread-safe DB Access through the TDataBase Class

Automatic Thread-Safety

The thread-safety is fully transparent to the software engineers, that is, the software engineers don't need to do anything to make sure it works, and don't even need to know how it works.

(For the curious: Thread-safety is achieved by using a SemaphoreSlim Object {FCoordinatedDBAccess} with a capacity of only one, two new Methods {WaitForCoordinatedDBAccess and ReleaseCoordinatedDBAccess}, and by calling those two Methods appropriately in all places where it is required to achieve thread-safety across everything the TDataBase Class can do. For the even-more-curious: A Mutex or lock couldn't be used to achieve that since these aren't 'thread-agnostic', whereas a SemaphoreSlim is thread-agnostic. Also, a SemaphoreSlim is a very performant way to achieve thread-safety. {|Details})

Automatic Time-out (Thread Stalling Prevention)

Suppose a thread that 'locks' the TDataBase Class (due to the thread-safety being in place) would run for a long time, or even 'stall' for some reason or the other, or get into an 'endless loop', and hence wouldn't release the 'lock' on the TDataBase Class. While none of these should happen (of course...), the consequence of any of this happening would be that all other threads of a user that were waiting for DB Access would be waiting to get access to the DB - for however long it takes for the first thread that 'grabbed the lock' to finish. This situation would not only give the user the impression that 'OpenPetra isn't responding/has crashed', but would also mean they couldn't save any work that they haven't saved yet, which they might be able to do if we are able to offer them 'Retry/Cancel' options. An automatic time-out is in place to help avoid those unwanted issues. This time-out applies not to the first thread (that 'locked' the TDataBase Class ), but to any 'next' thread that wants to use the TDataBase Class and which has to wait. The time-out means that this waiting isn't 'indefinite', but ends after a set time-out, and hence the thread that ran into the time-out can perform some action. That action could be as simple as repeating the request for DB access or giving the user the opportunity to either continue waiting or cancel the operation that the user initiated.

Important: The time-out doesn't mean that a thread that wants to execute some action using the TDataBase Class has to wait until the automatic time-out has expired if another thread 'grabbed the lock' earlier. Rather, the second thread will execute the action as soon as the first thread finishes its work and 'releases the lock'! Example: If the first thread 'grabs the lock', executes an action and 'releases the lock' within 4 seconds, a second (waiting) thread will be able to 'grab the lock' after those 4 seconds, and not just after the time-out expired!

When the time-out does expire, the TDataBase Class throws a new Exception, EDBCoordinatedDBAccessWaitingTimeExceededException. Though a calling Method could catch this Exception specifically, it is more helpful to catch its Base Class, EDBAccessLackingCoordinationException. For details about this see section XXXXXX.

Configuration (Optional): Automatic Time-out

The automatic time-out time can be configured. That configuration option has been introduced to prevent users from running into time-outs too often in situations where an OpenPetra Site has got a slower server than average OpenPetra Sites, or a higher concurrent user count than average OpenPetra Sites, or both. The value would be set to a higher number than the default in such situations.

The time-out value defaults to 3000 (=3.000 milliseconds, which equals 3 seconds) but can be changed by including the appSetting Server.DBWaitingTimeForCoordinatedDBAccess in the Server.config file. Testing has so far shown that 3.000 milliseconds is sufficient for situations where a 'not-very-fast' virtual test server is used on which limited numbers of users perform concurrent actions. That default could easily be changed in the future, should that become necessary (it is specified in the TDataBase Class Constructor).

Semi-Automatic: (DB-)Call Retries

The most common reaction to an automatic time-out should be a retry of getting the 'lock' on the TDataBase Class. The reason for that is that often the second or third attempt of getting this 'lock' succeeds (as many DB queries run only for a short time)! It would therefore not be very user-friendly to show a message to the user that the action that (s)he has taken could not be performed when an internal retry (which the user doesn't notice) can often succeed.

A new Class has been introduced to make it easy to program such DB call retries. The new Ict.Common.DB.TServerBusyHelper Class has got only one static Method, CoordinatedAutoRetryCall. Use this Method wherever you expect that the taking-out of a 'lock' on TDataBase could time out as other things that run in parallel might have come first in taking a 'lock' out. For details about this see section XXXXXX.

Configuration (Optional): Number of Retries

The number of retries that the CoordinatedAutoRetryCall Method automatically performs can be configured. That configuration option has been introduced to prevent users from running into time-outs too often in situations where an OpenPetra Site has got a slower server than average OpenPetra Sites, or a higher concurrent user count than average OpenPetra Sites, or both. The value would be set to a higher number than the default in such situations.

The time-out value defaults to 3 (=3 retries) but can be changed by including the appSetting CoordinatedAutoRetryCallMaxRetries in the Client.config file (for controlling the number of retries when the CoordinatedAutoRetryCall Method is used client-side) and/or the Server.config file (for controlling the number of retries when the CoordinatedAutoRetryCall Method is used server-side).

Automatic: Exception Handling in Case of Time-out

The client-side 'Unhandled Exception Handler' presents 'stock messages' to users in case an Exception that derives from the EDBAccessLackingCoordinationException Exception makes it as far as that (that is, if a programmer didn't use the TServerBusyHelper.CoordinatedAutoRetryCall Method and didn't catch it server-side or client-side). This is a much better option than letting the Exception escalate and showing the Unhandled Exception Dialog as a result of that, but there are situations where the manual catching and handling of such Exceptions is the better solution; see TODO.

Example of such an automatically-shown MessageBox:

---------------------------
OpenPetra Server Too Busy
---------------------------
The OpenPetra Server is currently too busy to perform the requested action.  (Reason: Failed to initiate shared data access.).

Please wait a few seconds, then retry the action that you wanted to perform (if you had previously started a long-running action and this is not finished yet, you might need to wait until this is finished).

If you have received this message while a screen was opened and that screen is left open then please don't try to use that screen! Rather, close it and try to open it again.
---------------------------
OK   
---------------------------

The '(Reason: XXX.)' part in those Messages can have different wording that helps a developer to know what has actually gone wrong (since the the Exception information is 'hidden' from the GUI):

  • 'Reason: Waiting time for initiating exclusive data access exceeded.' – gets shown when EDBTransactionBusyException gets thrown. This will be the case when the caller called one of the BegintXXXTransaction Methods, but another DB Transaction (e.g. from running a Report) was already running and the call cannot be satisfied because no second (=concurrent) DB Transaction can be started.
  • 'Reason: Failed to initiate shared data access.' – gets shown when EDBTransactionIsolationLevelWrongException gets thrown. This will be the case when the caller called one of the GetNewOrExistingXXXTransaction Methods, but another DB Transaction (e.g. from running a Report) was already running and the call cannot be satisfied because of a mismatch of the 'IsolationLevel'.
  • 'Reason: Waiting time for data access exceeded.' – gets shown when the new EDBCoordinatedDBAccessWaitingTimeExceededException gets thrown. This will be the case when the caller called one of the GetNewOrExistingXXXTransaction Methods, but another DB Transaction (e.g. from running a Report) was already running, and – while there wasn't a mismatch of the IsolationLevel – the second call timed out because it could not get exclusive access to the DB. This will be the case when the original call to the DB takes longer than the time-out (3 seconds by default).

Additionally, when the appSetting 'Client.DebugLevel' is set to a minimum of 3 in the Client.config file the Unhandled Exception handler also logs the full Exception details to the Client.log. This is useful for detailed checks by the developer. At that DebugLevel the developer also gets information in the Client.log about the various retry attempts, and if they were exceeded, for cases where screens handle those using the in an automatic retry fashion by means of the TServerBusyHelper.CoordinatedAutoRetryCall Method. Examples of screens/screen parts for the latter: the Partner Find/Partner Info Panel, the Partner Edit screen and the Maintain Extracts screen (when opening a Partner from the latter). The Client-side Cacheable DataTable Manager utilises TServerBusyHelper.CoordinatedAutoRetryCall, too.

Examples

Apart from the example that is shown in the bullet list below there aren't any specific example actions that you could take to make the automatic 'stock messages' come up; rather, they will come up when an EDBAccessLackingCoordinationException Exception makes it as far as the Unhandled Exception Handler!

Reproducible example for a BeginXXXTransaction call 'not going through':

  • Start the 'Partner by Subscription' Report from Partner Module -> Reports with a 'popular' Subscription (e.g. 'AFRIKAANS' on the SA DB; that will run for quite some time).
  • While that runs, go to Partner Module -> Partners and click 'Add New Person'. The screen opens fine, but on clicking 'OK' the user is presented with an automatically-shown MessageBox (see example above). On clicking 'OK' on that MessageBox, the user is back in the Main Menu (while the Report happily continues to calculate!). That also works when 'New Partner' is initiated e.g. from the Partner Edit screen, in that case one is back on the Partner Edit screen that you pressed 'New Partner' on.
    • We might be able to improve on this by keeping the user in the 'New Partner' Dialog (rather than closing this and returning him/her to wherever (s)he came from), but that is something that needs extra assessment and specific programming.
    • Note: In this case a BeginXXXTransaction call with IsolationLevel 'Serializable' needs to be used because only that can guarantee that the Partner Key that gets generated will exist exactly once in the DB!

Manual: Co-ordinated DB Access Exception Handling

Manual handling of Exceptions that either stem from co-ordinated DB Access, or from calling one of the BeginXXXTransaction Methods when a DB Transaction is already running is done in several ways:

  • Utilising the TServerBusyHelper.CoordinatedAutoRetryCall Method
    • Client-side
    • Server-side

TODO

  • in case of time-out

TODO


Examples / Implementations

As per March 2015 the following places exist that utilise the TServerBusyHelper.CoordinatedAutoRetryCall Method:

Client Side
  • Partner Edit screen (Method 'GetPartnerEditUIConnector' in file \csharp\ICT\Petra\Client\MPartner\Gui\PartnerEdit.ManualCode.cs);
  • Maintain Extracts screen - when opening a Partner for editing (Method 'EditPartner' in file \csharp\ICT\Petra\Client\MPartner\Gui\Extracts\UC_ExtractMaintain.ManualCode.cs);
  • Partner Find screen/Partner Info Panel (Method 'FetchDataFromServer' in file \csharp\ICT\Petra\Client\MPartner\Gui\UC_PartnerInfo.cs [that Method is launched only from a Timer, i.e. runs in a separate client-side Thread!]);
  • 'TSystemDefaults.GetSystemDefault' Method in file \csharp\ICT\Petra\Client\app\Core\SystemDefaults.cs;
  • 'TFrmMainWindowNew.UpdateSubsystemLinkStatus' Method in file \csharp\ICT\Petra\Client\app\MainWindow\MainWindowNew.ManualCode.cs.

The Client-side CacheableDataTable Manager (TDataCache) utilises TServerBusyHelper.CoordinatedAutoRetryCall, too (two overloads of the Method 'GetCacheableDataTableFromPetraServer' in file \csharp\ICT\Petra\Client\app\Core\Cache.cs).

Server Side
  • 'TReportingDbAdapter.RunQuery' Method in file \csharp\ICT\Petra\Server\lib\MCommon\Main.cs;
  • 'TSystemDefaults.GetSystemDefaults' Method in file \csharp\ICT\Petra\Server\lib\MSysMan\SystemDefaults.cs.

Challenges

Performance Considerations

As we are employing pseudoparallel execution to prevent any 'collisions' on DB Access, you must expect that DB access is a limiting factor in how 'parallel' multiple threads can actually run, even on servers that have multiple processors!

How much the pseudoparallel execution will impede the performance of multiple threads will depend heavily on how much the individual threads utilise DB access vs. how much they 'compute', i.e. perform non-DB-access-related operations. If both threads access the DB heavily at roughly the same times, the execution speed of the threads will be heavily impeded as only one thread a time gets access to the DB. If, however, the threads do quite some computations and only some DB access (which ideally doesn't occur at roughly the same times) then the performance of individual multiple threads will not be impeded and operations involving multiple threads can well yield advantages over operations that consist of a single thread on servers that have multiple processors.

This also affects the user: When a user starts multiple Reports in parallel (e.g. just before going on lunch break), the combined time that both Reports take to calculate might be higher, or lower, than if they were started individually one-after-the-other. This will depend on how many DB access operations will 'compete' over DB access - and of course also whether the server on which the PetraServerConsles.exe process runs has got multiple processors, or only one processor...

TODO

TODO

TODO

Multi-threading: General Advice

General caution/advice: Programming with multiple threads is always much, much harder than programming with a single thread. When multiple threads access the same program data (Fields, static Fields, structs, etc.) one can easily get intermittent problems which can be very hard to pin-point and to resolve. Extreme diligence needs to be employed when multiple threads should access some shared program data in a writing fashion! Debugging a multi-threaded program is also considerably more difficult than debugging a single-threaded program. Whole books are devoted to the topic of multi-threaded programming as multi-threading is a quite difficult discipline that is hard to master...

The Future: A Safe-to-use, Multi-threading Enabled OpenPetra

TODO


Questions?

In case you have questions regarding the 'Co-ordinated DB Access' feature please contact ChristianK, the developer of that feature.