FI failed this month again, but a different reason this time.
To summarize the problems so far:
- Last months FI run: Failed due to database space issues
-- Database server ran out of space due to an unrelated issue. (FI itself does not use much space).
-- Extra cleanups were already added to trim certain tables, however the actual space issue was not due to database size.
-- If the database server did not crash, it was likely the run would have failed due to memory usage.
- Last months FI manual rerun by sel: Failed due to memory usage
-- At the time, FI used significant amount of memory when executed, and crashed when it hit the limit of 2.5gb.
--- This is mainly due to three things;
--- an ever increasing number of facilities and characters
--- a leaky way pending income was stored
---- This has been fixed.
--- The large query that is executed that finds all facilities to be considered for income.
---- This has been fixed by paginating the query so all facilities do not reside in memory at once.
- Last months test asim run(s): Failed due to nginx timeout, also failed due to database transaction issue
-- asim tool nginx timeout: A admin tool was added to allow asims to run and test run FI. asims being able to run fincome is important because it means less work on our simmaster when things go pear shaped. However during the test the tool encountered an issue, if your page is inactive for 60 seconds, it nginx proxy cuts you off (and usually you end up with a blank page). The test on main failed because the query that fetches the facilities for consideration runs for longer than 60 seconds, due to the size of the resultset needing sorting. This query was already paginated, but now will also need to be partitioned.
--- Raised Related Issue: #4775
: this will increase the duration the run takes, but will make it more "live", reducing the chance the run is killed by nginx
-- database transaction issue: During the test run, when an error occurs it should not persist changes, however this did not occur and partial changes were persisted. This is critical to fix because if an asim executed fincome run fails, it would be very hard for an asims to manually revert partial income.
--- Raised Related issue: #4774
to fix the transaction integrity issue.
--- Raised Related issue: #4777
transaction integrity issue due to how fincome taxes are calculated.
- This months FI run: failed due to database congestion causing lock timeout
-- A lot of things happen at the same time on the server, when two things conflict, one of them wins and one loses, and the loser ususally sees a "Lock Timeout" or a "Deadlock". Something caused the FI job to "lose", which killed the fincome run. The deadlock occurred during the repossession of a facility, but this isn't really super important.
-- Unsure of the direct cause as to what was holding the lock contention.
Most likely due to contention caused by a job called
"newAccountWatch", this job has already been disabled, and we should see less timeouts when jobsdaily runs if it was indeed the cause.
--- Raised Related Issue: #4771
is to tweak the configuration of locks, so that people using the website have lower priority than jobs, this will make it less likely that a player can interrupt a long running job.
--- Raised Related issue: #4772
to gain better visibility into jobs timing
--- Raised Related issue: #4776
stagger jobs so there is less chance of one trampling another one
--- Raised Related issue: #4773
to reduce the amount of unnecessary work done in the main fi code, repossession of facilities can be moved into a separate job
-- Resource contention was still present after newAccountWatch was disabled, new suspect for problems is jobsdaily "purgeInactivePlayers", which has also been disabled.