“Tomas - Yesterday at 8:37 PM
what is the holdup with FI, by the way?
honest clarr - Yesterday at 8:53 PM
When it originally ran the database was out of space and that caused it to crash. Being out of space was unrelated to fi.
Sel tried to run it manually and it ran out of memory
honest clarr - Yesterday at 8:57 PM
I have some fixes on dev to reduce memory usage, however they need to be tested
honest clarr - Yesterday at 9:00 PM
Fi is a bit of an issue because it only runs monthly. It would be easier if it ran weekly but only gave 1/4 of the total amount. Things that don't run frequently are harder to maintain
honest clarr - Yesterday at 9:03 PM
We also need a way for either it to be split up into smaller chunks when ran, e.g. by sector and a way for asims to manually run it. Right now if an asim ran it, it would crash the game for every player for about 20 mins
Michael MaCleod - Yesterday at 9:09 PM
Silly question, but would it be possible to basically kill everything but the FI, let it run, then resume normal operations?
honest clarr - Yesterday at 9:18 PM
because of its runtime and memory usage, it can't be executed directly from the browser, which makes it difficult
it needs to be scheduled to run as a job by the server (like how travel finishes at wonky times) but even then the thing that finishes travel does not have the appropriate settings to be able to run fi”
Edit: apparently Clarr's symbol in his name broke my post.
Edited By: Tomas O`Cuinn on Year 19 Day 41 14:55 ____________
It might be fixed in time for the next FI run. Regardless, we now have tools where we can do test runs on main (which won't send events/credits but will give us an idea of memory usage and whether the process finished correctly) and then can run "real" FI runs as many times as necessary to catch up. Those tools will come across with the next sync and we should be able to correct this issue then, and hopefully avoid similar future issues.
FI failed this month again, but a different reason this time.
To summarize the problems so far:
- Last months FI run: Failed due to database space issues
-- Database server ran out of space due to an unrelated issue. (FI itself does not use much space).
-- Extra cleanups were already added to trim certain tables, however the actual space issue was not due to database size.
-- If the database server did not crash, it was likely the run would have failed due to memory usage.
- Last months FI manual rerun by sel: Failed due to memory usage
-- At the time, FI used significant amount of memory when executed, and crashed when it hit the limit of 2.5gb.
--- This is mainly due to three things;
--- an ever increasing number of facilities and characters
--- a leaky way pending income was stored
---- This has been fixed.
--- The large query that is executed that finds all facilities to be considered for income.
---- This has been fixed by paginating the query so all facilities do not reside in memory at once.
- Last months test asim run(s): Failed due to nginx timeout, also failed due to database transaction issue
-- asim tool nginx timeout: A admin tool was added to allow asims to run and test run FI. asims being able to run fincome is important because it means less work on our simmaster when things go pear shaped. However during the test the tool encountered an issue, if your page is inactive for 60 seconds, it nginx proxy cuts you off (and usually you end up with a blank page). The test on main failed because the query that fetches the facilities for consideration runs for longer than 60 seconds, due to the size of the resultset needing sorting. This query was already paginated, but now will also need to be partitioned.
--- Raised Related Issue: #4775: this will increase the duration the run takes, but will make it more "live", reducing the chance the run is killed by nginx
-- database transaction issue: During the test run, when an error occurs it should not persist changes, however this did not occur and partial changes were persisted. This is critical to fix because if an asim executed fincome run fails, it would be very hard for an asims to manually revert partial income.
--- Raised Related issue: #4774 to fix the transaction integrity issue.
--- Raised Related issue: #4777 transaction integrity issue due to how fincome taxes are calculated.
- This months FI run: failed due to database congestion causing lock timeout
-- A lot of things happen at the same time on the server, when two things conflict, one of them wins and one loses, and the loser ususally sees a "Lock Timeout" or a "Deadlock". Something caused the FI job to "lose", which killed the fincome run. The deadlock occurred during the repossession of a facility, but this isn't really super important.
-- Unsure of the direct cause as to what was holding the lock contention.
-- Most likely due to contention caused by a job called "newAccountWatch", this job has already been disabled, and we should see less timeouts when jobsdaily runs if it was indeed the cause.
--- Raised Related Issue: #4771 is to tweak the configuration of locks, so that people using the website have lower priority than jobs, this will make it less likely that a player can interrupt a long running job.
--- Raised Related issue: #4772 to gain better visibility into jobs timing
--- Raised Related issue: #4776 stagger jobs so there is less chance of one trampling another one
--- Raised Related issue: #4773 to reduce the amount of unnecessary work done in the main fi code, repossession of facilities can be moved into a separate job
-- Resource contention was still present after newAccountWatch was disabled, new suspect for problems is jobsdaily "purgeInactivePlayers", which has also been disabled.