Benjamin
Benjamin

新人工程師,偶爾分享工作心得(預計?

Work Essay - Emergency Online

Not long after going to work on Wednesday, I received a message from the supervisor, the content was an error message from the online service host.

In order to consider maintenance, management costs and stability, many companies now use Amazon's AWS services, such as cloud servers, cloud databases and other things, and our company is no exception.

After logging into the account, you can see the machine operation record, including exception information when the program executes error, to check the cause of the error. The supervisor assigns the wrong part of the exception message record to the engineer responsible for the development of this part to make corrections.


What I encountered this time is that the program was not checked for completeness during development, and the wrong data source was used as a parameter.

The error function is simply divided into three parts, the first and second parts are updated, and the third part deletes the data. The problem occurred in the second part, resulting in the unfinished update of the second part, and the third part was also interrupted. , a set of processes did not complete as scheduled, the result is that the data is out of sync, and the data that should have been deleted can still be searched, which is an obvious mistake.


It is conceivable that the later this problem is corrected, the more users may encounter this problem, so the normal online process, such as the D→S→P order as mentioned earlier, is obviously too late for acceptance, so this kind of online problem uses Emergency online process, update online after correcting problems as soon as possible.

After the development is completed, first merge the programs developed on this machine into the D environment, and after confirming that the problem is solved, merge from the local machine to the P environment, that is, online.
The purpose of this approach is to ensure that the online modification is only aimed at the part of the program where the error occurred, and will not cause further problems or version conflicts with other services, such as the data format when communicating between the back-end and the front-end. The major revision must be checked and accepted together on both sides before it can function normally.

To ensure that the programs used by users are stable, this is the most important purpose of layered acceptance.


As mentioned earlier, the interruption of the execution of the second part causes the third part to fail to delete the data normally, and this kind of problem must also be excluded.

But online is online, and developers are not allowed to read and write the database directly. In order to delete the data, I must first write a data change order in the system to keep a record of the content of the change, the grammar of the changed data, the reason for the change, etc. Wait, and then write a script for deleting data for the supervisor to execute, and the supervisor will reply whether the deletion is successful or failed, and the message returned.

Several hours have passed.


If possible, I really want to be careful and reduce errors as much as possible. The necessary procedures for the correction process are a bit cumbersome, and the subsequent work that is expected to be carried out is further delayed, just to return the program to a "normal state" without output.

CC BY-NC-ND 2.0

Like my work?
Don't forget to support or like, so I know you are with me..

Loading...
Loading...

Comment