The final rescue plan - safe mode

1 Introduction

Tmall client users, how to ensure the stability of the cat is very important task, and the start-up stage of protection is one of the key part.

Tmall security model is committed to solving the problem such as the crash of the APP start-up phase, with self-healing ability, synchronous hot repair capability, is a set of start protection solution.

The origin of the Tmall security model

  • Question: APP in use, and sometimes encounter online can not repair the crash crash, the user can not use APP

  • Thinking:

  1. Can we avoid this problem? Is there a way for the program to automatically fix the problem?

  2. How can we better repair similar problems?

  • in conclusion:

We need a solution that will ensure that APP starts smoothly and resolves major issues - security model

2 Design of Tmall Safe Mode

Tmall security model focuses on the start-up phase to solve the problem, from the configuration background, the client capabilities, data, testing four areas are given a unified solution, but also take into account the different APP compatibility issues

2.1 Configure the background

A unified configuration background, with gray release mechanism

2.2 Client Capability

  1. In the case of APP continuous Crash with classification, no sense of self-healing ability

  2. With synchronous hot repair capability

  3. Ability to specify the ability to trigger a particular function

  4. With the ability to register, you can easily extend the security model later

3.3 Data statistics and alarms

  1. Unified data platform

  2. Monitoring alarm function, allowing you to discover the problem in time

  3. You can view the success rate of heat repair and other data

3.4 Quick test

  1. Optimize pre-emptive testing

  2. Optimization of each regression to verify the difficulty of safe mode

3.5 The Development of Tmall Security Model

Security mode has so far experienced four major versions, the function has been constantly improved, the following figure describes the main function of each version

3.6 Tmall safe mode principle

APP crash for many reasons, each APP design solutions are different, all of its abnormal errors are very difficult to capture, so we changed the way, completely from the user’s point of view what is abnormal exit, that is, playing Flag the flag

  1. How to determine the abnormal exit:
  • A flag value is logged when APP starts

  • The flag value is cleared when the following conditions are true:

APP normally starts for 10 seconds

User exits the application normally

The user automatically switch from the foreground to the background

  • If an exception occurs during the startup phase, the flag is not cleared, and the flag can be used to determine whether the client is quitting abnormally.

  • Each time an exception exits, the flag value is +1

  1. Hierarchical Execution Policy for Safe Mode:
  • Security mode according to the size of the value of the flag to do a hierarchical implementation of the strategy, the current security is divided into two levels, continuous crash 2 times for a security model, continuous crash 3 times and above for the two security model

  • Line of business can be registered in a safe mode of behavior, such as a business to empty the cache data, so that when entering a safe mode, the security model will automatically call the registration behavior, try to repair the client

  • If the first-level security model can not repair APP, will enter the secondary security model, the secondary security model will restore the APP to the initial installation state, Document, Library, Cache three root directory empty

  1. Hot fix execution strategy:
  • Old version of the hot fix strategy: the secondary security model in the trigger

Question: 3 consecutive crashes after the trigger, in the case of problems, to open so many times APP users too little, we can not repair faster?

  • New version of the repair strategy:

Hot fix from the specific level of stripping, as long as the need to find hot fix configuration, APP will be blocked simultaneously hot repair, to ensure the timeliness of repair

  1. Grayscale scheme:
  • Security mode to develop a simple grayscale strategy, gray, the configuration will also include the gray, the official two copies of the configuration, will also include the probability of gray

  • APP according to a specific algorithm to calculate whether they meet the gray-scale conditions, if satisfied, the use of grayscale configuration, or the use of formal configuration

3.7 Thinking on the Ease of Use of the Tmall Security Model

At first, we did not particularly consider ease of use, because the first two versions are only one access to the cat, do not consider the difference; but in the docking group other APP, we find that the demand point is still a bit Big difference, but also found the lack of security model, so we increased the ease of use considerations, mainly in the following points:

  1. Access cost
  • Stand in the access angle, improve the document, redefine the interface, and strive to interface is simple, clear, lower access costs
  1. Unified configuration background
  • Easy access to the configuration information, the use of Ali cloud CDN services to build a unified configuration center, according to APP, version to configure
  1. Customization
  • Taking into account the different APP location and the actual needs of different security model to support the transformation of customized features, so that the access side to determine the specific behavior

  • For example, the day cat that users are concerned about the main function is available, do not care whether to enter the safe mode, so do not need to display a separate prompt page, but communicate with other APP found that some APP still want to have such a prompt page to inform the user What we did

  1. Grayscale mechanism
  • Safe mode, the initial position is only used to solve the problem of starting crash, but in the group with other APP communication process found that they also need to not crash in the case of APP to solve the problem of direct release is still very dangerous, Gray-scale mechanism is very urgent, so we achieved in the 4.0 version of the gray-scale mechanism
  1. data analysis
  • Using the Group’s unified data platform to facilitate the access side to query the relevant data, while improving the security model to provide a basis
  1. Quick test
  • For the test (simulated continuous crash) by adding special treatment, improve test efficiency

Conclusion

Security model is currently developed to V4.0 version, has been on the line more than six months time, a good guarantee of the cat’s APP start-up security, follow-up we will continue to polish its security model to better protect the escort for the APP.

ref:

The final rescue plan - safe mode