Learning NetWork Optimization for Mobile of Ctrip

Instroduction

Ctrip travel as a user to use the site around the world, its network optimization is the most important performance and user experience optimization, before we share Ctrip in the network and application architecture optimization exploration:

App network services, high reliability and low latency for the steady development of wireless business is essential in the past two years we have been continuously optimized App network service performance, to the end of Q2 this year, the basic completion of the App network service channel management and performance optimization phase Of the objectives, the author hereby summarize the lessons learned for future work to lay the foundation.

Ctrip App wireless network service architecture

In 2014, Ctrip developed Mobile Gateway for wireless services. There are two types: TCP Gateway and HTTP Gateway. TCP Gateway is designed for native service network services in App, based on the TCP protocol is designed on the application layer protocol, similar to the RPC mechanism. TCP Gateway combines the functions of the access layer and the service dynamic routing. The function of the access layer is based on the Netty implementation. It manages the client’s TCP long connection or short connection. The dynamic routing function is based on the NetWare open source Zuul implementation Which provides services such as routing, monitoring, anti-crawling, and user authentication on the TCP Gateway, which can provide dynamic routing, monitoring, resiliency, security, and more.
sd
After each TCP service request arrives at the TCP Gateway, it will be forwarded to the corresponding service cluster on the back end according to the service number in the packet header, thus decoupling the back-end service. The forwarding of TCP Gateway to the back-end business service cluster is realized by using the interface of HTTP protocol. The complete packet of a TCP service request is forwarded to the back-end business service cluster as Payload of HTTP request. After receiving the HTTP response, Its Payload complete return to the corresponding TCP connection.

HTTP Gateway for the App in the Hybrid and H5 Web site network services, the use of HTTP Restful interface to provide services, the logic is relatively simple, the core is the HTTP service dynamic forwarding function.
sdfsd
More details of the design of the Mobile Gateway can refer to Wang Xingchao in 2015 Shanghai QCon speech “Ctrip wireless Gateway”:

Http://www.infoq.com/cn/presentations/ctrip-wireless-gateway

Implementation of App Network Service Based on TCP Protocol

Bandwidth and delay are two factors that affect the performance of network services. Bandwidth is limited by the minimum bandwidth of the network channel. The delay is the round-trip transmission time of the network packet between the client and the server. The bandwidth and delay on different network types The difference is very large (see below).
sd
We want to achieve better performance of network services for the network bandwidth and delay of their two points, you can do just as much as possible to select the most appropriate network channel, the other can only be used on the network channel to optimize.

Traditional non-IM instant messaging class App usually use HTTP protocol to achieve network services (Restful API form), Ctrip use TCP protocol to achieve, does increase the cost of many development, such as the need to design application layer protocol, network management, Handling exceptions, etc., but the following reasons or let us finally choose to achieve App Web services based on TCP protocol:

Ctrip users sometimes in the network environment is very poor scenic areas need to be optimized for the weak network, a simple HTTP application layer protocol is difficult to achieve.
HTTP requests for the first time the need for DNS domain name resolution, we found that the domestic environment for Ctrip domain name failure rate of 2-3% (including domain name hijacking and resolution failure), seriously affect the user experience.
Although HTTP is based on TCP protocol to achieve the application layer protocol, the advantage is good encapsulation, client and server-side solution is mature. Disadvantages are small controllability, can not be customized for network connections, send requests and receive responses to optimize, even if the characteristics of HTTP such as to keep a long connection KeepAlive or pipeline Pipeline and so will be subject to the network environment Proxy or server implementation, It is difficult to fully play its role.

Based on the TCP protocol allows us to complete control of the entire network service life cycle of the various stages, including the following stages:

  • Gets the IP address of the server

  • establish connection

  • Serialize network request packets

  • Sends a network request

  • Accept a network response

  • Deserializes the network response message

Our network service channel management and optimization work is from these aspects.

TCP network service channel management and performance optimization

  1. Farewell DNS, direct use of IP addresses

If the first time to send HTTP-based network services, the first thing is DNS domain name resolution, DNS statistics we have only 98% of the success rate of resolution, the remaining 2% of the failure of resolution or DNS hijackers (Local DNS returned Non-source IP address), while DNS resolution in 3G time-consuming about 200 milliseconds, 4G also have 100 milliseconds or so, the delay is obvious. We are based on TCP connection, skip the DNS resolution stage, the use of built-in IP list of ways to connect to the network.

Ctrip App built a set of Server IP list, while each IP has a weight. Each time a new connection is established, the highest-weighted IP address is selected for the connection. When the App starts, all the weights of the IP list are the same. At this time, a set of Ping operations will be started to calculate the IP weight according to the delay time of the Ping value. The principle is that the IP address is smaller, Of the network transmission delay should also be relatively smaller. The industry also uses the HTTP DNS method to resolve DNS hijacking issues, while returning the most appropriate user network Server IP. However, the development and deployment of HTTP DNS requires no small development costs, we do not currently use.

The built-in Server IP list is also updated. Each App starts with a Mobile Config service (supports both TCP and HTTP network type services) to update the Server IP List and support Server IP List updates for different product lines. Therefore, the traditional DNS resolution can solve the function of multiple IDC diversion can also be resolved through this method.

  1. Socket connection optimization, reducing the connection time

As with the Keepalive feature in the HTTP protocol, the most direct way to reduce network service time is to maintain a long connection. Each TCP three-way handshake connection takes one RTT (round trip time) to complete, which means 100-300 millisecond delay; TCP protocol itself should deal with the network congestion Slow Start mechanism will also affect the new connection Of the transmission performance.

Ctrip App use of a long connection pool to use long connections, long connection pool to maintain a number of maintenance and service side of the TCP connection, each time the network service will be initiated from the long connection pool to obtain a free long connection to complete the network services And then put the TCP connection back into the long connection pool. We do not implement Pipeline and Multiplexing on a single TCP connection. Instead, we use the simplest FIFO mechanism for two reasons:

Simplify the service processing logic of Mobile Gateway, and reduce the development cost;
When multiple responses are sent back to the server, if a response packet is very large, using multiple long connections can speed up the response of the received service.
If a TCP connection is in use when a network service is initiated, or if a TCP long-connected network service fails, a TCP short connection is made to implement the network service. The difference between long and short connections is that the TCP connection is closed only after the service is complete.

Pipeline and Multiplexing is different, such as HTTP / 1.1 support Pipeline, the client can send multiple requests, but the server returns the response should be sent in accordance with the request to send the order to respond; SPDY and HTTP / 2 protocol Supports multiplexing, that is, support the out-of-order response of the return message, send the request and receive the response does not interfere with each other, so to avoid the HTTP / 1.1 Pipeline also can not completely solve the Head of line blocking problem.

References

Http://stackoverflow.com/questions/10362171/is-spdy-any-different-than-http-multiplexing-over-keep-alive-connections

Https://http2.github.io/faq/

The HTTP / 1.1 Pipeline feature mentioned in Reference 2 only partially resolves the Head of line blocking problem because a large or slow response can still block others behind it.

  1. Weak network and network jitter optimization

Ctrip App introduces the network quality parameters, through the network type and end to end Ping value calculation, according to different network quality to change the network service strategy:

Adjust the number of long connection pool: for example, in the 2G / 2.5G Egde network, will reduce the number of long connection pool is 1 (the operator will limit the number of single target IP TCP connection); WIFI network can increase the number of long connection pool And other mechanisms.
Dynamically adjust the TCP connection, write, read the timeout.
When the network type is switched, such as WIFI and mobile network, when the 4G / 3G switch to 2G, the IP address of the client will change. The TCP Socket that is already connected is destined to be invalid. (Each socket corresponds to a four-tuple: source IP, source Port, Destination IP, Destination Port), all free long connections are automatically closed, and the existing network service automatically retries according to the status.

  1. Data format optimization, reduce the amount of data transmission and serialization time

The smaller the amount of data transferred, the shorter the transmission time on the same TCP connection. Ctrip has used to design a set of data format, and later compared with Google ProtocolBuffer found that the specific data type packet size will be reduced by 20-30%, serialization and deserialization time can be reduced by 10-20%, so the current Core services are gradually migrating to ProtocolBuffer format. In addition, Facebook has shared their use FlatBuffer data format to improve performance of the practice, our analysis is not suitable for Ctrip’s business scenarios and therefore not used.

  1. Retry mechanism is introduced to improve the success rate of network services

By the TCP protocol retransmission mechanism to ensure reliable transmission mechanism of inspiration, we also introduced a retry mechanism in the application level to improve the success rate of network services. We found that more than 90% of the network service failure is due to network connection failure, then try again to have the opportunity to connect successfully and complete the service; At the same time we found that the network service life cycle mentioned in the establishment of a connection, serialization Network request packets, send network request failure of these three stages are automatically retry, because we can be sure that the request has not yet reached the server for processing, does not produce idempotent problems (if there is idempotent problem , There will be repeated orders, etc.). When a network service needs to retry, a short connection is used to compensate, rather than a long connection.

To achieve the above mechanism, Ctrip App network service success rate from the original 95.3% + upgrade to today’s 99.5% + (where the service success rate refers to the end-to-end service success rate, that is, the number of client acquisition service divided by the success Request the total amount calculated, and does not distinguish between the current network conditions), the effect is significant.

Other Network Services

Ctrip App also implements a number of other network services to facilitate business development, such as network service priority mechanism, high priority service priority to use long connections, low priority service by default using a short connection; network service dependency mechanism, depending on the relationship automatically initiated or When the network service is canceled, for example, when the main service fails, the sub-service is automatically canceled.

Development process, we also found that some mobile platforms TCP Socket development tricks:

IOS platform, the native Socket interface to create a connection does not activate the mobile network, where the native Socket interface is POSIX Socket interface, you must use the CFSocket or upper network interface to try to activate the network connection. So Ctrip will be activated when the first activation of some third-party registration SDK and send HTTP request to activate the mobile network.

The SO_NOSIGPIPE parameter closes the SIGPIPE event, and the TCP_NODELAY parameter turns off the TCP Nagle algorithm. The TCP_NODELAY parameter turns off the TCP Nagle algorithm. The TCP_NODELAY parameter turns off the TCP Nagle algorithm. The SO_NOSIGPIPE parameter is used to keep the TCP connection alive. .
Since iOS requires support for IPv6-Only networks, the native socket must support IPv6.
If you use select to handle nonblocking IO operations, ensure that different return values ​​and time-out parameters are handled correctly.

Heartbeat mechanism to maintain the availability of TCP long connections: For non-IM applications, the heartbeat mechanism is not significant, because the user will continue to trigger requests to use TCP connections, especially in the Ctrip business scenario, through data statistics found using heartbeat On the service time and success rate of minimal impact, it is now closed heartbeat mechanism. The original heartbeat mechanism is an idle TCP connection in a TCP long connection pool. A heartbeat packet is sent to the Gateway every 60 seconds, and the Gateway returns a heartbeat response packet, allowing both parties to confirm that the TCP connection is valid.
Hybrid network service optimization

Ctrip App a considerable proportion of the business is the use of Hybrid technology, running in the WebView environment, which all network services (HTTP requests) are controlled by the system, we can not control, it can not be optimized, the end to End service success rate is only about 97% (Note: here refers to the page business logic to send the network service request, rather than static resource request).

We adopt the technology called “TCP Tunnel for Hybrid” to optimize the hybrid network services. Unlike the traditional HTTP acceleration products, we do not use the method of intercepting HTTP requests to re-transmit, but in the Ctrip Hybrid framework of network services Layer to automatically switch.
sadf
As shown in the figure, the flow of the technical solution is as follows:

If the App supports TCP Tunnel for Hybrid, the Hybrid service forwards the network traffic through the Hybrid interface to the TCP network communication layer of the App Native layer. This module encapsulates the HTTP request and forwards it to the TCP Gateway as Payload of the TCP network service.

TCP Gateway will be based on the service number to determine the Hybrid forwarding service, unpacked directly after the Payload forwarded to the HTTP Gateway, the HTTP request is transparent to the HTTP Gateway, HTTP Gateway does not need to distinguish between the App directly sent to the TCP Gateway or forwarded to HTTP request;

After the back-end business service processing is complete, the HTTP response is returned to the TCP Gateway via the HTTP Gateway, which returns the HTTP response as Payload to the TCP network communication layer of the App.

TCP network communication layer will then deserialize the Payload back to the Hybrid framework, the final asynchronous callback to the Hybrid business caller. The whole process is also transparent to the caller of the Hybrid service. It does not know the existence of the TCP tunnel.

The adoption of the technology program, Ctrip App in the Hybrid business network service success rate increased to more than 99%, the average time-consuming decreased by 30%.
sdf

Overseas network service optimization

Ctrip is not currently deployed overseas IDC, overseas users need to access the use of App in the domestic IDC, the average time-consuming service was significantly higher than domestic users. We have adopted a technology called “TCP Bypass for Oversea” technology program to optimize the performance of overseas network services, mainly using Akamai’s exclusive network of overseas channels, while Ctrip domestic IDC deployment of central office equipment, the use of dedicated channel to accelerate the way Enhance the overseas user experience.

If the network service fails and the retry mechanism takes effect, the traditional Internet channel will be retried. If the Akamai channel fails, the network service will go to the Akamai channel first. Using the Akamai channel Bypass technology, the average service time was reduced by 33% compared to using traditional Internet channels only, while maintaining the success rate of network services.

Discussion on Other Network Protocols

Over the past two years our network service optimization are based on TCP protocol implementation, basically reached the optimization goals. But over the past two years the new application layer network protocol SPDY and HTTP / 2 gradually into the mainstream, UDP-based QUIC protocol also looks very interesting, worthy of follow-up research.

SPDY & HTTP / 2

SPDY is Google’s TCP-based network application layer protocol, has been developed to support the design based on SPDY results of HTTP / 2 protocol, HTTP / 2 protocol core improvement is in fact for the HTTP / 1.x impact delay performance pain points Optimize:

Header Compression: Compresses redundant HTTP request and response headers.
Supports Multiplexing: Supports multiple simultaneous requests and responses on a single TCP connection.
Maintains long connections (more thorough than HTTP / 1.x): Reduces network connection time.
Support Push: You can push the server to push the data to the client.
Official performance test results show that the use of SPDY or HTTP / 2 page load time reduced by about 30%, but this is the test results for the Web, for the App in the network services, specific optimization results we are still in-house testing, And now we use the TCP protocol optimization similar to the performance optimization performance may not be significant.

QUIC

QUIC is Google-based application layer protocol developed by UDP, UDP protocol without connection, there is no retransmission mechanism, so the application layer needs to ensure the reliability of the service. Currently domestic Tencent has tried for the weak network QUIC protocol, we are also testing, will eventually need to see if the test results.

Conclusion

Technology is only a means, and ultimately to reflect on the business results. We have achieved in addition to static resources and other needs to access the CDN network request, the other App network services using a unified TCP channel, which has better performance tuning and business monitoring capabilities. Ctrip is currently based on the TCP protocol for a variety of App network service optimization, but also a variety of technical solutions to the balance, although the HTTP / 2 and other new protocols mature, but the TCP protocol flexibility to support their own targeted performance optimization, Special advantage, hope that our practice summary of the domestic wireless technology practitioners have some reference value.

Reference

app-network-service-and-performance-optimization-of-ctrip