Thinking in High Concurrency

One word: points

  • Divide and rule, multi-level shunt

  • Browser-side, server front-end, middle layer, database side

  • Everywhere there is the possibility of diversion

Some things about high concurrency dev

reference
Recently, various IT media industry technology conference held a lot of sites are in the disclosure of their own technology to share with insiders, to facebook, Baidu, small to the start of the site. Facebook, Baidu and other large sites using the technology and extraordinary processing power really gives a fresh feeling, but not every site is like Facebook, Baidu has hundreds of millions of users to access traffic, there is a huge amount of data needs to be stored, Need to use mapreduce / parallel computing, HBase / column storage of these technologies is not. Technical means has always been operational support for the current operating environment is like, there is no need to have to catch a fashionable, be sure to have a popular technology and the relationship between the point was let go.

In the recent technical conference, we have more eyes are focused on these large sites, in fact, small and medium-sized portal technology system is worth to explore and concern. Siege division is not all in the world for these large portal services, more siege division are unknown for some just started small and medium-sized Web site services, and occupy the siege division in more than 60% of the team Of the population. In the large-scale portals concerned about the time, small and medium-sized portal technology development and practical experience more worth to share.

Both large portals and small to medium sized vertical types of sites are driven by stability, performance, and scalability. Large-scale site technology experience sharing is worth learning and borrowing, but the implementation of the more specific practice is not applicable to all sites, other language development sites I dare not say, but Java development system, I can You plug in a few words:

JVM

JEE container to run the JVM parameters The correct use of configuration parameters directly related to the performance of the entire system and processing power, JVM tuning is mainly on the memory management aspects of optimization, optimization direction is divided into the following four points:

  1. HeapSize heap size, it can be said that the Java virtual machine to use the memory strategy, this is very critical.
  2. GarbageCollector through the configuration parameters related to Java garbage collector in the four algorithms (strategy) to use.
  3. StackSize JVM stack is the memory instruction area, each thread has his own Stack, Stack size limits the number of threads.
  4. DeBug / Log In the JVM can also set the JVM run-time log and JVM hang log output, this is very critical, according to various types of JVM log output to configure the appropriate parameters.
    JVM configuration skills can be seen everywhere on the Internet, but I still recommend reading the official Sun article 2, you can configure the parameters of its still have an understanding
  5. Java HotSpot VM Options
    Http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
  6. Troubleshoot Guide for Java SE 6 with HotSpot VM http://www.oracle.com/technetwork/java/javase/index-137495.html
    In addition, I believe that not every siege division is facing these JVM parameters every day, if you forget those key parameters you can enter Java-X (uppercase X) to prompt.

JDBC

JDBC parameters for MySQL In the previous article also introduced in a single machine or cluster environment reasonable use of JDBC configuration parameters on the operation of the database also has a great impact.
Some of the so-called high-performance Java ORM open source framework is open a lot of JDBC in the default parameters:

  1. For example: autoReconnect, prepStmtCacheSize, cachePrepStmts, useNewIO, blobSendChunkSize,
  2. For example, cluster environment: roundRobinLoadBalance, failOverReadOnly, autoReconnectForPools, secondsBeforeRetryMaster.
    The specific content can refer to the MySQL JDBC official manual:
    Http://dev.mysql.com/doc/refman/5.1/en/connectors.html#cj-jdbc-reference

Database connection pool (DataSource)

Frequent interactions between application and database connections can create bottlenecks and a lot of overhead on system performance. The JDBC connection pool is responsible for allocating, managing, and releasing database connections. It allows an application to reuse an existing database connection, And not to re-establish a connection, so the application does not require frequent connections with the database switch, and can free up the free time than the maximum free time database connection to avoid the release of the database connection caused by missing database connection. This technology can significantly improve the performance of the database operation.
Here I think there is little need to explain:
The use of connection pooling is the need to close because the database connection pool to start when the database and pre-obtained the appropriate connection, and then no longer need to deal directly with the database application, because the application using the database connection pool is a “borrow” The application from the database connection pool access to resources is “lending”, also need to go back, like there are 20 buckets on here, people need to take the water can use these barrels from the pool inside the water, If 20 people take the water, do not return the bucket back to the original place, then the back of the people and then need to take the water, can only be in the next waiting for someone to return the barrel, before the people need to put back, or the back Of the people will have to wait, resulting in resource blockage, the same token, the application access to the database connection when the Connection connection object from the “pool” in the distribution of a database connection out after the return of the database connection, so as to Maintain the database connection “there are also” guidelines.
References:
ref

Data access

Database server optimization and data access, what type of data on where it is better to think about the problem, the future storage is likely to be mixed, Cache, NOSQL, DFS, DataBase in a system will have, Life tableware and weekdays need to wear clothes at home, but will not use the same type of furniture storage, looks like no one to put the tableware and clothes in the same cabinet inside. This is like a system of different types of data, as different types of data need to use the appropriate storage environment. File and image storage, the first visit in accordance with the heat classification, or in accordance with the size of the file. Strong relationship type and need to support the use of traditional transactional database, the weak relationship does not require transaction support can be considered NOSQL, massive file storage can be considered to support network storage DFS, the cache depends on your single data storage size and read and write proportion.

Another point worth noting is the separation of data read and write, both in the DataBase or NOSQL environment, most of the time is greater than the write, so the design need to consider not only need to read the data scattered in multiple machines, but also need Consider the data consistency between multiple machines, MySQL, a main and more from, plus MySQL-Proxy or to borrow some parameters in JDBC (roundRobinLoadBalance, failOverReadOnly, autoReconnectForPools, secondsBeforeRetryMaster) for follow-up application development, you can read and Write separation, the pressure will be a lot of reading scattered on multiple machines, and also to ensure the consistency of the data.

Cache

In general, the cache is generally divided into two kinds: the local cache and distributed cache

  1. Local cache, the local cache for Java is to say the data into static (static) data combination, and then need to use when the data from the static out of the combination of the proposed high ConcurrentHashMap or CopyOnWriteArrayList As a local cache. More specific use of the cache is the use of system memory, the use of the number of memory resources need to have an appropriate ratio, if more than the appropriate use of storage access, will be counterproductive, resulting in inefficient operation of the entire system.

  2. Distributed cache, generally used for distributed environment, the cache on each machine for centralized storage, and not only for the use of the scope of the cache can also be used as a distributed system data synchronization / transmission of a Means, the most commonly used is Memcached and Redis.

Data storage in different media read / write get efficiency is different in the system how to use the cache, so that your data closer to the cpu, the following a picture you need to always keep in mind, the technology from Google Daniel Jeff Dean (Ref)’s masterpiece, as shown:

Cache-speed

Concurrent / multithreading

In a high concurrency environment, developers are advised to use the JDK in the accompanying package (java.util.concurrent), after the JDK1.5 use java.util.concurrent tools can simplify the development of multi-threaded, java.util. Concurrent tools are mainly divided into the following main parts:

  1. Thread pool, thread pool interface (Executor, ExecutorService) and implementation class (ThreadPoolExecutor, ScheduledThreadPoolExecutor), using jdk thread pool framework can manage their own queuing and scheduling tasks, and allow controlled shutdown. Because the need to run a thread to consume the system CPU resources, and create, end a thread CPU system resources are also overhead, the use of thread pool can not only effectively manage the use of multi-threaded, or can improve the efficiency of thread operation.

  2. Local queues, providing efficient, scalable, thread-safe non-blocking FIFO queues. Each of the five implementations in java.util.concurrent supports the extended BlockingQueue interface, which defines the blocking versions of put and take: LinkedBlockingQueue, ArrayBlockingQueue, SynchronousQueue, PriorityBlockingQueue, and DelayQueue. These different classes cover the most common use contexts of producer-consumer, message passing, parallel task execution, and related concurrency design.

  3. Synchronizer, four classes can assist in the realization of common private synchronization statement. Semaphore is a classic concurrency tool. CountDownLatch is an extremely simple but extremely useful utility for blocking execution until a given number of signals, events, or conditions are maintained. CyclicBarrier is a resettable multiplexing point that is useful in some parallel programming styles. Exchanger allows two threads to exchange objects at the collection point, which is useful in multi-pipeline design.

  4. Concurrent package Collection, this package also provides for the design of the context of the multi-threaded Collection implementation: ConcurrentHashMap, ConcurrentSkipListMap, ConcurrentSkipListSet, CopyOnWriteArrayList and CopyOnWriteArraySet. ConcurrentHashMap usually outperforms synchronized HashMap when many threads are expected to access a given collection, and ConcurrentSkipListMap is usually superior to a synchronized TreeMap. CopyOnWriteArrayList outperforms synchronized ArrayList when the expected reading and traversal are much larger than the list’s update count.

queue

On the queue can be divided into: the local queue and distributed queue 2 categories

Local queue: the common common for non-timeliness of the data batch write, you can cache the data in an array to achieve a certain number of times when a batch write, you can use BlockingQueue or List / Map to achieve.

Related Information: Sun Java API.
Distributed Queue: General as a message middleware, build distributed environment subsystem and subsystem communication between the bridge, JEE environment is the most used Apache AvtiveMQ and Sun’s OpenMQ.

Lightweight MQ middleware has also been introduced before, for example: Kestrel and Redis (Ref http://www.javabloger.com/article/mq-kestrel-redis-for-java.html), recently heard LinkedIn’s search technology team has introduced an MQ product, kaukaf (Ref http://sna-projects.com/kafka), to stay focused.

Relevant information:

  1. ActiveMQ http://activemq.apache.org/getting-started.html

  2. OpenMQ http://mq.java.net/about.html

  3. Kafka http://sna-projects.com/kafka

  4. JMS article http://www.javabloger.com/article/category/jms

NIO

NIO is in JDK1.4 after the version of the emergence of the Java 1.4 before, Jdk are provided for the flow of I / O systems, such as read / write file is a one-byte data processing, an input stream One byte of data, one output stream consumes one byte of data, stream-oriented I / O is very slow, and one packet or the entire datagram has been received, or not yet. Java NIO non-blocking technology is to take Reactor mode, the content will come in automatic notification, do not die, death cycle, greatly enhance the system performance. NIO technology in the real scene most of the use of two aspects, 1 is the file read and write operations, 2 is the network data flow operation. There are several core objects in the NIO need to master: 1 selector (Selector), 2 channel (Channel), 3 buffer (Buffer).

My nonsense:

  1. Java NIO technology in the category of memory-mapped file is an efficient approach can be used in the cache stored in the cold / hot data separation, the cache part of the cold data for such treatment, this approach than the conventional Based on the flow or channel-based I / O more quickly by making the data in the file appears as the contents of the memory array to complete the actual reading of the file or write part of the map will be mapped to memory, not The entire file is read into memory.

  2. Mysql jdbc in the drive can also use NIO technology to operate the database to enhance system performance.

Long connection / Servlet3.0

Here is a long connection that long polling, the previous browser (client) need to pay attention to the server-side data changes occur need to constantly access the server, so that the number of clients will inevitably give a lot of pressure on the server side, for example: In-site messages in the forum. Servlet3.0 specification now provides a new feature: asynchronous IO communication; this feature will maintain a long connection. The use of Servlet3 asynchronous request of this technology can greatly ease the pressure on the server side.

Servlet3.0 principle is to request the request to open a thread to suspend, the middle set to wait for the time-out, if the background event trigger request request, the results returned to the client’s request, if the time set in the waiting time without any The event will also request to return to the client, the client will again request the request, the client and server-side interaction can reciprocate.

If you come over and told me that if someone is looking for you, I immediately inform you that you come to see him, the original you need to constantly ask if I have to find you, regardless of whether there are people looking for you, you need constant Ask me if I have to find you, so ask the person or ask people who will be exhausted.

Log

Log4J is commonly used tools, the system is in the line just when the log is generally set in the INFO level, the real on the general set in the ERROR level, but at any time, the log input content is the need to focus on the development Personnel can generally rely on the output log to find problems or rely on the output of the log to optimize the performance of the system, the system is running the log is the basis for reporting and troubleshooting.
In short, the log according to the definition of the different strategies and levels of output to different environments, so as to facilitate our analysis and management. On the contrary, you do not have the output of the strategy, then the machine more than one, over time, there will be a big mess of chaos log, you will not have time to start troubleshooting, so the log output strategy is to use log key points.

Reference: ref

Packaging / deployment

In the code design time it is best to different types of function modules in the IDE environment, coarse-grained into different projects, easy to play into different jar package deployed in different environments. There is such an application scenario: the need for regular daily remote access from the SP side of the day 100 news and part of the city’s weather forecast, although the amount of data per day is not much, but the front-end access to a large amount of clear, clearly in the system architecture Do read and write separation.

If the web project and the timing of the function of the module is fully concentrated in a project package, will lead to the need to expand each machine when there are both web applications and timers, because the functional modules are not separated, each machine has a timer Work will result in duplication of data inside the database.

If the development of the web and the timer will be divided into two projects, when the package can be deployed separately, 10 web corresponding to a regular device, decomposition of the front-end request pressure, the data will not be written to repeat.

Another advantage of this can be shared, in the above scenario, the web and the timer needs to be read on the database, then the web and timer projects have the operation of the database code, the logic of the code or feel chaotic messy. If you draw a DAL layer jar, web and timer application module developers need only reference the DAL layer jar, the development of business logic, interface-oriented programming, regardless of the specific database operation, the specific database operation by the other Developers to complete, you can in the development of the division of labor is very clear, and non-interference.

frame

The so-called popular SSH (Struts / Spring / Hiberanet) lightweight framework, for many small and medium-sized projects that are not lightweight, the developer not only need to maintain the code, also need to maintain cumbersome xml configuration file, and maybe Not written on the configuration file so that the whole project can not run. No configuration file can replace the SSH (struts / Spring / Hiberanet) framework of the product is really too much, I have introduced to you before a number of products (Ref).

I do not mean to use SSH (Struts / Spring / Hiberanet) framework, in my view SSH framework really is the normative development, and do not use the SSH (Struts / Spring / Hiberanet) framework can improve the number of performance.

SSH framework is only for a very large number of projects hundreds of teams, also need to continue to increase the size of the team, is the need to select some of the market are recognized and familiar with the technology, SSH (Struts / Spring / Hiberanet) framework More mature so it is the first product.

But for some of the small team in the middle of the technical team of experts can choose a more concise framework, the real speed of your development efficiency, early rejection of the SSH framework to choose a more concise technology in small team development is a A more informed choice.