|
Double 11 Black Technology: Big Data Real-time computing tailored
|
In the data era, big data computing has infiltrated all walks of life, business precipitation data, data computing to generate new business value, big data computing is constantly in this way to promote business forward development. E-commerce double 11, businesses and consumers Behind the carnival, the same can not be separated from the value of big data computing contribution, especially the application of more and more extensive 'real-time computing'. In the real world, data is continuously generated and collected and calculated in real time We need to do data calculation, mining the commercial value of the product, the primary problem to be solved is the data problem.In the real world, the data is often generated over time, such as the user browsing the product, a series of mouse clicks will produce A series of background data; driving the use of mobile navigation, GPS positioning updated every once in a while, will continue to produce log data; users browse the news push, search songs, surveillance cameras regularly capture images uploaded to the cloud storage, live video and so on scene, The data generated behind this are all generated in succession, and the continuous generated business data is collected in real time to form a data stream. Once collected, streaming data can be immediately involved in the calculation, while the calculation results into business applications, which is real-time computing. In fact, real-time data computing has long been entered into all aspects of people's lives, such as the weather forecast, people used to be Daily weather forecast information received, and now you can view real-time weather forecasts, the same time the weather forecast will become more accurate as time approaches, which is monitoring data acquisition updates and real-time data computing results. According to the interests of tailor-made, real-time computing to make the product more and more understanding of users More and more real-time data sources come into being, with an increasing number of data volumes exponentially increasing each year. This is good for real-time computing, allowing more application scenarios, better application results, and possibly Make some revolutionary changes.Then, big data real-time computing can do? In Netease, Koalahai dual sea 11, 618 Sea Amoy Festival and other activities, there will be a large number of NetEase big screen real-time display of the latest total current sales, sales percentage of each product category, order growth trend, active user location , All kinds of dimensions of information are constantly beating on a screen.Each user every order of the impact will be real-time updates to the big screen.This visual real-time application results, in addition to adding an e-commerce carnival atmosphere , Easier to find the value of the data, guide the market operations, supporting business decisions.  Financial risk control is another typical scenario of real-time computing.Facing the risk-sensitive business of financial services, it is far from enough to visualize the data. It requires that the flow computing system can take advantage of some risk model matching rules , Real-time analysis of massive user behavior data, the discovery of anomalies, determine the level of risk, and make the appropriate risk control measures to automate the alarm notification to change the business processes.Financial real-time financial risk control, the benefits are Faster, More Accurate, Wider Other Many event-driven computing scenarios like Windchill solve real-time calculations. Real-time computing applications in the recommended area has also been very deep.No matter the news recommendation, music recommendation or reading recommendation, the basic has done thousands of people, each person received the push content is tailored to individual interests preferences Of the user's interest preferences, often calculated through real-time data constantly updated to push the news, for example, when the user clicks a bar to push the message, the product behind the real time in real-time analysis of the user's behavior in real time to update the user Interest preferences continue to find new points of interest to users, more and more users understand, and finally to the user to push him more interested in content.Then the music recommendation as an example, if a user some time collection of a few sad songs , Real-time data analysis, the system can identify this information, and targeted push some songs to soothe the user.This scenario is only real-time computing can be solved, but also the most reflects the value of real-time computing. More and more real-time computing scenarios will be developed, people in the future 'everything is changing' will feel more and more profound. From 'save before going' to 'count while saving', real-time computing is no longer afraid of 'big' data Real-time computing so good, what should be done in the realization level, what are the difficulties and challenges that must be addressed? First of all, from the overall structure of view, data computing, nothing less than three things: data input → calculation → data output. The traditional calculation model to the database, for example, is the first data stored in a data table, the user through the implementation of the query Triggering the calculation of the database, the final output of the database after the completion of the calculation of this pre-existing model in the big data real-time computing scenarios will not work, we want to calculate the data is very large, a calculation results The source data involved may be data covering the past day, possibly hundreds of billions of data records.If every new data increase, all the data are recalculated again, so the overhead is very large, the end result will be It is very slow and can not reach the real-time effect.It is more reasonable to calculate the data while entering the real-time computing system, which does not necessarily need to be stored first, and can be directly involved in the calculation, and the calculation here is The current new data in the previous historical data calculation results do 'incremental calculation', the same data is not involved in the calculation of repeated calculation is completed, And then save the calculation results for business use, then the data storage pressure is much smaller.At the same time, "big" means that data concurrency is high, may need to calculate tens of millions of new data per second, so the calculation is not Stand-alone can withstand, so real-time computing big data to be solved is a series of technical issues under the distributed system architecture. Distributed real-time computing challenges include many aspects of the data from acquisition to computing, the output must be low-latency throughout the process, in addition to the computing node itself using 'incremental computing' model, but also requires the upstream data transmission module has a high Of the throughput, and have the data cache capabilities in the case of large flow can play a buffer role, the downstream output module also need to do data compression, batch output optimization, in order to ensure the output of the real-time performance. For real-time computing system, other features put forward higher requirements.For example, at 11 o'clock on the eve of Double 11, a large number of consumers at the same time a single payment, which is poured into the real-time computing system instantaneous amount of data is huge, the system needs The ability to process data in parallel, distribute large amounts of transient traffic to hundreds and hundreds of compute nodes, and aggregate the results of these nodes together to yield an overall result that is guaranteed for high throughput delay.  The most challenging from 'batch calculation' to 'incremental calculation' is accuracy and ease of use The same key challenges as low latency are accuracy. The 'incremental' model is different from the traditional 'batch' model, so we can not copy past technical experience or we have problems with accuracy. How new data is added to the old calculation results, and in some scenarios, it is even necessary to remove some of the calculated values from the old calculation results so as to ensure the accuracy of the final result. It is very common for a node in a distributed system to fail. The ability of a real-time flow computing system to recover from failures is also important because when a failure occurs, the system must recover quickly or the output of the system may be stagnant. Real-time It is impossible to talk about the same time the failure can not damage the incremental calculation of the model, or degenerate to the 'batch calculation' model will not get real-time calculation results, and the accuracy of the results is difficult to guarantee. In fact, NetEase big data in the process of self-study flow calculation platform Sloth encountered and overcome the above technical difficulties.Nieflow computing platform Sloth as a platform for the product, the product ease of use, multi-tenant isolation done A lot of work. For real-time computing, ease of use is a more discussed aspect. </s> It is harder for developers to write a distributed program than to write a stand-alone program, and to write a distributed real-time program will be harder. It is true that there are a number of open source streaming computing engines in the industry that have helped accomplish a lot of work, People can use these flow calculation engines to complete the development of flow calculation tasks, and they may no longer need to be concerned about how to distribute computational tasks to multiple compute nodes and how to transmit data among compute nodes, but only need to focus on the development of computational logic, Control different computational stages of computing parallelism. Taking the calculation of the number of words in an article as an example, the content of a distributed computing program may include three parts. First, several computing nodes work together to split each line of text into words one by one. The second step is to use another Some calculation nodes to count the number of words (taking into account the huge amount of data, there is a need to use more than one node to do the calculation); the third step is a computing node to calculate the upstream nodes of each part of the count into one Total count. In this simplest scenario, the amount of code that needs to be developed is about 200. In actual business scenarios, there are far more than three computing nodes through which data is calculated, and the calculation type is also much more complex than the basic summation Even with the flow calculation engine, distributed real-time calculation program development is still more difficult.Furthermore, even if the development is completed, but also need to spend a lot of time debugging, computing framework maintenance, etc., once the computing needs change , All the work needs to be re-iterated again, this is a more painful process.How to make the flow calculation program easier to write, is real-time Platform needed to complete the challenge. Regardless of how the real-time streaming computing system addresses ease of use, look at how similar problems are solved in the computer science process. People want to make programming easier, so more and more high-level programming languages have been invented ; People want to make data calculations easier, and then have a database and SQL language - a structured query language; in the era of big data, people are still struggling with off-line bulk computing when they rely on computational engine programming complex Problem and finally solved the problem by applying the SQL language to the distributed offline computing system.While the rapid development of real-time computing nowadays can the same be solved in SQL? The answer is yes, but there are many The details of the problem need to be scrutinized. The flow of data in real-time flow calculation can be understood as a dynamic data table Mentioned above, the offline batch calculation model and real-time incremental calculation models are different, when the SQL language, respectively, and batch computing and streaming computing, its semantics also need to change.Major batches of computing and computing The difference is that the former data is limited, the latter is unlimited data is collected continuously into the system.When an SQL query on a group of offline data above, the calculation is completed, the output results, this SQL query It does not end, because the data is constantly flowing in, according to the semantics of the offline SQL, until the SQL is done, the calculation will not output the result, which is obviously not Flow calculation of the desired effect, so the essence of streaming SQL should be to define a series of flow calculation tasks, while these tasks are executed while the output of the calculation results. Offline SQL handles static datasheets, whereas streamed SQL handles datastreams, and the computational semantics of SQL (such as sums, averages, data table joins, etc.) are valid on the data stream. To understand this question, you need to make a The concept of conversion: offline SQL is the static data table into another static data table; and real-time flow calculation of the data stream can be interpreted as a dynamic data table (the data will continue to grow dynamic data table) Different times The data table is different, the implementation of SQL will be different calculation results, the results of these different calculations like a movie slide show in the same series, we got a dynamic result table - streaming SQL to do the job is to convert a dynamic data table into another dynamic data table, so that the flow of SQL computing semantics easier to understand. Real-time streaming computing system to solve the problem reduced to 'How to achieve dynamic data table Calculate 'up. Streaming SQL engine optimization is currently the main direction of technological breakthroughs The ease-of-use of real-time streaming computing systems is solved in SQL, and the practice of Sloth, a NetEase flow computing platform, also validates this theory. Users no longer need to learn programming interfaces for various computational engines and no longer need to be tuned Distributed computing program, no longer need to maintain their own stream computing system, just need to migrate SQL running on the offline platform to real-time streaming computing platform, you can complete the complex real-time computing logic. The client's work has been greatly reduced, real-time streaming computing platform is bound to be an increase in work, the more difficult part is how to convert SQL query into the actual calculation logic, to achieve a support for streaming SQL computing engine, similar to the database engine Role, and as discussed earlier, the engine's computational logic must be consistent with the 'incremental computing' model.At the same time in order to make real-time calculations applied to a variety of business scenarios, the calculation engine needs to be able to dock a variety of storage roles , Such as data, message queue, offline storage. Double 11 large-screen real-time data flow is only an application scenarios, the future will have more and more real-time computing scenarios, such as real-time text calculation, image, voice computing can also be real-time, online machine learning, Internet of Things Real-time computing, etc. Real-time data and real-time streaming computing scenarios are exponentially growing, and real-time computing engines face no small challenge. SQL-based streaming computing descriptions are also evolving forward and will be increasingly incorporated into streams Calculate the unique attributes, such as output triggering, outdated data processing, a variety of rules of the data window partition, etc. Automatic optimization of streaming SQL engine is currently a major technology breakthrough direction, I believe the future of real-time flow calculation as technology advances, the application Come with in-depth, more extensive.
In the data era, big data computing has infiltrated all walks of life, business precipitation data, data computing to generate new business value, big data computing is constantly in this way to promote business forward development. E-commerce double 11, businesses and consumers Behind the carnival, the same can not be separated from the value of big data computing contribution, especially the application of more and more extensive 'real-time computing'. In the real world, data is continuously generated and collected and calculated in real time We need to do data calculation, mining the commercial value of the product, the primary problem to be solved is the data problem.In the real world, the data is often generated over time, such as the user browsing the product, a series of mouse clicks will produce A series of background data; driving the use of mobile navigation, GPS positioning updated every once in a while, will continue to produce log data; users browse the news push, search songs, surveillance cameras regularly capture images uploaded to the cloud storage, live video and so on scene, The data generated behind this are all generated in succession, and the continuous generated business data is collected in real time to form a data stream. Once collected, streaming data can be immediately involved in the calculation, while the calculation results into business applications, which is real-time computing. In fact, real-time data computing has long been entered into all aspects of people's lives, such as the weather forecast, people used to be Daily weather forecast information received, and now you can view real-time weather forecasts, the same time the weather forecast will become more accurate as time approaches, which is monitoring data acquisition updates and real-time data computing results. According to the interests of tailor-made, real-time computing to make the product more and more understanding of users More and more real-time data sources come into being, more and more data volumes are growing exponentially each year, which is good for real-time computing itself, with more application scenarios, better application effects and possibly Make some revolutionary changes.Then, big data real-time computing can do? In Netease, Koalahai dual sea 11, 618 Sea Amoy Festival and other activities, there will be a large number of NetEase big screen real-time display of the latest total current sales, sales percentage of each product category, order growth trend, active user location , All kinds of dimensions of information are constantly beating on a screen.Each user every order of the impact will be real-time updates to the big screen.This visual real-time application results, in addition to adding an e-commerce carnival atmosphere , Easier to find the value of the data, guide the market operations, supporting business decisions.  Financial risk control is another typical scenario of real-time computing.Facing the risk-sensitive business of financial services, it is far from enough to visualize the data. It requires that the flow computing system can utilize some risk model matching rules , Real-time analysis of massive user behavior data, the discovery of anomalies, determine the level of risk, and make the appropriate risk control measures to automate the alarm to do to change business processes.Financial real-time financial risk control, the benefits are Faster, More Accurate, Wider Other Many event-driven computing scenarios like Windchill solve real-time calculations. Real-time computing applications in the recommended area has also been very deep.No matter the news recommendation, music recommendation or reading recommendation, the basic has done thousands of people, each person received the push content is tailored to individual interests preferences Of the user's interest preferences, often calculated through real-time data constantly updated to push the news, for example, when the user clicks a bar to push the message, the product behind the real time in real-time analysis of the user's behavior in real time to update the user Interest preferences continue to find new points of interest to users, more and more users understand, and finally to the user to push him more interested in content.Then the music recommendation as an example, if a user some time collection of a few sad songs , Real-time data analysis, the system can identify this information, and targeted push some songs to soothe the user.This scenario is only real-time computing can be solved, but also the most reflects the value of real-time computing. More and more real-time computing scenarios will be developed, people in the future 'everything is changing' will feel more and more profound. From 'save before going' to 'count while saving', real-time computing is no longer afraid of 'big' data Real-time computing so good, what should be done in the realization level, what are the difficulties and challenges that must be addressed? First of all, from the overall structure of view, data computing, nothing less than three things: data input → calculation → data output. The traditional calculation model to the database, for example, is the first data stored in a data table, the user through the implementation of the query Triggering the calculation of the database, the final output of the database after the completion of the calculation of this pre-existing model in the big data real-time computing scenarios will not work, we want to calculate the data is very large, a calculation results The source data involved may be data covering the past day, possibly hundreds of billions of data records.If every new data increase, all the data are recalculated again, so the overhead is very large, the end result will be It is very slow and can not reach the real-time effect.It is more reasonable to calculate the data while entering the real-time computing system, which does not necessarily need to be stored first, and can be directly involved in the calculation, and the calculation here is The current new data in the previous historical data calculation results do 'incremental calculation', the same data is not involved in the calculation of repeated calculation is completed, And then save the calculation results for business use, then the data storage pressure is much smaller.At the same time, "big" means that data concurrency is high, may need to calculate tens of millions of new data per second, so the calculation is not Stand-alone can withstand, so real-time computing big data to be solved is a series of technical issues under the distributed system architecture. Distributed real-time computing challenges include many aspects of the data from acquisition to computing, the output must be low-latency throughout the process, in addition to the computing node itself using 'incremental computing' model, but also requires the upstream data transmission module has a high Of the throughput, and have the data cache capabilities in the case of large flow can play a buffer role, the downstream output module also need to do data compression, batch output optimization, in order to ensure the output of the real-time performance. For real-time computing system, other features put forward higher requirements.For example, at 11 o'clock on the eve of Double 11, a large number of consumers at the same time a single payment, which is poured into the real-time computing system instantaneous amount of data is huge, the system needs The ability to process data in parallel, distribute large amounts of transient traffic to hundreds and hundreds of compute nodes, and aggregate the results of these nodes together to yield an overall result that is guaranteed for high throughput delay.  The most challenging from 'batch calculation' to 'incremental calculation' is accuracy and ease of use The same key challenges as low latency are accuracy. The 'incremental' model is different from the traditional 'batch' model, so we can not copy past technical experience or we have problems with accuracy. How new data is added to the old calculation results, and in some scenarios, it is even necessary to remove some of the calculated values from the old calculation results so as to ensure the accuracy of the final result. It is very common for a node in a distributed system to fail. The ability of a real-time flow computing system to recover from failures is also important because when a failure occurs, the system must recover quickly or the output of the system may be stagnant. Real-time It is impossible to talk about the same time the failure can not damage the incremental calculation of the model, or degenerate to the 'batch calculation' model will not get real-time calculation results, and the accuracy of the results is difficult to guarantee. In fact, NetEase big data in the process of self-study flow calculation platform Sloth encountered and overcome the above technical difficulties.Nieflow computing platform Sloth as a platform for the product, the product ease of use, multi-tenant isolation done A lot of work. For real-time computing, ease of use is a more discussed aspect. </s> It is harder for developers to write a distributed program than to write a stand-alone program, and to write a distributed real-time program will be harder. It is true that there are a number of open source streaming computing engines in the industry that have helped accomplish a lot of work, People can use these flow calculation engines to complete the development of flow calculation tasks, and they may no longer need to be concerned about how to distribute computational tasks to multiple compute nodes and how to transmit data among compute nodes, but only need to focus on the development of computational logic, Control different computational stages of computing parallelism. Taking the calculation of the number of words in an article as an example, the content of a distributed computing program may include three parts. First, several computing nodes work together to split each line of text into words one by one. The second step is to use another Some calculation nodes to count the number of words (taking into account the huge amount of data, there is a need to use more than one node to do the calculation); the third step is a computing node to calculate the upstream nodes of each part of the count into one Total count. In this simplest scenario, the amount of code that needs to be developed is about 200. In actual business scenarios, there are far more than three computing nodes through which data is calculated, and the calculation type is also much more complex than the basic summation Even with the flow calculation engine, distributed real-time calculation program development is still more difficult.Furthermore, even if the development is completed, but also need to spend a lot of time debugging, computing framework maintenance, etc., once the computing needs change , All the work needs to be re-iterated again, this is a more painful process.How to make the flow calculation program easier to write, is real-time Platform needed to complete the challenge. Regardless of how the real-time streaming computing system addresses ease of use, look at how similar problems are solved in the computer science process. People want to make programming easier, so more and more high-level programming languages have been invented ; People want to make data calculations easier, and then have a database and SQL language - a structured query language; in the era of big data, people are still struggling with off-line bulk computing when they rely on computational engine programming complex Problem and finally solved the problem by applying the SQL language to the distributed offline computing system.While the rapid development of real-time computing nowadays can the same be solved in SQL? The answer is yes, but there are many The details of the problem need to be scrutinized. The flow of data in real-time flow calculation can be understood as a dynamic data table Mentioned above, the offline batch calculation model and real-time incremental calculation models are different, when the SQL language, respectively, and batch computing and streaming computing, its semantics also need to change.Major batches of computing and computing The difference is that the former data is limited, the latter is unlimited data is collected continuously into the system.When an SQL query on a group of offline data above, the calculation is completed, the output results, this SQL query It does not end, because the data is constantly flowing in, according to the semantics of the offline SQL, until the SQL is done, the calculation will not output the result, which is obviously not Flow calculation of the desired effect, so the essence of streaming SQL should be to define a series of flow calculation tasks, while these tasks are executed while the output of the calculation results. Offline SQL handles static datasheets, whereas streamed SQL handles datastreams, and the computational semantics of SQL (such as sums, averages, data table joins, etc.) are valid on the data stream. To understand this question, you need to make a Conceptual conversion: Offline SQL is the static data table is converted to another static data table; and real-time flow calculation of the data stream can be interpreted as a dynamic data table (the data will continue to grow dynamic data table) Different times The data table is different, the implementation of SQL will be different calculation results, the results of these different calculations like a movie slide show in the same series, we got a dynamic result table - streaming SQL to do the job is to convert a dynamic data table into another dynamic data table, so that the flow of SQL computing semantics easier to understand. Real-time streaming computing system to solve the problem reduced to 'How to achieve dynamic data table Calculate 'up. Streaming SQL engine optimization is currently the main direction of technological breakthroughs The ease-of-use of real-time streaming computing systems is solved in SQL, and the practice of Sloth, a NetEase flow computing platform, also validates this theory. Users no longer need to learn programming interfaces for various computational engines and no longer need to be tuned Distributed computing program, no longer need to maintain their own stream computing system, just need to migrate SQL running on the offline platform to real-time streaming computing platform, you can complete the complex real-time computing logic. The client's work has been greatly reduced, real-time streaming computing platform is bound to be an increase in work, the more difficult part is how to convert SQL query into the actual calculation logic, to achieve a support for streaming SQL computing engine, similar to the database engine Role, and as discussed earlier, the engine's computational logic must be consistent with the 'incremental computing' model.At the same time in order to make real-time calculations applied to a variety of business scenarios, the calculation engine needs to be able to dock a variety of storage roles , Such as data, message queue, offline storage. Double 11 large-screen real-time data flow is only an application scenarios, the future will have more and more real-time computing scenarios, such as real-time text calculation, image, voice computing can also be real-time, online machine learning, Internet of Things Real-time computing, etc. Real-time data and real-time streaming computing scenarios are exponentially growing, and real-time computing engines face no small challenge. SQL-based streaming computing descriptions are also evolving and will be increasingly incorporated into streams Calculate the unique attributes, such as output triggering, outdated data processing, a variety of rules of the data window partition, etc. Automatic optimization of streaming SQL engine is currently a major technology breakthrough direction, I believe the future of real-time flow calculation as technology advances, the application Come with in-depth, more extensive.
|
|