Data science is a very popular phrase these days. When it comes to solving large scale data-driven problems, most experts agree that turning to data science to solve the issue is the current best practice. So, what exactly is this science, and why is it so successful these days? Let’s have a closer look. Let’s also have a look at how this issue relates to the pipeline industry.
Data Science is the new star in town
Data science is considered a new discipline in our science portfolio. Just as computer science evolved from pure math as computers were invented and became more powerful, data science makes use of the technology and know-how of other disciplines as well. The specific drive and movement, as well as the corresponding mindset and focus on this area, help introduce this new phrase to our common vocabulary and justify this discipline as being a new standalone science.
Data science utilizes the best capabilities and procedures from related traditional disciplines to develop new insights from the vast amounts of data that we humans tend to create and host these days. The success of this new scientific field is based on the human need for complexity of the systems we use today. After all, we are no longer able to master our complex world with pen and paper. Computers are no longer used to just speed up research and development and established processes, but are in fact needed to help us understand the essence of the oftentimes immense dimensional complexity of the systems we create for our benefit.
Data science relies heavily on mathematics and statistics, computer science, and software engineering as well as our knowledge and experience when it comes to problem-solving. Traditionally, these areas have always overlapped to some degree. The cooperation between technical specialists facing a problem and software engineers, for example, has in the past most often provided a clever software algorithm to solve that issue. Traditional research is the natural result when working closely with scientists. The development of artificial intelligence originated in the partnership between classical and computer science. The complexity of today’s systems no longer allows for easy decision-making when approaching an issue. Will computer science part be the solution? Is experience all that is needed? Or will the investigation require a very strict approach based on statistics?
Figure 1: Data Science can be understood as the combination of traditional scientific disciplines and existing domain expertise.
Nowadays, the complexity of a system is usually defined by the number of parameters describing it. This number can reach dizzying heights. In all areas - be it industrial systems, finance, or medical applications - when it comes to ensuring that we have not forgotten a substantial aspect, we reach the order of tens of thousands to millions of significant parameters rather quickly. And no single human will be able to handle such level of complexity, or even understand every aspect of the data behind the parameters. To simply rely on experience is dangerous, even when it is used merely as a starting point for an analysis. A whole new way of working with the numbers is required, and this is what data science is all about. It is about stepping back a bit and letting the data tell its own story, exploring which rules can be found behind the highly dimensional parameter space allowing us to understand what is happening. Thus, data science is simply the result of prior procedures coming closer together in combination with ever increasing computational resources to allow for such powerful analysis scenarios.
Putting the pieces together
Data science is so popular these days for good reasons. To start, it brings together the disciplines so they can take the necessary step forward together, instead of letting one of them dominate the others or even make them obsolete. A frequent concern in this regard is the fear of job-losses caused by AI-enabled machines who might replace human beings due to their expected better accuracy and higher performance. I do not think this will be the case for projects that rely heavily on data science. In the end, data science provides a set of tools that - when used wisely - help humans focus and derive better results. As a result, the processes and their outcome remain in the hands of the human being.
Another reason for the growing popularity of data science lies in the fact that it offered the right tools and strategies at the right time. In an ever-accelerating world with growing expectations regarding product and service quality, the related data volumes simply explode. This is probably based on the naïve assumption that more always means better, and therefore generating more data to be analyzed is considered a good thing. The internet of things (IoT) development tells exactly the same story. Generating data has become ridiculously cheap. Almost every type of sensor can be connected to a network. The resulting set of data collected creates a giant data lake of related but unstructured data that may include all the ingredients for a successful business. In many people’s minds therefore “data is the new oil”. Data science offers the necessary tools and approaches to stay on top of the data and preventing users from drowning in the data lake. It is the only way to efficiently work with data these days and therefore a key to staying competitive.
How do we benefit from Data Science in the Pipeline Inspection Business?
ROSEN saw the potential of data science at ROSEN years ago already, and indeed, it has changed the way we work on projects in R&D and cooperate across several disciplines.
Only a couple of years ago, the traditional development cycle was based on a very robust continuous improvement approach. The general procedure for quality improvement was based on the implementation of the ideas coming from R&D and data analysis into new capabilities of our in-line Inspection tools and the algorithms working on the data these tools generated. Every bit of feedback from a data evaluation was adapted, every new situation was implemented into a new, improved version of the algorithm. As a consequence, the mechanical design of our in-line inspection tools became more and more sophisticated, the number and diversity of sensors used on a single tool increased steadily over the last years, generating increasing volumes of raw data for every single in-line inspection service we completed. The software we developed to assist in the complete course of the service became much more flexible, highly configurable and adjustable to the needs of a demanding in-line inspection data analysis.
However, the required effort to allow this flexibility increased over the years. It became rather difficult to respect every single variation that pipelines, their operations, and all the possible anomalies offered throughout the world. As a result, the parameter space required to look at issues from a research perspective became impractical to handle using traditional methods.
What prevents us from sinking?
Instead of negating the influence of large parts of the possible parameter space on the result quality, we decided to accept the real complexity of the system we were handling and looked for better tools to aid our work.
As of today, a modern, high end in-line inspection tool delivers multiple terabytes of data from a single in-line inspection run. While impressive, it does not form the basis of a successful data analysis alone. The raw data delivered has to be seen in the context of other available information pools, such as the results of supporting or prior inspections, geospatial data, and all knowledge the operator can offer about the pipeline. On the one hand, it sounds like a good basis to match the expectations regarding a successful integrity check, on the other hand, each layer of additional information adds a possible source of errors and uncertainties, and more importantly, expands the parameter space exponentially. This, in turn, creates a very complex view of the current state of the pipeline, and it requires a lot of time and effort to analyze the corresponding data correctly and in great detail in order to generate the best possible result quality.
The increasing volumes of data have not been a serious issue for us so far. Having offered pipeline inspections for many years, we have always faced large data volumes over the last decades, so we feel that we have a lot of experience in this regard. And this is why approaching the new possibilities provided by big data analytics and machine learning capabilities developed in and for different areas was not just a buy-in at the right time for us. We needed to find solutions that would perfectly match our grown infrastructure and own data processing and analytics toolbox.
We have been exploring the tools and solutions available for some years now, and - to the surprise of many - did not just rely on the popular commercially available products from the hadoop ecosystem. Thanks to our R&D department’s claim to completely understand procedures and algorithms applied at any time, we found the best solutions in the area of open source software. This allows us to constantly review and check the correctness of the algorithms as well as the underlying infrastructure and to participate in the development of these tools and broaden the corresponding mindset.
As a result, we are able to deal with the ever-growing volumes of data we face every day. We are able to provide some relief to our data analysts from the data onslaught, so they can focus on the most important aspects of data analysis, i.e., improving result quality to match our growing customer expectations. We are creating smart machine-learning-enabled algorithms that help our experienced data analysts in their work. Those algorithms are influenced by the decisions the analyst draws in real-time, which in turn enables us to tune into the special features of every pipeline we are analyzing.
Following this new concept comes at a cost
The machine learning algorithms we apply require some prior training to ensure best results. Without this, there would be neither a time nor a quality advantage. This prior step of educating the computer, the actual machine learning, is the crucial and critical step. It is based on showing the computer matching and non-matching examples, so it can learn a proper rule to distinguish between items and situations. And this requires a well-chosen and meaningful set of examples, covering all possible scenarios, which usually results in millions of decisive and completely understood examples provided to the learning algorithm.
Figure 2 - We found clever ways to fill gaps in our ground truth database by creating all sorts of simulation frameworks, allowing them to perform large-scale analyses.
This is the main machine learning challenge in our industry. Getting access to these decisive examples usually requires large-scale investment in field verification results, in order to acquire an independent measurement of a location with prior in-line inspection data. Ideally, we would use a direct measurement method instead one that requires data interpretation. This means to literally dig up all sorts of anomalies you may find in a pipeline, whether they affect the integrity of the pipeline or not. This is needed to prevent a bias in the algorithms used. This demand sounds hard to address, doesn’t it? The related price tag would be huge. It’s completely impractical to support this approach from the operator’s point of view. We know this of course, and this is why we put a lot of effort into closing the gaps in our models by simulating all sorts of anomalies occurring in a pipeline with regard to all relevant measurement technologies we apply in our in-line inspection services. To do this at large scale, we have created the necessary simulation environments ourselves, again based on the same open-source tools we are using in our data analysis algorithm development. However, the focus here is on filling the gaps for proper machine learning and to help us understand the influence of every important parameter identified. These simulations are backed by a large number of laboratory measurements, pull-through and pump tests using real in-line inspection tools, as well as field verification campaigns to create a complete and realistic base for our machine learning based algorithms. The anticipated increase in result quality in the data evaluation process is continuously checked by repeated algorithm performance checks against this complete set of pipeline-anomaly datasets reflecting real world scenarios.
There is still work to do
Now that we have found ways to not drown in the data lake, we are motivated again to carry on with our work. Every piece of information is significant, and every real-world example can be used to verify simulation counts.
Figure 3 – This is a very simple overview following the POF classification scheme. This is one example where we want to see the gaps filled. However, the requied variation or parameters is much more complex and not easy to illustrate.
The next challenge seems to be the creation of a workflow to organize, manage, and share necessary data and samples for the future benefit of everyone involved. This would serve as a reliable basis for future improvements in in-line inspection services, and therefore in assuring the value of pipeline related assets, at least from a technical perspective.