INVENTION | FREE FULL TEXT | Identification of two-phase flow patterns in vertical pipes using transformer neural networks

The architecture of the original TNN is based on the encoder-decoder, as shown in Figure 1. This architecture is widely used in tasks such as machine translation, where sequences of words are translated from one language to another. The architecture consists of two components: an encoder, which converts a sequence of input tokens into a sequence of embedding vectors, and a decoder, which iteratively uses the hidden state of the encoder to generate an output sequence of tokens, one token at a time.

2.1. Types of transformers

Transformer’s architecture was originally designed for sequence-to-sequence tasks such as machine translation, but independent encoders and decoders were quickly adopted, resulting in three different types of Transformers [32]:

–: Encoder only: This model can convert text input sequences into numerical representations suitable for tasks such as text classification or named entity recognition. The representation computed for a given token in this architecture depends on bidirectional attention, meaning it relies on the left and right context of the token.
–: Decoder only: This type of model is able to automatically complete a sequence by iteratively predicting the most likely next word. In this case, the model relies on causal or autoregressive attention, which means that the representation computed for a token in this architecture only depends on the left-hand context.
–: Encoder-Decoder: They are used to model complex mappings from one text sequence to another, making them suitable for tasks such as machine translation and summarization.

For this work, the encoder-only model is used. Therefore, the structure of the encoder block is explained in detail below. The structure of the encoder block is shown in Figure 2. The embedding sequence enters this block and is processed through the following layers: multi-head self-attention layer and feed-forward layer.

Self-attention is a mechanism that allows a neural network to assign different attention, that is, different weight values, to each element of a sequence. Therefore, the main idea of self-attention is to use the entire sequence to calculate the weighted average of each embedding.One way to express this is that given a sequence of tokens

X_{1}

…,

X_{n}

generating a series of new embeddings from attention

{X_{1}}^{‘}

…,

{X_{n}}^{‘}

each of which

{X_{I}}^{‘}

are all linear combinations

X_{j}

as shown in equation (1):

${X_{I}}^{‘} = Σ_{j = 1}^{n} w_{j I} X_{j}$

(1)

where the coefficient $w_{j I}$ are the attention weights and are normalized in this way: $S_{j} w_{j I} = 1$ .

The self-attention process involves three vectors:

ask

,Inquire,

K

, key y, and

V

, valuesobtained by multiplying each input entry by the weight matrix watt (Trained during training).Calculate the dot product between vectors

ask

and vector

K

. The result is then divided by the square root of the dimensionality of the original input vector (

d_{k}

), use normalized soft max operation and multiply the vector V. For computational efficiency, information is processed in matrix form. Equation (2) represents the attention process.

$A t t e n t I oh n (ask, K, V) = Z = S oh F t medium size A X (\frac{ask * K^{time}}{\sqrt{d_{k}}}) * V$

(2)

2.2. Data structure

The database was constructed based on existing experimental data from the literature related to two-phase flow of water and oil in vertical pipes. A total of 4864 data points were collected from 18 different authors, as shown in Table 1, which provides comprehensive information about each author’s data.Shown in the table are pipe diameters

(D)

Viscosity used in experiments

({rice}_{oh})

and density

(r_{oh})

The type of oil used, the number of flow patterns identified, and the amount of data extracted.

Based on the protocol developed in each study, a total of 9 different flow patterns were identified, characterized by changes in the continuous phase, either water (w) or oil (o). These modes include very fine dispersed droplets (VFD) o/w (265 cases) and w/o (204 cases), droplet (D) o/w (947 cases) and w/o (1292 cases), segment Plug (S) o/w (459 cases) and w/o (656 cases), drain (21 cases) o/w and (126 cases) w/o, annular (core flow) 480 cases, and transition region (TF ) 314 cases. Figure 3 provides a graphical representation illustrating the behavior of the most prominent flow modes.

Generally speaking, when a change from one flow pattern to another is evident, it can be defined as a transitional intermediate flow (TF). On the other hand, the studied database contains well-developed information related to the surface velocities of the fluids involved and their mixtures, the volume fractions of water and oil, the diameter of the implemented pipes, the viscosity of the oil, and the type of flow pattern.

2.3. Transformer neural network model

The structure of the implemented model consists of the encoder implemented in the TNN and the SoftMax layer at the model output.input vector

X

Consists of the apparent velocities of the fluids (oil and water), their sum, the in-situ volume fractions of the two fluids (hold-up volume), the viscosity of the oil and the diameter of the circular cross-section pipe, as shown in Equation (3) .

$X = \begin{matrix} J_{oh} \\ J_{w} \\ J_{oh + w} \\ {Second}_{oh} \\ {Second}_{w} \\ D \\ {rice}_{oh} \end{matrix}$

(3)

The training optimizer chosen is Adam because it combines the characteristics of two methods: one is the momentum gradient descent algorithm, and the other is the RMSP (Root Mean Square Propagation) algorithm. Gradient descent with momentum speeds up the gradient descent algorithm by considering an exponentially weighted average of the gradients, meaning that using the average allows the algorithm to converge to the minimum faster. The method is defined in equations (4) and (5).

${oh}_{t + 1} = {oh}_{t} - A {rice}_{t}$

(4)

Where

${rice}_{t} = Second {rice}_{t - 1} + (1 - Second) [\frac{d L}{d {oh}_{t}}]$

(5)

yes ${rice}_{t}$ is the aggregated gradient at the current time step t, initially set to 0. also, ${rice}_{t - 1}$ Represents the aggregated gradient at the time step t – 1, ${oh}_{t}$ is the weight at time step t, ${oh}_{t + 1}$ is the weight of the time step t + 1, $A$ is the learning rate at time step t. also, $d L$ is the derivative of the loss function, $d {oh}_{t}$ is the derivative of the time step weight tand $Second$ = 0.9 is the moving average parameter.

Instead of accumulating squared gradients, root mean square propagation (RMSP) uses an exponential moving average. Its definition is shown in the following formula (6).

${oh}_{t + 1} = {oh}_{t} - \frac{A}{{(v_{t} + Second)}^{\frac{1}{2}}} [\frac{d L}{d {oh}_{t}}]$

(6)

Where

$v_{t} = Second v_{t - 1} + (1 - Second) {[\frac{d L}{d {oh}_{t}}]}^{2}$

(7)

Where $v_{t}$ is the sum of squares of past gradients, that is, sum( $d L$ / ${d oh}_{t - 1}$ ). initial, $v_{t}$ = 0. $Second = 10^{- 8}$ is a small positive constant to prevent division by zero $v_{t} ⟶ 0$ and $A =$ 0.001 is the learning rate parameter.

Therefore, Adam controls the gradient descent speed such that the global minimum is reached in small steps, and the steps are large enough to overcome local minima along the way, thereby achieving efficiency. In this case, equation (8) is used.

${rice}_{t} = {Second}_{1} {rice}_{t - 1} + (1 - {Second}_{1}) [\frac{d L}{d {oh}_{t}}]; v_{t} = {Second}_{2} v_{t - 1} + (1 - {Second}_{2}) {[\frac{d L}{d {oh}_{t}}]}^{2}$

(8)

Where ${Second}_{1}$ = 0.9 and, ${Second}_{2}$ = 0.999 is the decay rate of the gradient average in the first two methods.

Whereas

{rice}_{t}

and

v_{t}

Initialized to 0 according to the previous method, they tend to be biased towards 0 because

{Second}_{1}

and

{Second}_{2}

≈ 1. Adam solved this problem by calculating the bias correction

{rice}_{t}

and

v_{t}

. This is done to control the weights when the global minimum is reached and avoid high oscillations near it, meaning it adapts to gradient descent in each iteration. The equations used are (9) and (10):

$\hat{{rice}_{t}} = \frac{{rice}_{t}}{1 - {Second}_{1}^{t}}; \hat{v_{t}} = \frac{v_{t}}{1 - {Second}_{2}^{t}}$

(9)

${oh}_{t + 1} = {oh}_{t} - \hat{{rice}_{t}} (\frac{A}{\sqrt{\hat{v_{t}}} + Second})$

(10)

The loss function implemented in the training of the developed model is the binary cross-entropy function, which is usually used in binary classification problems, but can also be applied to problems where the predictor variable takes values between 0 and 1. This function is defined as equation (11):

$L_{Second C Second} = - \frac{1}{n} Σ_{j = 1}^{n} Σ_{I = 1}^{C} [y_{I} \log \hat{y_{I}} + (1 - y_{I}) \log (1 - \hat{y_{I}})]$

(11)

Where $y_{I}$ is the actual category to be predicted, $\hat{y_{I}}$ is the predicted probability of the category, C is the number of classes, and n is the number of examples.

Variations using 4 different activation functions, where we have recover Function that converts an input value by zeroing out negative values and leaving positive values unchanged. It is defined as equation (12).

$right e L you (X) = maximum (0, X) = \{\begin{matrix} 0 for X < 0 \\ X F oh r X \geq 0 \end{matrix}$

(12)

The second activation function is the sigmoid function, which is used in models that need to predict the probability of an outcome, since it takes a range of values between 0 and 1. Its definition is shown in formula (13).

$S I G rice oh I d (X) = \frac{1}{1 + e^{- X}}$

(13)

In addition, the hyperbolic tangent function is also implemented, its continuity is similar to the sigmoid function, but the output range is -1 to 1, and its definition is as shown in Equation (14):

$time A n H (X) = \frac{e^{X} - e^{- X}}{e^{X} + e^{- X}}$

(14)

Finally, the GeLU activation function is used, as defined in equation (15), which adds nonlinearity by multiplying a random sigmoid function by the input, similar to recover Function.

$G e L U (X) = \frac{1}{2} X [1 + Elf (\frac{X}{\sqrt{2}})]$

(15)

First, the learning rate (LR) is varied with the following values: 0.005, 0.001, and 0.0005. The values of dropout layer are considered as 0.1, 0.3 and 0.5. The attention heads in the multi-head attention layer are set to 1, 2 and 4, and a single encoder is implemented. The embedding dimension is 32, and the feedforward layer dimension is also 32. For each specific simulation, 55 epochs were considered during neural network training and a batch size of 2800 was used. To compare the performance of different configurations, accuracy was evaluated as the number of correct predictions divided by the total number of data points at each stage.

Furthermore, the models selected based on their performance generated a confusion matrix from which the precision and accuracy of each flow pattern was extracted. First, you need to calculate the values of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The precision is defined by Equation (16) and the accuracy is defined by Equation (17).

$phosphorus r e C I s I oh n = \frac{Vice President}{Vice President + FP}$

(16)

$A C C you r A C y = \frac{Vice President + Vietnam}{Vice President + fiber mesh + FP + Vietnam}$

(17)

Source link

Computer Technology

INVENTION | FREE FULL TEXT | Identification of two-phase flow patterns in vertical pipes using transformer neural networks

2.1. Types of transformers

2.2. Data structure

2.3. Transformer neural network model

Leave a Reply Cancel reply

Column Post

Ghost of Tsushima Director’s Cut turns your PC into a PS5, well, sort of

Best Internet Providers in Centennial, Colorado

Google calls police over workers protesting Israeli contracts, cuts badge rights

2.1. Types of transformers

2.2. Data structure

2.3. Transformer neural network model

Leave a Reply Cancel reply

Related Posts

You Missed