### Experimental environment and data sets

The experimental environment is as follows: Debian, 3.16.0–4-amd64, Intel(R) Xeon(R) CPU E5–2670 v3 @ 2.30GHz processors, 256GB RAM, Apache Spark 2.2.1, Scala-2.11.8 and JDK1.8.0_71. We built a Spark application with a Stand-alone cluster task scheduling mode on a 48-core server. The CRF algorithm used in the experiments is an open source CRF algorithm in Spark [12]. They use Adam and AdaGrad optimizer based on Spark, so it will get better performance compared with other methods [13, 14].

The datasets used are the corpus (IOB2 format) that are manually annotated in our previous work [8] for bacteria named entity recognition and the 50,000 unannotated biomedical abstracts downloaded on PubMed with the keyword “human”, “oral”, “bacteria”.

### Methods

In this paper, we mainly study the computing platform for bacteria named entity recognition based on the conditional random field and Spark. To begin with, we extracted 34 features such as word features, affix features, etc. We trained the CRF model on a training sets in Spark, and then evaluated the model’s performance on a test set. Finally, we compared the Spark version and CRF++ on single node under the same conditions to verify the efficiency of the system, and tried to apply them to large-scale unannotated corpus to compare the prediction speed of them.

### Spark computing framework

Representative batch systems include MapReduce [15], Spark [9], Pregel [16] and Trinity [17], etc. Among them, Spark is implemented in Scala language and compatible with Hadoop’s original ecosystem while overcoming the shortcomings of MapReduce in iterative computing and interactive data analysis. In addition, it has the advantages of scalability, high reliability and load balancing, and has a huge community support, so it has become the most active and efficient general computing platform for large data. Resilient Distributed Dataset (RDD) [18] is the core data structure of Spark, the scheduling order of Spark is formed by the dependency of RDD, and entire Spark program is formed by the operation of RDD. With such memory calculation mode, Spark supports machine learning and other iterative computing well and has better computational efficiency than MapReduce.

### Conditional random field

The conditional random field was first proposed by Lafferty et al. in 2001 [19], which is a discriminant undirected graph model that models the conditional probabilities according to the given observation sequence of variables. In the field of biomedicine, linear chain CRFs are generally used to process sequence labeling tasks such as named entity recognition and part-of-speech tagging and so on.

Assuming X and Y are random variables, P(Y| X) is the conditional probability distribution of Y given X. If the random variable Y constitutes a Markov random field represented by an undirected graph G = (V,E),

$$ \mathrm{P}\left({\mathrm{Y}}_{\mathrm{v}}|\mathrm{X},{\mathrm{Y}}_{\mathrm{w}},\mathrm{w}\ne \mathrm{v}\right)=\mathrm{P}\left({\mathrm{Y}}_{\mathrm{v}}|\mathrm{X},{\mathrm{Y}}_{\mathrm{w}},\mathrm{w}\sim \mathrm{v}\right) $$

(1)

that is, Eq. (1) holds for any node v, then the conditional probability distribution P(Y|X) is called a conditional random field.

In Eq. (1), w~v denotes all nodes w that have edges connected to node v in the graph G = (V, E), w ≠ v represents all nodes other than the node v, and Y_{V}、Y_{u}、Y_{w} are random variables corresponding to node v、u、w.

Assume that X = (X_{1}, X_{2}, …, X_{n})and Y = (Y_{1}, Y_{2}, …, Y_{n}) are all random variable sequences represented by linear chains. If given a random variable sequence X, the conditional probability distribution P(Y| X) of the random variable sequence Y constitute a conditional random field, which means Markov Property is satisfied:

$$ \mathrm{P}\left({\mathrm{Y}}_{\mathrm{i}}|\mathrm{X},{\mathrm{Y}}_{1,}\dots, {\mathrm{Y}}_{\mathrm{i}-1},{\mathrm{Y}}_{\mathrm{i}+1},\dots, {\mathrm{Y}}_{\mathrm{n}}\right)=\mathrm{P}\left({\mathrm{Y}}_{\mathrm{i}}|\mathrm{X},{\mathrm{Y}}_{\mathrm{i}-1,},{\mathrm{Y}}_{\mathrm{i}+1}\right) $$

(2)

where i = 1, 2, …, n (Only one side is considered when i = 1 and n).

Then P(Y| X) is a linear chain conditional random field. In the labeling problem, X represents the input observation sequence, Y represents the corresponding output sequence or state sequence. Under the condition that random variable X is x, Y is y, the parametric form of the conditional probability is as follows:

$$ \mathrm{P}\left(\mathrm{y}|\mathrm{x}\right)=\frac{1}{\mathrm{Z}\left(\mathrm{x}\right)}\exp \left\{\sum \limits_{\mathrm{i},\mathrm{k}}{\uplambda}_{\mathrm{k}}{\mathrm{t}}_{\mathrm{k}}\left({\mathrm{y}}_{\mathrm{i}-1},{\mathrm{y}}_{\mathrm{i}},\mathrm{x},\mathrm{i}\right)+\sum \limits_{\mathrm{i},\mathrm{l}}{\mathrm{u}}_{\mathrm{l}}{\mathrm{s}}_{\mathrm{l}}\left({\mathrm{y}}_{\mathrm{i}},\mathrm{x},\mathrm{i}\right)\right\} $$

(3)

$$ \mathrm{Z}\left(\mathrm{x}\right)=\sum \limits_{\mathrm{y}}\exp \left\{\sum \limits_{\mathrm{i},\mathrm{k}}{\uplambda}_{\mathrm{k}}{\mathrm{t}}_{\mathrm{k}}\left({\mathrm{y}}_{\mathrm{i}-1},{\mathrm{y}}_{\mathrm{i}},\mathrm{x},\mathrm{i}\right)+\sum \limits_{\mathrm{i},\mathrm{l}}{\mathrm{u}}_{\mathrm{l}}{\mathrm{s}}_{\mathrm{l}}\left({\mathrm{y}}_{\mathrm{i}},\mathrm{x},\mathrm{i}\right)\right\} $$

(4)

Where t_{k} and s_{l} are eigenfunctions, their value is 1 when the feature is satisfied, 0 otherwise. λ_{k} and u_{l}are the corresponding weights. Z(x)is a normalization factor, summation is done on all possible output sequences. The conditional random field is completely determined by the eigenfunction and corresponding weights. The main tasks of training are feature selection and parameter estimation. The purpose of feature selection is to choose a feature set that can express this random process, and the parameter estimation is to estimate the weights for each feature selected. The training process can be essentially attributed to the process of estimating the weight parameters of the eigenfunctions based on the principle of maximum likelihood function. When the model training is completed, the maximum likelihood distribution and model parameters are obtained. For the new observation sequence X, the most likely output sequence Y is predicted based on training model. The conditional random fields can make full use of contextual label information to achieve good labeling results.

The computational scale of the conditional random field in training is related to the size of training set, templates and the number of output tags. The sequence of input sentences in biological texts is generally very long, so there exists the problems of long time excution of optimization and large memory occupation when training on large-scale data. Research on the efficiency of CRF in handling massive data has become one of the most popular hotspots in biomedical named entity recognition. Literature [20] implements CRFs training on large-scale parallel processing systems based on multi-core and can process large data sets with hundreds of thousands of sequences and millions of features, which significantly reduces the computation time. At the same time, using a second-order Markov-dependent in the training process, the model has achieved higher accuracy; Literature [21] deals with complex computing tasks by decomposing the learning process into smaller and simpler sub-problems. It developed a core approach to learn CRF structure and parameters and speeded up the regression by using more and more parallel platforms. Literature [22] controls the number of non-zero coefficients by introducing penalties in the CRFs model. Ignoring execution time, it implements CRF’s training task on processing hundreds of output tags and up to several billion features; In literature [23], CRF-RNN, a new neural network is proposed based on mean-field approximation and Gaussian potential functions for CRFs. And they obtained the best result of the challenging Pascal VOC 2012 segmentation benchmark when applying the proposed method to the semantic image segmentation problem. Literature [24] achieves the MapReduce-based parallel training of CRFs and can ensure the correctness of the training results. Meanwhile, it greatly reduces the training time and improves the performance. Although this MapReduce-based implementation can handle large-scale training sets and feature sets, the execution efficiency is not high enough. Literature [25] converts all data into RDDs and stores them in the memory of the cluster nodes. It implements SparkCRF, a distributed CRFs running in a cluster environment. Experiments show that SparkCRF has high computing performance and good expansibility, and it has the same accuracy level as the traditional single-node CRF++.

### Design and implementation of the system

The proposed system is written in Scala. Firstly, we extracted the features from the data sets on the Spark platform. The features used are the optimal 34 sub-features selected by the single optimal combination method in our previous work [8], and a feature matrix was generated in the next step. The training and predicting steps were executed using the Open Source Toolkit of CRF based on Spark(We call it “Spark-CRF”). The flow chart of the bacteria named entity recognition system is shown in Fig. 1.

The system includes two stages in the workflow: training and prediction. Spark-CRF creates RDDs in nodes and the user-defined Transformation and Action are used for preprocessing, feature extraction, model training and prediction.

### Evaluation metrics

Precision (P), Recall (R) and F-Measure (F) are generally used to evaluate the performance of NER system. They are defined as follows, respectively.

$$ \mathrm{P}=\frac{TP}{TP+ FP} $$

(5)

$$ \mathrm{R}=\frac{TP}{TP+ FN} $$

(6)

$$ F=\frac{2\times P\times R}{P+R} $$

(7)

Here, TP is the number of bacteria named entities that are correctly identified by the model, FP is the number of bacteria named entities which are incorrectly identified by the model, FN is the number of non-bacteria named entities that are incorrectly identified by the model. P represents the precision, R represents the recall rate, and F-Measure is the average of P and R.