Metrics for Automatic Evaluation of Text from NLP Models for Text to Scene Generation

— Performance metrics give us an indication of which model is better for which task. Researchers attempt to apply machine learning and deep learning models to measure the performance of models through cost function or evaluation criteria like Mean square error (MSE) for regression, accuracy, and f1-score for classification tasks Whereas in NLP performance measurement is a complex due variation of ground truth and results obtained.


I. INTRODUCTION
In Natural Language Processing, there exists a bias in models due to the dataset or performance evaluation criteria. Hence there is a need to apply Standard Performance Benchmarks metrics to evaluate the performance of models for NLP tasks. NLP is widely used in the field of research for many applications like Machine translation, Question Answering, Text Summarization, Image captioning, Sentiment Analysis, etc. [1].
Automatic evaluation of natural language generation, for applications like machine translation and caption generation, requires comparing candidate sentences to annotated references. The goal is to evaluate semantic equivalence. The methods used will rely upon surface-form similarity. Generally, we evaluate the Machine-generated texts against a target text (truth value). The generated text refers to the machine-produced text output from the model and target or reference text refers to the original truth value text. The performance of subtask can be measured by applying Intrinsic Evaluation metrics which focus on intermediary objectives and Extrinsic Evaluation which focuses on the performance of the final objective. Carefully picking metrics is an important part of the ensuring system we work with becomes usable [2].
The Text generated from several NLP models coupled with ML techniques can be used to compare models in NLG Domain. The commonly used evaluation metrics are discussed below in

II. IMPLEMENTATION
Natural language generation requires comparing candidate sentences to annotated references. Given reference set x with k labels has {x1, x2, x3, … Xn} and Candidate set y with l labels has {y1, y2, y3, …yn}. Evaluation metric z = f(x,y) ∈ R, the selection of evaluation metric depends on the type of NLP task or application choosing the better metric helps to provide correlation with human judgment. Existing metrics can be broadly categorized into using n-gram matching, edit distance, embedding matching [3].
The intrinsic metrics that are used to evaluate NLP systems are as follows: Accuracy: The accuracy metric is used in classification tasks to learn the closeness of a measured value to a known value. It's typically used in instances where the output variable is categorical or discrete.
Precision: The precision metric would inform the number of labels that are labeled as positive in correspondence to the instances that the classifier labeled as positive.
Recall: Recall measures how well the model can recall the positive class. Recall value signifies the number of positive labels that the model has correctly identified as positive.
F1 Score: Precision and Recall are complementary metrics that have an inverse relationship. If both metrics are equally important then the F1 score can be used to combine precision and recall into a single metric [4].
The popular metrics available are built upon exact matching scores. The Metrics are listed below.
Bilingual Evaluation Understudy (BLEU): The BLEU score evaluates the quality of text that has been translated by a machine from one natural language to another. BLEU Score is a performance metric to measure the performance of @ @ Metrics for Automatic Evaluation of Text from NLP Models for Text to Scene Generation S. Yashaswini and S. S. Shylaja machine translation models. It evaluates how well a model translates from one language to another. The MT will compare on unigram, bigram, or trigram in output with ground truth. Some of its shortcomings of BLEU Scores are It doesn't consider meaning, sentence structure, and morphologically rich language [7]. The BLUE score helps to evaluate the sentences related to interior design 76,068 sentences were considered as a reference set and 54 non repetitive sentences were taken as candidate set and precision of 47.74 and a BLEU Score of 46.59 was obtained as shown in Fig. 2.
METEOR: The Metric for Evaluation of Translation with Explicit ORdering (METEOR) is a precision-based metric for the evaluation of machine-translation output. It overcomes some of the pitfalls of the BLEU score, such as exact word matching whilst calculating precision. The METEOR score allows synonyms and stemmed words to be matched with a reference word. The n-grams can be matched based on stemmed words and meanings. METEOR uses unigram precision and recall to compute a score.
ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) evaluation metric measures the recall. It's typically used for evaluating the quality of generated text and in machine translation tasks. However, since it measures recall it's mainly used in summarization tasks [5].
CHRF Score: character level n-grams play an important role in the automatic evaluation as a part of more complex metrics [8]. The n-gram based F-score; especially the linguistically motivated ones based on Part-of-Speech tags and morphemes correlate very well with human judgments outperforming the widely used metrics such as BLEU and TER.
NIST provides the evaluation infrastructure, where the source files being MT system output is used to assess the quality of the source files. The goal is to create correlation between metrics and human assessment. Different types of human assessment are used.
The plots given below helps us to understand the ROC based on performance metrics. The score ranges from the probabilistic value between 0 to 1. The values are scores of different metrics for 100 and 300 sentences as shown in Fig.  5.    BERT Score: BERT score leverages the pre-trained contextual embedding's from BERT and matches words in candidate and reference sentences by cosine similarity. It correlates human judgment with sentence-level evaluation. Moreover, BERT Score computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks [12]. The accuracy of 97% is obtained for 900 sentences as shown in Fig 6. BLEURT: It is an evaluation metric for Natural Language Generation. It is built using multiple phases of transfer learning starting from a pre-trained BERT model and then employing another pre-training phrase using synthetic data [6]. Finally, it is trained on human annotations [10]. You may run BLEURT out-of-the-box or fine-tune it for your specific application as shown in Fig. 8

III. MATHEMATICAL FORMULATIONS WITH UNITS
The most commonly used metrics for text generation is to count the number of n-grams that occur in the reference x and candidate y. formally, let S(x) and S(y) be the lists of token n-grams (n ∈ Z+) in the reference x and candidate y sentences. The number of matched n-grams is P ∈ S(y) I[w ∈ S (x)], where I is an indicator function. The exact match precision (Exact-Pn) and recall (Exact-Rn) scores are: Exact-Pn = P ∈ S(y) I[w ∈ S (x) ] / S(y) and

Exact-Rn = P ∈S(x) I[w ∈ S(y) ] / S(x)
The Units in the metrics are probability varying from 0 to 1. The value 0 indicating the least probability and value 1 indicates the highest probability.

IV. RESULTS AND DISCUSSIONS
We evaluated metrics on considering the 1k sentences dataset got by applying the RNN-LSTM model for humanannotated sentences. The different automatic metrics like BLUE, CHRF, GLEU, METEOR, NIST, and ROGUE scores. The experimental setup uses 600 sentences as reference sentences and 300 has candidate sentences, a stepby-step evaluation is employed on 10, 100, 200, and 300 sentences, and comparative scores are noted down as shown in Table I.
The scores infer the BLEU score increases with an increase in the candidate sentences the results are based on bigram pairs. The CHRF score, GLEU score, and ROUGE [5] score have increased in the increase in no of candidate sentence whereas the NIST score has decreased and meteor score is zero since the sentence considered are interior design related hence adequacy and fluency error exists [11].
The experimentation results show that the Rouge score works well for interior design sentences by considering ngram overlap scores. These scores are correlated with the human evaluation of summaries up to some level of accuracy. Nevertheless, Rouge scores are used to compare 2 candidate summarization systems as shown in Fig. 8. The best evaluation policy is collecting human judgments provided there is sufficient time and cost. Recently scoring criteria are used in summarization tasks. The experiments were further carried out for n iterations to check whether the values are consistent for n batches in the dataset. The scores remained approximately the same for iterations hence it's determined the rogue score helps with matching n-grams in all batches of reference and candidate summaries as shown in Fig. 9.

V. CONCLUSION
The experimentation carried out shows that the rogue scores outperform by considering the n overlap n-grams. The meteor scores are zero irrespective of n iterations due to the dataset lacking fluency and adequacy. The interior design dataset is generated by applying the rnn-lstm model to human-annotated sentences. The dataset is generated and not translated hence it is hard to obtain fluency and adequacy.  Author's formal photo Author's for mal photo