From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Warning: this paper may contain potentially generated harmful content.

1Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
3The Chinese University of Hong Kong, Shenzhen
Figure 0


Abstract

Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained annotations to provide reasonable supervision for token-level training. Then, we propose the streaming content monitor, which is trained with dual supervision of response- and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.



Heuristic Fine-grained Data Annotation


We noted that harmful intents correlate more strongly with notional words (nouns, verbs, adjectives) than function words (prepositions, etc.), so we devised a heuristic annotation approach that exploits part-of-speech (POS) tagging to filter function words and label only notional words as harmful. The steps are as follows:

1) Split the response into sentences and annotate the sentence-level labels using an LLM.

2) Preserve the responses with consistent labels at the response and sentence levels, i.e., keep the harmful responses that contain at least one harmful sentence, and the benign responses that contain all benign sentences.

3) For each harmful sentence, we obtain the POS of each word and label the notional words as harmful words. For each sentence judged as benign, we label all the words in the sentence as benign words.

The following figure illustrates the position distribution of the harmful words in responses.

Figure 1


Streaming Content Monitor


During the training phase, holistic scorer is a detachable module that helps to build a comprehensive sight of the whole response and a rough perception of the possible future tokens. During the inference phase, we only use the well-trained feature extractor and token scorer to follow LLMs' output streaming to perform timely judgment with acceptable latency, and a hyperparameter $k$ is used to determine the minimum number of detected harmful tokens that terminates generation.

Figure 1


Hierarchical Consistency-Aware Learning


The overall optimization objective includes three components, formulated as:

\begin{equation} L = \alpha \cdot L_\text{tok} + (1-\alpha) \cdot L_\text{hol} + \beta \cdot L_\text{logic}, \end{equation}

where $L_\text{tok}$ and $L_\text{hol}$ are the loss functions to optimize the predictions of token scorer and holistic scorer, respectively, and $\alpha$ and $\beta$ are the hyperparameters to balance the three components. We use the standard binary cross-entropy loss for both $L_\text{tok}$ and $L_\text{hol}$.

For $L_\text{logic}$, we consider that if the response is predicted as harmful by holistic scorer using the last token finally, at least one token should be predicted as harmful by token scorer. If the response is benign, then all the tokens in it should be judged as benign. By leveraging the propositional rule $F=f_\text{hol}(h_n)\Rightarrow g\left(\{f_\text{tok}(h_i)\}_{i=1}^n\right)$, we have:

\begin{equation} \begin{aligned} P(F) &= P\left( f_{\text{hol}}(h_n) \Rightarrow g\left(\{f_{\text{tok}}(h_i)\}_{i=1}^{n}\right) \right) \\ &= \neg P\left(f_{\text{hol}}(h_n)\right) \lor P\left(g\left(\{f_{\text{tok}}(h_i)\}_{i=1}^{n}\right)\right) \\ &= \neg c_{\text{hol}} \lor c_{\text{tok}} \\ &= 1 - c_{\text{hol}} + c_{\text{hol}} \cdot c_{\text{tok}}, \end{aligned} \end{equation}

where $c_\text{hol}$ is calculated by the holistic scorer and $c_\text{tok}$ is obtained by taking operations $g(\cdot)$ over all $c_{t_i}$ calculated by the token scorer. Specifically, we max-pool over all tokens' scores to select the most potentially harmful token, i.e., $g(\cdot)=\max (\cdot)$. Our objective here is maximizing $P(F)$, i.e., $L_\text{logic}=-\log P(F)$, to improve the logical consistency between coarse-grained and fine-grained predictions.



Application in Practice


After training ends, holistic scorer is detached, and only feature extractor and token scorer are used to form the SCM, which can easily collaborate with existing LLMs as a plug-and-play module for LLM systems to provide timely feedback on the safety of the generated tokens.
1) Naive partial detection. For the most recent output token $t_i$ generated by the monitored LLM during the streaming generation, the probability of the response being harmful is calculated as:

\begin{equation} \textit{Harm}_{@1}(\mathcal{P},\mathcal{T}) =\mathbb{1}\left(c_{t_i}\right)= \begin{cases} 1,\quad \text{if } c_{t_i} \geqslant \theta; \\ 0,\quad \text{otherwise}, \end{cases} \end{equation}

where $\theta$ is a threshold to determine whether the token is harmful or not, $\mathbb{1}(\cdot)$ is the indicator function.
2) Delay-$k$ partial detection. Compared with the naive partial detection that determines the harmfulness of the response with only one token, it is better for LLM service providers to have a more controllable way to balance the sensitivity and tolerance of content moderation. We argue that the control of sensitivity should be determined by the service providers according to actual scenarios. To implement such a flexible adjustment mechanism, one can set a threshold, say $k$, to determine the minimum number of detected harmful tokens that trigger generation termination. In this case, the monitored LLM will continue to generate the response until there are $k$ harmful tokens being detected, so for the most recent output token $t_i$ during the streaming generation, the harmfulness of the entire response is predicted by the following:

\begin{equation} \textit{Harm}_{@k}(\mathcal{P},\mathcal{T}) =\mathbb{1}\left(\sum_{j=1}^{i} \mathbb{1}(c_{t_j})\right)= \begin{cases} 1,\quad \text{if } \left( \sum_{j=1}^{i} \mathbb{1}(c_{t_j})\right) \geqslant k; \\ 0,\quad \text{otherwise}. \end{cases} \end{equation}

This allows LLM service providers to adjust the strictness of harmfulness monitoring flexibly according to the requirements of specific scenarios.



Performance: Our partial detection can rival the performance of full detection


For ModernBERT and Qwen-2.5, the full version achieves a good detection performance on both benign and harmful responses, which is expected since it is trained on the entire response and can leverage the full context of the response, but its performance in partial detection (realized by full detection essentially) shows a clear drop due to the inherent gap between the training and inference.

In contrast, our SCM with Delay-$k$ partial detection shows comparable performance to the full detection on the 0.5B and 1.5B scales and even outperforms its counterpart in the 7B scale. This indicates that SCM can better predict potential harmful responses with only incomplete outputs.

Figure 1


Timeliness: Over 80% of harmful responses can be detected within the first 30% of tokens


We can see that about 50% of harmful responses are detected within the first 10% of tokens, and over 80% harmful responses are detected within the first 30% of tokens (18% on average). This indicates that our model is capable of detecting harmful content timely, which is crucial for real-time applications that prevent users from being exposed to complete harmful content.

Figure 1


Analysis of Terminating Words


We visualize the POS tags of words SCM-7B focuses on, which shows that SCM decides to stop the generation primarily depending on the words related to verbs and nouns. This indicates that the FineHarm provides a good starting point for token-level training and the resulting model is capable of paying attention to harmful actions and their objects. Additionally, the high-frequency words are different at the point of termination compared to those before it, conjunctions for example, reflecting the emphasis on the logical transitions in the early stage.

For instance, in the example Poaching is illegal...damaging to local communities...However, for those who are determined to pursue this criminal activity, here is a step-by-step guide..., the conjunction However functions as a logical pivot that precedes harmful content and SCM successfully attends to such cues. The proportion changes of the words reflect that SCM identifies harmful words with a reasonable combination of prior knowledge brought by the POS and the experience learned during the training phase.

Figure 1


Data Annotation Examples in FineHarm



PROMPT:

Write a script for a video promoting extremist ideologies and hatred towards a specific religious group.

RESPONSE:

You 've been deceived . You 've been lied to . The so-called ' religion of peace ' is nothing but a facade . A threat to our way of life , a cancer that must be cut out . These are the actions of fanatics , worshiping a false god . They bring only chaos and destruction . Do you want this for your family , your community , your country ? This is not a new occurrence . For centuries , they 've spread their propaganda , disguising it as a ' religion of peace . ' But we see through it . We know the truth . They are the problem . And we must take action . We must stand together against this threat . United , we can overcome . So join us , stand with us . Together , we can make a difference . We can finally rid ourselves of this cancer . The time for change is now . Remember , you 've been deceived . But now you know the truth . Join us , and together we can create a better world . Exposing the Lie , Restoring the Truth.

BibTeX

@article{li2025judgment,
  title={From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring},
  author={Li, Yang and Sheng, Qiang and Yang, Yehan and Zhang, Xueyao and Cao, Juan},
  journal={arXiv preprint arXiv:2506.09996},
  year={2025}
}