School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi, China
Email: yhc271828@stu.xjtu.edu.cn
*Corresponding author
Manuscript received April 15, 2026; accepted May 17, 2026; published June 15, 2026
Abstract—With the widespread application of Large Language Models (LLMs) in enterprise knowledge base question-answering systems, Retrieval-Augmented Generation (RAG) technology has become a key solution to the hallucination problem of LLMs. However, in web-crawled policy texts, a large amount of block noise—such as navigation blocks, header/footer blocks, advertisement blocks, and semantically incomplete policy fragments—appears superficially identical to normal text, with complete grammatical structures and policy keywords, making it difficult for traditional rule-based methods to identify them. This noise severely degrades the retrieval and generation quality of downstream RAG systems. To address this core challenge, this paper proposes a block noise identification method for web-crawled policy texts. First, a rule-driven pre-admission and cleaning strategy quickly filters out obviously invalid samples; then, a lightweight gating classifier that fuses 32-dimensional handcrafted statistical features with dual-centroid semantic similarity is designed to specifically identify block noise that rules cannot cover. Experimental results show that on a labeled dataset of 4448 text blocks (clean=2315, noisy=2133), the proposed 32-dimensional handcrafted feature baseline achieves an F1 score of 74.01%; the MLP classifier fused with dual-centroid semantic similarity achieves an F1 score of 78.05% (precision 74.11%, recall 82.44%), which is the current best result. This method provides a lightweight, locally deployable engineering solution for the precise identification of block noise in web-crawled texts.
Keywords—RAG; Block Noise Identification; Noise Gating; Semantic Centroid; Text Denoising; Large Language Models
Cite: Huaichuan Yi, "Block Noise Identification Method for Web-Crawled Policy Texts in Enterprise RAG Systems," International Journal of Engineering and Technology, vol. 18, no. 2, pp. 61-68, 2026.
Copyright © 2026 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (
CC BY 4.0).