Partially spoofed technology subtly manipulates interested parts in an audio to alter the original meaning, with its fine-grained forgery posing great challenges to existing fully spoofed detection countermeasures. Existing partially spoofed audio detection methods have shown excellent effectiveness in distinguishing clean and long-duration spoofed segments. However, their robustness remains limited when malicious attackers manipulate a finer-grained segment (e.g., only a single phoneme) and employ post-processing operations to reduce detectable discontinuities. To face these challenges, we propose the Semantic-Aware Inconsistency Learning (SAIL) method for robust frame-level detection. It incorporates a robust augmentation module (RAM), a Multi-Scale Semantic Inconsistency Learning (MSIL) module, and a Semantic Separation Module (SSM) to learn robust discriminative features by capturing multi-segment discontinuities and semantic inconsistencies introduced by partially spoofed manipulations. Specifically, the RAM is applied to suppress the model’s erroneous attention to additional interference caused by post-processing operations on the subtle spoofed artifacts. Then, the MSIL module is proposed to extract semantic inconsistency features after manipulations, using attention mechanisms at different scales to highlight forgery differences at various granularities. Finally, the SSM is devised to refine these features for robust frame-level detection, utilizing contrastive learning to ensure a clear distinction of inconsistent semantic features in the feature space. Extensive experiments are conducted on three public datasets, including ASVS2019PS, HAD, and LAV-DF, showing that our proposed method achieves the best performance under various noisy scenarios.