From raw data to model update: an automatic categorization pipeline

Automatic categorization of bank transactions is often perceived as a classic supervised classification problem. In practice, it sits at the intersection of several structural challenges: transaction labels are noisy and poorly standardized, semantic ambiguity is high, the target taxonomy spans nearly 90 categories, and the class distribution is heavily imbalanced (e.g. grocery spending is frequent, while account seizures are rare).

The same label can cover different realities depending on context: "Amazon," for instance, can refer to an everyday purchase, a subscription, or platform-related fees (Marketplace charges or e-commerce activity costs). Furthermore, a single merchant can appear under many different variations in bank labels. These variations stem from payment systems, aggregators, or bank-specific formats, and result in sometimes significant differences in the character string: abbreviations, technical identifiers, location tags, or transactional suffixes. A transaction linked to Amazon, for example, may appear as "AMZN Mktp FR," "Amazon EU Sarl," "AMZN Digital," or "Amazon Prime."

In this context, model performance depends less on architecture than on the quality, diversity, and continuous updating of training data.

1. Data sampling strategy and human annotation

The pipeline begins with periodic sampling from an anonymized data warehouse. We combine several complementary strategies: random sampling as an unbiased baseline reflecting the overall data distribution, and oversampling of rare classes to improve coverage of minority categories. This combination produces a representative dataset that balances exploration of rare cases with exploitation of frequent transactions.

Selected transactions are then pre-annotated using a language model, leveraged through a structured prompting scheme that combines explicit instructions with few-shot learning. The prompt incorporates representative examples, can we rather say and enforces a structured output schema instead of output space constraints, and business rules to guide the model toward predictions consistent with the target taxonomy. This step enables the controlled generation of candidate labels that serve as a basis for validation and human annotation. Transactions for which the LLM has very high confidence are processed automatically, while ambiguous ones are submitted to expert annotators. Human corrections are fed back into the pipeline and inform the next model iteration, enabling progressive accuracy improvements over time. This setup implements a pragmatic form of active learning, where the system automatically selects the most informative examples for human annotation, maximizing the impact of each correction and ensuring optimized enrichment of training datasets.

2. Model evaluation and deployment

Each new model version is evaluated on a stratified test set, never used during training, and compared against the model currently in production across several metrics: overall accuracy, weighted F1, and per-class performance. No regression on decision-critical categories is accepted, and in cases of parity or uncertainty, the existing model is retained. When all criteria are met, a new version can be deployed in shadowing mode, where its predictions are evaluated in parallel with the production system without affecting live operations. This approach allows potential anomalies to be detected, the behavior of both versions to be compared, and the safe introduction of model updates.

3. Traceability, governance, and regulatory compliance

To keep this process reliable and aligned with the principles of the AI Act, all data and transformations are subject to full versioning and traceability. Each model can be linked back to the annotated transactions that fed it, enabling precise lineage tracking. Experiment tracking logs parameters, metrics, and associated artifacts, while a data catalog enriched with business metadata facilitates navigation and traceability. Automated checks also detect anomalies upstream — corrupted data, schema inconsistencies — reducing the risk of errors in production.

In conclusion

Ultimately, transaction categorization is part of an iterative continuous improvement approach. Model performance and design choices are regularly reassessed based on usage feedback, observed errors, and data evolution. The system is thus progressively refined to better meet business requirements, and in particular to handle new cases as they emerge.

‍

Wissal El Achouri, Data Scientist at Algoan.