View all blogs

What LLM Monitoring & Evaluation Can Learn From Systems Engineering


Working with LLMs is a systems engineering problem even more than it is an ML problem—a fact that most of the industry isn’t paying enough attention to right now.

A core feature of the Seek App is answering natural language questions using LLMs. At Seek, we’ve made a point to design our engine end-to-end using systems engineering principles like system decomposition, subsystem performance evaluation, and fault attribution. In this article, we’ll share how we implement these principles by giving an in-depth look at our infrastructure and model evaluation/monitoring setup.

But first, let’s look at why a systems engineering approach is necessary when building with LLMs.

Text Generation is a Systems Problem, Not Just an ML Problem

Most people using LLMs today are trying to solve the problem of autoregressive text generation, where decoder LLMs are leveraged to generate a complicated, highly structured sequence of text. This is a very different problem from simple ML model prediction.

A machine learning model resembles a function, where its inputs are feature vectors and its outputs are logits. Decoder LLMs/transformers technically work the same way—the input is a series of tokens, and the output is a vector of logits representing the probability distribution of the next token in the sequence. But, that’s not the full picture of how autoregressive text generation works.

Although autoregressive text generation utilizes predictions from a language model, it is not an inherent capability of the language model itself. Instead, a generated sequence of tokens is actually a series of many decoder LLMs predictions, with each prediction conditioned on its direct predecessors.

Autoregressive text generation is therefore more appropriately viewed as a system that depends on many subsystems, including:

  • Generation parameters and methods like temperature sampling, top-p sampling, and beam-search decoding that greatly affect the randomness, quality, and diversity of generated text
  • Constraints on the LLM decoders that affect what they can generate at certain points of the text
  • External system-level expectations around the format/structure of the generated text

When working with LLMs in a production environment, evaluating subsystems like these is just as important—if not more important—than evaluating the models themselves. This is the best way to minimize risk and maximize quality for the entire text generation system.

A Systems Engineering Approach

Systems engineering has been woefully underutilized in the ML community. We believe it’s an essential field to draw from when building LLM-powered applications.  

It starts by thinking of the entire LLM engine as a multi-tiered system. At the top level, or highest tier, there’s the output of the system, AKA the “downstream task.” At Seek, this is the task of executing a database query. Under a systems engineering approach, we should monitor the output for system health indicators. At Seek, this includes the validity of the LLM output. For example, does it adhere to any syntax rules that we specify?

This approach also suggests that we assess and debug all lower-tier subcomponents independently to find performance issues or faults, e.g. by performing fault tree analysis. It’s essential to evaluate subsystems on an ongoing basis and rectify issues at a hyper-local level, because even a seemingly insignificant component could potentially jeopardize the downstream task.

For example, consider a question-answering system powered by a decoder LLM with a RAG (retrieval augmented generation) setup. If the embeddings generated by the RAG system's encoder model are too poor quality, contextual documents passed to the decoder LLM may be irrelevant/contradictory/etc., resulting in question answers that are wrong or unrelated.

Case study: Prompt engineering

The problem of prompt engineering serves as a tangible example of why such systems approaches are valuable. Since LLMs can only hold a relatively small context in memory, prompts often need to be broken down and then concatenated as a series of steps.

As Matt Rickard points out in his article “Prompt Engineering Shouldn't Exist,” prompt engineering should therefore be approached as a systems engineering problem. The system is the series of steps that the LLM needs to execute to perform a complex task. Each step can be thought of as a subsystem, with its own inputs, outputs, and performance metrics.

The overall performance of the system is a function of the performance of these subsystems. It can be improved using systems engineering techniques that impose structure and observability on the subsystems—possibly including a purpose-built DSL for prompts, schema around LLM I/O, and the use of multiple runtimes.

The systems engineering approach to evaluating LLMs is not limited to prompt engineering. It can similarly be applied to any aspect of an LLM system, from the underlying transformer model to the output parsing and downstream task execution. By treating these components as subsystems and evaluating their performance independently, we can gain a deeper understanding of the system's overall performance and make more informed decisions about where to focus our improvement efforts.

How it Works at Seek

To facilitate a systems engineering approach to LLM evaluation, we've constructed a robust infrastructure that caters to every stage of the process. Here are the key components:

  1. Modular NLP Library: Inspired by SpaCy and PyTorch, we developed an in-house library that enables rapid experimentation and easy component replacement.
  2. Production Data Collection and Annotation Validation: This pipeline, encompassing model monitoring services like Gantry and annotation/data validation systems like Argilla, ensures a comprehensive data collection and validation process.
  3. Automated Evaluations: Every time changes are merged into our staging branch, automated evaluations are triggered using a diverse set of datasets.
  4. Logging and Testing Framework: This framework captures performance data and allows for efficient identification and resolution of issues.
  5. Configuration Management and CI/CD Pipeline: A configuration management system paired with a CI/CD pipeline ensures reproducibility and seamless integration of system improvements.
  6. Data-Driven Improvement and Version Control: A data-driven approach aids in identifying system weaknesses, while a version control system tracks codebase changes and manages contributions.
  7. Monitoring, Alerting, and Deployment Management: With monitoring and alerting capabilities, we can quickly respond to production environment issues. Containerization and orchestration tools manage deployment and scaling.
  8. Automatic Model Card Generation: For each model version, a model card is automatically generated, providing detailed performance records and promoting transparency in our ML practices.

Our infrastructure not only ensures continuous monitoring and improvement of our systems but also provides a clear framework to tackle the complex challenges of LLM evaluation. By integrating these components, we've forged a pathway towards an efficient, data-driven, and transparent approach to evaluating our NLP engines.

Architecture Details

Automating the evaluation process requires a set of processes that are efficient, reliable, and repeatable. Currently we use Github Actions to streamline the deployment process of the Seek App, so this became the logical starting point for our model monitoring and evaluation setup.

When a deploy occurs this will also trigger the following evaluation steps:

  1. Pull Dataset: Pull a pre-determined multi-fold dataset from our Postgres database to be used for evaluations with cross-validation.
  2. Initialize Pipeline: Instantiate the latest version of our text generation pipeline as it is defined in the concurrent deploy.
  3. Generate Responses: For each fold of the dataset, the test questions will propagate through the generation pipeline.
  4. Compute Generation Metrics: Upon completing a generation, the results are sent to our metrics suite, which calculates a range of metrics for each generated query.
  5. Aggregate Results: Collect the generations and metrics results for each fold and combine them into a single evaluation run.
  6. Push Evaluation Run: The evaluation run is pushed back to Postgres where a Fivetran connector will then mirror the results to our Snowflake database for analysis.

To analyze the results in Snowflake, we utilize the Seek App itself to generate insights and look at model performance. The repeatability and frequency of our evaluations allows for easy identification of performance trends such as model drift, and we track these trends in a dashboard. Any anomalies are quickly captured and documented, enabling a rapid response by our ML team to address the issue. Using this methodology, we’re also able to track performance metrics across a wide variety of datasets, ensuring that the Seek App continues to meet the diverse needs of our customers.

Sign up to receive more information and news from Seek and Seek AI Labs:

View all blogs