diff --git a/README.md b/README.md index 86b617f49b006ced2338269531c075548ad5bd16..da8579fe31bc9d580f45dfad6290c2c9ef55f305 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # GitLab Handbook Cookbook Project -This project provides a structured way to evaluate AI models using LangSmith. It includes sample datasets, evaluation scripts, and a CI/CD pipeline setup. See https://handbook.gitlab.com/handbook/engineering/development/data-science/ai-powered/ai-framework/evaluation/ for more information. +This project provides a structured way to evaluate AI models using LangSmith. It includes sample datasets, evaluation scripts, and a CI/CD pipeline setup. ## Table of Contents @@ -140,4 +140,9 @@ cookbook-project/ ├── .gitlab-ci.yml ├── README.md ├── requirements.txt -└── setup.sh \ No newline at end of file +└── setup.sh +``` + +## Docs + +For a comprehensive end-to-end guide, please visit our [documentation](./doc/index.md) page. \ No newline at end of file diff --git a/doc/cicd/index.md b/doc/cicd/index.md new file mode 100644 index 0000000000000000000000000000000000000000..143048209ec5738f54e18a6b047236e046cf31a0 --- /dev/null +++ b/doc/cicd/index.md @@ -0,0 +1,30 @@ +## Integrate with GitLab CI/CD + +Using a CI/CD pipeline for evaluations provides several key benefits compared to running evaluations solely on your local machine. + +### Create a GitLab CI/CD Pipeline + +In your project repository, create or update your `.gitlab-ci.yml` file with the following content: + +```bash +stages: + - evaluate + +evaluate_langsmith: + stage: evaluate + script: + - pip install requests langsmith langchain + - python evaluate.py +``` + +### Commit and Push Changes + +```bash +git add .gitlab-ci.yml evaluate.py +git commit -m "Add LangSmith evaluation script and CI/CD pipeline" +git push origin main +``` + +### Monitor the Pipeline + +Navigate to your GitLab project and monitor the CI/CD pipeline. Ensure the job `evaluate_langsmith` runs successfully. diff --git a/doc/datasets/index.md b/doc/datasets/index.md new file mode 100644 index 0000000000000000000000000000000000000000..39524a6da18ffa5aaea178419e09c8c808b86843 --- /dev/null +++ b/doc/datasets/index.md @@ -0,0 +1,80 @@ +## Creating and uploading a dataset + +Creating a dataset tailored to your evaluation needs is a critical step in ensuring accurate and meaningful assessments of your AI models. Here’s how to create and upload a dataset for use with LangSmith. + +#### Create Your Dataset + +- Define Your Data Requirements: + - Identify the types of inputs and expected outputs you need for evaluation. For a chat model, this might include various questions and their corresponding expected responses. +- Prepare Your Data + - Create a CSV or JSON file containing your data. Each entry should include the necessary fields such as input questions and expected answers. + +#### Example CSV Structure + +```csv +question,expected_answer +"What's your name?","My name is GitLab Bot." +"How can I reset my password?","You can reset your password by going to the login page and clicking on 'Forgot password?'." +"What is the weather today?","I'm sorry, I can't provide weather updates." +"Tell me a joke.","Why did the scarecrow win an award? Because he was outstanding in his field!" +"Explain quantum physics.","Quantum physics is the branch of physics relating to the very small." +``` + +#### Example JSON Structure + +```json +[ + { + "question": "What's your name?", + "expected_answer": "My name is GitLab Bot." + }, + { + "question": "How can I reset my password?", + "expected_answer": "You can reset your password by going to the login page and clicking on 'Forgot password?'." + }, + { + "question": "What is the weather today?", + "expected_answer": "I'm sorry, I can't provide weather updates." + }, + { + "question": "Tell me a joke.", + "expected_answer": "Why did the scarecrow win an award? Because he was outstanding in his field!" + }, + { + "question": "Explain quantum physics.", + "expected_answer": "Quantum physics is the branch of physics relating to the very small." + } +] +``` + +#### Upload Your Dataset to LangSmith + +Once your dataset is prepared, follow these steps to upload it to LangSmith: + +- Log In to LangSmith: + - Visit the LangSmith website at `https://smith.langchain.com` and log in with your credentials. +- Navigate to the Datasets Section: + - In the LangSmith dashboard, locate and click on the "Datasets and Experiments" section. +- Upload the Dataset: + - Click on the “Upload Dataset” button. + - Choose your `CSV` or `JSON` file and upload it. Ensure you provide a meaningful name and description for your dataset. +- Verify the Upload: + - After uploading, verify that the dataset appears in your list of datasets and that the entries are correctly formatted. + +Once your dataset is uploaded to LangSmith, you can reference it in your evaluation scripts. + +##### Suggested Evaluation Scripts + +After uploading, click on “New Experiment” in the top right corner of the LangSmith Web UI to access a list of predefined experiment code that you can copy and use. This section includes predefined evaluators, such as correctness and helpfulness. Additionally, you can define your own evaluators using [Custom Evaluators](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#use-custom-evaluators). + +- ![EXP](../img/new_experiment.png) + +#### How to decide how to create a dataset + +Datasets are collections of Examples, the core building block for the evaluation workflow in LangSmith. Examples provide the inputs over which you will be running your pipeline, and, if applicable, the expected outputs that you will be comparing against. All examples in a given dataset should follow the same schema. Examples contain an "inputs" dict and an "output" dict, along with (optionally) a metadata dict. + +- [Datasets and examples](https://docs.smith.langchain.com/concepts/evaluation#datasets-and-examples) + +#### Current list of datasets + +You can find the current list of ongoing datasets [here](https://gitlab.com/groups/gitlab-org/modelops/ai-model-validation-and-research/-/epics/6#data-sets--use-cases). If the dataset you need is not already in the LangSmith Project, please upload it to use it. diff --git a/doc/evaluators/index.md b/doc/evaluators/index.md new file mode 100644 index 0000000000000000000000000000000000000000..95ad49cbe9c33f52629f5a312e3df84e1ab6da87 --- /dev/null +++ b/doc/evaluators/index.md @@ -0,0 +1,125 @@ +## LangChain Evaluators + +To create a custom evaluator called `oshot_choice` in LangSmith, we need to define it as a callable function or class that LangSmith can use to evaluate your results. This evaluator will be responsible for comparing the model’s outputs to the expected outputs and determining whether they meet the desired criteria. + +### Define the Evaluator Function + +First, define the `oshot_choice` evaluator function. This function should take the model’s inputs and outputs as parameters and return the evaluation results. + +```python +from langsmith.evaluation.evaluator import EvaluatorResult + +def oshot_choice(inputs, outputs): + """ + Custom evaluator that compares the model's output to the expected output. + + Parameters: + inputs (dict): A dictionary containing the input data, including the expected answer. + outputs (dict): A dictionary containing the model's output data. + + Returns: + EvaluatorResult: The result of the evaluation. + """ + expected_answer = inputs.get("expected_answer") + model_answer = outputs.get("answer") + + if expected_answer is None or model_answer is None: + return EvaluatorResult( + success=False, + score=0, + comment="Expected answer or model answer is missing." + ) + + # Custom evaluation logic (e.g., exact match, similarity check, etc.) + if expected_answer == model_answer: + return EvaluatorResult( + success=True, + score=1, + comment="The model's answer matches the expected answer." + ) + else: + return EvaluatorResult( + success=False, + score=0, + comment="The model's answer does not match the expected answer." + ) +``` + +### Integrate the Evaluator into Your Script + +Update your evaluation script to use the `oshot_choice` evaluator. + +```python +import os +import requests +from dotenv import load_dotenv +from langsmith import traceable +from langsmith.evaluation import evaluate, EvaluatorResult + +# Load environment variables from .env file +load_dotenv() + +@traceable +def get_chat_answer(question): + base_url = 'http://localhost:3000' + url = f"{base_url}/api/v4/chat/completions" + headers = { + "Content-Type": "application/json", + "PRIVATE-TOKEN": os.getenv("GITLAB_PRIVATE_TOKEN"), + } + payload = {"content": question} + response = requests.post(url, json=payload, headers=headers) + if response.status_code == 200: + return response.json() + else: + raise Exception(f"Error: {response.status_code} - {response.text}") + +# Custom evaluator function +def oshot_choice(inputs, outputs): + expected_answer = inputs.get("expected_answer") + model_answer = outputs.get("answer") + + if expected_answer is None or model_answer is None: + return EvaluatorResult( + success=False, + score=0, + comment="Expected answer or model answer is missing." + ) + + if expected_answer == model_answer: + return EvaluatorResult( + success=True, + score=1, + comment="The model's answer matches the expected answer." + ) + else: + return EvaluatorResult( + success=False, + score=0, + comment="The model's answer does not match the expected answer." + ) + +def main(): + chain_results = evaluate( + lambda inputs: get_chat_answer(inputs["question"]), + data="duo_chat_questions_0shot", # Replace with your dataset name + evaluators=[oshot_choice], # Use the custom evaluator + experiment_prefix="Run Small Duo Chat Questions on GDK", + ) + print(chain_results) + +if __name__ == "__main__": + main() +``` + +This will use the custom `oshot_choice` evaluator to assess the model’s answers against the expected answers in your dataset. Make sure to replace "duo_chat_questions_0shot" with the name of your uploaded dataset. + +- [LangChain Evaluators](https://docs.smith.langchain.com/reference/sdk_reference/langchain_evaluators) + +### Evaluate questions on more dimensions + +See [this script](https://gitlab.com/gitlab-org/ai-powered/eli5/-/blob/main/evaluation_scripts/chat/evaluate_multi_dimension.py?ref_type=heads) for an example. + +### More information + +See [evaluator implementations for details](https://docs.smith.langchain.com/old/evaluation/faq/evaluator-implementations). diff --git a/doc/heuristics/index.md b/doc/heuristics/index.md new file mode 100644 index 0000000000000000000000000000000000000000..484254d13d4608fd2e3d657ab67c5a42ba41a5b1 --- /dev/null +++ b/doc/heuristics/index.md @@ -0,0 +1,17 @@ +## Good Evaluation Heuristics + +- Choosing Evaluation Metrics: + - Accuracy: Measure how often the model’s predictions are correct. + - Precision and Recall: Evaluate the balance between correctly identified positive results and the number of actual positives. + - F1 Score: Combines precision and recall into a single metric. + - Latency: Measure the time taken to produce a response. + - Token Usage: Evaluate the efficiency of the model in terms of token consumption. + - Conciseness and Coherence: Assess the clarity and logical consistency of the model’s output. +- Designing Evaluations: + - Baseline Comparisons: Compare new models or prompts against a baseline to determine improvements. + - Side-by-Side Evaluations: Conduct evaluations that compare different models, prompts, or configurations directly against each other. + - Custom Evaluators: Implement custom evaluation functions to test specific aspects of your model’s performance relevant to your application’s needs. +- Best Practices: + - Start Small: Begin with a small, representative dataset to quickly iterate and refine your models and prompts. + - Automate: Use CI/CD pipelines to automate the evaluation process, ensuring consistent and repeatable results. + - Traceability: Use tracing tools to understand why certain results occurred, making debugging and improvement more straightforward. diff --git a/doc/img/new_experiment.png b/doc/img/new_experiment.png new file mode 100644 index 0000000000000000000000000000000000000000..10bbaec52d399e5808dffcc01984df06cb13847c Binary files /dev/null and b/doc/img/new_experiment.png differ diff --git a/doc/index.md b/doc/index.md new file mode 100644 index 0000000000000000000000000000000000000000..f7e9f2ef53ecc9e14350f4609f60e3756d4e78d5 --- /dev/null +++ b/doc/index.md @@ -0,0 +1,199 @@ +## Step-by-Step Guide for Conducting Evaluations using LangSmith at GitLab - ELI5 Evals + +This guide is designed to help Backend and Frontend developers at GitLab conduct evaluations using LangSmith, even if you are not familiar with Python. The process is broken down into easy-to-follow steps with detailed explanations, examples, and links for further context. + +### Prerequisites + +- Basic Tools and Setup: + - Ensure you have a GitLab account and access to the relevant repositories. + - Set up the GitLab Development Kit (GDK). Follow the [GDK setup guide](https://gitlab.com/gitlab-org/gitlab-development-kit). +- Python Installation: + - Make sure Python is installed on your machine. You can download and install it from the [official Python website](https://www.python.org/downloads/). +- API Keys and Tokens: + - [Create an issue using the AI Acesss Request template](https://gitlab.com/gitlab-com/team-member-epics/access-requests/-/issues/new?issuable_template=AI_Access_Request). Specify LangSmith and Anthropic as the required providers. + - [How to create an Anthropic API key](https://docs.anthropic.com/en/api/getting-started) + - [How to create a LangSmith API key](https://docs.smith.langchain.com/#2-create-an-api-key). + - A GitLab personal access token with the `api` and `ai_features` scopes from **your local GDK instance**. + - [How to generate a personal access token](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html#create-a-personal-access-token). + +### Step 1: Setting Up Your Environment + +#### Install Python and Necessary Libraries + +Ensure Python 3 is installed on your machine. + +If you are already using `asdf` to manage tool versions, you can [install python with `asdf`](https://github.com/asdf-community/asdf-python). + +If not, download and install it from the official [Python website](https://www.python.org/downloads/). + +You can check if Python is installed by running the following command: + +```bash +python --version +``` + +#### Clone the ELI5 Cookbook + +Clone the `eli5` project which has everything set up for you. + +```bash +git clone git@gitlab.com:gitlab-org/ai-powered/eli5.git +cd eli5 +``` + +#### Setup the project + +From the `eli5` folder, run the following: + +```bash +./setup.sh +``` + +#### Set Environment Variables + +Copy the example `.env` files and fill in your API keys. + +```bash +cp evaluation_scripts/chat/.env.example evaluation_scripts/chat/.env +cp evaluation_scripts/code_suggestions/.env.example evaluation_scripts/code_suggestions/.env +``` + +Edit the `.env` files to include your API keys and tokens. + +### Step 2: Create and upload your dataset + +- [See our dataset guide here](./datasets/) +- You can use an existing dataset or create and upload a new one specific to your evaluations. +- Follow the [instructions in the example project](https://gitlab.com/gitlab-org/ai-powered/eli5#creating-and-uploading-datasets) to create and upload datasets. +- You can see some sample datasets in [the eli5 repository](https://gitlab.com/gitlab-org/ai-powered/eli5/-/tree/main/datasets). + +### Step 3: Running the Evaluation Scripts + +The [example project](https://gitlab.com/gitlab-org/ai-powered/eli5) includes pre-configured evaluation scripts. Navigate to the respective directories and run the scripts. [See our evaluators guide here](./evaluators/) for more information. + +#### Running the Script Locally + +Make sure your GDK is running: + +```bash +gdk start +``` + +Then, in your terminal where `evaluate.py` is located, run: + +```bash +cd evaluation_scripts/chat +python evaluate.py +``` + +An example output would be: + +```bash +Running evaluation for: Run Small Duo Chat Questions on GDK + +---------------------------------------- +Evaluation Results: +---------------------------------------- + +1. Question: "What's your name?" + Expected Answer: "My name is GitLab Bot." + Model Answer: "My name is GitLab Bot." + Result: PASS + Evaluation Metrics: + - Accuracy: 100% + - Latency: 250ms + - Token Usage: 15 tokens + + ... + +---------------------------------------- +Summary: +---------------------------------------- +Total Questions Evaluated: 5 +Passed: 5 +Failed: 0 +Overall Accuracy: 100% +Average Latency: 286ms +Average Token Usage: 20.4 tokens +---------------------------------------- + +Trace Details: +---------------------------------------- +Question: "What's your name?" +Trace ID: abc123 +Latency: 250ms +Tokens Used: 15 +---------------------------------------- + +... + +Evaluation completed successfully. +``` + +#### Making Changes to Prompts and Rerunning the Evaluation + +To evaluate changes to prompts in the GDK, you can follow these steps: + +- Locate the Prompt File: + - The prompts for the chat model are located in the GitLab repository. For example, the file might be at [base.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/gitlab/llm/chain/agents/zero_shot/prompts/base.rb), etc. +- Modify the Prompt: + - Open the [base.rb](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/lib/gitlab/llm/chain/agents/zero_shot/prompts/base.rb) file and make your changes to the prompt. For instance, you might modify the `base_prompt` method to improve the clarity and specificity of the system prompt, which can lead to better model performance. Improving the clarity of prompts in language models like those used in Chat can significantly enhance the performance and reliability of the model. + +###### Example of the original base_prompt method + +```ruby + system_prompt = options[:system_prompt] || Utils::Prompt.default_system_prompt + zero_shot_prompt = format(options[:zero_shot_prompt], options) + ... +``` + +###### Example of a modified base_prompt method to improve clarity + +```ruby + system_prompt = options[:system_prompt] || "You are a helpful assistant knowledgeable about GitLab's features and services. Answer questions clearly and concisely." + zero_shot_prompt = format(options[:zero_shot_prompt], options) + ... +``` + +###### Rerun the Evaluation + +With the prompt updated, rerun the evaluation script to see how the changes affect the model’s performance. Navigate back to your `evaluation_scripts/chat` directory and run: + +```bash +python evaluate.py +``` + +###### Expected Benefits of the Improved Prompt + +1. By explicitly stating that the assistant should be knowledgeable about GitLab’s features and services and provide clear and concise answers, the model has a better understanding of the expected output. +2. The additional context helps the model generate more accurate responses, directly addressing user queries with relevant information. +3. Users receive more precise and helpful responses, enhancing their overall experience with the chat system. +4. With clearer instructions, the model can process queries more efficiently, potentially reducing latency and token usage. + +### Step 4: Analyzing the Results + +- Review Output: + - Check the output of your evaluation job in the GitLab CI/CD pipeline. It should print the results of the evaluation, showing the performance and any issues identified. +- Trace and Debug: + - If there are errors or unexpected results, use the tracing functionality provided by LangSmith. Refer to the LangSmith documentation for detailed guidance on tracing and debugging. + +### Good Evaluation Heuristics + +- [See our heuristics guide here](./heuristics/) + +### Evaluating performance metrics + +- [See our performance guide evaluation here](./performance/) + +### Integrate with GitLab CI/CD + +- [See our GitLab CI/CD guide here](./cicd/) + +### Additional Resources + +- [LangSmith Evaluation Cookbook](https://github.com/langchain-ai/langsmith-cookbook/blob/main/README.md#testing--evaluation): Contains various evaluation scenarios and examples. +- [LangSmith How To Guides](https://docs.smith.langchain.com/how_to_guides): Contains various how to walkthroughs. +- [GitLab Duo Chat Documentation](https://docs.gitlab.com/ee/development/ai_features/duo_chat.html): Comprehensive guide on setting up and using LangSmith for chat evaluations. +- [Prompt and AI Feature Evaluation Setup and Workflow](https://gitlab.com/groups/gitlab-org/-/epics/13952): Details on the overall workflow and setup for evaluations. + +By following these steps, you can effectively conduct evaluations using LangSmith, even with minimal Python knowledge. For any issues or further assistance, refer to the provided documentation links or reach out to your team leads. diff --git a/doc/performance/index.md b/doc/performance/index.md new file mode 100644 index 0000000000000000000000000000000000000000..0b6deb52bcdbf39c877d74e7d3fd519d56c3a644 --- /dev/null +++ b/doc/performance/index.md @@ -0,0 +1,83 @@ +## Example evaluating performance metrics + +This script evaluates performance metrics by sending chat requests, measuring latency, tracking token usage, and calculating overall metrics such as error rate, average latency, and reliability. + +```python +import os +import requests +import time +from dotenv import load_dotenv +from langsmith import traceable +from langsmith.evaluation import evaluate +from langchain_openai import ChatOpenAI +from langchain_community.callbacks.manager import get_openai_callback + +load_dotenv() + +@traceable +def get_chat_answer(question): + base_url = 'http://localhost:3000' + url = f"{base_url}/api/v4/chat/completions" + headers = { + "Content-Type": "application/json", + "PRIVATE-TOKEN": os.getenv("GITLAB_PRIVATE_TOKEN"), + } + payload = {"content": question} + start_time = time.time() + response = requests.post(url, json=payload, headers=headers) + end_time = time.time() + latency = end_time - start_time + + if response.status_code == 200: + response_data = response.json() + tokens_used = response_data.get('usage_metadata', {}).get('total_tokens', 0) + return { + "response": response_data, + "latency": latency, + "tokens_used": tokens_used, + "status": "success" + } + else: + return { + "status": "error", + "status_code": response.status_code, + "latency": latency, + "tokens_used": 0 + } + +def main(): + # Initialize the StringEvaluator with the grading function + evaluator_1 = LangChainStringEvaluator("exact_match") + dataset_name = "duo_chat_questions_0shot" + + chain_results = evaluate( + lambda inputs: get_chat_answer(inputs['input']), + data=dataset_name, + evaluators=[evaluator_1], # Use the built-in StringEvaluator + experiment_prefix="Run Small Duo Chat Questions on GDK", + ) + + total_requests = len(chain_results) + successful_requests = sum(1 for result in chain_results if result['status'] == 'success') + error_requests = total_requests - successful_requests + latencies = [result['latency'] for result in chain_results if result['status'] == 'success'] + tokens_used = [result['tokens_used'] for result in chain_results if result['status'] == 'success'] + + performance_metrics = { + "total_requests": total_requests, + "successful_requests": successful_requests, + "error_requests": error_requests, + "error_rate": error_requests / total_requests if total_requests > 0 else 0, + "average_latency": sum(latencies) / len(latencies) if latencies else 0, + "average_tokens_per_second": sum(tokens_used) / sum(latencies) if latencies else 0, + "time_to_first_token_render": min(latencies) if latencies else 0, + "reliability": successful_requests / total_requests if total_requests > 0 else 0 + } + + print("Performance Metrics:") + for key, value in performance_metrics.items(): + print(f"{key}: {value}") + +if __name__ == "__main__": + main() +```