The User Interface & Ground-Truth Testing

Part 5 of 5 | ← Part 4 | Complete Series

Streamlit Overview

Streamlit lets you build data apps in pure Python—no HTML, CSS, or JavaScript needed.

import streamlit as st
from datetime import datetime
import pandas as pd
import requests

# Page config
st.set_page_config(
    page_title="IPL AI Assistant",
    page_icon="🏏",
    layout="wide",
)

# Title
st.title("🏏 IPL AI Assistant")
st.subtitle("Predictions + Q&A Powered by ML")

# Tabs
tab1, tab2, tab3 = st.tabs(["💬 Chat", "🎯 Predict", "📊 Metrics"])

Enter fullscreen mode Exit fullscreen mode

Tab 1: Chat Interface

with tab1:
    st.header("Ask Anything")

    # Initialize session state
    if "messages" not in st.session_state:
        st.session_state.messages = []

    # Display chat history
    for message in st.session_state.messages:
        with st.chat_message(message["role"]):
            st.markdown(message["content"])

    # User input
    if user_input := st.chat_input("Ask about IPL..."):
        st.session_state.messages.append({"role": "user", "content": user_input})

        with st.chat_message("user"):
            st.markdown(user_input)

        # Call backend
        try:
            response = requests.post(
                "http://localhost:8000/chat",
                json={"message": user_input},
                timeout=5,
            )
            response.raise_for_status()

            result = response.json()
            assistant_message = result.get("message", "I couldn't understand that.")

            st.session_state.messages.append({
                "role": "assistant",
                "content": assistant_message,
            })

            with st.chat_message("assistant"):
                st.markdown(assistant_message)

        except Exception as e:
            st.error(f"❌ Backend error: {str(e)}")

Enter fullscreen mode Exit fullscreen mode

Key Concepts:

Session State: st.session_state.messages persists across reruns
- When user submits a message, Streamlit reruns the entire script
- Session state preserves conversation history
- Without it: chat history disappears on each input
Chat Message: st.chat_message() renders messages with role-based styling
- "user" = right-aligned, blue background
- "assistant" = left-aligned, gray background
Chat Input: st.chat_input() provides a textbox with submission handling
- Returns None until user submits
- Automatically clears after submission

Tab 2: Prediction Interface

with tab2:
    st.header("Match Prediction Simulator")

    col1, col2 = st.columns(2)

    with col1:
        st.subheader("Teams")
        batting_team = st.selectbox(
            "Batting Team:",
            [
                "Mumbai Indians",
                "Chennai Super Kings",
                "Royal Challengers Bangalore",
                "Kolkata Knight Riders",
                # ... all 10 teams
            ],
        )
        bowling_team = st.selectbox(
            "Bowling Team:",
            options=["All teams except batting team"],
            index=0,
        )

    with col2:
        st.subheader("Venue")
        venue = st.text_input("Ground name:", "Wankhede")

    st.subheader("Pre-Match Form")
    col1, col2, col3, col4 = st.columns(4)

    with col1:
        h2h_rate = st.slider(
            "H2H Win Rate (Batting Team)",
            0.0, 1.0, 0.5, 0.05,
        )

    with col2:
        overall_rate = st.slider(
            "Overall Win Rate",
            0.0, 1.0, 0.5, 0.05,
        )

    with col3:
        venue_rate = st.slider(
            "Venue Win Rate",
            0.0, 1.0, 0.5, 0.05,
        )

    with col4:
        rolling_rate = st.slider(
            "Last 5 Matches Win Rate",
            0.0, 1.0, 0.5, 0.05,
        )

    st.subheader("Toss")
    col1, col2 = st.columns(2)

    with col1:
        toss_win = st.radio("Who won toss?", ["Batting Team", "Bowling Team"])
        toss_win = 1 if toss_win == "Batting Team" else 0

    with col2:
        toss_choice = st.radio("Toss choice?", ["Bat", "Field"])
        toss_choice = toss_choice.lower()

    # Predict button
    if st.button("🎯 Predict Match Outcome", use_container_width=True):
        try:
            response = requests.post(
                "http://localhost:8000/predict",
                json={
                    "batting_team": batting_team,
                    "bowling_team": bowling_team,
                    "venue": venue,
                    "h2h_rate": h2h_rate,
                    "overall_rate": overall_rate,
                    "venue_rate": venue_rate,
                    "rolling_rate": rolling_rate,
                    "toss_win": toss_win,
                    "toss_choice": toss_choice,
                },
                timeout=5,
            )
            response.raise_for_status()

            result = response.json()
            winner = result["winner"]
            confidence = result["confidence"]

            st.success(
                f"### 🏆 {winner} wins!\n"
                f"**Confidence:** {confidence:.1%}"
            )

            # Show prediction breakdown
            st.info(
                f"**Model:** {result['model']}\n\n"
                f"**Reasoning:**\n"
                f"- H2H: {h2h_rate:.0%}\n"
                f"- Form: {rolling_rate:.0%}\n"
                f"- Venue: {venue_rate:.0%}\n"
            )

        except Exception as e:
            st.error(f"❌ Prediction failed: {str(e)}")

Enter fullscreen mode Exit fullscreen mode

Key UI Patterns:

st.selectbox() — Dropdown selector
st.slider() — Range input (0.0-1.0)
st.radio() — Single-choice radio buttons
st.columns() — Grid layout (col1, col2, etc.)
st.button() — Form submission
st.success/error/info() — Colored alerts

Tab 3: Metrics & Transparency

with tab3:
    st.header("Model Performance")

    # Load metrics
    import json
    with open("models/metrics.json") as f:
        metrics = json.load(f)

    col1, col2, col3 = st.columns(3)
    col1.metric("Test Accuracy", f"{metrics['accuracy']:.1%}")
    col2.metric("Precision", f"{metrics['precision']:.1%}")
    col3.metric("Recall", f"{metrics['recall']:.1%}")

    st.subheader("Confusion Matrix")
    st.image("models/confusion_matrix.png", use_column_width=True)

    st.subheader("Feature Importance")
    importance_df = pd.DataFrame({
        "Feature": ["h2h_rate", "rolling_rate", "venue_rate", ...],
        "Importance": [0.32, 0.28, 0.18, ...],
    }).sort_values("Importance", ascending=False)

    st.bar_chart(importance_df.set_index("Feature"))

    st.subheader("Q&A Engine")
    st.info(
        f"**Total Q&A Pairs:** 42,523\n\n"
        f"**Vocabulary Size:** 18,394\n\n"
        f"**Match Strategy:** TF-IDF + Cosine Similarity (threshold: 0.15)\n\n"
        f"**Coverage:** {(42523 / 50000 * 100):.1f}% of expected cricket topics"
    )

Enter fullscreen mode Exit fullscreen mode

Testing: The Foundation of Trust

Good tests = confidence in deployment.

Test Structure

# tests/test_qa.py

import pytest
import pandas as pd
from src.build_qa_model import answer_question
from joblib import load

# Load test data
test_df = pd.read_csv("datasets/ipl_2008_2024_complete.csv")
qa_model = load("models/qa_model.joblib")

# Extract Q&A components
tfidf = qa_model["tfidf"]
Q_matrix = qa_model["Q_matrix"]
answers = qa_model["answers"]

Enter fullscreen mode Exit fullscreen mode

Test 1: Specific Match Facts

def test_match_lookup():
    """Can we answer specific match questions?"""
    questions = [
        "Who won the match on 2024-04-01 between MI and RR?",
        "How many runs were scored by MI in 2024-04-01?",
        "What was the result of MI vs RR on 2024-04-01?",
    ]

    for question in questions:
        answer, score = answer_question(
            question, tfidf, Q_matrix, answers, threshold=0.15
        )

        assert answer is not None, f"Failed on: {question}"
        assert len(answer) > 10, f"Answer too short: {answer}"
        assert score > 0.15, f"Confidence too low: {score}"

Enter fullscreen mode Exit fullscreen mode

Why data-driven?

Not hardcoded: "answer == 'Mumbai Indians wins'"
CSV-based: Pulls real facts from dataset
Robust: Works even if answer phrasing changes

Test 2: Aggregate Statistics

def test_most_wins():
    """Can we retrieve aggregate stats?"""
    questions = [
        "Which team has won most matches?",
        "Who has most IPL titles?",
        "Team with highest win percentage?",
    ]

    for question in questions:
        answer, score = answer_question(
            question, tfidf, Q_matrix, answers, threshold=0.15
        )

        # Verify answer is a valid team name
        assert answer is not None
        valid_teams = ["Mumbai Indians", "CSK", "RCB", ...]
        assert any(team in answer for team in valid_teams)

Enter fullscreen mode Exit fullscreen mode

Test 3: Head-to-Head

def test_head_to_head():
    """Can we answer H2H questions?"""
    questions = [
        "Head-to-head record between MI and CSK?",
        "Does MI have winning record vs KKR?",
        "Who dominates MI vs RR?",
    ]

    for question in questions:
        answer, score = answer_question(
            question, tfidf, Q_matrix, answers, threshold=0.15
        )

        assert answer is not None
        # H2H answers contain numbers (win counts)
        assert any(char.isdigit() for char in answer)

Enter fullscreen mode Exit fullscreen mode

Test 4: Threshold Behavior

def test_threshold_protects_low_confidence():
    """Low-confidence matches are rejected."""
    nonsense = "xyzabc qwerty asdfgh"  # Gibberish

    answer, score = answer_question(
        nonsense, tfidf, Q_matrix, answers, threshold=0.15
    )

    # Model shouldn't hallucinate
    assert answer is None, f"Got answer for nonsense: {answer}"
    assert score < 0.15

Enter fullscreen mode Exit fullscreen mode

Test 5: ML Model Predictions

def test_model_predictions():
    """Can we predict match winners?"""
    from src.train import normalize_teams, engineer_features
    from joblib import load

    ml_model = load("models/model.joblib")
    pipeline = ml_model["pipeline"]

    # Create a test case
    test_row = test_df.iloc[0].copy()  # Use real match
    test_row["date"] = "2023-05-01"  # Test on recent data

    # Engineer features
    features_df = engineer_features(
        test_df[test_df["date"] < "2023-01-01"],  # Historical
        test_row,
    )

    # Predict
    prediction = pipeline.predict(features_df)
    prob = pipeline.predict_proba(features_df)

    assert prediction[0] in [0, 1]  # Binary classification
    assert 0 <= max(prob[0]) <= 1  # Valid probability
    assert abs(sum(prob[0]) - 1.0) < 0.01  # Probabilities sum to 1

Enter fullscreen mode Exit fullscreen mode

Test 6: Feature Sanity

def test_feature_ranges():
    """Are features in valid ranges?"""
    from src.features import engineer_features

    # Get one match
    test_match = test_df.iloc[0]

    # Engineer
    features = engineer_features(test_df, test_match)

    # Check rates (should be 0-1)
    assert 0 <= features["h2h_rate"].values[0] <= 1
    assert 0 <= features["overall_rate"].values[0] <= 1
    assert 0 <= features["venue_rate"].values[0] <= 1
    assert 0 <= features["rolling_rate"].values[0] <= 1

    # Check binary fields
    assert features["toss_win"].values[0] in [0, 1]

Enter fullscreen mode Exit fullscreen mode

Test 7: No Data Leakage

def test_future_data_not_used():
    """Ensure no future data affects past predictions."""
    from src.features import engineer_features

    # Engineer a 2023 match using only 2022 history
    before_date = "2023-01-01"
    hist_data = test_df[test_df["date"] < before_date]

    test_match = test_df[
        (test_df["date"] >= before_date) & 
        (test_df["date"] < "2023-02-01")
    ].iloc[0]

    features = engineer_features(hist_data, test_match)

    # Features should only use hist_data, not test_match
    # (This is enforced in engine_features with before_date guards)
    assert features is not None

    # Verify: no 2023 data leaked into 2022 rates
    assert all(hist_data["date"] < before_date)

Enter fullscreen mode Exit fullscreen mode

Running Tests

# Install test dependencies
pip install pytest

# Run all tests
pytest tests/

# Run with verbose output
pytest tests/ -v

# Run specific test
pytest tests/test_qa.py::test_match_lookup -v

# Coverage report
pytest tests/ --cov=src --cov-report=html

Enter fullscreen mode Exit fullscreen mode

Deployment

Option 1: Streamlit Cloud

# Push to GitHub
git push origin main

# Link in Streamlit Cloud (streamlit.io/cloud)
# - Select repository
# - Select app.py
# - Auto-deploys on push

Enter fullscreen mode Exit fullscreen mode

Live in 2 minutes, updates automatically.

Option 2: Docker

FROM python:3.11

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

EXPOSE 8501

CMD ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501"]

Enter fullscreen mode Exit fullscreen mode

docker build -t ipl-app .
docker run -p 8501:8501 ipl-app

Enter fullscreen mode Exit fullscreen mode

Visit http://localhost:8501

Option 3: Heroku / Railway

# Deploy with one command
heroku create ipl-ai-assistant
git push heroku main

Enter fullscreen mode Exit fullscreen mode

App runs at ipl-ai-assistant.herokuapp.com

Performance Monitoring

# Log predictions to dataframe
import logging
from datetime import datetime

logging.basicConfig(filename="predictions.log", level=logging.INFO)

@st.cache_data
def get_prediction_logs():
    return pd.read_csv("predictions.log")

# Dashboard
st.line_chart(
    get_prediction_logs()
    .groupby("hour")["confidence"]
    .mean()
)

Enter fullscreen mode Exit fullscreen mode

Track:

Average confidence over time
Common queries
Backend latency

Common Issues & Solutions

Issue

Solution

"Connection refused"

Backend not running on localhost:8000

Chat history disappears

Use st.session_state, not regular variables

Predictions slow

Enable model caching, use lazy loading

Tests fail on new data

Read from CSV, not hardcoded values

Threshold too strict

Lower from 0.15 to 0.10 for more results

Summary: The Complete System

Component

Purpose

Technology

Feature Engineering

Calculate pre-match metrics

pandas, numpy

ML Model

Predict match winners

scikit-learn

Q&A Engine

Answer cricket questions

TF-IDF, cosine similarity

FastAPI Backend

Intelligent routing, lazy loading

FastAPI, uvicorn

Streamlit Frontend

Chat, prediction, metrics

Streamlit

Testing

Verify correctness

pytest, CSV-based assertions

You now have:

✅ Production-ready predictions (61.8% accuracy)

✅ Intelligent Q&A with 42K learning pairs

✅ Low-latency API (<20ms per request)

✅ Smooth UI with session persistence

✅ 22 tests ensuring reliability

✅ Multiple deployment options

What's Next?

Deploy this and message me your results! 🚀

Repository: https://github.com/jayakrishnayadav24/ipl-ai-assistant

Series Complete 🎉

← Part 4: Backend Routing | ← Return to Series

Built with 💚 for cricket fans. Questions? DM me!

Source: Dev.to

The User Interface & Ground-Truth Testing

Streamlit Overview

Tab 1: Chat Interface

Tab 2: Prediction Interface

Tab 3: Metrics & Transparency

Testing: The Foundation of Trust

Test Structure

Test 1: Specific Match Facts

Test 2: Aggregate Statistics

Test 3: Head-to-Head

Test 4: Threshold Behavior

Test 5: ML Model Predictions

Test 6: Feature Sanity

Test 7: No Data Leakage

Running Tests

Deployment

Option 1: Streamlit Cloud

Option 2: Docker

Option 3: Heroku / Railway

Performance Monitoring

Common Issues & Solutions

Summary: The Complete System

What's Next?

← Part 4: Backend Routing | ← Return to Series

Related Posts

contrast()

contrast-color()

Granite 4.1: IBM's 8B Model Matching 32B MoE