Data Literacy for GenAI | AI Champion Training

Why Data Literacy Matters for GenAI

GenAI doesn't learn from your data the way traditional AI does. When you prompt ChatGPT or Claude, you're not training a model—you're providing context for a pre-trained model to work with. This changes what "data literacy" means.

For GenAI practitioners, data literacy is about:

What goes in: Curating the right context for accurate, relevant outputs
What stays out: Protecting sensitive information from exposure
What comes back: Verifying outputs against authoritative sources

The GenAI Data Flow

Stage	Traditional ML	GenAI
Training	Your historical data trains the model	Pre-trained on public data (not yours)
Input	Structured features	Natural language prompts + context
Output	Predictions, classifications	Generated text, analysis, media
Learning	Model improves with more data	Model doesn't learn from your usage*

*Note: Some enterprise agreements allow usage data for model improvement. Know your vendor's data policy.

Workplace Scenario: The Knowledge Gap

You're rolling out a GenAI assistant to help customer support agents find answers faster. The tool is connected to your knowledge base via RAG (Retrieval-Augmented Generation).

Early feedback is mixed:

"It gave me a policy from 2019 that we don't use anymore"
"It couldn't find anything about the new pricing tier"
"It confidently told a customer something that was just wrong"

The problem isn't the AI—it's the data feeding it.

This scenario plays out constantly. GenAI is only as good as the context it receives.

Curating Context: What Goes In

The RAG Reality

RAG (Retrieval-Augmented Generation) connects GenAI to your organization's knowledge. When a user asks a question, the system:

Searches your knowledge base for relevant documents
Includes those documents as context in the prompt
Generates an answer grounded in your content

RAG quality depends entirely on knowledge base quality.

Knowledge Base Health Checklist

Dimension	Question	Impact on GenAI
Currency	Is content up to date?	Outdated docs = outdated answers
Coverage	Are all topics documented?	Gaps = "I don't know" or hallucination
Accuracy	Is the content correct?	Wrong source = wrong answer
Findability	Can search retrieve it?	Unfindable = effectively doesn't exist
Clarity	Is it written clearly?	Ambiguous source = ambiguous output

Common Knowledge Base Problems

Problem 1: The Graveyard Your knowledge base has 10,000 articles. 6,000 haven't been updated in 3+ years. The AI retrieves outdated content and presents it as current policy.

Fix: Implement content lifecycle management. Archive or flag stale content. Add "last reviewed" dates that RAG can filter on.

Problem 2: The Gaps A new product launched last quarter. The knowledge base has minimal documentation. The AI either says "I don't know" or hallucinates based on similar products.

Fix: Treat knowledge base updates as part of product launch readiness. No launch without documentation.

Problem 3: The Contradictions Three different articles describe the return policy—each slightly differently. The AI retrieves one at random, giving inconsistent answers.

Fix: Establish single sources of truth for key topics. Consolidate duplicates. Use canonical URLs.

Interactive: Knowledge Base Triage

For each scenario, identify the knowledge base issue and recommend an action:

Match Knowledge Base Problems to Solutions

Click a scenario on the left, then click its matching solution on the right.

Scenarios

AI misinterprets a policy because it's written in legal jargon

Technical accuracy but poor comprehension.

AI can't answer questions about the new enterprise tier

Sales team frustrated by gaps in product knowledge.

AI gives different refund timelines to different customers

Three articles describe the policy differently.

AI cites a process that was replaced 6 months ago

Support agents report getting outdated procedures.

AI retrieves irrelevant articles about similar topics

Search returns tangentially related content.

Solutions

Protecting Sensitive Data: What Stays Out

The Privacy Risk

Every prompt you send to a GenAI tool is data leaving your control. Depending on your vendor agreement, that data may be:

Stored in logs
Used to improve the model
Accessible to the vendor's employees
Subject to legal requests

The rule is simple: Don't put anything in a prompt that you wouldn't put in an email to an external party.

Data Classification for GenAI

Classification	Examples	GenAI Guidance
Public	Marketing materials, public docs	Safe to use freely
Internal	Internal processes, non-sensitive business info	Generally safe with enterprise tools
Confidential	Customer PII, financial data, HR records	Never in public tools; caution even in enterprise
Restricted	Trade secrets, M&A info, legal matters	Avoid GenAI entirely or use air-gapped solutions

Common Mistakes

Mistake 1: Pasting Customer Data

"Summarize this customer complaint: [full name, account number, medical condition, detailed complaint]"

Better: Remove PII before pasting, or use anonymized examples.

Mistake 2: Sharing Proprietary Code

"Debug this code from our trading algorithm: [proprietary logic]"

Better: Abstract the problem, use pseudocode, or use an air-gapped coding assistant.

Mistake 3: Strategic Information

"Help me draft talking points for the acquisition of [company name]"

Better: Use generic placeholders until the information is public.

Enterprise vs. Consumer Tools

Aspect	Consumer (ChatGPT free)	Enterprise (ChatGPT Enterprise, Claude for Work)
Data retention	May retain prompts	Typically no retention
Model training	May use your data	Your data excluded
Access controls	None	SSO, audit logs
Compliance	Limited	SOC 2, HIPAA options

Know your organization's approved tools. Using consumer tools for work data may violate policy—and create real risk.

Knowledge Check

Test your understanding with a quick quiz

Verifying Outputs: Trust but Verify

The Hallucination Problem

GenAI models generate plausible text—but plausible isn't the same as true. They can:

Invent facts that sound authoritative
Cite sources that don't exist
Confidently state outdated information
Mix accurate and inaccurate details seamlessly

Verification isn't optional. It's the core skill of GenAI literacy.

Verification Strategies

Strategy 1: Source Attribution Ask the AI to cite its sources. Then check those sources.

tsx

01Prompt: "What is our refund policy for annual subscriptions?
02Cite the specific knowledge base article."
03
04Output: "According to KB-2847 'Subscription Refunds',
05annual plans are eligible for prorated refunds..."
06
07Verification: Open KB-2847. Confirm it says what the AI claims.

Strategy 2: Confidence Calibration Ask the AI to rate its confidence. Low confidence = more verification needed.

tsx

01Prompt: "Answer this question and rate your confidence
02(high/medium/low) based on how clearly our documentation
03addresses it."

Strategy 3: Cross-Reference For important outputs, verify against authoritative sources outside the AI.

Output Type	Verify Against
Policy statements	Official policy documents
Technical claims	Documentation, SMEs
Data/statistics	Source systems, reports
Legal/compliance	Legal team review

Strategy 4: The Smell Test If something seems too good, too specific, or too convenient—verify it. AI is better at sounding right than being right.

Building Verification Habits

Context	Verification Level
Internal brainstorming	Light (directional accuracy sufficient)
Customer communications	Medium (verify key claims)
Published content	High (fact-check everything)
Legal/financial/medical	Maximum (expert review required)

Case Study: The Grounded Assistant

Company: B2B Software Company

Goal: Deploy a GenAI assistant to help support agents resolve tickets faster.

Initial Approach: Connected the AI to the full knowledge base (5,000+ articles) via RAG.

Problems Discovered:

40% of articles were outdated (last updated 2+ years ago)
Multiple articles covered the same topics with conflicting info
Agents trusted AI answers without verification, leading to customer complaints
Sensitive customer data was being pasted into prompts

Data-First Response:

Knowledge Base Cleanup: Archived 2,000 stale articles. Consolidated duplicates. Added "last reviewed" metadata.
Source Attribution: Modified prompts to always cite the source article. Agents trained to click through and verify.
Confidence Indicators: Added visual confidence scores. Low-confidence answers flagged for manual lookup.
Data Handling Training: Trained agents on what data can/cannot be included in prompts. Added PII detection warnings.

Result: After 90 days, ticket resolution time decreased 25%. Customer complaints about incorrect information dropped 60%. Agents reported higher trust in the tool because they understood its limitations.

Key Learning: GenAI readiness isn't about the AI—it's about the data ecosystem around it.

The GenAI Data Readiness Assessment

Before deploying a GenAI solution, assess your data readiness:

GenAI Data Readiness Assessment

0/14

Evaluate your readiness for a GenAI deployment. Be honest—gaps here cause problems later.

Content is current (reviewed within appropriate timeframes)

Key topics have single, authoritative sources (no contradictions)

Content is written clearly (not just technically accurate)

Search/retrieval returns relevant results for common queries

We have clear guidelines on what data can be used with GenAI

Users are trained on data classification and handling

We're using enterprise-grade tools with appropriate data agreements

There are controls to prevent accidental sensitive data exposure

Users understand that GenAI outputs require verification

There are processes to verify high-stakes outputs before use

Source attribution is enabled so users can check references

There's a process to report and correct AI errors

Knowledge base has clear ownership and update processes

Usage is monitored for policy compliance

Completion: Data Readiness Memo

To complete this module, prepare a Data Readiness Memo for a GenAI use case in your area.

Your memo should include:

Use Case Summary: One sentence describing the GenAI application
Knowledge/Context Assessment: What data will the AI use? Is it current, accurate, and accessible?
Privacy Review: What sensitive data might users be tempted to include? How will you prevent exposure?
Verification Plan: How will users verify AI outputs? What level of verification is appropriate?
Recommendation: Ready to proceed, needs preparation, or significant gaps to address first?

Assessment Rubric:

Criterion	What We're Looking For
Thoroughness	All four areas (knowledge, privacy, verification, governance) addressed
Realism	Honest assessment of gaps, not optimistic assumptions
Actionability	Clear recommendation with specific next steps
Risk Awareness	Appropriate mitigation strategies for identified risks

Practical Exercise

Complete an artifact to demonstrate your skills

Key Takeaways

GenAI doesn't learn from your data—it uses data as context for pre-trained models
Knowledge base quality directly determines RAG output quality: garbage in, garbage out
Protect sensitive data: treat every prompt as potentially visible externally
Verification is the core skill: GenAI sounds authoritative even when wrong
Data readiness for GenAI = clean knowledge bases + data handling training + verification habits

Sources

Next Steps

In the next module, we'll explore Responsible AI and Ethics Essentials—understanding how to deploy GenAI fairly, transparently, and in compliance with organizational policies.