Practitioner Track · Module 3

Data Literacy for GenAI

Understand how data flows through GenAI systems, learn to curate effective context, protect sensitive information, and verify AI outputs against source data.

25 min
175 XP
Jan 2026
Learning Objectives
  • Understand how GenAI uses data differently than traditional AI
  • Curate effective context for prompts and RAG systems
  • Protect sensitive data when using GenAI tools
  • Verify and ground AI outputs against authoritative sources

Why Data Literacy Matters for GenAI

GenAI doesn't learn from your data the way traditional AI does. When you prompt ChatGPT or Claude, you're not training a model—you're providing context for a pre-trained model to work with. This changes what "data literacy" means.

For GenAI practitioners, data literacy is about:

  • What goes in: Curating the right context for accurate, relevant outputs
  • What stays out: Protecting sensitive information from exposure
  • What comes back: Verifying outputs against authoritative sources

The GenAI Data Flow

StageTraditional MLGenAI
TrainingYour historical data trains the modelPre-trained on public data (not yours)
InputStructured featuresNatural language prompts + context
OutputPredictions, classificationsGenerated text, analysis, media
LearningModel improves with more dataModel doesn't learn from your usage*

*Note: Some enterprise agreements allow usage data for model improvement. Know your vendor's data policy.


Workplace Scenario: The Knowledge Gap

You're rolling out a GenAI assistant to help customer support agents find answers faster. The tool is connected to your knowledge base via RAG (Retrieval-Augmented Generation).

Early feedback is mixed:

  • "It gave me a policy from 2019 that we don't use anymore"
  • "It couldn't find anything about the new pricing tier"
  • "It confidently told a customer something that was just wrong"

The problem isn't the AI—it's the data feeding it.

This scenario plays out constantly. GenAI is only as good as the context it receives.


Curating Context: What Goes In

The RAG Reality

RAG (Retrieval-Augmented Generation) connects GenAI to your organization's knowledge. When a user asks a question, the system:

  1. Searches your knowledge base for relevant documents
  2. Includes those documents as context in the prompt
  3. Generates an answer grounded in your content

RAG quality depends entirely on knowledge base quality.

Knowledge Base Health Checklist

DimensionQuestionImpact on GenAI
CurrencyIs content up to date?Outdated docs = outdated answers
CoverageAre all topics documented?Gaps = "I don't know" or hallucination
AccuracyIs the content correct?Wrong source = wrong answer
FindabilityCan search retrieve it?Unfindable = effectively doesn't exist
ClarityIs it written clearly?Ambiguous source = ambiguous output

Common Knowledge Base Problems

Problem 1: The Graveyard Your knowledge base has 10,000 articles. 6,000 haven't been updated in 3+ years. The AI retrieves outdated content and presents it as current policy.

Fix: Implement content lifecycle management. Archive or flag stale content. Add "last reviewed" dates that RAG can filter on.

Problem 2: The Gaps A new product launched last quarter. The knowledge base has minimal documentation. The AI either says "I don't know" or hallucinates based on similar products.

Fix: Treat knowledge base updates as part of product launch readiness. No launch without documentation.

Problem 3: The Contradictions Three different articles describe the return policy—each slightly differently. The AI retrieves one at random, giving inconsistent answers.

Fix: Establish single sources of truth for key topics. Consolidate duplicates. Use canonical URLs.


Interactive: Knowledge Base Triage

For each scenario, identify the knowledge base issue and recommend an action:

Match Knowledge Base Problems to Solutions

Click a scenario on the left, then click its matching solution on the right.

Scenarios
AI misinterprets a policy because it's written in legal jargon
Technical accuracy but poor comprehension.
AI can't answer questions about the new enterprise tier
Sales team frustrated by gaps in product knowledge.
AI gives different refund timelines to different customers
Three articles describe the policy differently.
AI cites a process that was replaced 6 months ago
Support agents report getting outdated procedures.
AI retrieves irrelevant articles about similar topics
Search returns tangentially related content.
Solutions

Protecting Sensitive Data: What Stays Out

The Privacy Risk

Every prompt you send to a GenAI tool is data leaving your control. Depending on your vendor agreement, that data may be:

  • Stored in logs
  • Used to improve the model
  • Accessible to the vendor's employees
  • Subject to legal requests

The rule is simple: Don't put anything in a prompt that you wouldn't put in an email to an external party.

Data Classification for GenAI

ClassificationExamplesGenAI Guidance
PublicMarketing materials, public docsSafe to use freely
InternalInternal processes, non-sensitive business infoGenerally safe with enterprise tools
ConfidentialCustomer PII, financial data, HR recordsNever in public tools; caution even in enterprise
RestrictedTrade secrets, M&A info, legal mattersAvoid GenAI entirely or use air-gapped solutions

Common Mistakes

Mistake 1: Pasting Customer Data

"Summarize this customer complaint: [full name, account number, medical condition, detailed complaint]"

Better: Remove PII before pasting, or use anonymized examples.

Mistake 2: Sharing Proprietary Code

"Debug this code from our trading algorithm: [proprietary logic]"

Better: Abstract the problem, use pseudocode, or use an air-gapped coding assistant.

Mistake 3: Strategic Information

"Help me draft talking points for the acquisition of [company name]"

Better: Use generic placeholders until the information is public.

Enterprise vs. Consumer Tools

AspectConsumer (ChatGPT free)Enterprise (ChatGPT Enterprise, Claude for Work)
Data retentionMay retain promptsTypically no retention
Model trainingMay use your dataYour data excluded
Access controlsNoneSSO, audit logs
ComplianceLimitedSOC 2, HIPAA options

Know your organization's approved tools. Using consumer tools for work data may violate policy—and create real risk.

Knowledge Check

Test your understanding with a quick quiz


Verifying Outputs: Trust but Verify

The Hallucination Problem

GenAI models generate plausible text—but plausible isn't the same as true. They can:

  • Invent facts that sound authoritative
  • Cite sources that don't exist
  • Confidently state outdated information
  • Mix accurate and inaccurate details seamlessly

Verification isn't optional. It's the core skill of GenAI literacy.

Verification Strategies

Strategy 1: Source Attribution Ask the AI to cite its sources. Then check those sources.

tsx
01Prompt: "What is our refund policy for annual subscriptions?
02Cite the specific knowledge base article."
03
04Output: "According to KB-2847 'Subscription Refunds',
05annual plans are eligible for prorated refunds..."
06
07Verification: Open KB-2847. Confirm it says what the AI claims.

Strategy 2: Confidence Calibration Ask the AI to rate its confidence. Low confidence = more verification needed.

tsx
01Prompt: "Answer this question and rate your confidence
02(high/medium/low) based on how clearly our documentation
03addresses it."

Strategy 3: Cross-Reference For important outputs, verify against authoritative sources outside the AI.

Output TypeVerify Against
Policy statementsOfficial policy documents
Technical claimsDocumentation, SMEs
Data/statisticsSource systems, reports
Legal/complianceLegal team review

Strategy 4: The Smell Test If something seems too good, too specific, or too convenient—verify it. AI is better at sounding right than being right.

Building Verification Habits

ContextVerification Level
Internal brainstormingLight (directional accuracy sufficient)
Customer communicationsMedium (verify key claims)
Published contentHigh (fact-check everything)
Legal/financial/medicalMaximum (expert review required)

Case Study: The Grounded Assistant

Company: B2B Software Company

Goal: Deploy a GenAI assistant to help support agents resolve tickets faster.

Initial Approach: Connected the AI to the full knowledge base (5,000+ articles) via RAG.

Problems Discovered:

  1. 40% of articles were outdated (last updated 2+ years ago)
  2. Multiple articles covered the same topics with conflicting info
  3. Agents trusted AI answers without verification, leading to customer complaints
  4. Sensitive customer data was being pasted into prompts

Data-First Response:

  1. Knowledge Base Cleanup: Archived 2,000 stale articles. Consolidated duplicates. Added "last reviewed" metadata.

  2. Source Attribution: Modified prompts to always cite the source article. Agents trained to click through and verify.

  3. Confidence Indicators: Added visual confidence scores. Low-confidence answers flagged for manual lookup.

  4. Data Handling Training: Trained agents on what data can/cannot be included in prompts. Added PII detection warnings.

Result: After 90 days, ticket resolution time decreased 25%. Customer complaints about incorrect information dropped 60%. Agents reported higher trust in the tool because they understood its limitations.

Key Learning: GenAI readiness isn't about the AI—it's about the data ecosystem around it.


The GenAI Data Readiness Assessment

Before deploying a GenAI solution, assess your data readiness:

GenAI Data Readiness Assessment
0/14

Evaluate your readiness for a GenAI deployment. Be honest—gaps here cause problems later.

Content is current (reviewed within appropriate timeframes)
Key topics have single, authoritative sources (no contradictions)
Content is written clearly (not just technically accurate)
Search/retrieval returns relevant results for common queries
We have clear guidelines on what data can be used with GenAI
Users are trained on data classification and handling
We're using enterprise-grade tools with appropriate data agreements
There are controls to prevent accidental sensitive data exposure
Users understand that GenAI outputs require verification
There are processes to verify high-stakes outputs before use
Source attribution is enabled so users can check references
There's a process to report and correct AI errors
Knowledge base has clear ownership and update processes
Usage is monitored for policy compliance

Completion: Data Readiness Memo

To complete this module, prepare a Data Readiness Memo for a GenAI use case in your area.

Your memo should include:

  1. Use Case Summary: One sentence describing the GenAI application
  2. Knowledge/Context Assessment: What data will the AI use? Is it current, accurate, and accessible?
  3. Privacy Review: What sensitive data might users be tempted to include? How will you prevent exposure?
  4. Verification Plan: How will users verify AI outputs? What level of verification is appropriate?
  5. Recommendation: Ready to proceed, needs preparation, or significant gaps to address first?

Assessment Rubric:

CriterionWhat We're Looking For
ThoroughnessAll four areas (knowledge, privacy, verification, governance) addressed
RealismHonest assessment of gaps, not optimistic assumptions
ActionabilityClear recommendation with specific next steps
Risk AwarenessAppropriate mitigation strategies for identified risks

Practical Exercise

Complete an artifact to demonstrate your skills


Key Takeaways

  • GenAI doesn't learn from your data—it uses data as context for pre-trained models
  • Knowledge base quality directly determines RAG output quality: garbage in, garbage out
  • Protect sensitive data: treat every prompt as potentially visible externally
  • Verification is the core skill: GenAI sounds authoritative even when wrong
  • Data readiness for GenAI = clean knowledge bases + data handling training + verification habits

Sources

Next Steps

In the next module, we'll explore Responsible AI and Ethics Essentials—understanding how to deploy GenAI fairly, transparently, and in compliance with organizational policies.