Practitioner Track · Module 3
Data Literacy for GenAI
Understand how data flows through GenAI systems, learn to curate effective context, protect sensitive information, and verify AI outputs against source data.
- Understand how GenAI uses data differently than traditional AI
- Curate effective context for prompts and RAG systems
- Protect sensitive data when using GenAI tools
- Verify and ground AI outputs against authoritative sources
Why Data Literacy Matters for GenAI
GenAI doesn't learn from your data the way traditional AI does. When you prompt ChatGPT or Claude, you're not training a model—you're providing context for a pre-trained model to work with. This changes what "data literacy" means.
For GenAI practitioners, data literacy is about:
- What goes in: Curating the right context for accurate, relevant outputs
- What stays out: Protecting sensitive information from exposure
- What comes back: Verifying outputs against authoritative sources
The GenAI Data Flow
| Stage | Traditional ML | GenAI |
|---|---|---|
| Training | Your historical data trains the model | Pre-trained on public data (not yours) |
| Input | Structured features | Natural language prompts + context |
| Output | Predictions, classifications | Generated text, analysis, media |
| Learning | Model improves with more data | Model doesn't learn from your usage* |
*Note: Some enterprise agreements allow usage data for model improvement. Know your vendor's data policy.
Workplace Scenario: The Knowledge Gap
You're rolling out a GenAI assistant to help customer support agents find answers faster. The tool is connected to your knowledge base via RAG (Retrieval-Augmented Generation).
Early feedback is mixed:
- "It gave me a policy from 2019 that we don't use anymore"
- "It couldn't find anything about the new pricing tier"
- "It confidently told a customer something that was just wrong"
The problem isn't the AI—it's the data feeding it.
This scenario plays out constantly. GenAI is only as good as the context it receives.
Curating Context: What Goes In
The RAG Reality
RAG (Retrieval-Augmented Generation) connects GenAI to your organization's knowledge. When a user asks a question, the system:
- Searches your knowledge base for relevant documents
- Includes those documents as context in the prompt
- Generates an answer grounded in your content
RAG quality depends entirely on knowledge base quality.
Knowledge Base Health Checklist
| Dimension | Question | Impact on GenAI |
|---|---|---|
| Currency | Is content up to date? | Outdated docs = outdated answers |
| Coverage | Are all topics documented? | Gaps = "I don't know" or hallucination |
| Accuracy | Is the content correct? | Wrong source = wrong answer |
| Findability | Can search retrieve it? | Unfindable = effectively doesn't exist |
| Clarity | Is it written clearly? | Ambiguous source = ambiguous output |
Common Knowledge Base Problems
Problem 1: The Graveyard Your knowledge base has 10,000 articles. 6,000 haven't been updated in 3+ years. The AI retrieves outdated content and presents it as current policy.
Fix: Implement content lifecycle management. Archive or flag stale content. Add "last reviewed" dates that RAG can filter on.
Problem 2: The Gaps A new product launched last quarter. The knowledge base has minimal documentation. The AI either says "I don't know" or hallucinates based on similar products.
Fix: Treat knowledge base updates as part of product launch readiness. No launch without documentation.
Problem 3: The Contradictions Three different articles describe the return policy—each slightly differently. The AI retrieves one at random, giving inconsistent answers.
Fix: Establish single sources of truth for key topics. Consolidate duplicates. Use canonical URLs.
Interactive: Knowledge Base Triage
For each scenario, identify the knowledge base issue and recommend an action:
Click a scenario on the left, then click its matching solution on the right.
Protecting Sensitive Data: What Stays Out
The Privacy Risk
Every prompt you send to a GenAI tool is data leaving your control. Depending on your vendor agreement, that data may be:
- Stored in logs
- Used to improve the model
- Accessible to the vendor's employees
- Subject to legal requests
The rule is simple: Don't put anything in a prompt that you wouldn't put in an email to an external party.
Data Classification for GenAI
| Classification | Examples | GenAI Guidance |
|---|---|---|
| Public | Marketing materials, public docs | Safe to use freely |
| Internal | Internal processes, non-sensitive business info | Generally safe with enterprise tools |
| Confidential | Customer PII, financial data, HR records | Never in public tools; caution even in enterprise |
| Restricted | Trade secrets, M&A info, legal matters | Avoid GenAI entirely or use air-gapped solutions |
Common Mistakes
Mistake 1: Pasting Customer Data
"Summarize this customer complaint: [full name, account number, medical condition, detailed complaint]"
Better: Remove PII before pasting, or use anonymized examples.
Mistake 2: Sharing Proprietary Code
"Debug this code from our trading algorithm: [proprietary logic]"
Better: Abstract the problem, use pseudocode, or use an air-gapped coding assistant.
Mistake 3: Strategic Information
"Help me draft talking points for the acquisition of [company name]"
Better: Use generic placeholders until the information is public.
Enterprise vs. Consumer Tools
| Aspect | Consumer (ChatGPT free) | Enterprise (ChatGPT Enterprise, Claude for Work) |
|---|---|---|
| Data retention | May retain prompts | Typically no retention |
| Model training | May use your data | Your data excluded |
| Access controls | None | SSO, audit logs |
| Compliance | Limited | SOC 2, HIPAA options |
Know your organization's approved tools. Using consumer tools for work data may violate policy—and create real risk.
Knowledge Check
Test your understanding with a quick quiz
Verifying Outputs: Trust but Verify
The Hallucination Problem
GenAI models generate plausible text—but plausible isn't the same as true. They can:
- Invent facts that sound authoritative
- Cite sources that don't exist
- Confidently state outdated information
- Mix accurate and inaccurate details seamlessly
Verification isn't optional. It's the core skill of GenAI literacy.
Verification Strategies
Strategy 1: Source Attribution Ask the AI to cite its sources. Then check those sources.
01Prompt: "What is our refund policy for annual subscriptions?02Cite the specific knowledge base article."0304Output: "According to KB-2847 'Subscription Refunds',05annual plans are eligible for prorated refunds..."0607Verification: Open KB-2847. Confirm it says what the AI claims.
Strategy 2: Confidence Calibration Ask the AI to rate its confidence. Low confidence = more verification needed.
01Prompt: "Answer this question and rate your confidence02(high/medium/low) based on how clearly our documentation03addresses it."
Strategy 3: Cross-Reference For important outputs, verify against authoritative sources outside the AI.
| Output Type | Verify Against |
|---|---|
| Policy statements | Official policy documents |
| Technical claims | Documentation, SMEs |
| Data/statistics | Source systems, reports |
| Legal/compliance | Legal team review |
Strategy 4: The Smell Test If something seems too good, too specific, or too convenient—verify it. AI is better at sounding right than being right.
Building Verification Habits
| Context | Verification Level |
|---|---|
| Internal brainstorming | Light (directional accuracy sufficient) |
| Customer communications | Medium (verify key claims) |
| Published content | High (fact-check everything) |
| Legal/financial/medical | Maximum (expert review required) |
Case Study: The Grounded Assistant
Company: B2B Software Company
Goal: Deploy a GenAI assistant to help support agents resolve tickets faster.
Initial Approach: Connected the AI to the full knowledge base (5,000+ articles) via RAG.
Problems Discovered:
- 40% of articles were outdated (last updated 2+ years ago)
- Multiple articles covered the same topics with conflicting info
- Agents trusted AI answers without verification, leading to customer complaints
- Sensitive customer data was being pasted into prompts
Data-First Response:
-
Knowledge Base Cleanup: Archived 2,000 stale articles. Consolidated duplicates. Added "last reviewed" metadata.
-
Source Attribution: Modified prompts to always cite the source article. Agents trained to click through and verify.
-
Confidence Indicators: Added visual confidence scores. Low-confidence answers flagged for manual lookup.
-
Data Handling Training: Trained agents on what data can/cannot be included in prompts. Added PII detection warnings.
Result: After 90 days, ticket resolution time decreased 25%. Customer complaints about incorrect information dropped 60%. Agents reported higher trust in the tool because they understood its limitations.
Key Learning: GenAI readiness isn't about the AI—it's about the data ecosystem around it.
The GenAI Data Readiness Assessment
Before deploying a GenAI solution, assess your data readiness:
Evaluate your readiness for a GenAI deployment. Be honest—gaps here cause problems later.
Completion: Data Readiness Memo
To complete this module, prepare a Data Readiness Memo for a GenAI use case in your area.
Your memo should include:
- Use Case Summary: One sentence describing the GenAI application
- Knowledge/Context Assessment: What data will the AI use? Is it current, accurate, and accessible?
- Privacy Review: What sensitive data might users be tempted to include? How will you prevent exposure?
- Verification Plan: How will users verify AI outputs? What level of verification is appropriate?
- Recommendation: Ready to proceed, needs preparation, or significant gaps to address first?
Assessment Rubric:
| Criterion | What We're Looking For |
|---|---|
| Thoroughness | All four areas (knowledge, privacy, verification, governance) addressed |
| Realism | Honest assessment of gaps, not optimistic assumptions |
| Actionability | Clear recommendation with specific next steps |
| Risk Awareness | Appropriate mitigation strategies for identified risks |
Practical Exercise
Complete an artifact to demonstrate your skills
Key Takeaways
- GenAI doesn't learn from your data—it uses data as context for pre-trained models
- Knowledge base quality directly determines RAG output quality: garbage in, garbage out
- Protect sensitive data: treat every prompt as potentially visible externally
- Verification is the core skill: GenAI sounds authoritative even when wrong
- Data readiness for GenAI = clean knowledge bases + data handling training + verification habits
Sources
- Retrieval-Augmented Generation for Knowledge-Intensive Tasks (Meta AI)
- Best Practices for RAG Applications (Anthropic)
- Enterprise GenAI Data Governance (Gartner)
Next Steps
In the next module, we'll explore Responsible AI and Ethics Essentials—understanding how to deploy GenAI fairly, transparently, and in compliance with organizational policies.