"Data Engineering for Large Models: Architecture, Algorithms, and Project Practice"¶
Book Overview¶
- Part 1: Infrastructure & Core Concepts (Building the Data Foundation)
- Part 2: Text Pre-training Data Engineering (Cleaning & Denoising)
- Part 3: Multimodal Data Engineering (Image-Text, Video & Audio)
- Part 4: Alignment & Synthetic Data Engineering (Instructions & Quality)
- Part 5: Application-level Data Engineering (RAG & Agent)
- Part 6: Capstone Projects (End-to-end Code Implementations)
Detailed Outline¶
Part 1: Infrastructure & Core Concepts¶
Goal: Establish a Data-Centric AI mindset and set up a high-performance data processing environment.
Chapter 1: Data Revolution in the LLM Era¶
- 1_1 Insights from Scaling Laws: Data quality > quantity — the paradigm shift from "big data" to "high-quality data."
- 1_2 LLM Data Lifecycle: Pre-training \(\rightarrow\) SFT \(\rightarrow\) RLHF \(\rightarrow\) RAG.
- 1_3 Challenges & Opportunities: The interplay of heterogeneous multimodality, copyright compliance, and compute costs.
Chapter 2: Data Infrastructure Selection¶
- 2_1 Modern Data Stack:
- Storage: Object storage (S3/MinIO) vs Data lakes (Iceberg/Hudi).
- Compute: Spark (traditional powerhouse) vs Ray Data (AI-native compute framework).
- 2_2 Data Formats & I/O Optimization:
- Parquet vs JSONL vs WebDataset (multimodal scenarios).
- Compression algorithms and read performance optimization.
- 2_3 Data Version Control (DataOps): Managing PB-scale datasets with DVC and LakeFS.
Part 2: Text Pre-training Data Engineering¶
Goal: Process massive unstructured text to build the model's linguistic cognitive foundation.
Chapter 3: Data Acquisition¶
- 3_1 Deconstructing Open-source Datasets: Deep analysis of Common Crawl, C4, RefinedWeb, and The Pile.
- 3_2 High-performance Web Crawling: Application of
Trafilaturaparsing library and distributed crawler architecture design. - 3_3 Specialized Data Acquisition: Extraction strategies for code (GitHub), papers (ArXiv/S2ORC), and book data.
Chapter 4: Cleaning & Deduplication¶
- 4_1 Heuristic Filtering Rules: Language identification (FastText), perplexity filtering, length and punctuation distribution.
- 4_2 Large-scale Deduplication:
- Fuzzy Deduplication: MinHash LSH algorithm principles and distributed implementation.
- Intra-document Deduplication: Eliminating repeated paragraphs and navigation bars.
- 4_3 Privacy Cleaning (PII Removal): Using Presidio to identify and mask emails, IPs, phone numbers, and addresses.
Chapter 5: Tokenization & Serialization¶
- 5_1 Tokenizer Principles: BPE, WordPiece, Unigram and their impact on model performance.
- 5_2 Efficient Vocabulary Construction: How to expand vocabularies for specific domains (e.g., medical, legal).
- 5_3 Data Mixing: Dynamic sampling strategies and Curriculum Learning data arrangement.
Part 3: Multimodal Data Engineering¶
Goal: Process images, video, and audio to support training of GPT-4V/Sora-class models.
Chapter 6: Image-Text Pair Processing¶
- 6_1 Data Paradigms: Image-text pairs (LAION-5B) vs Interleaved documents (OBELICS/MMC4).
- 6_2 Image Acquisition & Preprocessing:
img2datasethigh-concurrency download in practice.- GPU-accelerated decoding and transforms (NVIDIA DALI).
- 6_3 Multimodal Cleaning Pipeline:
- Aesthetic Scoring: Using CLIP-Score to filter high-aesthetic images.
- Image-text Alignment Filtering: Removing mismatched samples.
- Safety Detection: NSFW and watermark detection.
Chapter 7: Recaptioning¶
- 7_1 Limitations of Alt-text: Why raw web descriptions are unusable.
- 7_2 Synthetic Caption Factory:
- Using BLIP-2 / LLaVA / CogVLM to regenerate detailed captions.
- Prompt Strategies: Controlling caption granularity (brief vs detailed).
- 7_3 OCR Enhancement: Extracting in-image text and fusing it into text descriptions (critical for document understanding).
Chapter 8: Video & Audio Data¶
- 8_1 Video Processing Pipeline: Scene Detection and keyframe extraction strategies.
- 8_2 Video Tokenization: Video compression and discrete representation.
- 8_3 Audio Alignment: Large-scale ASR with Whisper and Force Alignment (timestamp alignment).
Part 4: Alignment & Synthetic Data Engineering¶
Goal: Make models follow instructions and break through the bottleneck of human data.
Chapter 9: Instruction Fine-tuning Data (SFT Data)¶
- 9_1 Prompt Engineering for Data Production: Writing robust System Prompts.
- 9_2 Automated Construction Methods:
- Self-Instruct: Leveraging strong models to generate instructions.
- Evol-Instruct: Evolutionary strategies for instruction complexity.
- 9_3 Chain-of-Thought (CoT) Data: Constructing step-by-step reasoning samples.
Chapter 10: Synthetic Data¶
- 10_1 Textbook-quality Data (Textbooks Are All You Need): Synthesizing high-quality domain knowledge.
- 10_2 Code & Math Synthesis:
- PoT (Program of Thought): Generating code, executing it, and verifying data correctness via execution results.
- 10_3 Multimodal Instruction Synthesis: Using GPT-4o to construct complex image-based reasoning Q&A.
Chapter 11: Human Preference Data (RLHF/DPO)¶
- 11_1 Preference Data Format: Constructing chosen vs rejected sample pairs.
- 11_2 Annotation Platforms & Quality Control: Crowdsourcing management and IAA (Inter-Annotator Agreement) analysis.
- 11_3 RLAIF (AI Feedback): Using LLMs to replace human preference scoring.
Part 5: Application-level Data Engineering (RAG & Agent)¶
Goal: Enterprise-oriented solutions for external knowledge base parsing and retrieval.
Chapter 12: RAG Data Pipeline¶
- 12_1 Deep Document Parsing:
- Complex PDF processing: table reconstruction, multi-column recognition (
Unstructured,LlamaParse). - 12_2 Chunking Strategies: Semantic chunking, recursive chunking, and Parent-Child Indexing.
- 12_3 Vectorization & Storage: Embedding model fine-tuning and vector database optimization.
Chapter 13: Multimodal RAG¶
- 13_1 Cross-modal Retrieval: Using CLIP/SigLIP for "text-to-image" and "image-to-text" search.
- 13_2 ColPali Architecture in Practice: Vision-language model based document retrieval (skip OCR, directly understand document images).
Part 6: Capstone Projects¶
Goal: Through 5 end-to-end projects, integrate all technical topics from the book with runnable code repositories.
Project 1: Building a "Mini-C4" Pre-training Set¶
- Scenario: From Common Crawl raw data (WARC) to high-quality Parquet data.
- Core Technologies: Trafilatura parsing, Spark/Ray distributed MinHash deduplication, KenLM quality filtering.
- Output: Cleaned plain text corpus and processing pipeline.
Project 2: Domain Expert SFT (Legal/Medical)¶
- Scenario: Building industry expert fine-tuning data from unstructured PDF documents.
- Core Technologies: Self-Instruct instruction generation, CoT reasoning enhancement, data diversity balancing.
- Output:
domain_expert.jsonlinstruction fine-tuning dataset.
Project 3: Building a LLaVA Multimodal Instruction Set¶
- Scenario: Training a multimodal model that can understand images.
- Core Technologies: Using GPT-4o API for multi-turn image-text dialogues, Bounding Box data alignment, multi-image interleaved format processing.
- Output: Image-text dataset with visual instructions.
Project 4: Synthetic Math/Code Textbook¶
- Scenario: Improving small model logical reasoning capabilities.
- Core Technologies: Evol-Instruct evolutionary strategies, Python code execution sandbox verification, PoT data formatting.
- Output: Verified high-quality synthetic reasoning dataset.
Project 5: Multimodal RAG Financial Report Assistant¶
- Scenario: Retrieving and answering questions about annual reports with complex charts.
- Core Technologies: PDF table and chart parsing, multi-route recall (hybrid retrieval), ColPali visual retrieval.
- Output: A RAG knowledge base system supporting chart Q&A.
Appendix¶
- Appendix A: Common Tools Quick Reference (Hugging Face Datasets, LangChain, Ray Data).
- Appendix B: Data Compliance Checklist (Copyright, GDPR, robots.txt).
- Appendix C: Compute Cost Estimation (GPU/CPU consumption reference for different data processing scales).