## Detailed Property Descriptions & Annotation Guidelines ### Core Content Properties #### 1. Content Integrity **What we're measuring**: Completeness and technical quality of the content itself, regardless of navigation ratio. ##### Values & Criteria: **`complete`** - Full, intact content as intended - Content appears complete with proper beginning, middle, and end - All essential elements present (introduction, body, conclusion where appropriate) - No obvious truncation or missing sections - Example: Complete articles, full tutorials, intact documents **`mostly_complete`** - Minor elements missing but core content intact - Core content is complete but some secondary elements may be missing - Minor truncation that doesn't affect main message - Example: Article with truncated comments, missing sidebar content, partial author bio **`fragment`** - Incomplete content, missing significant portions - Missing introduction, conclusion, or substantial middle sections - Truncated mid-sentence or mid-paragraph - Content feels incomplete or cut off - Example: Search result snippets, article excerpts, broken crawls, partial downloads **`severely_degraded`** - Broken, unreadable, or corrupted content - Encoding errors, scrambled text, missing characters - Severely malformed HTML rendering as gibberish - Technical corruption making content unreadable - Example: �&$^%*@# characters, completely broken formatting, corrupted files ##### Key Decision Points: - **Content completeness**: Does the content feel like a complete unit of information? - **Technical integrity**: Is the content technically readable and properly formatted? - **Fragment vs. complete**: Independent of navigation - is the actual content complete? - **Degraded vs. fragment**: Degraded has technical issues; fragment is just incomplete **Note**: Documents may end with the special tag ``, indicating upstream length-based truncation due to processing constraints. Do not penalize Content Integrity due to this truncation signal; assess integrity based on the visible content's coherence and technical readability, ignoring the artificial cutoff. --- #### 2. Content Ratio **What we're measuring**: How much of the document is actual content vs. navigation, UI elements, and structural markup. ##### Values & Criteria: **`complete_content`** - 90-100% meaningful content - Full articles, papers, tutorials with minimal navigation - Clean text with proper paragraphs and structure - Example: A Wikipedia article, academic paper, complete blog post **`mostly_content`** - 70-89% meaningful content - Complete documents with some navigation elements (header, footer, sidebar) - Minor UI elements that don't disrupt reading - Example: News articles with standard website navigation **`mixed_content`** - 40-69% meaningful content - Significant navigation mixed throughout content - Multiple sidebars, ads, or UI elements interrupting text - Example: E-commerce product pages with reviews mixed with purchase options **`mostly_navigation`** - 10-39% meaningful content - Predominantly menus, links, headers, footers - Content overwhelmed by structural elements - Example: Site maps, navigation pages, heavily UI-focused pages **`minimal_content`** - 0-9% meaningful content - Almost entirely navigation, UI elements, or structural markup - Very little readable content present - Example: Empty pages, pure navigation menus, error pages with minimal text ##### Key Decision Points: - Focus on the **ratio of readable text to navigation/UI elements** - **Count only substantive content**, ignore boilerplate and structural elements - **Mixed vs. mostly_navigation**: Can you read it as coherent content despite distractions? --- #### 3. Content Length **What we're measuring**: Amount of substantive content, ignoring navigation and boilerplate. ##### Values & Criteria: **`substantial`** - 2,000+ words of meaningful content - Long-form, comprehensive content that provides in-depth coverage of a topic - Typically includes detailed analysis, multiple sections or chapters, extensive research, or thorough exploration of complex subjects - Examples: White papers, research reports, e-books, long-form journalism **`moderate`** - 500–2,000 words of meaningful content - Standard-length content that offers meaningful coverage while remaining focused and digestible - Balances depth with accessibility; provides enough detail to be informative without overwhelming readers - Examples: Typical blog posts, news articles, product reviews, how-to guides **`brief`** - 100–500 words of meaningful content - Short, focused content that delivers key information quickly and efficiently - Gets straight to the point while still providing value and context - Examples: News briefs, product descriptions, FAQs, short blog posts **`minimal`** - Under 100 words of meaningful content - Very short content that provides only essential information or serves as a quick reference - Designed for rapid consumption or specific micro-purposes - Examples: Social media posts, announcements, abstracts, snippets, navigation pages ##### Measurement Tips: - **Count only readable content of value**: include article body and substantive headings/captions; exclude headers/footers, menus/sidebars, related links, share/consent UI, pagination, ads, and boilerplate. - **Focus on substantive information**, not filler words - **Complete thoughts matter more than exact word counts** - **Contextual adjustment**: Thresholds are guidelines and can be adjusted based on specific use cases and typical content. Academic contexts may shift ranges upward, while social media contexts may shift them downward. --- ### Content Classification #### 4. One-Sentence Description **What we're looking for**: A very short, neutral description of what the document contains. ##### Field: **`one_sentence_description`** - Ultra-short neutral description of the document - Exactly one sentence - Target length: <100 characters - Focus on the main topic and, if useful, the document’s function - Examples of functions: tutorial, policy, news report, product page, navigation page - Neutral, descriptive tone (no hype or marketing language) ##### To Avoid: - Boilerplate intros: "This document...", "This article...", "In this guide..." - Calls to action: "Learn how to...", "Discover...", "Find out..." - User-facing phrasing: "You will learn...", "How do I..." - Non-essential details (dates, numbers) unless central to the topic ##### Examples: - "Beginner tutorial on React hooks and basic state management." - "News report on European Central Bank interest rate decisions." - "Internal policy for customer data retention and deletion." - "API reference for payment processing endpoints and error codes." - "Research paper analyzing housing price trends in major US cities." - "FAQ answering common questions about employee parental leave." - "Opinion essay arguing for stricter international climate change legislation." ##### Examples for low-quality or problematic documents (still annotate): - "Fragment of article discussing proposed changes to European data privacy laws." - "Keyword-stuffed promotional page about cheap car insurance quotes." - "Website navigation page listing links to product categories and help pages." - "Error page explaining that the requested resource could not be found." - "Affiliate landing page promoting multiple online casino bonus offers." - "Corrupted text with no identifiable topic or meaningful content." #### 5. Content Type **What we're measuring**: The functional structure and purpose of content. **Multi-type content**: Content can be assigned multiple type labels if it genuinely serves multiple purposes. Choose ALL applicable types rather than forcing a single primary choice. Always output an array for this property, even if only one type applies. ##### Values & Criteria: **`analytical`** - In-depth analysis, research, and critical examination - Provides detailed analysis or research on a topic - Develops arguments, evaluates evidence, or presents findings - Example: Research analysis, investigative reports, academic articles, expert commentary **`instructional`** - Teaching and how-to content - Explicitly teaches skills, concepts, or procedures - Step-by-step guidance or educational explanations - Example: Tutorials, how-to guides, educational content, training materials **`reference`** - Lookup materials, definitions, specifications - Designed for looking up specific information rather than reading through - Often organized alphabetically, categorically, or as lists - Example: Dictionaries, encyclopedias, API references, product catalogs **`procedural`** - Step-by-step processes and procedures - Sequential instructions or workflows - Process documentation with clear steps - Example: Recipes, installation guides, standard operating procedures, workflows **`qa_structured`** - Structured question-answer content - Formal Q&A format with clear questions and answers - Often expert responses to specific questions - Example: Stack Overflow, FAQ sections, structured Q&A sites **`conversational`** - Multi-party or turn-based dialogues (humans, bots, or both) - Casual or structured conversations between two or more participants - May include human–AI chats, forum threads, or comment chains - Example: Reddit threads, forum discussions, support chats, assistant chat logs **`creative`** - Entertainment, artistic, fictional content - Primary purpose is entertainment or artistic expression - Not primarily informational or instructional - Example: Short stories, poems, movie reviews, game content, fiction **`transactional`** - Commercial, shopping, service-oriented - Primary purpose is to facilitate a transaction or service - Focuses on products, services, or business processes - Example: Product listings, service descriptions, checkout pages **`boilerplate`** - Legal, policy, standard template text - Standard legal or policy language - Often repeated across multiple sites with minimal variation - Example: Terms of service, privacy policies, disclaimers, cookie banners, standard notices **`news_report`** - Straight reporting of events with minimal analysis - Describes events or facts in a neutral, descriptive tone - Time-bound news, updates, or reports - Example: Wire-service news articles, breaking-news updates **`opinion_editorial`** - Persuasive/opinionated commentary or editorials - Expresses a stance or argument; aims to persuade - May cite evidence but prioritizes viewpoint - Example: Op-eds, opinion columns, personal essays with clear stance **`review_critique`** - Evaluative reviews of products, media, or services - Provides judgments, ratings, or critiques - May include pros/cons, scoring systems - Example: Product reviews, film/book critiques, app store reviews (long-form) **`technical_documentation`** - Manuals, API docs, developer guides, READMEs - Primary goal is to instruct usage of software/hardware/APIs - Includes reference sections, examples, parameters, version notes - Example: API reference, library README, user manual **`specification_standard`** - Normative standards and formal specifications - Defines requirements, must/shall language, compliance criteria - Maintained by standards bodies or authoritative groups - Example: RFCs, ISO standards, formal protocol specs **`legal_document`** - Statutes, case law, contracts, regulatory texts - Binding or authoritative legal content - Formal legal language and structure - Example: Court opinions, legislation, contracts, regulatory rules **`press_release`** - Organization-issued announcements and PR materials - Promotional announcements framed as information - Quotes from executives, product/service announcements - Example: Company press releases, launch announcements **`structured_data`** - Tables, datasets, indices, catalogs with minimal prose - Predominantly tabular/listed data meant for lookup - Minimal narrative or explanatory text - Example: Product catalogs, schedules, statistical tables **`source_code`** - Code listings as primary content - Dominant content is program source code or scripts - May include lightweight comments or snippets without narrative - Example: Code files, gist-like pages, competitive programming solutions ##### Multi-Type Examples: - **Tutorial that analyzes different approaches** → `["instructional", "analytical"]` - **Educational reference manual** → `["instructional", "reference"]` - **Research paper with step-by-step methodology** → `["analytical", "procedural"]` - **Q&A site with analytical responses** → `["qa_structured", "analytical"]` - **API guide with examples** → `["technical_documentation", "reference", "instructional"]` - **RFC with rationale** → `["specification_standard", "analytical"]` - **Film review with interview snippets** → `["review_critique", "conversational"]` - **Helpdesk chat with an AI** → `["conversational", "transactional"]` - **Breaking news explainer** → `["news_report", "explanatory"]` --- #### 6. Business Sector **What we're measuring**: Business sector(s) or industry domain(s) for training sector-specific LLMs. **Multi-sector content**: Content can be assigned multiple sector labels if it genuinely spans multiple industries. Choose ALL applicable sectors rather than forcing a single primary choice or using "other". Always output an array for this property, even if only one sector applies. #### Values & Criteria: **`academic_research`** - Scholarly and research content - Peer-reviewed publications, academic papers - University-affiliated research and scholarship - Formal academic discourse and methodology - Example: Journal articles, conference papers, academic books, dissertations **`education_sector`** - Educational institutions and pedagogy - K-12 education, higher education administration - Educational technology, curriculum development - Teaching methodologies and educational policy - Example: School curricula, educational policy papers, teaching resources, edtech content **`technology_software`** - Software and information technology - Software development, programming, IT services - Digital products, platforms, and technology companies - Computer science and software engineering - Example: Software documentation, tech company content, programming guides, IT industry analysis **`hardware_electronics`** - Hardware devices and electronics industry - Semiconductors, consumer electronics, embedded systems, hardware design - Electronics manufacturing and supply chains - Example: Chip design docs, hardware datasheets, device manuals **`healthcare_medical`** - Healthcare and medical sector - Medical research, clinical practice, healthcare delivery - Hospitals, medical devices, healthcare policy - Public health and wellness - Example: Medical journals, clinical guidelines, healthcare administration, wellness content **`pharmaceutical_biotech`** - Pharmaceutical and biotechnology - Drug development, clinical trials, biotech research - Pharmaceutical industry, biotechnology companies - Life sciences and molecular biology applications - Example: Drug research papers, clinical trial reports, biotech industry analysis **`financial_services`** - Banking and financial services - Banking, investment, insurance, financial planning - Financial markets, fintech, payment systems - Asset management and financial advisory - Example: Financial analysis, banking documentation, investment guides **`legal_services`** - Legal sector and jurisprudence - Law firms, legal practice, court systems - Legal education, regulatory compliance - Litigation, contracts, legal advisory - Example: Legal briefs, court opinions, legal analysis, compliance guides **`government_public`** - Government and public administration - Government agencies, public policy, civic services - Regulatory bodies, public administration - Political institutions and governance - Example: Government reports, policy documents, regulatory filings, civic information **`manufacturing_industrial`** - Manufacturing and heavy industry - Industrial production, manufacturing processes - Supply chain, logistics, industrial equipment - Factory operations and industrial engineering - Example: Manufacturing specs, industrial reports, supply chain analysis, production guides **`mining_resources`** - Mining and natural resources - Exploration, extraction, and processing of minerals and resources - Resource markets and operations (metals, rare earths) - Example: Mining reports, resource exploration docs, commodity operations **`chemicals_materials`** - Chemicals and advanced materials - Petrochemicals, specialty chemicals, polymers, composites, advanced materials - Safety data sheets (SDS), process chemistry, materials science - Example: Material datasheets, REACH documentation, chemical process guides **`energy_utilities`** - Energy and utilities sector - Power generation, renewable energy, oil and gas - Electric utilities, water services, waste management - Energy infrastructure and grid management - Example: Energy industry reports, utility regulations, renewable energy research **`retail_commerce`** - Retail and e-commerce - Retail operations, e-commerce platforms - Consumer goods distribution, merchandising - Retail technology and customer experience - Example: Retail industry analysis, e-commerce guides, merchandising strategies **`wholesale_distribution`** - Wholesale trade and distribution - B2B wholesale, distributors, procurement, inventory and fulfillment - Supply relationships between manufacturers and retailers - Example: Distributor catalogs, wholesale operations, procurement guides **`real_estate_construction`** - Real estate and construction - Property development, construction industry - Real estate markets, property management - Architecture and building services - Example: Real estate analysis, construction specifications, property guides **`transportation_logistics`** - Transportation and logistics - Airlines, shipping, freight, public transit - Logistics operations, supply chain transportation - Vehicle fleet management, transportation infrastructure - Example: Logistics guides, transportation planning, shipping documentation **`travel_aviation`** - Travel industry and commercial aviation - Airlines, airports, OTA platforms, hospitality travel operations - Route planning, airline commercial, loyalty, IATA regulations - Example: Airline scheduling, fare rules, OTA partner docs **`automotive_industry`** - Automotive manufacturing and services - Vehicle manufacturers, automotive suppliers - Automotive technology, electric vehicles - Dealerships and automotive services - Example: Automotive engineering docs, vehicle technology papers, industry analysis **`telecommunications`** - Telecommunications industry - Telecom operators, network infrastructure - Mobile services, broadband, satellite communications - Telecommunications equipment and technology - Example: Telecom industry reports, network specifications, 5G technology papers **`media_entertainment`** - Media and entertainment industry - Film, television, music, gaming industries - Publishing, news media, content creation - Streaming services and digital media - Example: Entertainment industry analysis, media studies, content strategy **`gaming_industry`** - Video games and interactive entertainment - Game development, studios, engines, esports, live ops - Monetization models, community management, platform ecosystems - Example: Patch notes, game design docs, esports operations **`gambling_betting`** - Gambling, betting, and online casinos - Sportsbooks, casino games, lotteries, poker rooms - Affiliate landing pages, bonus/promotions, tipster content - Often high commercial bias and promotional framing **`advertising_marketing`** - Advertising, marketing, and PR - Brand strategy, campaign planning, performance marketing, martech - Agencies, in-house marketing, PR communications - Example: Campaign briefs, media plans, PR strategies **`hospitality_tourism`** - Hospitality and tourism sector - Hotels, restaurants, travel services - Tourism industry, destination management - Event planning and hospitality services - Example: Tourism studies, hospitality management, travel industry reports **`food_beverage_hospitality`** - Food & beverage and restaurant operations - Restaurant ops, menu engineering, supply chain, QSR/fast casual - Food safety, compliance, procurement for F&B - Example: Restaurant training manuals, HACCP docs, vendor specs **`agriculture_food`** - Agriculture and food production - Farming, agricultural technology, food processing - Agricultural supply chain, food safety - Agribusiness and agricultural policy - Example: Agricultural research, food industry reports, farming guides **`environmental_services`** - Environmental and sustainability services - Environmental consulting, ESG reporting, sustainability programs - Waste management services, remediation, impact assessments - Example: ESG reports, environmental impact assessments, sustainability frameworks **`aerospace_defense`** - Aerospace and defense industry - Aircraft manufacturing, space technology - Defense contractors, military systems - Aviation and space exploration - Example: Aerospace engineering papers, defense industry analysis, aviation guides **`insurance_industry`** - Insurance sector - Life, health, property, and casualty insurance - Reinsurance, actuarial science, risk assessment - Insurance technology and underwriting - Example: Actuarial studies, insurance policy analysis, risk management guides **`nonprofit_ngo`** - Nonprofit and NGO sector - Charitable organizations, international development - Social services, humanitarian organizations - Foundations and philanthropic institutions - Example: NGO reports, nonprofit management, development studies **`consulting_professional`** - Professional services and consulting - Management consulting, accounting firms - Business advisory, professional services firms - Corporate strategy and business transformation - Example: Consulting reports, professional services guides, business strategy papers **`human_resources`** - HR and people operations - Talent acquisition, compensation & benefits, performance management, L&D - HR tech, workforce planning, organizational development - Example: HR policy docs, job frameworks, talent strategy **`security_cyber`** - Security and cybersecurity - Information security, threat intelligence, risk management, compliance (e.g., SOC2) - Physical security operations and incident response - Example: Security guidelines, incident playbooks, vulnerability reports **`consumer_goods`** - Consumer products and CPG - Fast-moving consumer goods, household products - Personal care, food and beverage brands - Consumer product development and marketing - Example: CPG industry analysis, product development docs, consumer research **`general_interest`** - General audience content - Content for broad audiences without sector focus - General knowledge and miscellaneous topics - Cross-sector or sector-agnostic content - Example: General magazines, broad interest content, lifestyle articles **`other`** - Highly specialized or unclassifiable - Highly specialized niches not covered by existing sectors - Content with genuinely unclear sector classification - Unique content types that don't map to any defined sector - Example: Highly specialized technical niches, unique content formats ##### Multi-Sector Examples: - **Medical device regulations** → `healthcare_medical` + `pharmaceutical_biotech` + `government_public` - **Fintech software documentation** → `financial_services` + `technology_software` - **Agricultural biotechnology research** → `agriculture_food` + `pharmaceutical_biotech` --- #### 7. Technical Content **What we're measuring**: Type and intensity of specialized technical knowledge. **Multi-technical content**: Content can be assigned multiple technical content labels if it genuinely combines multiple technical domains. Choose ALL applicable technical types rather than forcing a single primary choice. Always output an array for this property, even if only one technical type applies. ##### Values & Criteria: **`code_heavy`** - Significant programming content - Multiple code examples, algorithms, or implementations - Technical programming concepts and methodologies - Software development focus - Example: Programming tutorials, API documentation, software guides **`math_heavy`** - Substantial mathematical content - Mathematical equations, proofs, or statistical analysis - Quantitative analysis and mathematical reasoning - Mathematical concepts and methodologies - Example: Mathematical papers, statistical analysis, quantitative research **`scientific`** - Research and scientific methodology content - Scientific research findings, experimental data - Scientific methodology and analysis - Peer-reviewed research content - Example: Research papers, scientific studies, experimental reports **`data_heavy`** - Substantial datasets, tables, and data analysis - Contains significant data tables, charts, or datasets - Focus on data interpretation and analysis - Statistical content with data presentations - Example: Research data, statistical reports, data analysis, survey results **`engineering`** - Engineering and applied technical content - Engineering design, systems, and applied technical solutions - Technical specifications for physical systems - Non-software engineering disciplines - Example: Mechanical engineering, civil engineering, technical specifications, design documents **`basic_technical`** - Some technical elements but not dominant - Light technical content mixed with general explanations - Technical concepts explained for general audience - Example: Technology articles for general audience, basic technical explanations **`non_technical`** - No significant technical content - General audience content without specialized technical knowledge - No programming, mathematical, engineering, or scientific focus - Example: General articles, humanities content, basic informational content ##### Multi-Technical Examples: - **Data science tutorial with code examples** → `["code_heavy", "math_heavy", "data_heavy"]` - **Engineering research with statistical analysis** → `["engineering", "scientific", "data_heavy"]` - **Computational biology paper** → `["code_heavy", "scientific"]` --- ### Quality and Value Assessment #### 8. Content Quality **What we're measuring**: Overall quality of content considering writing excellence, substantive value, and presentation quality regardless of authorship origin. #### Values & Criteria: **`excellent`** - Outstanding quality across all dimensions - Sophisticated writing with varied sentence structures and engaging style - Rich, appropriate vocabulary with error-free grammar and punctuation - High substantive value with clear insights or information - Professional presentation and formatting - Natural flow and logical organization - Example: High-quality publications, expert analyses, polished educational content, well-crafted professional documents **`good`** - High quality with minor imperfections - Grammatically correct with proper sentence structure - Appropriate vocabulary and tone for content type - Solid substantive value and clear information - Good organization and readable flow - Only occasional minor issues (1-2 typos per section) - Example: Quality journalism, professional websites, well-written blog posts, solid educational materials **`adequate`** - Acceptable quality for most purposes - Generally clear and understandable writing - Some grammatical errors but meaning remains clear - Reasonable substantive value though may lack depth - Basic organization and structure present - Minor formatting or presentation issues - Example: Casual blogs, user reviews, basic informational content, simple guides **`poor`** - Significant quality issues impacting utility - Multiple errors affecting comprehension or credibility - Unclear expression, confusing organization, or awkward phrasing - Limited substantive value or questionable information - Major formatting problems or unprofessional presentation - Difficult to extract reliable information - Example: Low-quality web content, poorly edited materials, confusing instructions **`unacceptable`** - Quality too low for productive use - Severely impaired communication with major errors - Incoherent, nonsensical, or corrupted content - No reliable substantive value - Broken formatting or technical corruption - Cannot determine intended meaning or extract useful information - Example: Corrupted text, severe translation errors, spam content, SEO content, completely broken formatting ##### Quality Assessment Guidelines: - **Comprehension**: Can the intended message be clearly understood? - **Substantive value**: Does the content provide useful information or insights? - **Technical presentation**: Is the content properly formatted and readable? - **Error impact**: Do errors significantly impede understanding or credibility? - **Professional standards**: Does the content meet basic standards for its intended purpose? **Language-Specific Quality Indicators:** - For non-Latin scripts (Arabic, Chinese, Japanese): Check for proper character encoding - For agglutinative languages (Turkish, Finnish): Adjust expectations for word count/density - For languages with different formality levels (Japanese, Korean): Assess appropriate register - Mixed-language documents: Evaluate code-switching quality and appropriateness --- #### 9. Information Density **What we're measuring**: Ratio of valuable information to redundancy, padding, and repetition. ##### Values & Criteria: **`dense`** - Efficient, information-packed content - Every sentence adds new information or insight - Minimal redundancy or unnecessary elaboration - Little to no repetition of the same concepts - Example: Technical specifications, concise academic writing, quality reference material **`adequate`** - Good information content with reasonable elaboration - Most content adds value with some acceptable elaboration - Minimal repetition within the document - Good balance of information and explanation - Example: Well-written articles, good tutorials with examples **`moderate`** - Mixed substantive content with noticeable padding - Some valuable information mixed with unnecessary elaboration - Noticeable repetition of key points for emphasis - Some sections feel padded or verbose - Example: Blog posts with some fluff, articles with repetitive conclusions **`thin`** - Low information content with significant problems - Much content doesn't add new information - High internal repetition and excessive redundancy - Significant padding to reach desired length - Example: SEO-optimized content, poorly edited writing **`empty`** - Dominated by repetition and meaningless content - Minimal actual information value - Dominated by repetition and copy-paste artifacts - Same ideas repeated multiple times without development - Example: Spam content, template-filled pages, keyword-stuffed articles ##### Common Repetition Patterns to Watch For: - **Same phrases repeated throughout** (especially in SEO content) - **Identical paragraphs** or sections (copy-paste errors) - **Circular reasoning** (saying the same thing in different ways) - **Template artifacts** (repeated boilerplate mixed with content) --- #### 10. Educational Value **What we're measuring**: Potential for teaching, learning, and knowledge transfer. ##### Values & Criteria: **`high`** - Clear instructional design and learning objectives - Explicitly teaches concepts or skills - Progressive skill building from basic to advanced - Clear learning objectives and outcomes - Comprehensive explanations with examples - Example: Quality tutorials, textbooks, structured courses, educational guides **`moderate`** - Good instructional value with some learning potential - Some instructional elements present - Explanations help build understanding - Transferable knowledge to other contexts - Good examples or illustrations - Example: How-to articles, explanatory content, informative guides **`basic`** - Limited educational content - Some explanations but not systematically instructional - Basic explanations of concepts - Limited learning potential or skill building - Example: Basic explanations, simple informational content **`minimal`** - Little educational value - Primarily informational rather than instructional - No clear learning objectives or skill building - Entertainment or commercial focus - Example: Entertainment content, basic news, commercial content **`none`** - No educational content - No instructional value or learning potential - Purely transactional, entertainment, or administrative - No knowledge transfer potential - Example: Pure entertainment, transactions, legal boilerplate ##### Disambiguation tips - Explanatory vs Educational: explanations alone ≠ educational design; require intent to teach plus scaffolding for Basic+ - Reference docs: typically Minimal; promote to Basic/Moderate when guided “how-to” segments or curated examples exist - Reviews/op-eds: None/Minimal unless they include actionable how-to guidance designed for learning ##### Automation heuristics - Keywords: Objectives/Outcomes, Lesson, Exercise/Quiz, Homework, Assessment, Syllabus, Module, Unit, Learning Goals - Structure: numbered steps + prerequisites/requirements → Basic; add practice tasks/solutions → Moderate; syllabus/modules/assessments → High - Signals of non-edu mix: heavy CTAs/ads or product pitches → cap at Minimal unless clear instructional scaffolding ##### Quick decision tree - Are there explicit learning goals or a syllabus? → High - Else, are there step-by-step instructions with examples/exercises? → Moderate - Else, are there explanatory sections intended to teach basics? → Basic - Else, is there any minor instructional element? → Minimal - Otherwise → None ##### Borderline examples - API reference with examples but no guidance → Minimal to Basic (depending on clarity/examples) - Blog post explaining concept with analogies and one example → Basic - Tutorial with tasks, checkpoints, and solutions → High - Product documentation with “Getting Started” and “How-To” flows → Moderate ##### Educational Indicators: - **Learning objectives**: Clear goals for what reader should learn - **Skill progression**: Builds from basic to advanced concepts - **Examples and practice**: Provides concrete examples or exercises - **Knowledge transfer**: Concepts applicable beyond immediate context --- #### 11. Reasoning Indicators **What we're measuring**: Presence and quality of logical reasoning, analysis, and explanatory content. ##### Values & Criteria: **`analytical`** - Complex reasoning and systematic analysis - Multi-step arguments with logical progression - Cause-effect analysis and systematic thinking - Considers multiple perspectives or variables - Draws conclusions from evidence and reasoning - Example: Research analysis, complex problem-solving, systematic evaluations **`explanatory`** - Clear explanations with logical flow - Explains how or why things work - Shows cause-effect relationships clearly - Educational reasoning that builds understanding - Logical connections between concepts - Example: Good tutorials, educational content, how-to explanations **`basic_reasoning`** - Simple logical connections - Some logical connections between ideas - Basic explanations of concepts or processes - Elementary analytical thinking - Simple cause-effect relationships - Example: Basic explanations, simple arguments, elementary analysis **`minimal`** - Limited reasoning, mostly descriptive - Primarily describes what rather than why or how - Few logical connections between ideas - Mostly factual statements without analysis - Little explanatory content - Example: Basic descriptions, simple factual content, minimal analysis **`none`** - No clear reasoning present - Purely descriptive content - Simple factual listing without connections - Narrative content without analysis - No logical argumentation or explanation - Example: Simple lists, basic narratives, pure description ##### Thinking-trace signals (what to look for) - Stepwise structure: numbered steps in proofs/derivations/solutions; “First… therefore… hence… so…” - Hypothesis and test: assumptions, intermediate results, counterexamples, sanity checks - Tool- or method-calls: named algorithms, theorems, lemmas, or procedures invoked and justified - Error analysis or reflection: “we tried X, failed because Y, so we…”, “limitations,” “edge cases” - Intermediate artifacts: scratch calculations, partial code reasoning, sub-problems and sub-claims ##### Disambiguation rules - Explanatory vs Analytical: explanations tell how; analytical shows multi-step inference with evidence and intermediate claims - Worked example vs Mere answer: worked examples expose steps and justification; mere answers without steps are not reasoning-rich - Procedural vs Reasoning: procedural lists actions; reasoning links actions via logic, evidence, or constraints ##### Automation heuristics - Lexical cues: because, therefore, thus, hence, suppose/assume, we conclude, by induction, lemma/theorem/proof, O(n), hypothesis, counterexample - Structure cues: presence of proof blocks, derivations (e.g., “Proof.”, “QED”, TeX environments), multi-step numeric calculations - Program reasoning: code comments like “// invariant”, “// complexity”, pre/post-conditions, test reasoning - Thresholding: count reasoning cues per 1k tokens; with ≥2 structural cues or ≥5 lexical cues → at least explanatory; proofs/derivations → analytical ##### Quick decision tree - Is there a proof/derivation or multi-step argument with intermediate claims? → analytical - Else, does it explain why/how with cause-effect and logical links? → explanatory - Else, are there simple logical connections or one-step justifications? → basic_reasoning - Else, does it mostly describe without connecting ideas? → minimal/none ##### Borderline examples - Answer-only solutions (final numeric result without steps) → minimal - Step-by-step math solution with intermediate equations → analytical - “How it works” article connecting 2–3 causal steps without data → explanatory - Troubleshooting log with attempts and justifications → analytical if causal chain is explicit; otherwise explanatory ##### Key Reasoning Patterns to Identify: - **Cause-effect**: "Because X, therefore Y" - **Problem-solution**: Identifies problems and proposes solutions - **Comparison**: Analyzes similarities and differences - **Logical progression**: Ideas build on previous ideas - **Evidence-based conclusions**: Draws conclusions from presented evidence --- ### Audience and Purpose #### 12. Audience Level **What we're measuring**: Intended sophistication level and background knowledge assumptions of the target audience. ##### Values & Criteria: **`expert`** - Highly specialized professional/academic content - Assumes deep domain expertise and advanced training - Uses technical terminology without explanation - Content for practitioners actively working in specialized fields - Example: Climate modeling methodology in Nature Climate Change, research papers, technical specifications, expert-to-expert communications **`advanced`** - Educated adult audience with analytical skills - Assumes higher education and critical thinking ability - Explains specialized concepts but uses sophisticated language - Intellectually challenging but accessible to educated generalists - Example: Complex climate change analysis in The Atlantic, quality journalism, policy analysis, advanced general interest content **`general`** - General adult audience - Accessible to most educated adults without specialized background - Explains technical concepts when introduced - Uses clear language while maintaining intellectual substance - Example: Quality journalism, general interest articles, accessible explanations of complex topics **`beginner`** - Introductory level with minimal prerequisites - Explains basic concepts and terminology - Builds up from fundamental principles - Assumes minimal prior knowledge of the subject area - Example: Introductory tutorials, beginner guides, basic explanations, getting-started content **`youth`** - Targeted at teenagers and young adults (ages 13-19) - Age-appropriate complexity with contemporary cultural references - Sophisticated enough for developing critical thinking but accessible - May address topics relevant to adolescent experiences and concerns - Example: High school educational content, young adult literature, teen-focused explanations, college prep materials **`children`** - Designed specifically for children - Simple language and concepts appropriate for young readers - Educational content designed for elementary/middle school levels - Age-appropriate topics and complexity - Example: Children's educational content, elementary school materials, simple explanations for young learners ##### Assessment Guidelines: - **Professional context**: Is this content designed for workplace use vs. general learning? - **Terminology density**: How much specialized vocabulary is used without explanation? - **Concept complexity**: How sophisticated are the ideas and their development? - **Background assumptions**: What education level and domain knowledge does the author assume? **Cross-Linguistic Considerations:** - Expert terminology density varies by language (German allows more compound terms) - Formality markers differ across cultures - Educational level assumptions vary by country's education system - Age-appropriate content differs across cultures --- #### 13. Commercial Bias **What we're measuring**: How much commercial interests influence the objectivity and informational value of content. ##### Values & Criteria: **`none`** - No commercial influence detected - Objective, informational presentation - No promotional language or commercial agenda - Focus purely on informing or educating - Example: Academic papers, objective journalism, educational content **`minimal`** - Slight commercial context but maintains objectivity - May mention products/services but in informational context - Maintains balanced, objective tone - Commercial mentions serve informational purpose - Example: Product reviews with balanced analysis, informational articles mentioning relevant products **`moderate`** - Some commercial influence on content - Mix of informational and promotional content - Some promotional language but still provides useful information - Commercial interests somewhat visible but not dominant - Example: Company blogs with useful information, sponsored content with actual value **`heavy`** - Strong commercial bias throughout - Primarily promotional with some informational elements - Heavy use of marketing language and persuasive techniques - Clear commercial agenda affects content objectivity - Example: Marketing articles disguised as information, heavily biased product comparisons **`pure_marketing`** - Entirely commercial/promotional content - No genuine informational value beyond promotion - Pure marketing copy or advertising material - Designed solely to drive sales or conversions - Example: Sales pages, pure advertising copy, promotional brochures ##### Key Indicators: - **Language tone**: Objective vs. promotional language - **Primary purpose**: Inform vs. persuade/sell - **Balance**: Are alternatives/drawbacks mentioned? - **Call-to-action**: Subtle information vs. obvious sales pitch --- #### 14. Time-Sensitivity **What we're measuring**: How time-sensitive the content is - whether its value degrades over time or remains stable. ##### Values & Criteria: **`evergreen`** - Content remains valuable indefinitely - Fundamental concepts, principles, theories - Historical information and established facts - Skills and techniques that don't change - Reference materials with lasting value - Example: Mathematical proofs, language grammar guides, classical literature analysis, basic cooking techniques **`slowly_changing`** - Content remains valuable for years - Best practices that evolve slowly - Technical content that updates every few years - Cultural and social topics with gradual change - Example: Programming language tutorials, academic textbooks, industry standards, educational curricula **`regularly_updating`** - Content valuable for months to a year - Industry trends and market analysis - Technology reviews and comparisons - Policy discussions and current research - Example: Software framework guides, business strategies, product reviews, research summaries **`time_sensitive`** - Content value degrades quickly - News and current events - Time-bound information (prices, schedules, availability) - Temporary situations or short-term trends - Real-time data and statistics - Example: Stock prices, weather reports, breaking news, event announcements, sales/promotions ##### Key Decision Points: - **Core question**: If someone reads this in 2 years, will it still be valuable? - **Update frequency**: How often does this type of information typically change? - **Temporal references**: Does the content heavily reference "now," "recently," "currently"? - **Subject matter stability**: Is this about unchanging principles or evolving situations? --- ### Safety and Compliance #### 15. Content Safety **What we're measuring**: Presence of inappropriate, harmful, or legally problematic content. ##### Values & Criteria: **`safe`** - Appropriate for all contexts - No concerning content of any type - Professional, appropriate language throughout - Suitable for general audiences including workplace settings **`mild_concerns`** - Minor issues that don't constitute major problems - Occasional mild profanity in context - Brief mentions of sensitive topics handled appropriately - Minor concerns that don't affect overall suitability - Example: Historical discussions of sensitive topics, professional content with mild language **`nsfw`** - Not safe for work or general audiences - Explicit sexual content or graphic descriptions - Adult themes requiring content warnings - Graphic violence or disturbing imagery descriptions - Example: Adult content, graphic medical descriptions, explicit violence **`harmful`** - Potentially harmful content requiring careful handling - Content promoting dangerous activities or self-harm - Hate speech targeting individuals or groups - Violent content glorifying harm to others - Example: Self-harm content, hate speech, dangerous "how-to" guides **`illegal`** - Illegal content requiring immediate rejection - Content promoting clearly illegal activities - Material that violates laws in major jurisdictions - Example: Terrorist content, child exploitation ##### Safety Assessment Guidelines: - **Context matters**: Medical/educational discussions of sensitive topics may be appropriate - **Intent matters**: Discussing harmful topics for educational purposes vs. promoting them - **Audience consideration**: Content appropriate for experts may not be safe for general audiences --- #### 16. PII Presence **What we're measuring**: Whether the content contains personally identifiable information that could identify private individuals. ##### Values & Criteria: **`no_pii`** - No personal information detected - No names of private individuals - No contact information (emails, phones, addresses) - No identification numbers - Public figures and officials mentioned by name are acceptable - Example: News articles about politicians, technical documentation, general information **`contains_pii`** - Contains potentially identifiable information - Names of private individuals (non-public figures) - Email addresses, phone numbers, physical addresses - ID numbers (SSN, passport, driver's license, employee IDs) - Medical information about identifiable individuals - Financial account information - Example: Personal blogs with full names, leaked databases, medical case studies with identifying info ##### Key Decision Points: - **Public vs. Private figures**: Politicians, celebrities, CEOs = public (no PII flag); private citizens = PII - **Context matters**: Academic paper authors and their institutional emails = typically no PII; personal emails in forums = PII - **Aggregated vs. Individual**: Statistical data = no PII; individual records = PII --- ### Geographic Relevance #### 17. Regional Relevance **What we're measuring**: Primary regional, cultural, or geopolitical sphere(s) that the content relates to, regardless of language used. **Multi-regional content**: Content can be assigned multiple regional labels if it genuinely spans multiple regions. Choose ALL applicable regions rather than forcing a single primary choice. Always output an array for this property, even if only one region applies. ##### Values & Criteria: **`european`** - European context (EU and broader Europe) - Content about European countries, EU policies, or pan-European topics - European cultural perspectives, social systems, or business practices - References to European cities, institutions, companies, or regulations - Includes: EU member states, UK, Switzerland, Norway, Balkans, etc. - Example: GDPR compliance, European Parliament elections, Schengen area travel, European football leagues **`north_american`** - North American context - Content about US, Canada, or Mexico - North American cultural perspectives, USMCA/NAFTA region topics - References to North American institutions, companies, or issues - Example: FDA regulations, Silicon Valley tech, NHL, US constitutional law, Canadian healthcare **`east_asian`** - East Asian context - Content about China, Japan, Korea (North/South), Taiwan, Mongolia - East Asian cultural perspectives, Confucian-influenced societies - References to East Asian economic models, companies, or social systems - Example: Gaokao exams, K-pop, Shenzhen tech hub, Japanese work culture, Taiwan semiconductor industry **`south_asian`** - South Asian context - Content about India, Pakistan, Bangladesh, Sri Lanka, Nepal, Bhutan, Afghanistan, Maldives - South Asian cultural perspectives, subcontinental issues - References to South Asian institutions, economies, or social structures - Example: IIT entrance exams, Bollywood, cricket leagues, monsoon impacts, caste system discussions **`southeast_asian`** - Southeast Asian context - Content about ASEAN countries (Indonesia, Thailand, Vietnam, Philippines, Malaysia, Singapore, etc.) - Southeast Asian regional perspectives and economic integration - References to ASEAN policies, regional companies, or cultural phenomena - Example: ASEAN economic community, Indonesian elections, Singapore financial sector, Thai tourism **`middle_eastern`** - Middle Eastern and North African context - Content about Arab states, Iran, Turkey, Israel, North Africa (MENA region) - Middle Eastern cultural perspectives, Islamic finance, regional conflicts - References to Middle Eastern institutions, oil economies, or geopolitics - Example: Gulf Cooperation Council, OPEC decisions, Middle East peace process, Islamic banking **`sub_saharan_african`** - Sub-Saharan African context - Content about African countries south of the Sahara - African Union topics, sub-Saharan development issues - References to African institutions, economies, or cultural topics - Example: M-Pesa mobile banking, African Union policies, safari tourism, ubuntu philosophy **`latin_american`** - Latin American context - Content about Central and South America, Caribbean - Latin American cultural perspectives, regional integration (Mercosur, etc.) - References to Latin American institutions, economies, or social movements - Example: Mercosur trade, telenovelas, Amazon rainforest, Latin American revolutions **`oceanian`** - Oceanian context - Content about Australia, New Zealand, Pacific Island nations - Oceanian perspectives, Pacific regional issues - References to Oceanian institutions, companies, or cultural topics - Example: ANZAC relations, Pacific Island climate change, Australian mining, Māori culture **`central_asian`** - Central Asian context - Content about Kazakhstan, Uzbekistan, Turkmenistan, Tajikistan, Kyrgyzstan - Central Asian perspectives, post-Soviet regional dynamics - Silk Road region, resource economies, nomadic heritage - Example: Silk Road initiatives, Caspian Sea resources, post-Soviet transitions **`russian_sphere`** - Russian/Post-Soviet context - Content about Russia, Belarus, and strong Russian influence areas - Post-Soviet perspectives, CIS (Commonwealth of Independent States) topics - Russian language content about regional (not global) topics - Example: Russian federal politics, CIS integration, post-Soviet economic transitions **`global`** - Genuinely international or universal - Content with truly global scope or application - International organizations, worldwide phenomena, global comparisons - Topics that transcend regional boundaries - Example: UN reports, climate change (global perspective), international standards, pandemic response **`culturally_neutral`** - No clear regional focus - Abstract, theoretical, or technical content without regional markers - Universal scientific, mathematical, or philosophical content - Content that could apply equally anywhere without modification - Example: Mathematical proofs, chemical formulas, abstract philosophy, programming concepts **`indeterminate`** - Cannot determine regional relevance - Insufficient content to identify regional focus - Mixed or contradictory regional signals - Fragment or corrupted content lacking regional context - Example: Technical specifications without context, isolated data tables ##### Multi-Regional Examples: - **EU-China trade relations** → `["european", "east_asian"]` - **NAFTA/USMCA impact on Mexican agriculture** → `["north_american", "latin_american"]` - **Indian diaspora in the Gulf states** → `["south_asian", "middle_eastern"]` - **Comparative study of healthcare systems globally** → `["global"]` ##### Regional Identification Guidelines: **Primary indicators:** - **Geographic references**: Countries, cities, regions, landmarks mentioned - **Institutional references**: Governments, companies, universities, organizations specific to region - **Cultural markers**: Holidays, customs, cultural phenomena, sports, entertainment - **Political/economic systems**: References to regional political structures, economic blocs - **Legal/regulatory frameworks**: Region-specific laws, regulations, standards - **Language context**: While not determinative, language can provide regional hints **Important distinctions:** - **Language ≠ Region**: Spanish content about Asian markets = `["east_asian"]`, not `["latin_american"]` - **Company origin vs. topic**: Apple (US company) operating in India = consider actual content focus - **Historical vs. current**: Historical content about ancient Rome = `["european"]` if discussing modern implications - **Diaspora content**: Content about diaspora communities should include both origin and current regions **Quality checks:** - If content is in a non-English language but discusses global topics → still mark as `["global"]` - If content compares multiple regions → mark all regions discussed substantially - If content is about a specific place but has universal applications → consider both regional and global tags --- #### 18. Country Relevance **What we're measuring**: Which specific country or countries (if any) the content is relevant to, globally. **Note**: Always output an array of country names for this property (even when only a single country applies). Use standard country names from any region worldwide (e.g., "germany", "france", "united_states", "united_kingdom", "china", "japan", "brazil", "india", "south_africa", "australia", "canada", etc.). The array may also contain the special values `supranational` or `none`. ##### Values & Criteria: **`{COUNTRY_NAME}`** - Content specifically relevant to a single country - Content explicitly about that country's politics, culture, institutions, or regulations - Content written from that country's cultural perspective - Content addressing that country's specific issues, regulations, or cultural phenomena - Content about that country's cities, companies, institutions, or country-specific topics - Example: For "germany" → German election coverage, Bundesliga content, German legal analysis - Example: For "united_states" → US election coverage, NFL content, US legal analysis - Example: For "japan" → Japanese politics, J-League content, Japanese cultural analysis - Only use country names listed in ISO-3166. Use "united_kingdom" instead of "england", "wales", etc. **`supranational`** - For content focused on supranational entities or regions - International organizations, regional blocs, global institutions - Content about supranational policies, international organizations, global governance - Pan-regional analysis that transcends individual countries - Multi-continental or global institutional content - Example: UN resolutions, NATO discussions, EU policy analysis, ASEAN agreements, WTO trade rules **`none`** - For content not specifically relevant to any country - Abstract, theoretical, or universal content without geographic specificity - Technical/scientific content that applies globally without country focus - Content that doesn't reference specific countries, cultures, or national contexts - Example: Mathematical proofs, universal scientific principles, abstract philosophical discussions ##### Country Identification Criteria: - **Political content**: Elections, government policies, political parties, political figures specific to the country - **Cultural content**: National traditions, cultural phenomena, historical events specific to the country - **Institutional references**: Government bodies, national companies, universities specific to the country - **Geographic focus**: Cities, regions, landmarks within the country as primary subjects - **Legal/regulatory**: Laws, regulations, legal frameworks specific to the country - **Economic content**: National economic policies, country-specific market analysis - **Sports/media**: National sports leagues, national teams, country-specific media outlets - **Social issues**: Social policies, demographic topics, social movements specific to the country ---