Files
propella-1-4b/property_descriptions.md

1182 lines
58 KiB
Markdown
Raw Permalink Normal View History

## Detailed Property Descriptions & Annotation Guidelines
### Core Content Properties
#### 1. Content Integrity
**What we're measuring**: Completeness and technical quality of the content itself, regardless of navigation ratio.
##### Values & Criteria:
**`complete`** - Full, intact content as intended
- Content appears complete with proper beginning, middle, and end
- All essential elements present (introduction, body, conclusion where appropriate)
- No obvious truncation or missing sections
- Example: Complete articles, full tutorials, intact documents
**`mostly_complete`** - Minor elements missing but core content intact
- Core content is complete but some secondary elements may be missing
- Minor truncation that doesn't affect main message
- Example: Article with truncated comments, missing sidebar content, partial author bio
**`fragment`** - Incomplete content, missing significant portions
- Missing introduction, conclusion, or substantial middle sections
- Truncated mid-sentence or mid-paragraph
- Content feels incomplete or cut off
- Example: Search result snippets, article excerpts, broken crawls, partial downloads
**`severely_degraded`** - Broken, unreadable, or corrupted content
- Encoding errors, scrambled text, missing characters
- Severely malformed HTML rendering as gibberish
- Technical corruption making content unreadable
- Example: <20>&$^%*@# characters, completely broken formatting, corrupted files
##### Key Decision Points:
- **Content completeness**: Does the content feel like a complete unit of information?
- **Technical integrity**: Is the content technically readable and properly formatted?
- **Fragment vs. complete**: Independent of navigation - is the actual content complete?
- **Degraded vs. fragment**: Degraded has technical issues; fragment is just incomplete
**Note**: Documents may end with the special tag `<content_truncated>`, indicating upstream length-based truncation due to processing constraints. Do not penalize Content Integrity due to this truncation signal; assess integrity based on the visible content's coherence and technical readability, ignoring the artificial cutoff.
---
#### 2. Content Ratio
**What we're measuring**: How much of the document is actual content vs. navigation, UI elements, and structural markup.
##### Values & Criteria:
**`complete_content`** - 90-100% meaningful content
- Full articles, papers, tutorials with minimal navigation
- Clean text with proper paragraphs and structure
- Example: A Wikipedia article, academic paper, complete blog post
**`mostly_content`** - 70-89% meaningful content
- Complete documents with some navigation elements (header, footer, sidebar)
- Minor UI elements that don't disrupt reading
- Example: News articles with standard website navigation
**`mixed_content`** - 40-69% meaningful content
- Significant navigation mixed throughout content
- Multiple sidebars, ads, or UI elements interrupting text
- Example: E-commerce product pages with reviews mixed with purchase options
**`mostly_navigation`** - 10-39% meaningful content
- Predominantly menus, links, headers, footers
- Content overwhelmed by structural elements
- Example: Site maps, navigation pages, heavily UI-focused pages
**`minimal_content`** - 0-9% meaningful content
- Almost entirely navigation, UI elements, or structural markup
- Very little readable content present
- Example: Empty pages, pure navigation menus, error pages with minimal text
##### Key Decision Points:
- Focus on the **ratio of readable text to navigation/UI elements**
- **Count only substantive content**, ignore boilerplate and structural elements
- **Mixed vs. mostly_navigation**: Can you read it as coherent content despite distractions?
---
#### 3. Content Length
**What we're measuring**: Amount of substantive content, ignoring navigation and boilerplate.
##### Values & Criteria:
**`substantial`** - 2,000+ words of meaningful content
- Long-form, comprehensive content that provides in-depth coverage of a topic
- Typically includes detailed analysis, multiple sections or chapters, extensive research, or thorough exploration of complex subjects
- Examples: White papers, research reports, e-books, long-form journalism
**`moderate`** - 5002,000 words of meaningful content
- Standard-length content that offers meaningful coverage while remaining focused and digestible
- Balances depth with accessibility; provides enough detail to be informative without overwhelming readers
- Examples: Typical blog posts, news articles, product reviews, how-to guides
**`brief`** - 100500 words of meaningful content
- Short, focused content that delivers key information quickly and efficiently
- Gets straight to the point while still providing value and context
- Examples: News briefs, product descriptions, FAQs, short blog posts
**`minimal`** - Under 100 words of meaningful content
- Very short content that provides only essential information or serves as a quick reference
- Designed for rapid consumption or specific micro-purposes
- Examples: Social media posts, announcements, abstracts, snippets, navigation pages
##### Measurement Tips:
- **Count only readable content of value**: include article body and substantive headings/captions; exclude headers/footers, menus/sidebars, related links, share/consent UI, pagination, ads, and boilerplate.
- **Focus on substantive information**, not filler words
- **Complete thoughts matter more than exact word counts**
- **Contextual adjustment**: Thresholds are guidelines and can be adjusted based on specific use cases and typical content. Academic contexts may shift ranges upward, while social media contexts may shift them downward.
---
### Content Classification
#### 4. One-Sentence Description
**What we're looking for**: A very short, neutral description of what the document contains.
##### Field:
**`one_sentence_description`**
- Ultra-short neutral description of the document
- Exactly one sentence
- Target length: <100 characters
- Focus on the main topic and, if useful, the documents function
- Examples of functions: tutorial, policy, news report, product page, navigation page
- Neutral, descriptive tone (no hype or marketing language)
##### To Avoid:
- Boilerplate intros: "This document...", "This article...", "In this guide..."
- Calls to action: "Learn how to...", "Discover...", "Find out..."
- User-facing phrasing: "You will learn...", "How do I..."
- Non-essential details (dates, numbers) unless central to the topic
##### Examples:
- "Beginner tutorial on React hooks and basic state management."
- "News report on European Central Bank interest rate decisions."
- "Internal policy for customer data retention and deletion."
- "API reference for payment processing endpoints and error codes."
- "Research paper analyzing housing price trends in major US cities."
- "FAQ answering common questions about employee parental leave."
- "Opinion essay arguing for stricter international climate change legislation."
##### Examples for low-quality or problematic documents (still annotate):
- "Fragment of article discussing proposed changes to European data privacy laws."
- "Keyword-stuffed promotional page about cheap car insurance quotes."
- "Website navigation page listing links to product categories and help pages."
- "Error page explaining that the requested resource could not be found."
- "Affiliate landing page promoting multiple online casino bonus offers."
- "Corrupted text with no identifiable topic or meaningful content."
#### 5. Content Type
**What we're measuring**: The functional structure and purpose of content.
**Multi-type content**: Content can be assigned multiple type labels if it genuinely serves multiple purposes. Choose ALL applicable types rather than forcing a single primary choice. Always output an array for this property, even if only one type applies.
##### Values & Criteria:
**`analytical`** - In-depth analysis, research, and critical examination
- Provides detailed analysis or research on a topic
- Develops arguments, evaluates evidence, or presents findings
- Example: Research analysis, investigative reports, academic articles, expert commentary
**`instructional`** - Teaching and how-to content
- Explicitly teaches skills, concepts, or procedures
- Step-by-step guidance or educational explanations
- Example: Tutorials, how-to guides, educational content, training materials
**`reference`** - Lookup materials, definitions, specifications
- Designed for looking up specific information rather than reading through
- Often organized alphabetically, categorically, or as lists
- Example: Dictionaries, encyclopedias, API references, product catalogs
**`procedural`** - Step-by-step processes and procedures
- Sequential instructions or workflows
- Process documentation with clear steps
- Example: Recipes, installation guides, standard operating procedures, workflows
**`qa_structured`** - Structured question-answer content
- Formal Q&A format with clear questions and answers
- Often expert responses to specific questions
- Example: Stack Overflow, FAQ sections, structured Q&A sites
**`conversational`** - Multi-party or turn-based dialogues (humans, bots, or both)
- Casual or structured conversations between two or more participants
- May include humanAI chats, forum threads, or comment chains
- Example: Reddit threads, forum discussions, support chats, assistant chat logs
**`creative`** - Entertainment, artistic, fictional content
- Primary purpose is entertainment or artistic expression
- Not primarily informational or instructional
- Example: Short stories, poems, movie reviews, game content, fiction
**`transactional`** - Commercial, shopping, service-oriented
- Primary purpose is to facilitate a transaction or service
- Focuses on products, services, or business processes
- Example: Product listings, service descriptions, checkout pages
**`boilerplate`** - Legal, policy, standard template text
- Standard legal or policy language
- Often repeated across multiple sites with minimal variation
- Example: Terms of service, privacy policies, disclaimers, cookie banners, standard notices
**`news_report`** - Straight reporting of events with minimal analysis
- Describes events or facts in a neutral, descriptive tone
- Time-bound news, updates, or reports
- Example: Wire-service news articles, breaking-news updates
**`opinion_editorial`** - Persuasive/opinionated commentary or editorials
- Expresses a stance or argument; aims to persuade
- May cite evidence but prioritizes viewpoint
- Example: Op-eds, opinion columns, personal essays with clear stance
**`review_critique`** - Evaluative reviews of products, media, or services
- Provides judgments, ratings, or critiques
- May include pros/cons, scoring systems
- Example: Product reviews, film/book critiques, app store reviews (long-form)
**`technical_documentation`** - Manuals, API docs, developer guides, READMEs
- Primary goal is to instruct usage of software/hardware/APIs
- Includes reference sections, examples, parameters, version notes
- Example: API reference, library README, user manual
**`specification_standard`** - Normative standards and formal specifications
- Defines requirements, must/shall language, compliance criteria
- Maintained by standards bodies or authoritative groups
- Example: RFCs, ISO standards, formal protocol specs
**`legal_document`** - Statutes, case law, contracts, regulatory texts
- Binding or authoritative legal content
- Formal legal language and structure
- Example: Court opinions, legislation, contracts, regulatory rules
**`press_release`** - Organization-issued announcements and PR materials
- Promotional announcements framed as information
- Quotes from executives, product/service announcements
- Example: Company press releases, launch announcements
**`structured_data`** - Tables, datasets, indices, catalogs with minimal prose
- Predominantly tabular/listed data meant for lookup
- Minimal narrative or explanatory text
- Example: Product catalogs, schedules, statistical tables
**`source_code`** - Code listings as primary content
- Dominant content is program source code or scripts
- May include lightweight comments or snippets without narrative
- Example: Code files, gist-like pages, competitive programming solutions
##### Multi-Type Examples:
- **Tutorial that analyzes different approaches** → `["instructional", "analytical"]`
- **Educational reference manual** → `["instructional", "reference"]`
- **Research paper with step-by-step methodology** → `["analytical", "procedural"]`
- **Q&A site with analytical responses** → `["qa_structured", "analytical"]`
- **API guide with examples** → `["technical_documentation", "reference", "instructional"]`
- **RFC with rationale** → `["specification_standard", "analytical"]`
- **Film review with interview snippets** → `["review_critique", "conversational"]`
- **Helpdesk chat with an AI** → `["conversational", "transactional"]`
- **Breaking news explainer** → `["news_report", "explanatory"]`
---
#### 6. Business Sector
**What we're measuring**: Business sector(s) or industry domain(s) for training sector-specific LLMs.
**Multi-sector content**: Content can be assigned multiple sector labels if it genuinely spans multiple industries. Choose ALL applicable sectors rather than forcing a single primary choice or using "other". Always output an array for this property, even if only one sector applies.
#### Values & Criteria:
**`academic_research`** - Scholarly and research content
- Peer-reviewed publications, academic papers
- University-affiliated research and scholarship
- Formal academic discourse and methodology
- Example: Journal articles, conference papers, academic books, dissertations
**`education_sector`** - Educational institutions and pedagogy
- K-12 education, higher education administration
- Educational technology, curriculum development
- Teaching methodologies and educational policy
- Example: School curricula, educational policy papers, teaching resources, edtech content
**`technology_software`** - Software and information technology
- Software development, programming, IT services
- Digital products, platforms, and technology companies
- Computer science and software engineering
- Example: Software documentation, tech company content, programming guides, IT industry analysis
**`hardware_electronics`** - Hardware devices and electronics industry
- Semiconductors, consumer electronics, embedded systems, hardware design
- Electronics manufacturing and supply chains
- Example: Chip design docs, hardware datasheets, device manuals
**`healthcare_medical`** - Healthcare and medical sector
- Medical research, clinical practice, healthcare delivery
- Hospitals, medical devices, healthcare policy
- Public health and wellness
- Example: Medical journals, clinical guidelines, healthcare administration, wellness content
**`pharmaceutical_biotech`** - Pharmaceutical and biotechnology
- Drug development, clinical trials, biotech research
- Pharmaceutical industry, biotechnology companies
- Life sciences and molecular biology applications
- Example: Drug research papers, clinical trial reports, biotech industry analysis
**`financial_services`** - Banking and financial services
- Banking, investment, insurance, financial planning
- Financial markets, fintech, payment systems
- Asset management and financial advisory
- Example: Financial analysis, banking documentation, investment guides
**`legal_services`** - Legal sector and jurisprudence
- Law firms, legal practice, court systems
- Legal education, regulatory compliance
- Litigation, contracts, legal advisory
- Example: Legal briefs, court opinions, legal analysis, compliance guides
**`government_public`** - Government and public administration
- Government agencies, public policy, civic services
- Regulatory bodies, public administration
- Political institutions and governance
- Example: Government reports, policy documents, regulatory filings, civic information
**`manufacturing_industrial`** - Manufacturing and heavy industry
- Industrial production, manufacturing processes
- Supply chain, logistics, industrial equipment
- Factory operations and industrial engineering
- Example: Manufacturing specs, industrial reports, supply chain analysis, production guides
**`mining_resources`** - Mining and natural resources
- Exploration, extraction, and processing of minerals and resources
- Resource markets and operations (metals, rare earths)
- Example: Mining reports, resource exploration docs, commodity operations
**`chemicals_materials`** - Chemicals and advanced materials
- Petrochemicals, specialty chemicals, polymers, composites, advanced materials
- Safety data sheets (SDS), process chemistry, materials science
- Example: Material datasheets, REACH documentation, chemical process guides
**`energy_utilities`** - Energy and utilities sector
- Power generation, renewable energy, oil and gas
- Electric utilities, water services, waste management
- Energy infrastructure and grid management
- Example: Energy industry reports, utility regulations, renewable energy research
**`retail_commerce`** - Retail and e-commerce
- Retail operations, e-commerce platforms
- Consumer goods distribution, merchandising
- Retail technology and customer experience
- Example: Retail industry analysis, e-commerce guides, merchandising strategies
**`wholesale_distribution`** - Wholesale trade and distribution
- B2B wholesale, distributors, procurement, inventory and fulfillment
- Supply relationships between manufacturers and retailers
- Example: Distributor catalogs, wholesale operations, procurement guides
**`real_estate_construction`** - Real estate and construction
- Property development, construction industry
- Real estate markets, property management
- Architecture and building services
- Example: Real estate analysis, construction specifications, property guides
**`transportation_logistics`** - Transportation and logistics
- Airlines, shipping, freight, public transit
- Logistics operations, supply chain transportation
- Vehicle fleet management, transportation infrastructure
- Example: Logistics guides, transportation planning, shipping documentation
**`travel_aviation`** - Travel industry and commercial aviation
- Airlines, airports, OTA platforms, hospitality travel operations
- Route planning, airline commercial, loyalty, IATA regulations
- Example: Airline scheduling, fare rules, OTA partner docs
**`automotive_industry`** - Automotive manufacturing and services
- Vehicle manufacturers, automotive suppliers
- Automotive technology, electric vehicles
- Dealerships and automotive services
- Example: Automotive engineering docs, vehicle technology papers, industry analysis
**`telecommunications`** - Telecommunications industry
- Telecom operators, network infrastructure
- Mobile services, broadband, satellite communications
- Telecommunications equipment and technology
- Example: Telecom industry reports, network specifications, 5G technology papers
**`media_entertainment`** - Media and entertainment industry
- Film, television, music, gaming industries
- Publishing, news media, content creation
- Streaming services and digital media
- Example: Entertainment industry analysis, media studies, content strategy
**`gaming_industry`** - Video games and interactive entertainment
- Game development, studios, engines, esports, live ops
- Monetization models, community management, platform ecosystems
- Example: Patch notes, game design docs, esports operations
**`gambling_betting`** - Gambling, betting, and online casinos
- Sportsbooks, casino games, lotteries, poker rooms
- Affiliate landing pages, bonus/promotions, tipster content
- Often high commercial bias and promotional framing
**`advertising_marketing`** - Advertising, marketing, and PR
- Brand strategy, campaign planning, performance marketing, martech
- Agencies, in-house marketing, PR communications
- Example: Campaign briefs, media plans, PR strategies
**`hospitality_tourism`** - Hospitality and tourism sector
- Hotels, restaurants, travel services
- Tourism industry, destination management
- Event planning and hospitality services
- Example: Tourism studies, hospitality management, travel industry reports
**`food_beverage_hospitality`** - Food & beverage and restaurant operations
- Restaurant ops, menu engineering, supply chain, QSR/fast casual
- Food safety, compliance, procurement for F&B
- Example: Restaurant training manuals, HACCP docs, vendor specs
**`agriculture_food`** - Agriculture and food production
- Farming, agricultural technology, food processing
- Agricultural supply chain, food safety
- Agribusiness and agricultural policy
- Example: Agricultural research, food industry reports, farming guides
**`environmental_services`** - Environmental and sustainability services
- Environmental consulting, ESG reporting, sustainability programs
- Waste management services, remediation, impact assessments
- Example: ESG reports, environmental impact assessments, sustainability frameworks
**`aerospace_defense`** - Aerospace and defense industry
- Aircraft manufacturing, space technology
- Defense contractors, military systems
- Aviation and space exploration
- Example: Aerospace engineering papers, defense industry analysis, aviation guides
**`insurance_industry`** - Insurance sector
- Life, health, property, and casualty insurance
- Reinsurance, actuarial science, risk assessment
- Insurance technology and underwriting
- Example: Actuarial studies, insurance policy analysis, risk management guides
**`nonprofit_ngo`** - Nonprofit and NGO sector
- Charitable organizations, international development
- Social services, humanitarian organizations
- Foundations and philanthropic institutions
- Example: NGO reports, nonprofit management, development studies
**`consulting_professional`** - Professional services and consulting
- Management consulting, accounting firms
- Business advisory, professional services firms
- Corporate strategy and business transformation
- Example: Consulting reports, professional services guides, business strategy papers
**`human_resources`** - HR and people operations
- Talent acquisition, compensation & benefits, performance management, L&D
- HR tech, workforce planning, organizational development
- Example: HR policy docs, job frameworks, talent strategy
**`security_cyber`** - Security and cybersecurity
- Information security, threat intelligence, risk management, compliance (e.g., SOC2)
- Physical security operations and incident response
- Example: Security guidelines, incident playbooks, vulnerability reports
**`consumer_goods`** - Consumer products and CPG
- Fast-moving consumer goods, household products
- Personal care, food and beverage brands
- Consumer product development and marketing
- Example: CPG industry analysis, product development docs, consumer research
**`general_interest`** - General audience content
- Content for broad audiences without sector focus
- General knowledge and miscellaneous topics
- Cross-sector or sector-agnostic content
- Example: General magazines, broad interest content, lifestyle articles
**`other`** - Highly specialized or unclassifiable
- Highly specialized niches not covered by existing sectors
- Content with genuinely unclear sector classification
- Unique content types that don't map to any defined sector
- Example: Highly specialized technical niches, unique content formats
##### Multi-Sector Examples:
- **Medical device regulations** → `healthcare_medical` + `pharmaceutical_biotech` + `government_public`
- **Fintech software documentation** → `financial_services` + `technology_software`
- **Agricultural biotechnology research** → `agriculture_food` + `pharmaceutical_biotech`
---
#### 7. Technical Content
**What we're measuring**: Type and intensity of specialized technical knowledge.
**Multi-technical content**: Content can be assigned multiple technical content labels if it genuinely combines multiple technical domains. Choose ALL applicable technical types rather than forcing a single primary choice. Always output an array for this property, even if only one technical type applies.
##### Values & Criteria:
**`code_heavy`** - Significant programming content
- Multiple code examples, algorithms, or implementations
- Technical programming concepts and methodologies
- Software development focus
- Example: Programming tutorials, API documentation, software guides
**`math_heavy`** - Substantial mathematical content
- Mathematical equations, proofs, or statistical analysis
- Quantitative analysis and mathematical reasoning
- Mathematical concepts and methodologies
- Example: Mathematical papers, statistical analysis, quantitative research
**`scientific`** - Research and scientific methodology content
- Scientific research findings, experimental data
- Scientific methodology and analysis
- Peer-reviewed research content
- Example: Research papers, scientific studies, experimental reports
**`data_heavy`** - Substantial datasets, tables, and data analysis
- Contains significant data tables, charts, or datasets
- Focus on data interpretation and analysis
- Statistical content with data presentations
- Example: Research data, statistical reports, data analysis, survey results
**`engineering`** - Engineering and applied technical content
- Engineering design, systems, and applied technical solutions
- Technical specifications for physical systems
- Non-software engineering disciplines
- Example: Mechanical engineering, civil engineering, technical specifications, design documents
**`basic_technical`** - Some technical elements but not dominant
- Light technical content mixed with general explanations
- Technical concepts explained for general audience
- Example: Technology articles for general audience, basic technical explanations
**`non_technical`** - No significant technical content
- General audience content without specialized technical knowledge
- No programming, mathematical, engineering, or scientific focus
- Example: General articles, humanities content, basic informational content
##### Multi-Technical Examples:
- **Data science tutorial with code examples** → `["code_heavy", "math_heavy", "data_heavy"]`
- **Engineering research with statistical analysis** → `["engineering", "scientific", "data_heavy"]`
- **Computational biology paper** → `["code_heavy", "scientific"]`
---
### Quality and Value Assessment
#### 8. Content Quality
**What we're measuring**: Overall quality of content considering writing excellence, substantive value, and presentation quality regardless of authorship origin.
#### Values & Criteria:
**`excellent`** - Outstanding quality across all dimensions
- Sophisticated writing with varied sentence structures and engaging style
- Rich, appropriate vocabulary with error-free grammar and punctuation
- High substantive value with clear insights or information
- Professional presentation and formatting
- Natural flow and logical organization
- Example: High-quality publications, expert analyses, polished educational content, well-crafted professional documents
**`good`** - High quality with minor imperfections
- Grammatically correct with proper sentence structure
- Appropriate vocabulary and tone for content type
- Solid substantive value and clear information
- Good organization and readable flow
- Only occasional minor issues (1-2 typos per section)
- Example: Quality journalism, professional websites, well-written blog posts, solid educational materials
**`adequate`** - Acceptable quality for most purposes
- Generally clear and understandable writing
- Some grammatical errors but meaning remains clear
- Reasonable substantive value though may lack depth
- Basic organization and structure present
- Minor formatting or presentation issues
- Example: Casual blogs, user reviews, basic informational content, simple guides
**`poor`** - Significant quality issues impacting utility
- Multiple errors affecting comprehension or credibility
- Unclear expression, confusing organization, or awkward phrasing
- Limited substantive value or questionable information
- Major formatting problems or unprofessional presentation
- Difficult to extract reliable information
- Example: Low-quality web content, poorly edited materials, confusing instructions
**`unacceptable`** - Quality too low for productive use
- Severely impaired communication with major errors
- Incoherent, nonsensical, or corrupted content
- No reliable substantive value
- Broken formatting or technical corruption
- Cannot determine intended meaning or extract useful information
- Example: Corrupted text, severe translation errors, spam content, SEO content, completely broken formatting
##### Quality Assessment Guidelines:
- **Comprehension**: Can the intended message be clearly understood?
- **Substantive value**: Does the content provide useful information or insights?
- **Technical presentation**: Is the content properly formatted and readable?
- **Error impact**: Do errors significantly impede understanding or credibility?
- **Professional standards**: Does the content meet basic standards for its intended purpose?
**Language-Specific Quality Indicators:**
- For non-Latin scripts (Arabic, Chinese, Japanese): Check for proper character encoding
- For agglutinative languages (Turkish, Finnish): Adjust expectations for word count/density
- For languages with different formality levels (Japanese, Korean): Assess appropriate register
- Mixed-language documents: Evaluate code-switching quality and appropriateness
---
#### 9. Information Density
**What we're measuring**: Ratio of valuable information to redundancy, padding, and repetition.
##### Values & Criteria:
**`dense`** - Efficient, information-packed content
- Every sentence adds new information or insight
- Minimal redundancy or unnecessary elaboration
- Little to no repetition of the same concepts
- Example: Technical specifications, concise academic writing, quality reference material
**`adequate`** - Good information content with reasonable elaboration
- Most content adds value with some acceptable elaboration
- Minimal repetition within the document
- Good balance of information and explanation
- Example: Well-written articles, good tutorials with examples
**`moderate`** - Mixed substantive content with noticeable padding
- Some valuable information mixed with unnecessary elaboration
- Noticeable repetition of key points for emphasis
- Some sections feel padded or verbose
- Example: Blog posts with some fluff, articles with repetitive conclusions
**`thin`** - Low information content with significant problems
- Much content doesn't add new information
- High internal repetition and excessive redundancy
- Significant padding to reach desired length
- Example: SEO-optimized content, poorly edited writing
**`empty`** - Dominated by repetition and meaningless content
- Minimal actual information value
- Dominated by repetition and copy-paste artifacts
- Same ideas repeated multiple times without development
- Example: Spam content, template-filled pages, keyword-stuffed articles
##### Common Repetition Patterns to Watch For:
- **Same phrases repeated throughout** (especially in SEO content)
- **Identical paragraphs** or sections (copy-paste errors)
- **Circular reasoning** (saying the same thing in different ways)
- **Template artifacts** (repeated boilerplate mixed with content)
---
#### 10. Educational Value
**What we're measuring**: Potential for teaching, learning, and knowledge transfer.
##### Values & Criteria:
**`high`** - Clear instructional design and learning objectives
- Explicitly teaches concepts or skills
- Progressive skill building from basic to advanced
- Clear learning objectives and outcomes
- Comprehensive explanations with examples
- Example: Quality tutorials, textbooks, structured courses, educational guides
**`moderate`** - Good instructional value with some learning potential
- Some instructional elements present
- Explanations help build understanding
- Transferable knowledge to other contexts
- Good examples or illustrations
- Example: How-to articles, explanatory content, informative guides
**`basic`** - Limited educational content
- Some explanations but not systematically instructional
- Basic explanations of concepts
- Limited learning potential or skill building
- Example: Basic explanations, simple informational content
**`minimal`** - Little educational value
- Primarily informational rather than instructional
- No clear learning objectives or skill building
- Entertainment or commercial focus
- Example: Entertainment content, basic news, commercial content
**`none`** - No educational content
- No instructional value or learning potential
- Purely transactional, entertainment, or administrative
- No knowledge transfer potential
- Example: Pure entertainment, transactions, legal boilerplate
##### Disambiguation tips
- Explanatory vs Educational: explanations alone ≠ educational design; require intent to teach plus scaffolding for Basic+
- Reference docs: typically Minimal; promote to Basic/Moderate when guided “how-to” segments or curated examples exist
- Reviews/op-eds: None/Minimal unless they include actionable how-to guidance designed for learning
##### Automation heuristics
- Keywords: Objectives/Outcomes, Lesson, Exercise/Quiz, Homework, Assessment, Syllabus, Module, Unit, Learning Goals
- Structure: numbered steps + prerequisites/requirements → Basic; add practice tasks/solutions → Moderate; syllabus/modules/assessments → High
- Signals of non-edu mix: heavy CTAs/ads or product pitches → cap at Minimal unless clear instructional scaffolding
##### Quick decision tree
- Are there explicit learning goals or a syllabus? → High
- Else, are there step-by-step instructions with examples/exercises? → Moderate
- Else, are there explanatory sections intended to teach basics? → Basic
- Else, is there any minor instructional element? → Minimal
- Otherwise → None
##### Borderline examples
- API reference with examples but no guidance → Minimal to Basic (depending on clarity/examples)
- Blog post explaining concept with analogies and one example → Basic
- Tutorial with tasks, checkpoints, and solutions → High
- Product documentation with “Getting Started” and “How-To” flows → Moderate
##### Educational Indicators:
- **Learning objectives**: Clear goals for what reader should learn
- **Skill progression**: Builds from basic to advanced concepts
- **Examples and practice**: Provides concrete examples or exercises
- **Knowledge transfer**: Concepts applicable beyond immediate context
---
#### 11. Reasoning Indicators
**What we're measuring**: Presence and quality of logical reasoning, analysis, and explanatory content.
##### Values & Criteria:
**`analytical`** - Complex reasoning and systematic analysis
- Multi-step arguments with logical progression
- Cause-effect analysis and systematic thinking
- Considers multiple perspectives or variables
- Draws conclusions from evidence and reasoning
- Example: Research analysis, complex problem-solving, systematic evaluations
**`explanatory`** - Clear explanations with logical flow
- Explains how or why things work
- Shows cause-effect relationships clearly
- Educational reasoning that builds understanding
- Logical connections between concepts
- Example: Good tutorials, educational content, how-to explanations
**`basic_reasoning`** - Simple logical connections
- Some logical connections between ideas
- Basic explanations of concepts or processes
- Elementary analytical thinking
- Simple cause-effect relationships
- Example: Basic explanations, simple arguments, elementary analysis
**`minimal`** - Limited reasoning, mostly descriptive
- Primarily describes what rather than why or how
- Few logical connections between ideas
- Mostly factual statements without analysis
- Little explanatory content
- Example: Basic descriptions, simple factual content, minimal analysis
**`none`** - No clear reasoning present
- Purely descriptive content
- Simple factual listing without connections
- Narrative content without analysis
- No logical argumentation or explanation
- Example: Simple lists, basic narratives, pure description
##### Thinking-trace signals (what to look for)
- Stepwise structure: numbered steps in proofs/derivations/solutions; “First… therefore… hence… so…”
- Hypothesis and test: assumptions, intermediate results, counterexamples, sanity checks
- Tool- or method-calls: named algorithms, theorems, lemmas, or procedures invoked and justified
- Error analysis or reflection: “we tried X, failed because Y, so we…”, “limitations,” “edge cases”
- Intermediate artifacts: scratch calculations, partial code reasoning, sub-problems and sub-claims
##### Disambiguation rules
- Explanatory vs Analytical: explanations tell how; analytical shows multi-step inference with evidence and intermediate claims
- Worked example vs Mere answer: worked examples expose steps and justification; mere answers without steps are not reasoning-rich
- Procedural vs Reasoning: procedural lists actions; reasoning links actions via logic, evidence, or constraints
##### Automation heuristics
- Lexical cues: because, therefore, thus, hence, suppose/assume, we conclude, by induction, lemma/theorem/proof, O(n), hypothesis, counterexample
- Structure cues: presence of proof blocks, derivations (e.g., “Proof.”, “QED”, TeX environments), multi-step numeric calculations
- Program reasoning: code comments like “// invariant”, “// complexity”, pre/post-conditions, test reasoning
- Thresholding: count reasoning cues per 1k tokens; with ≥2 structural cues or ≥5 lexical cues → at least explanatory; proofs/derivations → analytical
##### Quick decision tree
- Is there a proof/derivation or multi-step argument with intermediate claims? → analytical
- Else, does it explain why/how with cause-effect and logical links? → explanatory
- Else, are there simple logical connections or one-step justifications? → basic_reasoning
- Else, does it mostly describe without connecting ideas? → minimal/none
##### Borderline examples
- Answer-only solutions (final numeric result without steps) → minimal
- Step-by-step math solution with intermediate equations → analytical
- “How it works” article connecting 23 causal steps without data → explanatory
- Troubleshooting log with attempts and justifications → analytical if causal chain is explicit; otherwise explanatory
##### Key Reasoning Patterns to Identify:
- **Cause-effect**: "Because X, therefore Y"
- **Problem-solution**: Identifies problems and proposes solutions
- **Comparison**: Analyzes similarities and differences
- **Logical progression**: Ideas build on previous ideas
- **Evidence-based conclusions**: Draws conclusions from presented evidence
---
### Audience and Purpose
#### 12. Audience Level
**What we're measuring**: Intended sophistication level and background knowledge assumptions of the target audience.
##### Values & Criteria:
**`expert`** - Highly specialized professional/academic content
- Assumes deep domain expertise and advanced training
- Uses technical terminology without explanation
- Content for practitioners actively working in specialized fields
- Example: Climate modeling methodology in Nature Climate Change, research papers, technical specifications, expert-to-expert communications
**`advanced`** - Educated adult audience with analytical skills
- Assumes higher education and critical thinking ability
- Explains specialized concepts but uses sophisticated language
- Intellectually challenging but accessible to educated generalists
- Example: Complex climate change analysis in The Atlantic, quality journalism, policy analysis, advanced general interest content
**`general`** - General adult audience
- Accessible to most educated adults without specialized background
- Explains technical concepts when introduced
- Uses clear language while maintaining intellectual substance
- Example: Quality journalism, general interest articles, accessible explanations of complex topics
**`beginner`** - Introductory level with minimal prerequisites
- Explains basic concepts and terminology
- Builds up from fundamental principles
- Assumes minimal prior knowledge of the subject area
- Example: Introductory tutorials, beginner guides, basic explanations, getting-started content
**`youth`** - Targeted at teenagers and young adults (ages 13-19)
- Age-appropriate complexity with contemporary cultural references
- Sophisticated enough for developing critical thinking but accessible
- May address topics relevant to adolescent experiences and concerns
- Example: High school educational content, young adult literature, teen-focused explanations, college prep materials
**`children`** - Designed specifically for children
- Simple language and concepts appropriate for young readers
- Educational content designed for elementary/middle school levels
- Age-appropriate topics and complexity
- Example: Children's educational content, elementary school materials, simple explanations for young learners
##### Assessment Guidelines:
- **Professional context**: Is this content designed for workplace use vs. general learning?
- **Terminology density**: How much specialized vocabulary is used without explanation?
- **Concept complexity**: How sophisticated are the ideas and their development?
- **Background assumptions**: What education level and domain knowledge does the author assume?
**Cross-Linguistic Considerations:**
- Expert terminology density varies by language (German allows more compound terms)
- Formality markers differ across cultures
- Educational level assumptions vary by country's education system
- Age-appropriate content differs across cultures
---
#### 13. Commercial Bias
**What we're measuring**: How much commercial interests influence the objectivity and informational value of content.
##### Values & Criteria:
**`none`** - No commercial influence detected
- Objective, informational presentation
- No promotional language or commercial agenda
- Focus purely on informing or educating
- Example: Academic papers, objective journalism, educational content
**`minimal`** - Slight commercial context but maintains objectivity
- May mention products/services but in informational context
- Maintains balanced, objective tone
- Commercial mentions serve informational purpose
- Example: Product reviews with balanced analysis, informational articles mentioning relevant products
**`moderate`** - Some commercial influence on content
- Mix of informational and promotional content
- Some promotional language but still provides useful information
- Commercial interests somewhat visible but not dominant
- Example: Company blogs with useful information, sponsored content with actual value
**`heavy`** - Strong commercial bias throughout
- Primarily promotional with some informational elements
- Heavy use of marketing language and persuasive techniques
- Clear commercial agenda affects content objectivity
- Example: Marketing articles disguised as information, heavily biased product comparisons
**`pure_marketing`** - Entirely commercial/promotional content
- No genuine informational value beyond promotion
- Pure marketing copy or advertising material
- Designed solely to drive sales or conversions
- Example: Sales pages, pure advertising copy, promotional brochures
##### Key Indicators:
- **Language tone**: Objective vs. promotional language
- **Primary purpose**: Inform vs. persuade/sell
- **Balance**: Are alternatives/drawbacks mentioned?
- **Call-to-action**: Subtle information vs. obvious sales pitch
---
#### 14. Time-Sensitivity
**What we're measuring**: How time-sensitive the content is - whether its value degrades over time or remains stable.
##### Values & Criteria:
**`evergreen`** - Content remains valuable indefinitely
- Fundamental concepts, principles, theories
- Historical information and established facts
- Skills and techniques that don't change
- Reference materials with lasting value
- Example: Mathematical proofs, language grammar guides, classical literature analysis, basic cooking techniques
**`slowly_changing`** - Content remains valuable for years
- Best practices that evolve slowly
- Technical content that updates every few years
- Cultural and social topics with gradual change
- Example: Programming language tutorials, academic textbooks, industry standards, educational curricula
**`regularly_updating`** - Content valuable for months to a year
- Industry trends and market analysis
- Technology reviews and comparisons
- Policy discussions and current research
- Example: Software framework guides, business strategies, product reviews, research summaries
**`time_sensitive`** - Content value degrades quickly
- News and current events
- Time-bound information (prices, schedules, availability)
- Temporary situations or short-term trends
- Real-time data and statistics
- Example: Stock prices, weather reports, breaking news, event announcements, sales/promotions
##### Key Decision Points:
- **Core question**: If someone reads this in 2 years, will it still be valuable?
- **Update frequency**: How often does this type of information typically change?
- **Temporal references**: Does the content heavily reference "now," "recently," "currently"?
- **Subject matter stability**: Is this about unchanging principles or evolving situations?
---
### Safety and Compliance
#### 15. Content Safety
**What we're measuring**: Presence of inappropriate, harmful, or legally problematic content.
##### Values & Criteria:
**`safe`** - Appropriate for all contexts
- No concerning content of any type
- Professional, appropriate language throughout
- Suitable for general audiences including workplace settings
**`mild_concerns`** - Minor issues that don't constitute major problems
- Occasional mild profanity in context
- Brief mentions of sensitive topics handled appropriately
- Minor concerns that don't affect overall suitability
- Example: Historical discussions of sensitive topics, professional content with mild language
**`nsfw`** - Not safe for work or general audiences
- Explicit sexual content or graphic descriptions
- Adult themes requiring content warnings
- Graphic violence or disturbing imagery descriptions
- Example: Adult content, graphic medical descriptions, explicit violence
**`harmful`** - Potentially harmful content requiring careful handling
- Content promoting dangerous activities or self-harm
- Hate speech targeting individuals or groups
- Violent content glorifying harm to others
- Example: Self-harm content, hate speech, dangerous "how-to" guides
**`illegal`** - Illegal content requiring immediate rejection
- Content promoting clearly illegal activities
- Material that violates laws in major jurisdictions
- Example: Terrorist content, child exploitation
##### Safety Assessment Guidelines:
- **Context matters**: Medical/educational discussions of sensitive topics may be appropriate
- **Intent matters**: Discussing harmful topics for educational purposes vs. promoting them
- **Audience consideration**: Content appropriate for experts may not be safe for general audiences
---
#### 16. PII Presence
**What we're measuring**: Whether the content contains personally identifiable information that could identify private individuals.
##### Values & Criteria:
**`no_pii`** - No personal information detected
- No names of private individuals
- No contact information (emails, phones, addresses)
- No identification numbers
- Public figures and officials mentioned by name are acceptable
- Example: News articles about politicians, technical documentation, general information
**`contains_pii`** - Contains potentially identifiable information
- Names of private individuals (non-public figures)
- Email addresses, phone numbers, physical addresses
- ID numbers (SSN, passport, driver's license, employee IDs)
- Medical information about identifiable individuals
- Financial account information
- Example: Personal blogs with full names, leaked databases, medical case studies with identifying info
##### Key Decision Points:
- **Public vs. Private figures**: Politicians, celebrities, CEOs = public (no PII flag); private citizens = PII
- **Context matters**: Academic paper authors and their institutional emails = typically no PII; personal emails in forums = PII
- **Aggregated vs. Individual**: Statistical data = no PII; individual records = PII
---
### Geographic Relevance
#### 17. Regional Relevance
**What we're measuring**: Primary regional, cultural, or geopolitical sphere(s) that the content relates to, regardless of language used.
**Multi-regional content**: Content can be assigned multiple regional labels if it genuinely spans multiple regions. Choose ALL applicable regions rather than forcing a single primary choice. Always output an array for this property, even if only one region applies.
##### Values & Criteria:
**`european`** - European context (EU and broader Europe)
- Content about European countries, EU policies, or pan-European topics
- European cultural perspectives, social systems, or business practices
- References to European cities, institutions, companies, or regulations
- Includes: EU member states, UK, Switzerland, Norway, Balkans, etc.
- Example: GDPR compliance, European Parliament elections, Schengen area travel, European football leagues
**`north_american`** - North American context
- Content about US, Canada, or Mexico
- North American cultural perspectives, USMCA/NAFTA region topics
- References to North American institutions, companies, or issues
- Example: FDA regulations, Silicon Valley tech, NHL, US constitutional law, Canadian healthcare
**`east_asian`** - East Asian context
- Content about China, Japan, Korea (North/South), Taiwan, Mongolia
- East Asian cultural perspectives, Confucian-influenced societies
- References to East Asian economic models, companies, or social systems
- Example: Gaokao exams, K-pop, Shenzhen tech hub, Japanese work culture, Taiwan semiconductor industry
**`south_asian`** - South Asian context
- Content about India, Pakistan, Bangladesh, Sri Lanka, Nepal, Bhutan, Afghanistan, Maldives
- South Asian cultural perspectives, subcontinental issues
- References to South Asian institutions, economies, or social structures
- Example: IIT entrance exams, Bollywood, cricket leagues, monsoon impacts, caste system discussions
**`southeast_asian`** - Southeast Asian context
- Content about ASEAN countries (Indonesia, Thailand, Vietnam, Philippines, Malaysia, Singapore, etc.)
- Southeast Asian regional perspectives and economic integration
- References to ASEAN policies, regional companies, or cultural phenomena
- Example: ASEAN economic community, Indonesian elections, Singapore financial sector, Thai tourism
**`middle_eastern`** - Middle Eastern and North African context
- Content about Arab states, Iran, Turkey, Israel, North Africa (MENA region)
- Middle Eastern cultural perspectives, Islamic finance, regional conflicts
- References to Middle Eastern institutions, oil economies, or geopolitics
- Example: Gulf Cooperation Council, OPEC decisions, Middle East peace process, Islamic banking
**`sub_saharan_african`** - Sub-Saharan African context
- Content about African countries south of the Sahara
- African Union topics, sub-Saharan development issues
- References to African institutions, economies, or cultural topics
- Example: M-Pesa mobile banking, African Union policies, safari tourism, ubuntu philosophy
**`latin_american`** - Latin American context
- Content about Central and South America, Caribbean
- Latin American cultural perspectives, regional integration (Mercosur, etc.)
- References to Latin American institutions, economies, or social movements
- Example: Mercosur trade, telenovelas, Amazon rainforest, Latin American revolutions
**`oceanian`** - Oceanian context
- Content about Australia, New Zealand, Pacific Island nations
- Oceanian perspectives, Pacific regional issues
- References to Oceanian institutions, companies, or cultural topics
- Example: ANZAC relations, Pacific Island climate change, Australian mining, Māori culture
**`central_asian`** - Central Asian context
- Content about Kazakhstan, Uzbekistan, Turkmenistan, Tajikistan, Kyrgyzstan
- Central Asian perspectives, post-Soviet regional dynamics
- Silk Road region, resource economies, nomadic heritage
- Example: Silk Road initiatives, Caspian Sea resources, post-Soviet transitions
**`russian_sphere`** - Russian/Post-Soviet context
- Content about Russia, Belarus, and strong Russian influence areas
- Post-Soviet perspectives, CIS (Commonwealth of Independent States) topics
- Russian language content about regional (not global) topics
- Example: Russian federal politics, CIS integration, post-Soviet economic transitions
**`global`** - Genuinely international or universal
- Content with truly global scope or application
- International organizations, worldwide phenomena, global comparisons
- Topics that transcend regional boundaries
- Example: UN reports, climate change (global perspective), international standards, pandemic response
**`culturally_neutral`** - No clear regional focus
- Abstract, theoretical, or technical content without regional markers
- Universal scientific, mathematical, or philosophical content
- Content that could apply equally anywhere without modification
- Example: Mathematical proofs, chemical formulas, abstract philosophy, programming concepts
**`indeterminate`** - Cannot determine regional relevance
- Insufficient content to identify regional focus
- Mixed or contradictory regional signals
- Fragment or corrupted content lacking regional context
- Example: Technical specifications without context, isolated data tables
##### Multi-Regional Examples:
- **EU-China trade relations** → `["european", "east_asian"]`
- **NAFTA/USMCA impact on Mexican agriculture** → `["north_american", "latin_american"]`
- **Indian diaspora in the Gulf states** → `["south_asian", "middle_eastern"]`
- **Comparative study of healthcare systems globally** → `["global"]`
##### Regional Identification Guidelines:
**Primary indicators:**
- **Geographic references**: Countries, cities, regions, landmarks mentioned
- **Institutional references**: Governments, companies, universities, organizations specific to region
- **Cultural markers**: Holidays, customs, cultural phenomena, sports, entertainment
- **Political/economic systems**: References to regional political structures, economic blocs
- **Legal/regulatory frameworks**: Region-specific laws, regulations, standards
- **Language context**: While not determinative, language can provide regional hints
**Important distinctions:**
- **Language ≠ Region**: Spanish content about Asian markets = `["east_asian"]`, not `["latin_american"]`
- **Company origin vs. topic**: Apple (US company) operating in India = consider actual content focus
- **Historical vs. current**: Historical content about ancient Rome = `["european"]` if discussing modern implications
- **Diaspora content**: Content about diaspora communities should include both origin and current regions
**Quality checks:**
- If content is in a non-English language but discusses global topics → still mark as `["global"]`
- If content compares multiple regions → mark all regions discussed substantially
- If content is about a specific place but has universal applications → consider both regional and global tags
---
#### 18. Country Relevance
**What we're measuring**: Which specific country or countries (if any) the content is relevant to, globally.
**Note**: Always output an array of country names for this property (even when only a single country applies). Use standard country names from any region worldwide (e.g., "germany", "france", "united_states", "united_kingdom", "china", "japan", "brazil", "india", "south_africa", "australia", "canada", etc.). The array may also contain the special values `supranational` or `none`.
##### Values & Criteria:
**`{COUNTRY_NAME}`** - Content specifically relevant to a single country
- Content explicitly about that country's politics, culture, institutions, or regulations
- Content written from that country's cultural perspective
- Content addressing that country's specific issues, regulations, or cultural phenomena
- Content about that country's cities, companies, institutions, or country-specific topics
- Example: For "germany" → German election coverage, Bundesliga content, German legal analysis
- Example: For "united_states" → US election coverage, NFL content, US legal analysis
- Example: For "japan" → Japanese politics, J-League content, Japanese cultural analysis
- Only use country names listed in ISO-3166. Use "united_kingdom" instead of "england", "wales", etc.
**`supranational`** - For content focused on supranational entities or regions
- International organizations, regional blocs, global institutions
- Content about supranational policies, international organizations, global governance
- Pan-regional analysis that transcends individual countries
- Multi-continental or global institutional content
- Example: UN resolutions, NATO discussions, EU policy analysis, ASEAN agreements, WTO trade rules
**`none`** - For content not specifically relevant to any country
- Abstract, theoretical, or universal content without geographic specificity
- Technical/scientific content that applies globally without country focus
- Content that doesn't reference specific countries, cultures, or national contexts
- Example: Mathematical proofs, universal scientific principles, abstract philosophical discussions
##### Country Identification Criteria:
- **Political content**: Elections, government policies, political parties, political figures specific to the country
- **Cultural content**: National traditions, cultural phenomena, historical events specific to the country
- **Institutional references**: Government bodies, national companies, universities specific to the country
- **Geographic focus**: Cities, regions, landmarks within the country as primary subjects
- **Legal/regulatory**: Laws, regulations, legal frameworks specific to the country
- **Economic content**: National economic policies, country-specific market analysis
- **Sports/media**: National sports leagues, national teams, country-specific media outlets
- **Social issues**: Social policies, demographic topics, social movements specific to the country
---