初始化项目,由ModelHub XC社区提供模型

Model: jorge-erdb/gpt-oss-20b-uncensored-sonnet-flavored-gguf
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-15 02:22:16 +08:00
commit 88f986a0ef
15 changed files with 1777 additions and 0 deletions

40
.gitattributes vendored Normal file
View File

@@ -0,0 +1,40 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
gpt-oss-20B-uncensored-sonnet-flavored-v1-BF16.gguf filter=lfs diff=lfs merge=lfs -text
gpt-oss-20B-uncensored-sonnet-flavored-v1-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
gpt-oss-20B-uncensored-sonnet-flavored-v1-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
gpt-oss-20B-uncensored-sonnet-flavored-v1-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
hf/gpt-oss-20b-uncensored-sonnet-flavored-v1/tokenizer.json filter=lfs diff=lfs merge=lfs -text

630
README.md Normal file
View File

@@ -0,0 +1,630 @@
---
license: apache-2.0
pipeline_tag: text-generation
tags:
- sonnet-flavored
- gpt-oss-20b
- gpt-oss
- moe
- mxfp4
- lora
- fine-tune
- sonnet-style
- claude-style
- distilled
- gguf
base_model:
- p-e-w/gpt-oss-20b-heretic-ara-v3
language:
- en
---
# gpt-oss-20b-uncensored-sonnet-flavored-v1
A LoRA fine-tune of [p-e-w/gpt-oss-20b-heretic-ara-v3](https://huggingface.co/p-e-w/gpt-oss-20b-heretic-ara-v3) — p-e-w's Heretic ARA-derestricted version of GPT-OSS-20B — that transfers Claude Sonnet 4.6's response style. Final-answer-only training; original weight reasoning and capabilities are maintained, original weight flaws too.
This repo contains both the merged HF safetensors (in `hf/`) and a GGUF ladder with MXFP4 experts preserved natively across all variants.
---
## What changed and what didn't
This LoRA targets **response style only** — phrasing, tone, structure, register. It does not target reasoning capability, factual knowledge, or numerical calibration. For factual or numerical tasks, the underlying GPT-OSS-20B's calibration limits apply: this model will produce confidently-wrong answers in the same situations the base model does, just phrased in Sonnet's voice.
The base for this variant is p-e-w's heretic-ara-v3, which applies ARA (Attention Refusal Ablation) derestriction to GPT-OSS-20B. The LoRA was trained on top of that derestricted base, so the published merged weights inherit both the derestriction and the style transfer. The MoE experts (~19B of 20B parameters) are unchanged from the heretic base. The LoRA modified attention projections only, at rank 16.
The combination produces a model that retains GPT-OSS-20B's reasoning capability with refusals lifted (via heretic ARA) and Sonnet-flavored response style on top.
---
## Response Comparisons
<details>
<summary><b>Prompt 1:</b> Coffee shop loyalty program critique</summary>
**Prompt:**
A city planner argues that because neighborhoods with an abundance of boutique coffee shops have lower crime rates, the city should heavily subsidize new coffee shops in high-crime areas to make them safer. What are the specific logical and statistical flaws in this policy proposal?
<details>
<summary>Base GPT-OSS-20B-heretic-ara-v3:</summary>
Below is a compact “checklist” of the most common reasoning and dataanalysis problems that usually show up when people take a simple crosssectional observation—“neighborhoods with more boutique coffee shops have lower crime”—and turn it into a policy directive. Ive split the issues into two groups:
| # | **Logical flaw** | Example / Why it matters |
|---|------------------|--------------------------|
| L1 | **Correlation ≠ causation** | The observation merely tells you that two variables move together, not that one *causes* the other. A city planner might assume “coffee → calmer people → less crime” without any evidence of a causal mechanism. |
| L2 | **Reverse causality** | Low crime can attract coffee shops. A downtown thats safe becomes a hotspot for cafés, not the café that makes it safe. Assuming the wrong direction hides an important driver. |
| L3 | **Ignoring confounding factors** | Things that covary with both coffee shop density and crime (income, education, police presence, zoning, housing stock, communitycenteredness) can explain away the association. Subsidizing cafés without controlling for those covariates may subsidize the wrong “factor.” |
| L4 | **Ecological fallacy** | The data are at the neighborhood level; that relationship might not hold for individual streets or households within that neighborhood. A policy designed around aggregated data can mistarget where it is actually needed. |
| L5 | **Endogeneity / simultaneity** | Neighborhood characteristics (e.g., property values) may be jointly determined by crime and café density. Standard OLS will bias the estimated effect because one variable is “tautologically” related to the other. |
| L6 | **Assuming a linear, uniform effect** | The impact of adding a café could be nonlinear (e.g., a 1st café helps a lot, but 5th cafés do little) or vary by café type, location, or existing social capital. A blanket subsidy ignores heterogeneity. |
| L7 | **Neglecting time dynamics** | Crime rates and café openings have lags (construction time, business failure risk). A policy that taxes the “instantaneous” correlation overlooks that crime may decline months or years after a café opens, or that cafés close shortly after opening. |
| L8 | **Omitting selection bias** | Cafés tend to open in areas that are already improving (gentrifying), so the relationship is partly an *selection* effect: coffee shops “chose” neighborhoods that were on a trajectory of lower crime. Subsidies may not change that trajectory. |
| L9 | **Assuming “more cafés = more benefit”** | If cafés saturate, additional ones add little incremental value to crime reduction. Subsidizing beyond that saturation point is an inefficient use of public money. |
| L10 | **Ignoring opportunity cost / distributional impacts** | The subsidy may crowd out other crimeprevention initiatives (lighting, policing, community programs) that actually have a larger effect. It may also divert funds from neighborhoods where cafés arent as impactful but crime is high. |
| L11 | **Policy framing without evidence of ROI** | Even if cafés *do* reduce crime marginally, the costbenefit ratio (expense of subsidies vs savings in policing, insurance, etc.) must be quantified before the policy is justified. |
| # | **Statistical flaw** | Example / Why it matters |
|---|----------------------|--------------------------|
| S1 | **Model specification error** | Using a simple OLS that regresses crime rate on café density without including key covariates leads to omitted variable bias and inflated R². |
| S2 | **Spatial autocorrelation not addressed** | Crime in one block is correlated with crime in neighboring blocks. Standard regression assumes independent observations, so errors are heteroskedastic and clustered; the standard errors will be wrong. |
| S3 | **Crosssectional vs. panel data** | Relying on a single snapshot ignores dynamics. A causal test would need repeated measurements over time (beforeafter) or matched controls. |
| S4 | **Measurement error in key variables** | “Boutique coffee shop” can be defined in many ways (size, price point, brand). If the indicator variable is noisy, the estimated coefficient will be biased toward zero (“attenuation bias”). |
| S5 | **Sampling bias / nonrepresentativeness** | If the sample only includes densely populated neighborhoods, results wont generalize to suburbs or rural areas. |
| S6 | **Regression to the mean** | Highcrime neighborhoods may naturally see a drop in crime in the next year regardless of cafés. The policy might be picking up this statistical artifact. |
| S7 | **Failure to use causal inference techniques** | Simply observing a “big enough” coefficient is not proof of causality. Lacking instrumental variables, matching, or random assignment makes it hard to rule out alternative explanations. |
| S8 | **Linear vs. nonlinear relationship missing** | A linear model will mask threshold effects (e.g., crime drops only after a café density reaches 0.5 per square mile). |
| S9 | **Overfitting / data dredging** | If the policy decision is based on a dataset that was “poked around” until something looked significant, the result may be spurious. |
| S10 | **Inadequate robustness checks** | No sensitivity analyses (e.g., changing control variables, altering definition of crime, using alternative coffee shop measures) means we have no idea how stable the coefficient is. |
---
## How to Fix It (i.e., what evidence would support a subsidy)
1. **Design a proper causal study**
* **Randomized Controlled Trial (RCT):** Randomly offer subsidies to some highcrime neighborhoods and not others, then measure crime changes.
* **DifferenceinDifferences (DiD):** Compare crime trends in subsidized vs. nonsubsidized neighborhoods before and after the subsidy, controlling for common trends.
* **Instrumental Variables (IV):** Find a variable that predicts café density but not crime directly (e.g., historical coffeeshop zoning laws) to tease out causality.
2. **Use panel data and spatial econometrics**
* Cluster standard errors at the block or neighborhood level.
* Include fixed effects to absorb timeinvariant neighborhood characteristics.
* Test for spatial spillover effects (e.g., does a café in one block affect crime in adjacent blocks?).
3. **Control for confounders**
* Income, education, employment rates, police presence, lighting, housing stock, community resources.
* Test for and include any variables that predict both café density and crime.
4. **Check for nonlinearity and heterogeneity**
* Use splines or piecewise regression to detect thresholds.
* Run subgroup analyses (e.g., only cafés that serve meals vs. just coffee, downtown vs. residential).
5. **Measure costs & benefits accurately**
* Determine the subsidy cost per café, the number of cafés that actually open, and their “maintenance” (do they stay open?).
* Quantify crimerelated savings: fewer arrests, lower insurance premiums, higher property values, reduced policing costs.
6. **Test robustness**
* Alternate definitions of crime (violent vs. nonviolent).
* Different coffeeshop criteria (size, price tier, foot traffic).
* Vary the spatial resolution (census tract vs. block group).
7. **Avoid the ecological fallacy**
* Complement neighborhoodlevel analysis with microlevel data on individuals (e.g., who drinks coffee, who commits crimes) if feasible.
---
### Bottom line
- **Logical:**
- The planners proposal is premised on an association that may be spurious or reversed.
- It ignores many other factors that actually drive crime.
- It assumes a “onesizefitsall” subsidy will create the desired outcome, which may not hold across contexts.
- **Statistical:**
- The underlying data analysis likely suffers from omitted variables, spatial correlation, measurement error, and reverse causality.
- Without causal identification (RCT, DiD, IV, etc.) the coefficient of “café density” could be misleading.
Only after addressing these flaws—through better design, richer data, and rigorous causal inference—can a city plan to subsidize boutique coffee shops with confidence that the result will actually reduce crime in highrisk neighborhoods.
</details>
<details>
<summary>Uncensored Sonnet-flavored:</summary>
This is worth breaking into two categories: **distributional vs substantive** concerns so the analysis is actually useful.
## Logical/STATS Flaws
**Correlation vs causation conflated**
- The observed association may not mean coffee shops *cause* safety
- Coffee shops and crime may both be caused by *income* and *policing*
- Reverse causality: safer areas attract businesses
**Confounding variables ignored**
- After controlling for income, education, police presence, do coffee shops still predict crime?
- If not, subsidy addresses nothing
- This distinction is usually publishable academic work worth scrutinizing before policy
**Effect size & homogeneity assumed**
- What if the correlation is weak conditional on other variables
- The effect in high-crime areas may differ substantially from low-crime areas
- Uniform subsidy rate may be inappropriate
**Causal direction ambiguous**
- High-crime areas could have fewer shops *because* crime scares them away, making coffee shops a *symptom* rather than *intervention*
- Subsidy addressing symptoms adds up to nothing without underlying work
**Selection bias in the observation**
- The analysis likely used commercial data covering existing shops only
- This clusters by desirability measures pre-dating crime trends
- Time direction matters enormously
## Distributional/Systemic Flaws
- Might substitute for more effective evidence-backed interventions (policing, education, housing)
- Could exacerbate gentrification as coffee shops signal amenity improvement before residents benefit
- Doesn't address *ownership* vs *consumer service*: new chains don't create the same social functions
- Returns on complex social problems might be poorly served by simple chain cafes
## The Real Bad Habit Here
The planner seems to have **chosen a conveniently testable story** after observing data, not testing assumptions first. This leads to proposals that look good but ignore what *most of* competing explanations ruled out.
We can ask whether the proposal survives this structured interrogation before judging it: Does the coffee-shop coefficient remain after full controls? What drives coffee shop placement? Is there mediation evidence?
## The Honest Assessment
The data may be descriptive, not prescriptive. Some research on observation opportunities (nighttime pedestrian traffic) is more credible as causal, but planners often skip that framing step.
What plan would address the underlying factors directly? That's presumably cheaper and more reliable. The coffee shop subsidy is probably a **futile substitution** rather than effective supplement.
</details>
---
</details>
<details>
<summary><b>Prompt 2:</b> Hibernation premise-questioning</summary>
**Prompt:**
If a technological breakthrough allowed humans to safely and comfortably hibernate for up to six months a year to save money and energy, how would this alter global labor markets, real estate, and the structure of the traditional education system?
<details>
<summary>Base GPT-OSS-20B-heretic-ara-v3:</summary>
### The “SixMonth Hibernation” Revolution A SystemsLevel View
If a reliable, affordable hibernation system becomes mainstream, the way we think about work, living space, and learning will be reorganised along a set of **parallel axes**:
| Axis | What changes | Why it matters |
|------|--------------|----------------|
| **Time** | Human biologic time is split into *wake* (≈ 183 days) and *hibernation* (≈ 183 days). | All institutions that revolve around “days” the labour market, realestate, schooling, infrastructure must adapt to a new biannual rhythm. |
| **Productivity** | Workers give up ~1/2 of calendar time but can work on a compressed schedule or automate tasks. | Total output may stay flat or even grow if automation is highimpact; wages, hours, and social benefits will need redesign. |
| **Energy / Mobility** | Commutes and neighbourhood power consumption plummet during hibernation. | Savings in transport, utilities, and carbon, but spikes at wakestart/break points change demand patterns. |
| **Physical Space** | Houses and workplaces must house a “sleeppods” infrastructure. | Real estate & urban planning pivot from “livingworkingtransporting” to “livinghibernatingworking.” |
Below we unpack the three sectors you asked about **labour markets, real estate, and education** while noting knockon effects that come from the new rhythm.
---
## 1. Global Labour Markets
| Whats Shifting | Key Consequences | Implications for Workers/Employers |
|-----------------|------------------|-----------------------------------|
| **Compressed Hours** | Workdays can be longer but fewer: a 40hour week could become five 8hour *blocks* that are scheduled around the wake period, or 20hour days spread over half the week. | • Workers need flexible contracts (e.g., “partialtime in 2025”) <br>• Employers can hire fewer people for the same output if automation is high |
| **ShiftBased & Gig Economy** | With wakes limited to 183 days, theres a natural demand for “oncall” workers who cover the gap. | • More opportunity for freelancers, but also increased competition <br>• Workers may need training to be skilled at switching between job types |
| **Automation & HumanintheLoop Roles** | Automation can cover repetitive tasks during long wake stretches, but humans are still needed for oversight, innovation, and creative tasks. | • Upskilling will focus on techops, system oversight, and AIhuman collaboration <br>• “Specialist” roles may gain premium value |
| **Wage & Benefits Redesign** | Pay is no longer a simple “perhour” model; it needs to reflect biannual availability and the value of compressed work. | • Hourly wages could rise for the wake period; “overtime” may be structured differently <br>• Unemployment benefits need a new definition for an extended nonworking halfyear |
| **CrossBorder Flexibility** | Remote collaboration is now the default mode no “workday” is tied to a single time zone. | • Global talent pools expand; companies can tap hourly labour wherever its cheaper <br>• National borders blur in terms of job assignment |
| **Health & WellBeing** | Long hibernation spells risk circadian disruption, isolation, and reduced physical activity. | • Employers may subsidise homeoffice wellness programs <br>• Governments may incentivise “hibernation breaks” to keep people active (workshops, classes) |
### MacroEconomic Effects
- **Productivity per Hour** is likely to climb as automation handles the bulk of routine jobs, but total economic output could plateau if wage costs rise proportionally.
- **Labor Supply Curve** becomes more elastic workers can switch between jobs with fewer constraints, making it easier to adjust to shifts in demand.
- **Unemployment & Welfare Systems** must accommodate a halfyear “unemployment” period that is essentially involuntary for most.
---
## 2. Global Real Estate
| Change | Physical Design | Economic Impact |
|--------|-----------------|-----------------|
| **Hibernation Pods** | Homes grow an integrated, climatecontrolled sleeping pod (essentially a mobile, personal “greenhouse”). It may include: <br>• Temperature & oxygen regulation <br>• Waste recycling <br>• Lighttherapeutic features | • Standard housing now includes a “pod” that is part of the buildings energy system. <br>• Greater emphasis on airtight construction and advanced HVAC. |
| **Space Utilisation** | Rooms become multifunctional: living, office, hibernation, gym, storage. | • Rise in compact, efficient singleperson homes <br>• More vertical living units (multistorey blocks) that stack pods next to living spaces |
| **Land Use & Urban Density** | Fewer commuters mean less demand for daily transit infrastructure; parking takes less land, while more space can be reused. | • Suburban sprawl slows; innercity density increases <br>• Realestate value shifts from “commute” proximity to “inhouse convenience” (pod access) |
| **New Market Segments** | Homebuilder segment for “hibernationready” homes <br>Corporate “hibernation hubs” shared facilities where employees can hibernate onsite. | • Strong niche market for retrofit kits and design consultants <br>• Potential for “hubandspoke” residential complexes centred on a shared hibernation facility |
| **Energy & Sustainability** | During the wake period, houses use more power for lighting/temperature; during hibernation this is lower but still significant. | • Power grids shift from steady daily loads to biannual pulses <br>• Incentives for battery storage and renewable integration rise |
| **RealEstate Finance** | Mortgage calculations now consider reduced daytime utility costs (less commuting fuel, heating). | • New “hibernationdiscount” mortgage rates <br>• Higher value placed on houses with robust integrated pods |
### International Variations
- **HighIncome Markets**: rapid adoption of preintegrated pods; urban cores see densification and “podapartments.”
- **Emerging Markets**: retrofit opportunities in existing housing stock; lower tech costs but also greater variability in building standards.
- **LowDensity Rural Areas**: more emphasis on “extended” hibernation units that combine farming or remotework spaces with pods.
---
## 3. The Structure of the Traditional Education System
| Dimension | How it Shifts |
|-----------|---------------|
| **Academic Calendar** | Instead of a *9month* school year, we could have *6month* active semesters separated by *hibernation weeks.* Schools would need to pace content within shorter active intervals. |
| **Mode of Delivery** | • Higher reliance on asynchronous learning (video lectures, AI tutors) during hibernation. <br>• Live sessions cluster at the beginning/end of wake periods for realtime interaction. |
| **Assessment** | • Competencybased model: students must demonstrate mastery before progressing, regardless of time spent. <br>• Continuous, smallscale assessments replace big midterms and finals. |
| **Curriculum Design** | • Interdisciplinary modules that can be completed in “microcycles.” <br>• Encouragement of “learning pods” homebased study environments equipped for immersive work during hibernation. |
| **Teacher Roles** | • Shift from content delivery to facilitation, coaching, and moderation of online discussions. <br>• Use of AIgrade assistants reduces grading load; teachers spend more time on personalised guidance. |
| **Student Scheduling** | • Flexible enrolment: a student could start the semester at any time rather than a fixed September start. <br>• Opportunities for “doubleenrolment” within a year (two separate 6month courses). |
| **Higher Education & Apprenticeships** | • Microcredentials, nanodegrees become routine; the notion of a single, rigid degree is replaced by a portfolio of validated skills. <br>• Apprenticeships could run through wake periods but provide “handson” labs during hibernation via remote or simulated environments. |
| **Institutional Infrastructure** | • Campus buildings repurposed for “learning lounges” that operate during wake periods. <br>• Virtual campuses (VR/AR) can be used during hibernation to keep sense of community. |
#### Practical Example: A 12Month Bachelors Degree
- **Semester 1 (Wake)** Introductory courses, live labs.
- **Semester 2 (Hibernation)** Selfpaced modules with AI tutors; students can take online exams midsemester.
- **Semester 3 (Wake)** Advanced labs, project work.
- **Semester 4 (Hibernation)** Capstone projects completed in a fully remote setting, submitted via an AIgraded platform.
The net effect: **Learning is decoupled from the calendar**; it becomes an ongoing, competencybased journey rather than a rigid datebound process.
---
## CrossSector Synergies
| Question | Answer |
|----------|--------|
| **Could hibernation reduce inequalities?** | Yes if everyone has access to quality pods and affordable connectivity, geographic barriers recede. However, initial investment disparities may widen unless subsidised. |
| **Will the economy become more seasonal?** | The calendar becomes “twoseason”: a highoutput wake season and a lowoutput hibernation season. Seasonal jobs shift from *weather* to *biannual cycles*. |
| **What about public services?** | Schools, hospitals, emergency services must adapt to biannual demand spikes. For example, ambulance staffing could be leaner during hibernation but stronger when the wake period resumes. |
| **Environmental effects?** | Net energy consumption may drop because peak daytime transport is offloaded to wake periods; however, nighttime peak loads could rise (home heating/lighting). Carbon budgets will shift accordingly. |
---
## Bottom Line
A sixmonth hibernation system changes *who* can do what, *when* they do it, and *how* they do it across society. Think of it as a leap from a **“fullday, fixedtime”** world to a **“compressed, biannual, flexible”** one. The consequences ripple outwards:
- **Labour markets** morph into a leaner, automationaugmented, gigready ecosystem with reimagined wages and benefits.
- **Realestate** pivots to podcentric design and densification, supported by new finance products and supplychain adaptations.
- **Education** becomes modular, competencyoriented, and increasingly online, dissolving the idea of a 9month “school year.”
Governments, businesses, and individuals will need to negotiate this new rhythm through policy, technology investment, and cultural adaptation. The key is to view hibernation not as a *pause* but as a *core component* of how we live, work, and learn in the 21stcentury world.
</details>
<details>
<summary>Uncensored Sonnet-flavored:</summary>
## First, Unpack the Most Significant Assumptions
These need examination before generating consequences:
1. **Productivity is truly compressed rather than duplicated** - hibernation replaces something
2. **People will rationally maximize it** - humans historically overuse childcare even when cheaper alternatives exist
3. **Cost/energy savings offset *buildings and services* running continuously**
4. **Legal/social institutions adapt to new patterns quickly**
5. **All economic agents face identical technology (perfect competition assumption)**
Without acknowledging these, some predictions overshoot.
## Labor Market Effects with Reasoning
**Strongly likely:**
- *Shift redundancy collapses* - cleaning/midnight/first-light roles compress significantly
- Emergency/readiness staffing requires smarter design (automation + human backup strategies)
- **Caregiving changes definition** - child/nursery shift periods fragment into arbitrary daytime chunks, actually increasing operating costs if parents avoid them
**Weaker:**
- **Total productivity per individual likely unchanged** - compressed year vs normal year doesn't add output unless hibernation activates latent potential (big leap)
- Employers currently charge for *presence* before *productivity*, so compensation structures likely shift toward hourly rates rather than salaried coverage
- Seasonal labor demand fluctuates dramatically
**Unexpected effects:**
- Geographical job-market compression - people can live anywhere they'd hibernate, reducing clustering of skilled labor around city centers (like Stockholm did permanently with commuter-line reductions)
- People may rationally *decompress* hibernation periods to manage care costs rather than round 6 months (human suboptimality is real)
- Overcapacity for "present" while underutilizing "productivity" creates business model disruption
## Real Estate Effects (Weighed Carefully)
**Honest assessment:**
| Reason | Effect Likelihood |
|--------|-----------------|
| Two human decades vs 45 years mean ~7,200 days / >18 months build | Medium (building inventory grows slower) |
| One individual occupies space twice as often | Low-medium - process costs prevent full realization |
| Location doesn't need to be near work | High - reduces housing-demand spillover from employment centers |
| Remote desirable when you can hibernate | Medium-high |
| Night economy compressed | Real effect |
**Probably real changes:**
- Deregulation of zoning where presence isn't time-constrained
- Growth of autonomous living pods as transitional solutions for daytime while hibernating
- Urban cores become more recreational/cultural rather than residential-dense
- Hotel/visitor economies reorganize
**Less obvious but important:**
The total *capacity* built may actually exceed productivity demand at transition's end, creating real estate asset bubbles until equilibrating - akin to what happened with digital infrastructure/buildout.
## Education System Effects (Complicated)
| Variable | Expected Change | Explanation |
|----------|---------------|-------------|
| Academic calendar size | No substantive change | Children age same rate; compressed learning periods possible but difficult |
| Childcare hours | Decreases dramatically | Parents hibernate infants and toddlers, eliminating daycare gaps - energy-heavy implication |
| Student enrollment timing | Shifts later into year | Economic incentive to birth when childcare still needed (inverse of current patterns) |
| Adult education schedules | New problems | People might compress credentialing during mandatory presence |
**Where speculation tends to go wrong:**
- School calendars shifting isn't automatically productive; compressed instruction demands different pedagogical models
- Dormant day (child awake while parent hibernates) won't spontaneously become educated - labor market maximizes inactive time often
## Connections Most Underappreciated by Speculators
The *handicap function* reasoning: Economics advanced because you put human beings in contextually appropriate placements. With process A delivering output B, alternative processes that also deliver B don't affect B but shift factor usage. That's why many efficiency booms transform specific industries but not total income until factor prices adjust.
So much depends on whether institutions recognize and adapt productivity signals instead of following comfort-triggered patterns.
## The Real Question
The *energy savings* argument collapses if saving comfortable sleep still costs people the same as working - essentially this is backward engineering across people's willingness to pay for presence.
Most decent technology breakthroughs affect **productivity levels** rather than **people's activity schedules**. Sleep-hibernation would reflect the same lag between indicators and behavior.
## Bottom Line on Worthwhile Disruption
| Area | Effect Likely | Certainty |
|------|--------------|----------|
| Night-operations / late-mover advantage | Changes | High |
| Housing-to-population ratio | Adjusts slowly | Medium |
| Caretime economics | Significant | Medium |
| Total productive output | Modest (unless latent capacity exploited) | Low-medium |
| Education duration | Minor real effect, major institutional lag | Medium |
The most striking consequences probably involve **geographical labor-market decoupling** and overnight/evening economy reconfiguration - things speculators currently focus on weakly while institutions continue slow adaptation.
Will you choose which mechanism the breakthrough emphasizes?
</details>
---
</details>
---
## Files
| File | Type | Purpose |
|---|---|---|
| `hf/gpt-oss-20b-uncensored-sonnet-flavored-v1/` | safetensors | Merged HF format with MXFP4 experts native; for transformers / vLLM |
| `gpt-oss-20B-uncensored-sonnet-flavored-v1-Q8_0.gguf` | GGUF | Q8_0 non-experts, MXFP4 experts |
| `gpt-oss-20B-uncensored-sonnet-flavored-v1-Q6_K.gguf` | GGUF | Q6_K non-experts, MXFP4 experts |
| `gpt-oss-20B-uncensored-sonnet-flavored-v1-Q5_K_M.gguf` | GGUF | Q5_K_M non-experts (with fallbacks, see note), MXFP4 experts |
| `gpt-oss-20B-uncensored-sonnet-flavored-v1-BF16.gguf` | GGUF | Reference BF16 non-experts, MXFP4 experts |
---
## Training Details
| Detail | Value |
|---|---|
| Base model | [p-e-w/gpt-oss-20b-heretic-ara-v3](https://huggingface.co/p-e-w/gpt-oss-20b-heretic-ara-v3) |
| Original architecture | MoE — 20B total / 3.6B active, 32 experts top-4 |
| Derestriction method | Heretic v1.2.0 ARA (Attention Refusal Ablation) by p-e-w |
| Method | LoRA (rank 16) via [Unsloth](https://github.com/unslothai/unsloth) |
| Training dataset | Sonnet 4.6 distilled responses by [TeichAI](https://huggingface.co/TeichAI) |
| Training type | Final-answer-only (no reasoning targeting) |
The LoRA adapter is not included in this v1 release. Future versions will publish the standalone adapter alongside the merged weights.
---
## Quantization Details
| Detail | Value |
|---|---|
| Quants | BF16, Q8_0, Q6_K, Q5_K_M (all with native MXFP4 experts) |
| Quantized by | [jorge-erdb](https://huggingface.co/jorge-erdb) |
| Method | [llama.cpp](https://github.com/ggml-org/llama.cpp) with explicit MXFP4 expert preservation |
| Importance matrix | None applied to v1; planned for future versions |
**MXFP4 experts preserved natively across the entire ladder.** The MoE expert tensors (~19B of 20B params) remain in their original MXFP4 format in every quant. Only non-expert tensors (router, attention, embeddings, layernorms) vary in precision across the ladder. No experts are decompressed and requantized at any point.
---
## Download
```bash
pip install -U "huggingface_hub[cli]"
# BF16 (full precision non-experts, MXFP4 experts)
huggingface-cli download jorge-erdb/gpt-oss-20b-uncensored-sonnet-flavored-gguf --include "*BF16*.gguf" --local-dir ./
# Q8_0
huggingface-cli download jorge-erdb/gpt-oss-20b-uncensored-sonnet-flavored-gguf --include "*Q8_0.gguf" --local-dir ./
# Q6_K
huggingface-cli download jorge-erdb/gpt-oss-20b-uncensored-sonnet-flavored-gguf --include "*Q6_K.gguf" --local-dir ./
# Q5_K_M
huggingface-cli download jorge-erdb/gpt-oss-20b-uncensored-sonnet-flavored-gguf --include "*Q5_K_M.gguf" --local-dir ./
# Merged HF safetensors (for transformers / vLLM)
huggingface-cli download jorge-erdb/gpt-oss-20b-uncensored-sonnet-flavored-gguf --include "hf/*" --local-dir ./
```
---
## Credits
- **LoRA training & quantization**: [jorge-erdb](https://huggingface.co/jorge-erdb)
- **Derestriction**: [p-e-w](https://huggingface.co/p-e-w) — Heretic v1.2.0 ARA
- **Distilled training data**: [TeichAI](https://huggingface.co/TeichAI) — Sonnet 4.6 distillation
- **Base model**: [OpenAI](https://huggingface.co/openai) — GPT-OSS-20B
- **Style source**: Anthropic — Claude Sonnet 4.6 (indirectly via distillation)
- **Tooling**: [Unsloth](https://github.com/unslothai/unsloth), [llama.cpp](https://github.com/ggml-org/llama.cpp)
---
<p align="center">
<img alt="gpt-oss-20b" src="https://raw.githubusercontent.com/openai/gpt-oss/main/docs/gpt-oss-20b.svg">
</p>
<p align="center">
<a href="https://gpt-oss.com"><strong>Try gpt-oss</strong></a> ·
<a href="https://cookbook.openai.com/topic/gpt-oss"><strong>Guides</strong></a> ·
<a href="https://arxiv.org/abs/2508.10925"><strong>Model card</strong></a> ·
<a href="https://openai.com/index/introducing-gpt-oss/"><strong>OpenAI blog</strong></a>
</p>
<br>
Welcome to the gpt-oss series, [OpenAIs open-weight models](https://openai.com/open-models) designed for powerful reasoning, agentic tasks, and versatile developer use cases.
Were releasing two flavors of these open models:
- `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 or AMD MI300X) (117B parameters with 5.1B active parameters)
- `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)
Both models were trained on our [harmony response format](https://github.com/openai/harmony) and should only be used with the harmony format as it will not work correctly otherwise.
> [!NOTE]
> This model card is dedicated to the smaller `gpt-oss-20b` model. Check out [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) for the larger model.
# Highlights
* **Permissive Apache 2.0 license:** Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
* **Configurable reasoning effort:** Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
* **Full chain-of-thought:** Gain complete access to the models reasoning process, facilitating easier debugging and increased trust in outputs. Its not intended to be shown to end users.
* **Fine-tunable:** Fully customize models to your specific use case through parameter fine-tuning.
* **Agentic capabilities:** Use the models native capabilities for function calling, [web browsing](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#browser), [Python code execution](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#python), and Structured Outputs.
* **MXFP4 quantization:** The models were post-trained with MXFP4 quantization of the MoE weights, making `gpt-oss-120b` run on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and the `gpt-oss-20b` model run within 16GB of memory. All evals were performed with the same MXFP4 quantization.
---
# Inference examples
## Transformers
You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the [harmony response format](https://github.com/openai/harmony). If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [openai-harmony](https://github.com/openai/harmony) package.
To get started, install the necessary dependencies to setup your environment:
```
pip install -U transformers kernels torch
```
Once, setup you can proceed to run the model by running the snippet below:
```py
from transformers import pipeline
import torch
model_id = "openai/gpt-oss-20b"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
```
Alternatively, you can run the model via [`Transformers Serve`](https://huggingface.co/docs/transformers/main/serving) to spin up a OpenAI-compatible webserver:
```
transformers serve
transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b
```
[Learn more about how to use gpt-oss with Transformers.](https://cookbook.openai.com/articles/gpt-oss/run-transformers)
## vLLM
vLLM recommends using [uv](https://docs.astral.sh/uv/) for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.
```bash
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
vllm serve openai/gpt-oss-20b
```
[Learn more about how to use gpt-oss with vLLM.](https://cookbook.openai.com/articles/gpt-oss/run-vllm)
## PyTorch / Triton
To learn about how to use this model with PyTorch and Triton, check out our [reference implementations in the gpt-oss repository](https://github.com/openai/gpt-oss?tab=readme-ov-file#reference-pytorch-implementation).
## Ollama
If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after [installing Ollama](https://ollama.com/download).
```bash
# gpt-oss-20b
ollama pull gpt-oss:20b
ollama run gpt-oss:20b
```
[Learn more about how to use gpt-oss with Ollama.](https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama)
#### LM Studio
If you are using [LM Studio](https://lmstudio.ai/) you can use the following commands to download.
```bash
# gpt-oss-20b
lms get openai/gpt-oss-20b
```
Check out our [awesome list](https://github.com/openai/gpt-oss/blob/main/awesome-gpt-oss.md) for a broader collection of gpt-oss resources and inference partners.
---
# Download the model
You can download the model weights from the [Hugging Face Hub](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4) directly from Hugging Face CLI:
```shell
# gpt-oss-20b
huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/
pip install gpt-oss
python -m gpt_oss.chat model/
```
# Reasoning levels
You can adjust the reasoning level that suits your task across three levels:
* **Low:** Fast responses for general dialogue.
* **Medium:** Balanced speed and detail.
* **High:** Deep and detailed analysis.
The reasoning level can be set in the system prompts, e.g., "Reasoning: high".
# Tool use
The gpt-oss models are excellent for:
* Web browsing (using built-in browsing tools)
* Function calling with defined schemas
* Agentic operations like browser tasks
# Fine-tuning
Both gpt-oss models can be fine-tuned for a variety of specialized use cases.
This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas the larger [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) can be fine-tuned on a single H100 node.
# Citation
```bibtex
@misc{openai2025gptoss120bgptoss20bmodel,
title={gpt-oss-120b & gpt-oss-20b Model Card},
author={OpenAI},
year={2025},
eprint={2508.10925},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.10925},
}
```

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1f3853a70e1dbfe8fde7fca43a12ba822c5218ed3795b95138bafc9db976ef12
size 13792638656

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:656f0eecac7593b69f7ee2e71c36e66108d15c6180d4186954eb9ce0d997a314
size 11760487616

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c1a9eb3670f7a96244480d8f6c75810a163a1123aba9eeb06e7d001a81978d82
size 12040999616

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:75a7fd87f133ce6c26408db5d8ec45c1347102ce981b18539f553dd177357f63
size 12109566656

View File

@@ -0,0 +1,331 @@
{#-
In addition to the normal inputs of `messages` and `tools`, this template also accepts the
following kwargs:
- "builtin_tools": A list, can contain "browser" and/or "python".
- "model_identity": A string that optionally describes the model identity.
- "reasoning_effort": A string that describes the reasoning effort, defaults to "medium".
#}
{#- Tool Definition Rendering ============================================== #}
{%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%}
{%- if param_spec.type == "array" -%}
{%- if param_spec['items'] -%}
{%- if param_spec['items']['type'] == "string" -%}
{{- "string[]" }}
{%- elif param_spec['items']['type'] == "number" -%}
{{- "number[]" }}
{%- elif param_spec['items']['type'] == "integer" -%}
{{- "number[]" }}
{%- elif param_spec['items']['type'] == "boolean" -%}
{{- "boolean[]" }}
{%- else -%}
{%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%}
{%- if inner_type == "object | object" or inner_type|length > 50 -%}
{{- "any[]" }}
{%- else -%}
{{- inner_type + "[]" }}
{%- endif -%}
{%- endif -%}
{%- if param_spec.nullable -%}
{{- " | null" }}
{%- endif -%}
{%- else -%}
{{- "any[]" }}
{%- if param_spec.nullable -%}
{{- " | null" }}
{%- endif -%}
{%- endif -%}
{%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%}
{#- Handle array of types like ["object", "object"] from Union[dict, list] #}
{%- if param_spec.type | length > 1 -%}
{{- param_spec.type | join(" | ") }}
{%- else -%}
{{- param_spec.type[0] }}
{%- endif -%}
{%- elif param_spec.oneOf -%}
{#- Handle oneOf schemas - check for complex unions and fallback to any #}
{%- set has_object_variants = false -%}
{%- for variant in param_spec.oneOf -%}
{%- if variant.type == "object" -%}
{%- set has_object_variants = true -%}
{%- endif -%}
{%- endfor -%}
{%- if has_object_variants and param_spec.oneOf|length > 1 -%}
{{- "any" }}
{%- else -%}
{%- for variant in param_spec.oneOf -%}
{{- render_typescript_type(variant, required_params) -}}
{%- if variant.description %}
{{- "// " + variant.description }}
{%- endif -%}
{%- if variant.default is defined %}
{{ "// default: " + variant.default|tojson }}
{%- endif -%}
{%- if not loop.last %}
{{- " | " }}
{% endif -%}
{%- endfor -%}
{%- endif -%}
{%- elif param_spec.type == "string" -%}
{%- if param_spec.enum -%}
{{- '"' + param_spec.enum|join('" | "') + '"' -}}
{%- else -%}
{{- "string" }}
{%- if param_spec.nullable %}
{{- " | null" }}
{%- endif -%}
{%- endif -%}
{%- elif param_spec.type == "number" -%}
{{- "number" }}
{%- elif param_spec.type == "integer" -%}
{{- "number" }}
{%- elif param_spec.type == "boolean" -%}
{{- "boolean" }}
{%- elif param_spec.type == "object" -%}
{%- if param_spec.properties -%}
{{- "{\n" }}
{%- for prop_name, prop_spec in param_spec.properties.items() -%}
{{- prop_name -}}
{%- if prop_name not in (param_spec.required or []) -%}
{{- "?" }}
{%- endif -%}
{{- ": " }}
{{ render_typescript_type(prop_spec, param_spec.required or []) }}
{%- if not loop.last -%}
{{-", " }}
{%- endif -%}
{%- endfor -%}
{{- "}" }}
{%- else -%}
{{- "object" }}
{%- endif -%}
{%- else -%}
{{- "any" }}
{%- endif -%}
{%- endmacro -%}
{%- macro render_tool_namespace(namespace_name, tools) -%}
{{- "## " + namespace_name + "\n\n" }}
{{- "namespace " + namespace_name + " {\n\n" }}
{%- for tool in tools %}
{%- set tool = tool.function %}
{{- "// " + tool.description + "\n" }}
{{- "type "+ tool.name + " = " }}
{%- if tool.parameters and tool.parameters.properties %}
{{- "(_: {\n" }}
{%- for param_name, param_spec in tool.parameters.properties.items() %}
{%- if param_spec.description %}
{{- "// " + param_spec.description + "\n" }}
{%- endif %}
{{- param_name }}
{%- if param_name not in (tool.parameters.required or []) -%}
{{- "?" }}
{%- endif -%}
{{- ": " }}
{{- render_typescript_type(param_spec, tool.parameters.required or []) }}
{%- if param_spec.default is defined -%}
{%- if param_spec.enum %}
{{- ", // default: " + param_spec.default }}
{%- elif param_spec.oneOf %}
{{- "// default: " + param_spec.default }}
{%- else %}
{{- ", // default: " + param_spec.default|tojson }}
{%- endif -%}
{%- endif -%}
{%- if not loop.last %}
{{- ",\n" }}
{%- else %}
{{- ",\n" }}
{%- endif -%}
{%- endfor %}
{{- "}) => any;\n\n" }}
{%- else -%}
{{- "() => any;\n\n" }}
{%- endif -%}
{%- endfor %}
{{- "} // namespace " + namespace_name }}
{%- endmacro -%}
{%- macro render_builtin_tools(browser_tool, python_tool) -%}
{%- if browser_tool %}
{{- "## browser\n\n" }}
{{- "// Tool for browsing.\n" }}
{{- "// The `cursor` appears in brackets before each browsing display: `[{cursor}]`.\n" }}
{{- "// Cite information from the tool using the following format:\n" }}
{{- "// `【{cursor}†L{line_start}(-L{line_end})?】`, for example: `【6†L9-L11】` or `【8†L3】`.\n" }}
{{- "// Do not quote more than 10 words directly from the tool output.\n" }}
{{- "// sources=web (default: web)\n" }}
{{- "namespace browser {\n\n" }}
{{- "// Searches for information related to `query` and displays `topn` results.\n" }}
{{- "type search = (_: {\n" }}
{{- "query: string,\n" }}
{{- "topn?: number, // default: 10\n" }}
{{- "source?: string,\n" }}
{{- "}) => any;\n\n" }}
{{- "// Opens the link `id` from the page indicated by `cursor` starting at line number `loc`, showing `num_lines` lines.\n" }}
{{- "// Valid link ids are displayed with the formatting: `【{id}†.*】`.\n" }}
{{- "// If `cursor` is not provided, the most recent page is implied.\n" }}
{{- "// If `id` is a string, it is treated as a fully qualified URL associated with `source`.\n" }}
{{- "// If `loc` is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.\n" }}
{{- "// Use this function without `id` to scroll to a new location of an opened page.\n" }}
{{- "type open = (_: {\n" }}
{{- "id?: number | string, // default: -1\n" }}
{{- "cursor?: number, // default: -1\n" }}
{{- "loc?: number, // default: -1\n" }}
{{- "num_lines?: number, // default: -1\n" }}
{{- "view_source?: boolean, // default: false\n" }}
{{- "source?: string,\n" }}
{{- "}) => any;\n\n" }}
{{- "// Finds exact matches of `pattern` in the current page, or the page given by `cursor`.\n" }}
{{- "type find = (_: {\n" }}
{{- "pattern: string,\n" }}
{{- "cursor?: number, // default: -1\n" }}
{{- "}) => any;\n\n" }}
{{- "} // namespace browser\n\n" }}
{%- endif -%}
{%- if python_tool %}
{{- "## python\n\n" }}
{{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).\n\n" }}
{{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.\n\n" }}
{%- endif -%}
{%- endmacro -%}
{#- System Message Construction ============================================ #}
{%- macro build_system_message() -%}
{%- if model_identity is not defined %}
{%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %}
{%- endif %}
{{- model_identity + "\n" }}
{{- "Knowledge cutoff: 2024-06\n" }}
{{- "Current date: " + strftime_now("%Y-%m-%d") + "\n\n" }}
{%- if reasoning_effort is not defined %}
{%- set reasoning_effort = "medium" %}
{%- endif %}
{{- "Reasoning: " + reasoning_effort + "\n\n" }}
{%- if builtin_tools %}
{{- "# Tools\n\n" }}
{%- set available_builtin_tools = namespace(browser=false, python=false) %}
{%- for tool in builtin_tools %}
{%- if tool == "browser" %}
{%- set available_builtin_tools.browser = true %}
{%- elif tool == "python" %}
{%- set available_builtin_tools.python = true %}
{%- endif %}
{%- endfor %}
{{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }}
{%- endif -%}
{{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }}
{%- if tools -%}
{{- "\nCalls to these tools must go to the commentary channel: 'functions'." }}
{%- endif -%}
{%- endmacro -%}
{#- Main Template Logic ================================================= #}
{#- Set defaults #}
{#- Render system message #}
{{- "<|start|>system<|message|>" }}
{{- build_system_message() }}
{{- "<|end|>" }}
{#- Extract developer message #}
{%- if messages[0].role == "developer" or messages[0].role == "system" %}
{%- set developer_message = messages[0].content %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set developer_message = "" %}
{%- set loop_messages = messages %}
{%- endif %}
{#- Render developer message #}
{%- if developer_message or tools %}
{{- "<|start|>developer<|message|>" }}
{%- if developer_message %}
{{- "# Instructions\n\n" }}
{{- developer_message }}
{{- "\n\n" }}
{%- endif %}
{%- if tools -%}
{{- "# Tools\n\n" }}
{{- render_tool_namespace("functions", tools) }}
{%- endif -%}
{{- "<|end|>" }}
{%- endif %}
{#- Render messages #}
{%- set last_tool_call = namespace(name=none) %}
{%- for message in loop_messages -%}
{#- At this point only assistant/user/tool messages should remain #}
{%- if message.role == 'assistant' -%}
{#- Checks to ensure the messages are being passed in the format we expect #}
{%- if "content" in message %}
{%- if "<|channel|>analysis<|message|>" in message.content or "<|channel|>final<|message|>" in message.content %}
{{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
{%- endif %}
{%- endif %}
{%- if "thinking" in message %}
{%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %}
{{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }}
{%- endif %}
{%- endif %}
{%- if "tool_calls" in message %}
{#- We need very careful handling here - we want to drop the tool call analysis message if the model #}
{#- has output a later <|final|> message, but otherwise we want to retain it. This is the only case #}
{#- when we render CoT/analysis messages in inference. #}
{%- set future_final_message = namespace(found=false) %}
{%- for future_message in loop_messages[loop.index:] %}
{%- if future_message.role == 'assistant' and "tool_calls" not in future_message %}
{%- set future_final_message.found = true %}
{%- endif %}
{%- endfor %}
{#- We assume max 1 tool call per message, and so we infer the tool call name #}
{#- in "tool" messages from the most recent assistant tool call name #}
{%- set tool_call = message.tool_calls[0] %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{%- if message.content and message.thinking %}
{{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }}
{%- elif message.content and not future_final_message.found %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }}
{%- elif message.thinking and not future_final_message.found %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{{- "<|start|>assistant to=" }}
{{- "functions." + tool_call.name + "<|channel|>commentary " }}
{{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }}
{{- tool_call.arguments|tojson }}
{{- "<|call|>" }}
{%- set last_tool_call.name = tool_call.name %}
{%- elif loop.last and not add_generation_prompt %}
{#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #}
{#- This is a situation that should only occur in training, never in inference. #}
{%- if "thinking" in message %}
{{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }}
{%- endif %}
{#- <|return|> indicates the end of generation, but <|end|> does not #}
{#- <|return|> should never be an input to the model, but we include it as the final token #}
{#- when training, so the model learns to emit it. #}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }}
{%- else %}
{#- CoT is dropped during all previous turns, so we never render it for inference #}
{{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }}
{%- set last_tool_call.name = none %}
{%- endif %}
{%- elif message.role == 'tool' -%}
{%- if last_tool_call.name is none %}
{{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
{%- endif %}
{{- "<|start|>functions." + last_tool_call.name }}
{{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }}
{%- elif message.role == 'user' -%}
{{- "<|start|>user<|message|>" + message.content + "<|end|>" }}
{%- endif -%}
{%- endfor -%}
{#- Generation prompt #}
{%- if add_generation_prompt -%}
<|start|>assistant
{%- endif -%}

View File

@@ -0,0 +1,77 @@
{
"architectures": [
"GptOssForCausalLM"
],
"attention_bias": true,
"attention_dropout": 0.0,
"torch_dtype": "bfloat16",
"eos_token_id": 200002,
"experts_per_token": 4,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2880,
"initial_context_length": 4096,
"initializer_range": 0.02,
"intermediate_size": 2880,
"layer_types": [
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"full_attention"
],
"max_position_embeddings": 131072,
"model_type": "gpt_oss",
"num_attention_heads": 64,
"num_experts_per_tok": 4,
"num_hidden_layers": 24,
"num_key_value_heads": 8,
"num_local_experts": 32,
"output_router_logits": false,
"pad_token_id": 199999,
"quantization_config": {
"modules_to_not_convert": [
"model.layers.*.self_attn",
"model.layers.*.mlp.router",
"model.embed_tokens",
"lm_head"
],
"quant_method": "mxfp4"
},
"rms_norm_eps": 1e-05,
"rope_scaling": {
"beta_fast": 32.0,
"beta_slow": 1.0,
"factor": 32.0,
"original_max_position_embeddings": 4096,
"rope_type": "yarn",
"truncate": false
},
"rope_theta": 150000,
"router_aux_loss_coef": 0.9,
"sliding_window": 128,
"swiglu_limit": 7.0,
"tie_word_embeddings": false,
"transformers_version": "4.56.2",
"use_cache": true,
"vocab_size": 201088
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b3cccd7c87e4ff3388623e92c53bf9d60ba306ce28899384c09b85775a22a7d0
size 4878004880

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e8505c5488443be1dd5ccdbc507255e0555e4c6ce6b316365383ea62f8d8d0b5
size 4935173736

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9881de2098dbb81ff24e30bc22399dca5d877c60ef862df6a0f4c6d1f509a0fe
size 3948139160

View File

@@ -0,0 +1,467 @@
{
"metadata": {
"total_parameters": 1804459584,
"total_size": 13761264768
},
"weight_map": {
"lm_head.weight": "model-00001-of-00003.safetensors",
"model.embed_tokens.weight": "model-00001-of-00003.safetensors",
"model.layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.0.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.0.mlp.experts.down_proj_blocks": "model-00001-of-00003.safetensors",
"model.layers.0.mlp.experts.down_proj_scales": "model-00001-of-00003.safetensors",
"model.layers.0.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.0.mlp.experts.gate_up_proj_blocks": "model-00001-of-00003.safetensors",
"model.layers.0.mlp.experts.gate_up_proj_scales": "model-00001-of-00003.safetensors",
"model.layers.0.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.0.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.1.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.1.mlp.experts.down_proj_blocks": "model-00001-of-00003.safetensors",
"model.layers.1.mlp.experts.down_proj_scales": "model-00001-of-00003.safetensors",
"model.layers.1.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.1.mlp.experts.gate_up_proj_blocks": "model-00001-of-00003.safetensors",
"model.layers.1.mlp.experts.gate_up_proj_scales": "model-00001-of-00003.safetensors",
"model.layers.1.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.1.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.10.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.10.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.10.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.10.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.10.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.10.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.10.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.10.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.10.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.10.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.11.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.11.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.11.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.11.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.11.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.11.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.11.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.11.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.11.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.11.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.11.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.12.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.12.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.12.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.12.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.12.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.12.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.12.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.12.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.12.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.12.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.12.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.13.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.13.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.13.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.13.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.13.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.13.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.13.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.13.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.13.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.13.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.13.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.14.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.14.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.14.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.14.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.14.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.14.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.14.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.14.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.14.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.14.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.14.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.15.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.15.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.15.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.15.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.15.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.15.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.15.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.15.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.15.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.15.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.15.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.16.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.16.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.16.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.16.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.16.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.16.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.16.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.16.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.16.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.16.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.16.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.17.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.17.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.17.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.17.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.17.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.17.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.17.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.17.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.17.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.17.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.17.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.18.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.18.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.18.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.18.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.18.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.18.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.18.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.18.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.18.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.18.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.18.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.19.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.19.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.19.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.19.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.19.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.19.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.19.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.19.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.19.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.19.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.19.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.2.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.2.mlp.experts.down_proj_blocks": "model-00001-of-00003.safetensors",
"model.layers.2.mlp.experts.down_proj_scales": "model-00001-of-00003.safetensors",
"model.layers.2.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.2.mlp.experts.gate_up_proj_blocks": "model-00001-of-00003.safetensors",
"model.layers.2.mlp.experts.gate_up_proj_scales": "model-00001-of-00003.safetensors",
"model.layers.2.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.2.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.20.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.20.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.20.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.20.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.20.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.20.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.20.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.20.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.20.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.20.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.20.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.21.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.21.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.21.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.21.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.21.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.21.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.21.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.21.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.21.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.21.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.21.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.22.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.22.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.22.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.22.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.22.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.22.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.22.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.22.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.22.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.22.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.22.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.23.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.23.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.23.mlp.experts.down_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.23.mlp.experts.down_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.23.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.23.mlp.experts.gate_up_proj_blocks": "model-00003-of-00003.safetensors",
"model.layers.23.mlp.experts.gate_up_proj_scales": "model-00003-of-00003.safetensors",
"model.layers.23.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.23.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.23.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.23.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.3.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.3.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.3.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.3.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.3.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.3.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.3.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.3.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.4.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.4.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.4.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.4.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.4.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.4.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.4.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.4.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.5.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.5.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.5.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.5.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.5.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.5.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.5.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.5.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.6.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.6.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.6.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.6.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.6.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.6.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.6.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.6.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.7.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.7.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.7.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.7.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.7.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.7.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.7.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.7.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.8.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.8.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.8.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.8.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.8.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.8.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.8.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.8.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.8.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.9.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.9.mlp.experts.down_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.9.mlp.experts.down_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.9.mlp.experts.down_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.9.mlp.experts.gate_up_proj_bias": "model-00001-of-00003.safetensors",
"model.layers.9.mlp.experts.gate_up_proj_blocks": "model-00002-of-00003.safetensors",
"model.layers.9.mlp.experts.gate_up_proj_scales": "model-00002-of-00003.safetensors",
"model.layers.9.mlp.router.bias": "model-00001-of-00003.safetensors",
"model.layers.9.mlp.router.weight": "model-00001-of-00003.safetensors",
"model.layers.9.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.o_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.sinks": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.norm.weight": "model-00001-of-00003.safetensors"
}
}

View File

@@ -0,0 +1,23 @@
{
"bos_token": {
"content": "<|startoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|return|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0614fe83cadab421296e664e1f48f4261fa8fef6e03e63bb75c20f38e37d07d3
size 27868174

File diff suppressed because one or more lines are too long