TinyTimV1/README.md

# TinyTimV1: Fine-tuning TinyLlama on Finnegan's Wake

A project exploring the fine-tuning of TinyLlama-1.1B on James Joyce's *Finnegan's Wake* to generate Joyce-inspired text.

## Overview

This project fine-tunes the TinyLlama-1.1B-Chat model on the complete text of James Joyce's *Finnegan's Wake*, creating a language model capable of generating text in Joyce's distinctive experimental style. The model learns to replicate the complex wordplay, neologisms, and stream-of-consciousness narrative techniques characteristic of Joyce's final work.

## Files

- `process_wake.py` - Preprocesses the raw text, removes page numbers, and splits into manageable chunks
- `fine_tune_joyce.py` - Main training script using HuggingFace Transformers
- `text_gen.py` - Text generation script for the fine-tuned model
- `finn_wake.txt` - Complete text of Finnegan's Wake (1.51 MB)
- `finn_wake.csv` - Processed dataset in CSV format
- `finn_wake_dataset/` - Tokenized dataset directory

## Usage

### 1. Data Preprocessing
```bash
python process_wake.py
```
This removes page numbers and splits the text into 100-word chunks for training.
2. Fine-tuning

```bash
python fine_tune_joyce.py
```
Fine-tunes TinyLlama on the processed dataset for 3 epochs with CPU training.
3. Text Generation
```bash
python text_gen.py
```
Generates Joyce-inspired text using the fine-tuned model.

Model Details

Base Model: TinyLlama-1.1B-Chat-v1.0
Training Data: Finnegan's Wake (~1.5MB text)
Training Parameters:

3 epochs
Batch size: 1
Max sequence length: 128 tokens
Temperature: 0.7
Top-k: 50, Top-p: 0.95


Example Output
Input: "ae left to go to ireland and found a fairy"
The model generates text continuing in Joyce's experimental style with invented words, Irish references, and complex linguistic play.
Requirements
transformers
datasets
pandas
torch
Installation
bashpip install transformers datasets pandas torch
Notes

Training was performed on CPU due to resource constraints
Model checkpoints saved every 500 steps
Resume training supported from checkpoints
初始化项目，由ModelHub XC社区提供模型 Model: npc-worldwide/TinyTimV1 Source: Original Platform 2026-04-12 03:34:59 +08:00			`# TinyTimV1: Fine-tuning TinyLlama on Finnegan's Wake`

			`A project exploring the fine-tuning of TinyLlama-1.1B on James Joyce's Finnegan's Wake to generate Joyce-inspired text.`

			`## Overview`

			`This project fine-tunes the TinyLlama-1.1B-Chat model on the complete text of James Joyce's Finnegan's Wake, creating a language model capable of generating text in Joyce's distinctive experimental style. The model learns to replicate the complex wordplay, neologisms, and stream-of-consciousness narrative techniques characteristic of Joyce's final work.`

			`## Files`

			- `process_wake.py` - Preprocesses the raw text, removes page numbers, and splits into manageable chunks
			- `fine_tune_joyce.py` - Main training script using HuggingFace Transformers
			- `text_gen.py` - Text generation script for the fine-tuned model
			- `finn_wake.txt` - Complete text of Finnegan's Wake (1.51 MB)
			- `finn_wake.csv` - Processed dataset in CSV format
			- `finn_wake_dataset/` - Tokenized dataset directory

			`## Usage`

			`### 1. Data Preprocessing`
			```bash
			`python process_wake.py`
			```
			`This removes page numbers and splits the text into 100-word chunks for training.`
			`2. Fine-tuning`

			```bash
			`python fine_tune_joyce.py`
			```
			`Fine-tunes TinyLlama on the processed dataset for 3 epochs with CPU training.`
			`3. Text Generation`
			```bash
			`python text_gen.py`
			```
			`Generates Joyce-inspired text using the fine-tuned model.`

			`Model Details`

			`Base Model: TinyLlama-1.1B-Chat-v1.0`
			`Training Data: Finnegan's Wake (~1.5MB text)`
			`Training Parameters:`

			`3 epochs`
			`Batch size: 1`
			`Max sequence length: 128 tokens`
			`Temperature: 0.7`
			`Top-k: 50, Top-p: 0.95`



			`Example Output`
			`Input: "ae left to go to ireland and found a fairy"`
			`The model generates text continuing in Joyce's experimental style with invented words, Irish references, and complex linguistic play.`
			`Requirements`
			`transformers`
			`datasets`
			`pandas`
			`torch`
			`Installation`
			`bashpip install transformers datasets pandas torch`
			`Notes`

			`Training was performed on CPU due to resource constraints`
			`Model checkpoints saved every 500 steps`
			`Resume training supported from checkpoints`