Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models (DAÑADO)
Under the Hood of Large Language Models (DAÑADO)

Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

1. Gather a Representative Mini-Corpus

In practice, you should use a folder containing domain-specific documents (in .txt format) that truly represent the language patterns you want your tokenizer to learn. This corpus should include all the specialized vocabulary, technical terms, and syntactic constructions common in your target domain.

For this demonstration, we'll create a small sample of legal text that mimics the distinctive characteristics of legal language, including:

  • Formal legal terminology (plaintiff, defendant, pursuant)
  • Legal references (Rule 12(b), Section 2.3)
  • Specialized document structures (WHEREAS clauses)
  • Contractual language patterns ("remains in full force and effect")

When building a real-world tokenizer, you would want to collect thousands of documents from your target domain to ensure comprehensive coverage of domain-specific language.

corpus = [
    "The plaintiff hereby files a motion to dismiss under Rule 12(b).",
    "The defendant shall pay damages as determined by the court.",
    "Pursuant to Section 2.3, the agreement remains in full force and effect.",
    "WHEREAS, the Parties desire to amend the Master Services Agreement (MSA)."
]

Path("data").mkdir(exist_ok=True)
with open("data/legal_demo.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

Let me break down this code snippet that creates a small corpus of legal text samples and saves it to a file:

1. Creating a corpus of legal text samples:

corpus = [
    "The plaintiff hereby files a motion to dismiss under Rule 12(b).",
    "The defendant shall pay damages as determined by the court.",
    "Pursuant to Section 2.3, the agreement remains in full force and effect.",
    "WHEREAS, the Parties desire to amend the Master Services Agreement (MSA)."
]

This part defines a Python list named 'corpus' containing four example sentences that demonstrate typical legal language patterns. Each string represents a sample with legal terminology (plaintiff, defendant, pursuant), legal references (Rule 12(b), Section 2.3), document structure elements (WHEREAS), and contractual language ("remains in full force and effect").

2. Creating a directory and saving the corpus to a file:

Path("data").mkdir(exist_ok=True)
with open("data/legal_demo.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

This section handles file operations:

  • First, it creates a directory named "data" using Path("data").mkdir(exist_ok=True). The exist_ok=True parameter ensures the code won't raise an error if the directory already exists.
  • Then, it opens a file named "legal_demo.txt" inside the "data" directory in write mode ('w') with UTF-8 encoding.
  • It iterates through each line in the corpus list and writes it to the file, adding a newline character ('\n') after each line to ensure each sample appears on a separate line.
For a real project, stream large text files: contracts, statutes, filings, domain wikis, etc. The key is coverage—include the symbols and structures your model will see in production.

1. Gather a Representative Mini-Corpus

In practice, you should use a folder containing domain-specific documents (in .txt format) that truly represent the language patterns you want your tokenizer to learn. This corpus should include all the specialized vocabulary, technical terms, and syntactic constructions common in your target domain.

For this demonstration, we'll create a small sample of legal text that mimics the distinctive characteristics of legal language, including:

  • Formal legal terminology (plaintiff, defendant, pursuant)
  • Legal references (Rule 12(b), Section 2.3)
  • Specialized document structures (WHEREAS clauses)
  • Contractual language patterns ("remains in full force and effect")

When building a real-world tokenizer, you would want to collect thousands of documents from your target domain to ensure comprehensive coverage of domain-specific language.

corpus = [
    "The plaintiff hereby files a motion to dismiss under Rule 12(b).",
    "The defendant shall pay damages as determined by the court.",
    "Pursuant to Section 2.3, the agreement remains in full force and effect.",
    "WHEREAS, the Parties desire to amend the Master Services Agreement (MSA)."
]

Path("data").mkdir(exist_ok=True)
with open("data/legal_demo.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

Let me break down this code snippet that creates a small corpus of legal text samples and saves it to a file:

1. Creating a corpus of legal text samples:

corpus = [
    "The plaintiff hereby files a motion to dismiss under Rule 12(b).",
    "The defendant shall pay damages as determined by the court.",
    "Pursuant to Section 2.3, the agreement remains in full force and effect.",
    "WHEREAS, the Parties desire to amend the Master Services Agreement (MSA)."
]

This part defines a Python list named 'corpus' containing four example sentences that demonstrate typical legal language patterns. Each string represents a sample with legal terminology (plaintiff, defendant, pursuant), legal references (Rule 12(b), Section 2.3), document structure elements (WHEREAS), and contractual language ("remains in full force and effect").

2. Creating a directory and saving the corpus to a file:

Path("data").mkdir(exist_ok=True)
with open("data/legal_demo.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

This section handles file operations:

  • First, it creates a directory named "data" using Path("data").mkdir(exist_ok=True). The exist_ok=True parameter ensures the code won't raise an error if the directory already exists.
  • Then, it opens a file named "legal_demo.txt" inside the "data" directory in write mode ('w') with UTF-8 encoding.
  • It iterates through each line in the corpus list and writes it to the file, adding a newline character ('\n') after each line to ensure each sample appears on a separate line.
For a real project, stream large text files: contracts, statutes, filings, domain wikis, etc. The key is coverage—include the symbols and structures your model will see in production.

1. Gather a Representative Mini-Corpus

In practice, you should use a folder containing domain-specific documents (in .txt format) that truly represent the language patterns you want your tokenizer to learn. This corpus should include all the specialized vocabulary, technical terms, and syntactic constructions common in your target domain.

For this demonstration, we'll create a small sample of legal text that mimics the distinctive characteristics of legal language, including:

  • Formal legal terminology (plaintiff, defendant, pursuant)
  • Legal references (Rule 12(b), Section 2.3)
  • Specialized document structures (WHEREAS clauses)
  • Contractual language patterns ("remains in full force and effect")

When building a real-world tokenizer, you would want to collect thousands of documents from your target domain to ensure comprehensive coverage of domain-specific language.

corpus = [
    "The plaintiff hereby files a motion to dismiss under Rule 12(b).",
    "The defendant shall pay damages as determined by the court.",
    "Pursuant to Section 2.3, the agreement remains in full force and effect.",
    "WHEREAS, the Parties desire to amend the Master Services Agreement (MSA)."
]

Path("data").mkdir(exist_ok=True)
with open("data/legal_demo.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

Let me break down this code snippet that creates a small corpus of legal text samples and saves it to a file:

1. Creating a corpus of legal text samples:

corpus = [
    "The plaintiff hereby files a motion to dismiss under Rule 12(b).",
    "The defendant shall pay damages as determined by the court.",
    "Pursuant to Section 2.3, the agreement remains in full force and effect.",
    "WHEREAS, the Parties desire to amend the Master Services Agreement (MSA)."
]

This part defines a Python list named 'corpus' containing four example sentences that demonstrate typical legal language patterns. Each string represents a sample with legal terminology (plaintiff, defendant, pursuant), legal references (Rule 12(b), Section 2.3), document structure elements (WHEREAS), and contractual language ("remains in full force and effect").

2. Creating a directory and saving the corpus to a file:

Path("data").mkdir(exist_ok=True)
with open("data/legal_demo.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

This section handles file operations:

  • First, it creates a directory named "data" using Path("data").mkdir(exist_ok=True). The exist_ok=True parameter ensures the code won't raise an error if the directory already exists.
  • Then, it opens a file named "legal_demo.txt" inside the "data" directory in write mode ('w') with UTF-8 encoding.
  • It iterates through each line in the corpus list and writes it to the file, adding a newline character ('\n') after each line to ensure each sample appears on a separate line.
For a real project, stream large text files: contracts, statutes, filings, domain wikis, etc. The key is coverage—include the symbols and structures your model will see in production.

1. Gather a Representative Mini-Corpus

In practice, you should use a folder containing domain-specific documents (in .txt format) that truly represent the language patterns you want your tokenizer to learn. This corpus should include all the specialized vocabulary, technical terms, and syntactic constructions common in your target domain.

For this demonstration, we'll create a small sample of legal text that mimics the distinctive characteristics of legal language, including:

  • Formal legal terminology (plaintiff, defendant, pursuant)
  • Legal references (Rule 12(b), Section 2.3)
  • Specialized document structures (WHEREAS clauses)
  • Contractual language patterns ("remains in full force and effect")

When building a real-world tokenizer, you would want to collect thousands of documents from your target domain to ensure comprehensive coverage of domain-specific language.

corpus = [
    "The plaintiff hereby files a motion to dismiss under Rule 12(b).",
    "The defendant shall pay damages as determined by the court.",
    "Pursuant to Section 2.3, the agreement remains in full force and effect.",
    "WHEREAS, the Parties desire to amend the Master Services Agreement (MSA)."
]

Path("data").mkdir(exist_ok=True)
with open("data/legal_demo.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

Let me break down this code snippet that creates a small corpus of legal text samples and saves it to a file:

1. Creating a corpus of legal text samples:

corpus = [
    "The plaintiff hereby files a motion to dismiss under Rule 12(b).",
    "The defendant shall pay damages as determined by the court.",
    "Pursuant to Section 2.3, the agreement remains in full force and effect.",
    "WHEREAS, the Parties desire to amend the Master Services Agreement (MSA)."
]

This part defines a Python list named 'corpus' containing four example sentences that demonstrate typical legal language patterns. Each string represents a sample with legal terminology (plaintiff, defendant, pursuant), legal references (Rule 12(b), Section 2.3), document structure elements (WHEREAS), and contractual language ("remains in full force and effect").

2. Creating a directory and saving the corpus to a file:

Path("data").mkdir(exist_ok=True)
with open("data/legal_demo.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

This section handles file operations:

  • First, it creates a directory named "data" using Path("data").mkdir(exist_ok=True). The exist_ok=True parameter ensures the code won't raise an error if the directory already exists.
  • Then, it opens a file named "legal_demo.txt" inside the "data" directory in write mode ('w') with UTF-8 encoding.
  • It iterates through each line in the corpus list and writes it to the file, adding a newline character ('\n') after each line to ensure each sample appears on a separate line.
For a real project, stream large text files: contracts, statutes, filings, domain wikis, etc. The key is coverage—include the symbols and structures your model will see in production.