Synthetic Data Generation & Data preprocessing

Synthetic data generation for industrial applications

Synthetic data is a form of data that mimics the real-world patterns generated through machine learning algorithms.

Importance of synthetic data

Synthetic data is very important for industrial uses because of several key reasons: Here is a mindmap diagram illustrating the importance of data synthesis for industrial applications, specifically for fine-tuning Large Language Models (LLMs):

1. Overcoming Data Scarcity

Data synthesis helps generate diverse datasets where real-world data is limited or hard to collect. For example, in industries such as healthcare, synthetic patient data can be created to train models without compromising privacy. This approach allows for a more comprehensive dataset that models can learn from, enhancing their ability to generalize across different scenarios.

2. Improving Model Robustness

Data synthesis can help reduce model bias and increase its generalization capabilities, By generating synthetic data that covers rare or underrepresented scenarios. For Example, in autonomous vehicle development, synthetic data can simulate rare driving conditions (like extreme weather which real-data is more rare) to ensure the model performs well in a variety of situations, thus improving safety and reliability.

3. Accelerating Development

Data synthesis can significantly speed up the development cycle of LLMs by reducing the time and costs associated with data collection and labeling. For example, for companies developing natural language processing (NLP) applications, synthetic data can quickly provide the diverse linguistic input needed for training, enabling faster iteration and deployment of models.

4. Data Protection and Privacy:

By generating synthetic datasets that exclude personally identifiable information and sensitive data, user privacy is effectively safeguarded. These datasets can be used for research and development purposes.

5. Data Augmentation:

Moreover, generative models provide the advantage of generating novel training data that can significantly improve real-world datasets. This method is especially beneficial when obtaining more real data is expensive or time-consuming.

6. Data Availability and Accessibility:

In situations where access to authentic data is constrained or inadequate, generative models offer a viable solution, enabling researchers and developers to manipulate data representations for their research or applications.

Attention

When creating and utilizing synthetic data, it’s essential to be mindful of its accuracy, potential biases, privacy concerns, the risk of overfitting, the expenses involved, and adherence to legal standards.

Techniques for generating synthetic data

Now that we discussed the importance of generating synthetic datasets, We will introduce multiple techniques to generate synthetic data, out of which these are the most prominent:

Drawing Numbers from a Distribution

Sampling from a distribution is a widely used method in synthetic data generation, where numbers are selected from a statistical distribution to mimic real-world data patterns. Using Python and the NumPy library, one can easily generate datasets that adhere to a specified distribution, such as a normal distribution. An example of this is generating a dataset with values normally distributed around a mean of 0 and a standard deviation of 1.

Agent-based Modeling (ABM):

Agent-based modeling is a simulation technique where individual agents interact with each other in a complex system. ABM is useful for examining interactions between agents such as people, cells, or computer programs. Python packages such as Mesa can quickly create agent-based models using built-in core components and visualize them in a browser-based interface. For example simulating predator-prey dynamics in an ecosystem.

Note

Using effective prompt strategies can steer the model to produce better, consistent, and more factual responses.

Prompts:

Prompt:: Produce 10 exemplars for sentiment analysis. Examples are categorized as either positive or negative. Produce 2 negative examples and 8 positive examples. Use this format for the examples: Q: <sentence> A: <sentiment>

Output::
  • Q: I just got the best news ever! A: Positive

  • Q: We just got a raise at work! A: Positive

  • Q: I’m so proud of what I accomplished today. A: Positive

  • Q: I’m having the best day ever! A: Positive

  • Q: I’m really looking forward to the weekend. A: Positive

  • Q: I just got the best present ever! A: Positive

  • Q: I’m so happy right now. A: Positive

  • Q: I’m so blessed to have such an amazing family. A: Positive

  • Q: The weather outside is so gloomy. A: Negative

  • Q: I just got some terrible news. A: Negative

This is very useful. We actually use this example for a different test in another section of the guides.

Note

In our case, we generated synthetic data from GPT-4, by the use of the following prompt:

Prompt::

Instruction: As an industrial engineer with extensive expertise in your field, your task is to generate a synthetic dataset that accurately reflects the complexities and nuances of industrial engineering.

Input: generate synthetic data about production to be near factual, generate real industrial situations.

Output: Json file <input, context, output> and the number of questions is 500.

The output was our json file containing synthetic data.

Data preprocessing for LLMs

Data: What’s It All About?

Data is information collected or generated for analysis and interpretation. It encompasses a wide range of formats, including numerical values, textual content, images, and more. In the realm of computing and analysis, data serves as the cornerstone for deriving insights, facilitating informed decision-making, and driving innovation.

Alternative text for the image

Data Preprocessing and LLMs

Data preprocessing is a pivotal phase in our pipeline, essential for achieving high-quality results. It encompasses refining raw text data to ensure smoother training, faster convergence, and superior model performance. By standardizing input data and employing specialized techniques tailored for large language models, we effectively reduce complexity and enhance overall efficiency.

Alternative text for the image

Data cleaning

Note

Data cleaning is a fundamental aspect of data pre-processing for training LLMs. This technique involves identifying and rectifying inaccuracies, inconsistencies, and irrelevant elements within the raw text data.

Data cleaning methods :

1. Handling missing values

In any dataset, missing values can crop up when certain observations or features lack data. These gaps pose a significant challenge, potentially skewing predictions and introducing bias into our models. To prevent this, we need to address missing values effectively, ensuring our predictions remain reliable and accurate.

2. Noise reduction

Think of data noise as those pesky distractions – irrelevant or random bits of information that sneak into our dataset. They’re like static on a radio, distorting the true signal and making it harder for our models to predict accurately. By filtering out this noise, we can unveil the genuine patterns hiding within our data and make more reliable predictions.

3. Consistency checks

Warning

Consistency checks are like quality control for data, making sure everything follows the same formats and rules. When data doesn’t match up due to errors or glitches, it can cause confusion and inaccuracies in model training.

4. Deduplication

Duplicates in data often stem from errors during data entry or glitches in the system. These duplicates can skew the distribution of data and result in biased model training. By eliminating duplicates, we ensure the dataset is more accurate and representative, thereby improving the performance of the LLM.

Text cleaning and normalization

Removing noise and normalization

Removing noise from text and standardizing it are essential for enhancing the quality of our textual data, which in turn improves the performance of natural language processing tasks. Various techniques and tools are available to address these challenges.

To effectively manage this task, we can utilize several libraries that are designed to help clean and standardize text data. These libraries provide robust functionalities for preprocessing text, making it cleaner and more uniform for subsequent analysis.

Lowercasing is a text preprocessing step where all letters in the text are converted to lowercase. This step is implemented so that the algorithm does not treat the same words differently in different situations.

Punctuation and special characters removal is a text preprocessing step where you remove all punctuation marks (such as periods, commas, exclamation marks, emojis etc.) from the text to simplify it and focus on the words themselves.

Stop-words are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and do not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”. For this technique, we will use the NLTK library.

This preprocessing step is to remove any URLs present in the data.

It’s important to remove HTML tags from the text data. Indeed, if we work with text data obtained from HTML sources, the text may contain HTML tags, which are not desirable for text analysis.

Stemming is a linguistic normalization that simplifies words to their base form or root by removing prefixes and suffixes, helping the natural language processing and information retrieval.

Lemmatization simplifies words to their base or root form, known as the lemma, making it easier to analyze a word. The difference between stemming and lemmatization is that the latter transforms words to their standardized form and aims to return a valid word by applying linguistic rules and context.

Warning

The difference between Lemmatization and Stemming is simple :Lemmatization reduces words to their canonical form,while stemming reduces them to their root form

Alternative text for the image

Removing redundant whitespace is an essential step in text preprocessing to ensure a clean and standardized representation of the text. Extra whitespaces can lead to inconsistencies and negatively impact the performance of natural language processing (NLP) tasks.

LAB

Note

After preprocessing our data, the next step will be to feed it into our large language model. The concern here is whether our LLM will have a very good understanding of our cleaned data. In this context, we will discuss an important concept known as Tokenization.

Tokenization and word embedding

Tokenization and word embedding are crucial steps in natural language processing (NLP) that help convert textual data into numerical formats that can be easily processed by machine learning models.

1.Tokenization

Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words. The aim of tokenization is to represent text in a manner that’s meaningful for machines without losing its context.

  1. Word Tokenization: Splits text into individual words.

    • Example: “Hello, world!” becomes [“Hello”, “,”, “world”, “!”]

  2. Sentence Tokenization: Splits text into individual sentences.

    • Example: “Hello world. How are you?” becomes [“Hello world.”, “How are you?”]

  3. Subword Tokenization: Breaks words into subword units, often used in transformer models like BERT.

    • Example: “unhappiness” becomes [“un”, “##happiness”]

    Alternative text for the image
  • NLTK (Natural Language Toolkit): Provides functions for word and sentence tokenization along with additional NLP functionalities like stemming, lemmatization, and POS tagging.

    # Importing necessary functions from NLTK
    import nltk
    from nltk.tokenize import word_tokenize, sent_tokenize
    
    # Download the NLTK data files (only the first time)
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')
    nltk.download('wordnet')
    
    # Input text
    text = "Hello world. How are you?"
    
    # Word tokenization
    words = word_tokenize(text)
    
    # Sentence tokenization
    sentences = sent_tokenize(text)
    
    # Part-of-Speech (POS) tagging
    pos_tags = nltk.pos_tag(words)
    
    # Lemmatization
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(word) for word in words]
    
    # Output results
    print("Word Tokenization:", words)
    print("Sentence Tokenization:", sentences)
    print("POS Tagging:", pos_tags)
    print("Lemmatization:", lemmas)
    

    Output:

    Word Tokenization: ['Hello', 'world', '.', 'How', 'are', 'you', '?']
    Sentence Tokenization: ['Hello world.', 'How are you?']
    POS Tagging: [('Hello', 'NNP'), ('world', 'NN'), ('.', '.'), ('How', 'WRB'), ('are', 'VBP'), ('you', 'PRP'), ('?', '.')]
    Lemmatization: ['Hello', 'world', '.', 'How', 'are', 'you', '?']
    
  • spaCy: Efficient tokenization and additional NLP functionalities such as POS tagging, named entity recognition (NER), and dependency parsing.

    # Importing spaCy and loading the English model
    import spacy
    
    # Load the small English model
    nlp = spacy.load("en_core_web_sm")
    
    # Input text
    text = "Hello world. How are you?"
    
    # Processing the text through the spaCy pipeline
    doc = nlp(text)
    
    # Extracting tokens
    tokens = [token.text for token in doc]
    
    # POS tagging
    pos_tags = [(token.text, token.pos_) for token in doc]
    
    # Named Entity Recognition (NER)
    entities = [(entity.text, entity.label_) for entity in doc.ents]
    
    # Dependency Parsing
    dependencies = [(token.text, token.dep_, token.head.text) for token in doc]
    
    # Output results
    print("spaCy Tokenization:", tokens)
    print("POS Tagging:", pos_tags)
    print("Named Entities:", entities)
    print("Dependencies:", dependencies)
    

    Output:

    spaCy Tokenization: ['Hello', 'world', '.', 'How', 'are', 'you', '?']
    POS Tagging: [('Hello', 'INTJ'), ('world', 'NOUN'), ('.', 'PUNCT'), ('How', 'ADV'), ('are', 'AUX'), ('you', 'PRON'), ('?', 'PUNCT')]
    Named Entities: []
    Dependencies: [('Hello', 'intj', 'Hello'), ('world', 'ROOT', 'world'), ('.', 'punct', 'world'), ('How', 'advmod', 'are'), ('are', 'ROOT', 'are'), ('you', 'nsubj', 'are'), ('?', 'punct', 'are')]
    
  • transformers: Tokenization for transformer-based models, and access to pre-trained transformer models for tasks such as text classification, named entity recognition, and question answering.

    # Importing the BertTokenizer from transformers library
    from transformers import BertTokenizer
    
    # Loading the BERT tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    # Input text
    text = "Hello world. How are you?"
    
    # Tokenizing the text using BERT tokenizer
    tokens = tokenizer.tokenize(text)
    
    # Output results
    print("BERT Tokenization:", tokens)
    

    Output:

    BERT Tokenization: ['hello', 'world', '.', 'how', 'are', 'you', '?']
    

2.Word Embedding

Word embedding is a technique to represent words in continuous vector space where semantically similar words are mapped to nearby points.

  • To capture the semantic meaning of words.

  • To reduce the dimensionality of text data for easier processing by machine learning models.

1. One-Hot Encoding

One-hot encoding is a basic and straightforward method to represent words. Each word in the vocabulary is represented as a binary vector, where only the index corresponding to that word is set to 1, and all other indices are set to 0.

  • Example: In a vocabulary of three words [“cat”, “dog”, “fish”], the one-hot encodings would be: - “cat” -> [1, 0, 0] - “dog” -> [0, 1, 0] - “fish” -> [0, 0, 1]

  • Advantages: - Simple to implement. - Intuitive and easy to understand.

  • Limitations: - High dimensionality: For a large vocabulary, the vectors become very large and sparse. - No semantic meaning: Does not capture any relationship between words. “cat” and “dog” are as dissimilar as “cat” and “fish”.

2. Word2Vec

Word2Vec is a popular word embedding technique that produces dense, continuous vector representations of words. It uses neural networks to learn the embeddings from large corpora of text. Word2Vec offers two models:

  • CBOW (Continuous Bag of Words): - Predicts the target word from the context words. - Faster and works well with frequent words.

  • Skip-gram: - Predicts the context words from the target word. - Better for smaller datasets and rare words.

CBOW aims to predict a target word based on its context, which consists of the surrounding words in a given window. It a feedforward neural network with a single hidden layer. The input layer represents the context words, and the output layer represents the target word. The hidden layer contains the learned continuous vector representations (word embeddings) of the input words.

SentencePiece Tokenization

Skip-gram is a slightly different word embedding technique in comparison to CBOW as it does not predict the current word based on the context.This variant takes only one word as an input and then predicts the closely related context words. That is the reason it can efficiently represent rare words.

SentencePiece Tokenization
LAB

3. Bag of Words (BOW)

Bag of Words (BOW) is a simple but powerful approach to vectorizing text. As the name suggests, the bag-of-words technique does not consider the position of a word in a document. Instead, all the words are dropped into a big bag. The idea is to count the number of times each word appears in each document without considering its position or grammatical role.

  • Example: Consider the three following sentences from the well-known Surfin’ Bird song, and count the number of times each word appears in each sentence. Let’s first list all the words in the verses:

BOW on Surfin’ Bird

Each word is now associated with its own column of numbers, its own vector:

Each word has its own vector

Note

The matrix calculated on this simple example of three sentences can be generalized to many documents in the corpus. Each document is a row, and each token is a column. Such a matrix is called the document-term matrix. It describes the frequency of terms that occur in a collection of documents and is used as input to a machine learning classifier.

Note that the size of the document-term matrix is: number of documents * size of vocabulary.

The vector size of each token equals the number of documents in the corpus. With this BOW approach, a large corpus has long vectors, and a small corpus has short vectors (as in the Surfin’ Bird example above, where each vector only has three numbers).

Note

A corpus is a set of texts or documents. A document is any distinct text varying in size, format, or structure and can include sentences, paragraphs, tweets, SMS, comments, reviews, articles, books, etc.

LAB

4. GloVe (Global Vectors for Word Representation)

GloVe is a word embedding technique that combines the advantages of both matrix factorization techniques and local context window methods. It leverages word co-occurrence statistics from a corpus to learn word vectors.

  • Example: GloVe creates word vectors based on the frequency of words appearing together in a context window across the entire corpus.

  • Advantages: - Captures both global statistical information and local context. - Efficient and produces meaningful embeddings.

  • Limitations: - Context-independent: Similar to Word2Vec, it doesn’t capture different meanings of words in different contexts.

5. FastText

FastText extends Word2Vec by representing each word as an n-gram of characters, which helps in generating embeddings for words that were not seen during training (out-of-vocabulary words).

  • Example: The word “unhappiness” can be represented as character n-grams [“un”, “unh”, “hap”, “happ”, “ppiness”, “iness”].

  • Advantages: - Handles out-of-vocabulary words better. - Produces more fine-grained embeddings by considering subword information.

  • Limitations: - Increased computational complexity compared to Word2Vec.