Data synthesis - Githubissues

What's the problem?

Many Wordplay examples need text content or data to be interesting -- a hollow algorithm shell without interesting content tends not to inspire. The internet can be a source of this, but that can require challenging data wrangling. And there are benefits to generating data or content that is unique to a project.

What's the design idea?

Use generative AI approaches to synthesis data and content for programs, both for testing and for creative purposes. There are many open research questions to make this possible:

What is the difference between generating data and text?
Can generative AI approaches generate structured data that has particular statistical properties?
How can generative approaches maintain provenance of the source of text or data?
How might generated data or content be kept around and swapped in and out, like a collection to be used and reused in and outside of a project?

Who benefits?

Anyone creating Wordplay projects centered around data and content.

Design specification

(This section should be included after a design proposal is ready and approved, and the buildable tag is added. This text can remain until then. Designers should add their proposal here, not in a comment).

Forgot to assign previously, but conducted rudimentary research on this issue, specifically "What is the difference between generating data and text?". I will paste my research below, but will not be continuing this issue for the summer quarter.

https://www.geeksforgeeks.org/difference-between-data-mining-and-text-mining/ https://www.cms-connected.com/News-Archive/March-2019/What%E2%80%99s-the-Difference-Between-Data-and-Text-Mining#:~:text=While%20data%20mining%20handles%20structured,as%20in%20social%20media%20feeds.

GENERATING DATA Objective: The primary goal of data generation is to create structured data that can be used for analysis, training machine learning models, simulations, testing software, or other specific applications. Characteristics Structured Output: Generated data is often in the form of structured formats such as tables, CSV files, or databases. Consistency and Format: The data needs to follow specific schemas or formats, with defined types (e.g., integers, floats, strings) and constraints (e.g., ranges, unique values). Statistical Properties: Generated data may need to adhere to certain statistical properties, distributions, or real-world correlations. Synthetic Data: Often involves creating synthetic datasets that mimic real-world data while avoiding privacy issues or limitations of actual data availability. Applications: Used in fields like data science, machine learning (for training and validation), software testing, simulations, and research. Methods Random Generation: Using algorithms to produce random values within specified constraints. Simulation Models: Creating data based on simulations of real-world processes. Data Augmentation: Expanding existing datasets by slightly modifying existing data points. Algorithmic Synthesis: Using algorithms or models to create data that follows specific patterns or distributions. GENERATING TEXT Objective: The main goal of text generation is to produce coherent and contextually appropriate written content that resembles human language, whether for creative writing, conversational agents, automated reporting, or content creation. Characteristics Unstructured Output: Generated text is typically unstructured or semi-structured, presented in paragraphs, sentences, or dialogue. Coherence and Relevance: Text must be coherent, contextually relevant, and grammatically correct, often adhering to the stylistic and contextual norms of human language. Creativity and Expression: Text generation involves a degree of creativity, producing new and diverse expressions, ideas, and narratives. Natural Language Processing (NLP): Involves understanding and generating human language using NLP techniques. Applications: Used in chatbots, virtual assistants, automated content creation (news, blogs), creative writing, translations, summarization, and more. Methods Language Models: Using models like GPT (Generative Pre-trained Transformer) that are trained on large corpora of text data to generate human-like text. Templates and Rules: Using predefined templates and rules for more structured text generation, such as automated emails or simple reports. Machine Learning: Leveraging machine learning techniques to predict the next word or sentence based on the context provided. DIFFERENCES IN (data vs text): structure/format: highly structured; unstructured Objectives: create datasets for analysis, modeling, or testing; produce human-like, contextually appropriate content for communication and expression Methods/tools: algorithms for random generation, simulations, or data synthesis; NLP models and language processing techniques Uses: scientific, technical, and analytical fields; creative, communicative, and interactive applications

wordplaydev / wordplay

Data synthesis #388

What's the problem?

What's the design idea?

Who benefits?

Design specification