enhance dataset metada, preprocessing

noamoss commented 1 year ago

example output


 {
    "purpose": "This dataset contains information about conditions in species other than cattle, poultry, pigs, and sheep & goats, collected by the Food Standards Agency at approved meat establishments.",
    "description": "The dataset provides details on various species, inspection types, conditions, year and month, country, number of conditions, throughput, number of throughput plants, and percentage of throughput. The information is gathered from approved meat establishments by the Food Standards Agency, which is responsible for food safety and food hygiene across the UK.",
    "keywords": "Food Standards Agency; meat establishments; species conditions; inspection types; throughput",
    "data general": "This dataset contains species, inspection types, conditions in species, year and month, country, number of conditions, throughput, number of throughput plants, and percentage of throughput.",
    "data fields": [
        {
            "name": "Species",
            "type": "categorical",
            "range": "Solipeds, Small wild game in feather, Deer, Large Wild Game, Small wild game in fur, alpacas, Farmed deer, Ratites",
            "distribution": "Top values: Solipeds, Small wild game in feather, Deer, Large Wild Game, Small wild game in fur",
            "missing_values_ratio": 0.0
        },
        {
            "name": "InspectionType",
            "type": "categorical",
            "range": "Conditions, Offal, Carcase, AnteMortem, Total Rejection",
            "distribution": "Top values: Conditions, Offal, Carcase, AnteMortem, Total Rejection",
            "missing_values_ratio": 0.0
        },
        {
            "name": "Condition",
            "type": "categorical",
            "range": "Contamination, Trauma, Other, Abnormal Smell, Colour, Fascioliasis (fluke), Emaciation/Cachexia, Pneumonia Mycoplasma like, Lung Worm, Lung Abscesses, etc.",
            "distribution": "Top values: Contamination, Trauma, Other, Abnormal Smell, Colour",
            "missing_values_ratio": 0.0
        },
        {
            "name": "YearMonth",
            "type": "datetime",
            "range": "2017-10-01 to 2020-12-01",
            "distribution": null,
            "missing_values_ratio": 0.0
        },
        {
            "name": "Country",
            "type": "categorical",
            "range": "England, Wales",
            "distribution": "Top values: England, Wales",
            "missing_values_ratio": 0.0
        },
        {
            "name": "NumberOfConditions",
            "type": "integer",
            "range": "1 to 5954",
            "distribution": "Mean: 300.252, Median: 4.0",
            "missing_values_ratio": 0.0
        },
        {
            "name": "Throughput",
            "type": "integer",
            "range": "1 to 238697",
            "distribution": "Mean: 84927.563, Median: 118405.0",
            "missing_values_ratio": 0.0
        },
        {
            "name": "NumberOfThroughputPlants",
            "type": "integer",
            "range": "1 to 133",
            "distribution": "Mean: 29.905, Median: 5.0",
            "missing_values_ratio": 0.0
        },
        {
            "name": "PercentageOfThroughput",
            "type": "float",
            "range": "0.001 to 100.0",
            "distribution": "Mean: 5.377, Median: 1.301",
            "missing_values_ratio": 0.0
        }
    ],
    "example questions": [
        "What is the most common condition found in inspected species?",
        "How does the percentage of throughput vary across different species?",
        "Which species has the highest number of conditions?",
        "How does the number of conditions change over time?",
        "What are the most common inspection types for different species?"
    ]
}```

noamoss commented 1 year ago

@pwalsh @OriHoch (not sure why, but I can't mention Adam here...) any comments or thoughts regarding the example above?

OriHoch commented 1 year ago

@akariv

OriHoch commented 1 year ago

looks good, very helpful

maybe this is the next step, but I would suggest to split this data into chunks of data to vectorize for the semantic DB and also try to eliminate irelevant details and think what would benefit a semantic search. The basic data we input to the vector DB is short metadata and embedding text.

Following is a possible example, but this is something we will need to test with the vector DB and frontend to see which data is useful or not and how the frontend can query this data

"purpose": "This dataset contains information about conditions in species other than cattle, poultry, pigs, and sheep & goats, collected by the Food Standards Agency at approved meat establishments."
- metadata: type=purpose
- embedding: Food Safety, Quality Control, Regulatory Compliance, Scientific Research, Policy Development, Consumer Information
"description": "The dataset provides details on various species, inspection types, conditions, year and month, country, number of conditions, throughput, number of throughput plants, and percentage of throughput. The information is gathered from approved meat establishments by the Food Standards Agency, which is responsible for food safety and food hygiene across the UK."
- metadata: type=description
- embedding: inspection types, conditions, year and month, country, number of conditions, throughput, number of throughput plants, and percentage of throughput. Information gathered from approved meat establishments by the Food Standards Agency, which is responsible for food safety and food hygiene across the UK. Included species: fish, game, exotic meats, equine, birds.
"keywords": "Food Standards Agency; meat establishments; species conditions; inspection types; throughput"
- I don't see the added value for this, as it's included already in the description
"data general": "This dataset contains species, inspection types, conditions in species, year and month, country, number of conditions, throughput, number of throughput plants, and percentage of throughput."
- I don't see the added value for this, as it's included already in the description
data fields - might be useful to split it into several embeddings for different types of fields, I think that numerical fields are not useful
- metadata: type=data_field, embedding: Species: Solipeds, Small wild game in feather, Deer, Large Wild Game, Small wild game in fur, alpacas, Farmed deer, Ratites
- metadata: type=data_field, embedding: InspectionType: Conditions, Offal, Carcase, AnteMortem, Total Rejection
- metadata: type=data_field, embedding: Condition: Contamination, Trauma, Other, Abnormal Smell, Colour, Fascioliasis (fluke), Emaciation/Cachexia, Pneumonia Mycoplasma like, Lung Worm, Lung Abscesses
example questions -
- metadata: type=question, embedding: "What is the most common condition found in inspected species?"
- metadata: type=question, embedding: "How does the percentage of throughput vary across different species?"
- metadata: type=question, embedding: "Which species has the highest number of conditions?"
- metadata: type=question, embedding: "How does the number of conditions change over time?"
- metadata: type=question, embedding: "What are the most common inspection types for different species?"

noamoss commented 1 year ago

@OriHoch,

thanks; before applying these suggestions, I will provide here 2-3 more examples to validate you want to give up the "keywords" and "general data", and if so, I will omit them from my code.
I think that numerical fields are not useful The value of numerical values, I assume, provides more context to the area (for example, measurement units) or hints for the availability of the relevant data (based on the "missing data ratio"). I might be wrong - but would it not be worth a check?
do you want me to transfer the output to the format you showed in your comment or leave it as is for now?
@akariv (weird, mention still does not work for me here :) ) + @pwalsh Before I do that, please approve you agree with Ori's direction, so I won't need to reverse versions.

OriHoch commented 1 year ago

The value of numerical values, I assume, provides more context to the area (for example, measurement units) or hints for the availability of the relevant data (based on the "missing data ratio"). I might be wrong - but would it not be worth a check?

I just don't see how this numerical data will be beneficial to the end user queries. Maybe if you give some example user queries you think will be able to use this data.

do you want me to transfer the output to the format you showed in your comment or leave it as is for now?

I think you can leave it as is for now

noamoss commented 1 year ago

I simplified the prompt a bit:

no detailed numeric values
I left the purpose and the data general, as sometimes we get some extra relevant information - we can always remove one of them if we will see it is redundant.

Here are some updated examples:

Description:
 {
    "title": "Congestion on Locally Managed A Roads",
    "purpose": "To provide official statistics on congestion on locally managed A roads in England.",
    "description": "This dataset contains the latest provisional official statistics on congestion on locally managed A roads in England. The data is collected and maintained by the Department for Transport, a ministerial department supported by 22 agencies and public bodies. The dataset aims to support the transport network and infrastructure planning in the UK.",
    "keywords": "congestion; locally managed A roads; England; Department for Transport; traffic; infrastructure planning",
    "data general": "The dataset contains information on congestion levels, traffic volume, and road management on locally managed A roads in England.",
    "data fields": {
        "road_type": "categorical (A roads); no missing values",
        "congestion_level": "continuous; range of data points values: low to high; distribution: varies by location and time; ratio of missing values: low",
        "traffic_volume": "continuous; range of data points values: low to high; distribution: varies by location and time; ratio of missing values: low",
        "road_management": "categorical (locally managed); no missing values"
    },
    "example questions": [
        "What is the relationship between congestion levels and traffic volume on locally managed A roads in England?",
        "How do congestion levels on locally managed A roads in England vary by location and time?",
        "What are the trends in traffic volume on locally managed A roads in England?",
        "Are certain regions in England experiencing higher congestion levels on their locally managed A roads compared to others?"
    ]
}

Description:
 {
    "title": "Workforce Management Information for Export Credits Guarantee Department",
    "purpose": "To provide monthly management information on staff numbers and paybill costs in UK Civil Service departments, their agencies, and their executive NDPBs.",
    "description": "This dataset contains public data on staff numbers and paybill costs for the UK Export Finance's Export Credits Guarantee Department. The dataset includes monthly information on both payroll and non-payroll staff, as well as details on workforce costs, broken down by department, agency, and executive NDPB. The data is useful for understanding workforce trends and managing human resources within the UK Civil Service.",
    "keywords": "workforce management; staff numbers; paybill costs; civil service; UK Export Finance; Export Credits Guarantee Department; payroll; non-payroll; agencies; executive NDPBs",
    "data general": "The dataset contains monthly management information on staff numbers, paybill costs, payroll and non-payroll staff, and breakdowns by department, agency, and executive NDPB.",
    "data fields": [
        {
            "field": "Department",
            "type": "categorical",
            "range": "Various UK Civil Service departments",
            "distribution": "Varies by department",
            "missing_values_ratio": "Low"
        },
        {
            "field": "Agency",
            "type": "categorical",
            "range": "Various UK Civil Service agencies",
            "distribution": "Varies by agency",
            "missing_values_ratio": "Low"
        },
        {
            "field": "Executive NDPB",
            "type": "categorical",
            "range": "Various UK Civil Service executive NDPBs",
            "distribution": "Varies by executive NDPB",
            "missing_values_ratio": "Low"
        },
        {
            "field": "Month",
            "type": "date",
            "range": "Monthly data points",
            "distribution": "Uniform across months",
            "missing_values_ratio": "Low"
        },
        {
            "field": "Staff Numbers",
            "type": "integer",
            "range": "Various staff numbers",
            "distribution": "Varies by department, agency, and executive NDPB",
            "missing_values_ratio": "Low"
        },
        {
            "field": "Paybill Costs",
            "type": "integer or float",
            "range": "Various paybill costs",
            "distribution": "Varies by department, agency, and executive NDPB",
            "missing_values_ratio": "Low"
        }
    ],
    "example questions": [
        "What is the trend in staff numbers for a specific department over time?",
        "How do paybill costs compare between different agencies?",
        "Which executive NDPBs have the highest payroll staff numbers?",
        "What is the proportion of non-payroll staff in a given department or agency?"
    ]
}

Description:
 {
    "title": "Quarterly Minor Planning Decisions",
    "purpose": "This dataset provides information about planning decisions made by district planning authorities, categorized by development type, speed of decision, and authority.",
    "description": "The dataset contains public data on quarterly minor planning decisions made by district planning authorities in the United Kingdom. It is provided by the Ministry of Housing, Communities and Local Government, whose mission is to create great places to live and work, and to empower local people to shape their area. The dataset includes various development types, the speed of decision-making, and the authority involved in the decision.",
    "keywords": "planning decisions; district planning authorities; development type; speed of decision; authority; housing; communities; local government",
    "data general": "In this dataset, users can expect to find information about planning decisions made by district planning authorities, such as the development type, the speed of the decision-making process, and the authority responsible for the decision.",
    "data fields": [
        {
            "field": "Development Type",
            "type": "Categorical",
            "range": "Various development types",
            "distribution": "Varies depending on the development type",
            "missing_values_ratio": "Minimal"
        },
        {
            "field": "Speed of Decision",
            "type": "Continuous",
            "range": "Varies from fast to slow",
            "distribution": "Varies depending on the speed of the decision-making process",
            "missing_values_ratio": "Minimal"
        },
        {
            "field": "Authority",
            "type": "Categorical",
            "range": "Various district planning authorities",
            "distribution": "Varies depending on the authority",
            "missing_values_ratio": "Minimal"
        }
    ],
    "example questions": [
        "What is the average speed of decision-making for different development types?",
        "Which district planning authorities have the fastest decision-making process?",
        "How does the speed of decision-making vary across different authorities and development types?",
        "Are certain development types associated with faster or slower decision-making processes?"
    ]
}

whiletrue-industries / odds

enhance dataset metada, preprocessing #23