GitHunt

kmeow

The problem

  • Interfacing with a set of arbitrary tools for metagenome sequence analysis (or, in general, other bioinformatics pipelines) is an imposing problem for users not already familiar with available tools in a given toolset, the ways they are expected to be used together, or their appropriate parameters.

  • An example: I have sequencing results from an environmental isolate. How do I find out what it is (bacteria, archaea, eukaryote) and whether it’s an auxotroph for a given media component?

  • Stretch goal: also try to determine if any of the organism’s genes have been altered with synthetic engineering? Or if the genes are just wild type

The solution

  • Large language models such as GPT-3+ are capable of translating natural language prompts into structured queries

  • We may use GPT-3+ as part of a standardized approach for defining appropriate metagenome sequence processing pipelines.

  • Langchain can be used to provide additional context, which when combined with the user prompt will deliver a more informative result.

  • While finetuning can deliver improved results, it is out of scope for this work.

The desired outcome

  • A standard set of reusable code for interfacing with GPT-3+ and an API defining a collection of bioinformatics tools.

  • An explanation in natural language of how to do what the user wants to do

Languages

Python100.0%

Contributors

Created May 17, 2023
Updated September 2, 2023
caufieldjh/kmeow | GitHunt