[Solved] How to represent DNA sequences for neural networks?


Why not learn the numerical representations for each base?

This is a common problem in Neural Machine Translation, where we seek to encode “words” with a meaning as (naively) numbers. The core idea is that different words should not be represented with simple numbers, but with a learned dense vector. The process of finding this vector representation is called embedding.

In this, bases that are more closely related to one other might have vector representations closer to one another in n-dimensional space (where n is the size of the vector). This is a simple concept that can be difficult to visualize the first time. Your choice of embedding size (a hyperparameter) should likely be small, since you are only embedding one of four parameters (try a size of 2-5).

As an example of some embedding mappings with size 4 (numerical values are not relevant to this example):

G -> [1.0, 0.2, 0.1, 0.2]
A -> [0.2, 0.5, 0.7, 0.1]
T -> [0.1, 0.2, 1.0, 0.5]
C -> [0.4, 0.4, 0.5, 0.8]

The exact technique of generating and optimizing embedding is a topic in itself; hopefully the concept is useful to you.

Alternative

If you want to avoid embeddings (since your “vocabulary” is limited to 4), you can assign scalar values to each base. If you do this, you should normalize your mappings between -1 and 1.

5

solved How to represent DNA sequences for neural networks?