Skip to main content Skip to main navigation


K-mer Neural Embedding Performance Analysis Using Amino Acid Codons

Muhammad Nabeel Asim; Muhammad Imran Malik; Andreas Dengel; Sheraz Ahmed
In: 2020 International Joint Conference onNeural Networks (IJCNN). International Joint Conference on Neural Networks (IJCNN-2020), July 19-24, Glasgow, United Kingdom, Pages 1-8, ISBN 978-1-7281-6927-9, IEEE, 2020.


Exponential growth of genome-wide assays of gene expressions and their public access open new horizons for machine learning methodologies to effectively perform genetic analysis. In this work, domain specific pre-train k-mer embeddings of DNA sequences are generated by utilising FastText approach. Sequence co-expression pattern information is embedded into 200 dimensional vectors by training Fasttext model on 317,151 samples of DNA sequences (with k-mers representation). We propose a novel idea to utilize the information of various codons present in amino acids for the evaluation of learned sequence vectors. We employ two diverse techniques to compare the performance of generated task-specific k-mer embeddings with state-of-the-art publicly available generic k-mer embeddings of genome. Firstly, we utilize a dimensionality reduction approach namely PCA to alleviate the dimensions of DNA sequences upto 50 features by preserving almost 85% of sequence features information. Afterwards, TSNE algorithms is used to visualize k-mer embeddings and to make sure whether different codons representing the same amino acid are more closer to each other than the ones representing different amino acids. Secondly, to assess the analogy of k-mer embeddings, generated domain specific k-mer embeddings are compared with state-of-the-art k-mer embeddings by estimating the cosine similarity among those codons vectors which represent same amino acid. Overall, we believe that task-specific distributed representation of k-mers would be useful for DNA methylation and Histone occupancy prediction tasks.

Weitere Links