About me

Predicting Protein Secondary Structures with CNN + BiLSTM

Introduction

Protein secondary structure prediction is a cornerstone of bioinformatics. Understanding how proteins fold and interact helps researchers unravel biological function, design drugs, and engineer novel proteins. Traditional experimental methods like X-ray crystallography and NMR spectroscopy are highly accurate but slow and expensive. Machine learning (ML) offers a faster, scalable alternative: predicting secondary structures directly from amino acid sequences.

In this project, we developed an ML model combining Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks to predict three classes of secondary structures:

Using Kaggle’s GPU resources, the model achieved over 71% overall accuracy, demonstrating that deep learning can effectively capture patterns in protein sequences.


Dataset and Preprocessing

Dataset

The dataset was sourced from Kaggle and contains peptide sequences with corresponding secondary structures in Q3 format. Each entry includes:

Preprocessing

  1. Encoding Amino Acid Sequences

    • One-hot encoding—fast, simple representation
    • Pretrained embeddings—ProtBERT, TAPE, or ESM2 for richer features
  2. Label Encoding
    Converted H, E, C into numerical labels: 0, 1, 2.

  3. Train-Validation-Test Split

    • 80% training
    • 10% validation
    • 10% test
# Example preprocessing snippet
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(sequences, labels, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Model Architecture

The model combines CNN and BiLSTM layers to capture both local and long-range dependencies in protein sequences.

Structure

  1. CNN Layer – extracts local motifs from amino acid sequences
  2. BiLSTM Layer – models sequential dependencies across the sequence
  3. Fully Connected Layer – maps features to secondary structure classes
  4. Softmax Activation – outputs probability scores for each class
# Example model skeleton (PyTorch)
import torch.nn as nn

class ProteinSSPredictor(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn = nn.Conv1d(in_channels=20, out_channels=64, kernel_size=3, padding=1)
        self.bilstm = nn.LSTM(input_size=64, hidden_size=128, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(256, 3)  # 3 classes: H, E, C
        self.softmax = nn.Softmax(dim=1)
    
    def forward(self, x):
        x = self.cnn(x)
        x, _ = self.bilstm(x)
        x = self.fc(x[:, -1,])
        return self.softmax(x)

Training Strategy

Optimization Techniques


Results and Evaluation

The model was evaluated on unseen test data:

Performance Metrics

Structure Accuracy
H (Helix) 76.21%
E (Beta Sheet) 63.26%
C (Coil) 70.92%
Overall 71.01%

Confidence Statistics

Structure Mean Confidence Std Dev
H 0.8013 0.1763
E 0.7272 0.1723
C 0.6969 0.1511

Classification Report

Class Precision Recall F1-Score
H 76.37% 76.21% 76.29%
E 66.37% 63.26% 64.78%
C 68.79% 70.92% 69.84%

Insights:


Future Improvements

  1. Transformers for Protein Sequences

    • Replace BiLSTM with ESM2 or ProtBERT embeddings for richer sequence representation.
  2. Scaling with Distributed ML

    • Train on larger datasets using cloud-based frameworks.
  3. Tertiary Structure Prediction

    • Implement Variational Autoencoders (VAEs) or diffusion models for 3D structure modeling.
  4. Generative AI for Protein Design

    • Extend to synthetic protein generation for drug discovery.

Conclusion

This project demonstrates that ML can predict secondary protein structures effectively, achieving 71% accuracy using a CNN + BiLSTM architecture. Leveraging free GPU resources makes it accessible for researchers and students alike. Future work will focus on transformers, scaling, and generative applications in biotech.

All code, datasets, and results are open-source, welcoming contributions and collaboration from the bioinformatics and ML community. To the code: https://www.kaggle.com/datasets/allanwandia/secondary-protein-structure-prediction To the full paper: https://github.com/DarkStarStrix/CSE-Repo-of-Advanced-Computation-ML-and-Systems-Engineering/blob/main/Papers/Computer_Science/Machine_Learning/Protein_Structure_Prediction.pdf