Predicting Protein Secondary Structures with CNN + BiLSTM

08 Sep, 2025

Introduction

Protein secondary structure prediction is a cornerstone of bioinformatics. Understanding how proteins fold and interact helps researchers unravel biological function, design drugs, and engineer novel proteins. Traditional experimental methods like X-ray crystallography and NMR spectroscopy are highly accurate but slow and expensive. Machine learning (ML) offers a faster, scalable alternative: predicting secondary structures directly from amino acid sequences.

In this project, we developed an ML model combining Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks to predict three classes of secondary structures:

H – Helix
E – Beta Sheet
C – Coil

Using Kaggle’s GPU resources, the model achieved over 71% overall accuracy, demonstrating that deep learning can effectively capture patterns in protein sequences.

Dataset and Preprocessing

Dataset

The dataset was sourced from Kaggle and contains peptide sequences with corresponding secondary structures in Q3 format. Each entry includes:

Amino acid sequence
Secondary structure labels (H, E, C)
Sequence metadata (length, non-standard amino acids, etc.)

Preprocessing

Encoding Amino Acid Sequences
- One-hot encoding—fast, simple representation
- Pretrained embeddings—ProtBERT, TAPE, or ESM2 for richer features
Label Encoding
Converted H, E, C into numerical labels: 0, 1, 2.
Train-Validation-Test Split
- 80% training
- 10% validation
- 10% test

# Example preprocessing snippet
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(sequences, labels, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Model Architecture

The model combines CNN and BiLSTM layers to capture both local and long-range dependencies in protein sequences.

Structure

CNN Layer – extracts local motifs from amino acid sequences
BiLSTM Layer – models sequential dependencies across the sequence
Fully Connected Layer – maps features to secondary structure classes
Softmax Activation – outputs probability scores for each class

# Example model skeleton (PyTorch)
import torch.nn as nn

class ProteinSSPredictor(nn.Module):
    def __init__(self):
        super().__init__()
        self.cnn = nn.Conv1d(in_channels=20, out_channels=64, kernel_size=3, padding=1)
        self.bilstm = nn.LSTM(input_size=64, hidden_size=128, bidirectional=True, batch_first=True)
        self.fc = nn.Linear(256, 3)  # 3 classes: H, E, C
        self.softmax = nn.Softmax(dim=1)
    
    def forward(self, x):
        x = self.cnn(x)
        x, _ = self.bilstm(x)
        x = self.fc(x[:, -1,])
        return self.softmax(x)

Training Strategy

Loss Function: Categorical Cross-Entropy
Optimizer: Adam
Epochs: 30 with early stopping
Hardware: Kaggle T4 GPU

Optimization Techniques

Batch processing & caching – reduces training latency
GPU-only training – minimizes CPU-GPU overhead
Learning rate adjustments—ensures stable convergence

Results and Evaluation

The model was evaluated on unseen test data:

Performance Metrics

Structure	Accuracy
H (Helix)	76.21%
E (Beta Sheet)	63.26%
C (Coil)	70.92%
Overall	71.01%

Confidence Statistics

Structure	Mean Confidence	Std Dev
H	0.8013	0.1763
E	0.7272	0.1723
C	0.6969	0.1511

Classification Report

Class	Precision	Recall	F1-Score
H	76.37%	76.21%	76.29%
E	66.37%	63.26%	64.78%
C	68.79%	70.92%	69.84%

Insights:

Helix and coil structures were predicted with higher accuracy.
Beta sheets (E) had lower performance, potentially due to class imbalance.
Early stopping at epoch 27 prevented overfitting.

Future Improvements

Transformers for Protein Sequences
- Replace BiLSTM with ESM2 or ProtBERT embeddings for richer sequence representation.
Scaling with Distributed ML
- Train on larger datasets using cloud-based frameworks.
Tertiary Structure Prediction
- Implement Variational Autoencoders (VAEs) or diffusion models for 3D structure modeling.
Generative AI for Protein Design
- Extend to synthetic protein generation for drug discovery.

Conclusion

This project demonstrates that ML can predict secondary protein structures effectively, achieving 71% accuracy using a CNN + BiLSTM architecture. Leveraging free GPU resources makes it accessible for researchers and students alike. Future work will focus on transformers, scaling, and generative applications in biotech.

All code, datasets, and results are open-source, welcoming contributions and collaboration from the bioinformatics and ML community. To the code: https://www.kaggle.com/datasets/allanwandia/secondary-protein-structure-prediction To the full paper: https://github.com/DarkStarStrix/CSE-Repo-of-Advanced-Computation-ML-and-Systems-Engineering/blob/main/Papers/Computer_Science/Machine_Learning/Protein_Structure_Prediction.pdf