Predicting Protein Secondary Structures with CNN + BiLSTM
Introduction
Protein secondary structure prediction is a cornerstone of bioinformatics. Understanding how proteins fold and interact helps researchers unravel biological function, design drugs, and engineer novel proteins. Traditional experimental methods like X-ray crystallography and NMR spectroscopy are highly accurate but slow and expensive. Machine learning (ML) offers a faster, scalable alternative: predicting secondary structures directly from amino acid sequences.
In this project, we developed an ML model combining Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory (BiLSTM) networks to predict three classes of secondary structures:
- H – Helix
- E – Beta Sheet
- C – Coil
Using Kaggle’s GPU resources, the model achieved over 71% overall accuracy, demonstrating that deep learning can effectively capture patterns in protein sequences.
Dataset and Preprocessing
Dataset
The dataset was sourced from Kaggle and contains peptide sequences with corresponding secondary structures in Q3 format. Each entry includes:
- Amino acid sequence
- Secondary structure labels (H, E, C)
- Sequence metadata (length, non-standard amino acids, etc.)
Preprocessing
Encoding Amino Acid Sequences
- One-hot encoding—fast, simple representation
- Pretrained embeddings—ProtBERT, TAPE, or ESM2 for richer features
Label Encoding
ConvertedH, E, Cinto numerical labels:0, 1, 2.Train-Validation-Test Split
- 80% training
- 10% validation
- 10% test
# Example preprocessing snippet
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(sequences, labels, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
Model Architecture
The model combines CNN and BiLSTM layers to capture both local and long-range dependencies in protein sequences.
Structure
- CNN Layer – extracts local motifs from amino acid sequences
- BiLSTM Layer – models sequential dependencies across the sequence
- Fully Connected Layer – maps features to secondary structure classes
- Softmax Activation – outputs probability scores for each class
# Example model skeleton (PyTorch)
import torch.nn as nn
class ProteinSSPredictor(nn.Module):
def __init__(self):
super().__init__()
self.cnn = nn.Conv1d(in_channels=20, out_channels=64, kernel_size=3, padding=1)
self.bilstm = nn.LSTM(input_size=64, hidden_size=128, bidirectional=True, batch_first=True)
self.fc = nn.Linear(256, 3) # 3 classes: H, E, C
self.softmax = nn.Softmax(dim=1)
def forward(self, x):
x = self.cnn(x)
x, _ = self.bilstm(x)
x = self.fc(x[:, -1,])
return self.softmax(x)
Training Strategy
- Loss Function: Categorical Cross-Entropy
- Optimizer: Adam
- Epochs: 30 with early stopping
- Hardware: Kaggle T4 GPU
Optimization Techniques
- Batch processing & caching – reduces training latency
- GPU-only training – minimizes CPU-GPU overhead
- Learning rate adjustments—ensures stable convergence
Results and Evaluation
The model was evaluated on unseen test data:
Performance Metrics
| Structure | Accuracy |
|---|---|
| H (Helix) | 76.21% |
| E (Beta Sheet) | 63.26% |
| C (Coil) | 70.92% |
| Overall | 71.01% |
Confidence Statistics
| Structure | Mean Confidence | Std Dev |
|---|---|---|
| H | 0.8013 | 0.1763 |
| E | 0.7272 | 0.1723 |
| C | 0.6969 | 0.1511 |
Classification Report
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| H | 76.37% | 76.21% | 76.29% |
| E | 66.37% | 63.26% | 64.78% |
| C | 68.79% | 70.92% | 69.84% |
Insights:
- Helix and coil structures were predicted with higher accuracy.
- Beta sheets (E) had lower performance, potentially due to class imbalance.
- Early stopping at epoch 27 prevented overfitting.
Future Improvements
Transformers for Protein Sequences
- Replace BiLSTM with ESM2 or ProtBERT embeddings for richer sequence representation.
Scaling with Distributed ML
- Train on larger datasets using cloud-based frameworks.
Tertiary Structure Prediction
- Implement Variational Autoencoders (VAEs) or diffusion models for 3D structure modeling.
Generative AI for Protein Design
- Extend to synthetic protein generation for drug discovery.
Conclusion
This project demonstrates that ML can predict secondary protein structures effectively, achieving 71% accuracy using a CNN + BiLSTM architecture. Leveraging free GPU resources makes it accessible for researchers and students alike. Future work will focus on transformers, scaling, and generative applications in biotech.
All code, datasets, and results are open-source, welcoming contributions and collaboration from the bioinformatics and ML community. To the code: https://www.kaggle.com/datasets/allanwandia/secondary-protein-structure-prediction To the full paper: https://github.com/DarkStarStrix/CSE-Repo-of-Advanced-Computation-ML-and-Systems-Engineering/blob/main/Papers/Computer_Science/Machine_Learning/Protein_Structure_Prediction.pdf