Sequence features

One-hot encoding and charges of wild-type and mutated amino acids: We constructed two distinct vectors that represent the wild-type and mutated amino acids. Each amino acid was encoded by a binary vector of length 20, with a value of 1 at the corresponding position and 0s elsewhere (data not provided here). We constructed an additional vector that encodes the charge on the wild-type and mutated amino acids (data not provided here).

Phosphomimetic or acetylation mimicking:

A variant was considered phosphomimetic if the amino acid changed from a Ser(S) or Thr(T) to an Asp(D) or Glu(E), and acetylation mimicking if the amino acid changed from Lys(K) to Gln(Q) (data not provided here).

ATP binding pocket: We calculated the number of known ATP binding sites at the position equivalent to the variant in the alignment. We obtained the list of known ATP binding sites in human kinases from UniProt (version 2023_02).

Post-translational modification information: We incorporated known post-translational modification (PTM) information of the variant position and its adjacent positions (window size = 5) as a feature vector, with a length equal to the number of possible PTM types (phosphorylation, acetylation, methylation, etc.). The presence of a specific PTM type was represented by 1, and otherwise as 0. We repeated the procedure to incorporate known PTM information at the alignment position equivalent to the variant position, and its adjacent residues (window size = 5). Each element in the vector encoded the number of kinases harbouring the corresponding PTM type at the given position in the alignment.

Loss/gain of amino acids in known mutations: We also incorporated the number of times an amino acid was observed to be a wild-type (loss) or mutated (gain) in a mutation type (i.e. activating, deactivating, and resistance) at the position equivalent to the variant (and its adjacent residues; window size=5) in the alignment. We set the count initially to zero for all the amino acids at all alignment positions. For a loss of an amino acid at an alignment position in a mutation type, we decreased the corresponding count by 1, and increased for a gain.

Evolutionary features

We extracted log scores for each amino acid and position from the profile hidden Markov model of the alignment and used the wild-type and mutated scores as features. We did the same for three additional alignments determined after pan-proteome comparisons and ortholog/paralog determination. We subsetted the orthologs based on the phylogeny into eukaryotes, metazoa, vertebrates, and mammals and used conservation across them as features. Specifically, this included conservation scores from three alignments (all homologs, best-per-species orthologs and exclusive paralogs used previously)

Structural features

We used Alphafold2 structures for each kinase to determine the secondary structure, accessibility and backbone psi/phi angles using DSSP. We used IUPred to determine disorder scores. We scored intra-protein side-chain-to-side-chain contacts using Mechismo. For all values, we determined log-odds values for each amino acid in each environment and used these values and their mutant-wild-type differences as features (as described previously)

Additional files

The mapping of kinases' WT sequence to the domain sequence is provided in the file below.