Pubchem, InChI, SMILES, and uniqueness


Solution 1

Unfortunately Pubchem is right, the two structures have the same InChI string and key, since the protonation state is the same in the zwitterion and the neutral form. So the reason for the discrepancy is by design.
I also always thought, InChI was designed for distinguishing between these conformations, but it turns out to just be one of the limitations of the system. The issue is addressed in section 13.2 of the technical FAQ of the InChI trust:

The different protonation states of the same compound will have InChIKeys differing only by the protonation indicator (unless both states have a number of inserted/removed protons greater than 12; in this case the protonation flag will also be the same, ‘A’).
This is exemplified below by standard InChIKeys as well as standard InChI strings for neutral, zwitterionic, anionic and cationic states of glycine (note that neutral and zwitterionic states do not differ in the total number of protons so they have the same standard InChI/InChIKey):
InChI for glycin

Solution 2

InChI is intended to ignore tautomeric forms. As Martin indicates this also means zwitterions are considered identical to the neutral form.

Unlike you and Martin, I'm not sure I see this as a bug, since predicting the most stable tautomer or zwitterion/neutral is a complicated issue.

If you want to keep track of zwitterions, I think SMILES is a better format, since you can specify exactly what you want as far as explicit hydrogens and charges. You'll need to stick to a particular toolkit to create a canonical ordering.

Moreover, there's a complicated relationship between CIDs and InChI / InChI keys. There are other cases where PubChem will have separate records for compounds that might be "the same" under InChI.

  • Axial or non-traditional stereochemistry. For example hexihelicene (CID=98863) should have two enantiomers, but the InChI reflects no stereochemistry.
  • "Extra" stereo centers. PubChem takes depositions in the SD file format, which allows 2D stereochemistry in wedge/hash notation. If you take the actual 3D geometry, you might realize the wedge/hash were used for appearance, not for indicating a stereo center (i.e., multiple CIDs generate the same InChI from the full 3D molecule).

There are also cases where PubChem indicates an InChI key computed from the 2D depiction in the SD file, but there are missing or undefined stereo centers.

So I'd say that because of incomplete stereochemistry and/or inconsistencies in representations, PubChem CIDs will not always match up with "structural uniqueness" and this is by design, both for PubChem and InChI.

The moral of the story, frankly, is that chemistry is complicated and coming up with "perfect" unique identifiers is incredibly hard.

Solution 3

Martin's answer led me to discover an important extension of InChI that does allow for specification of some tautomer and zwitterion identification.

  1. InChi identifiers that begin with InChI=1S/... are standard InChI. In standard InChI, the InChI identifier "must be the same for any arrangement of mobile hydrogen atoms", with the quote from Section 6 of the Technical FAQ of the Inchi Trust.

  2. However, non-standard InChIs are also possible. These start with InChI=1/.... Note the missing S. In non-standard InChI, there can be an extra layer beginning with /f that is called the fixed hydrogen layer.

  3. Playing around with rdkit (via its Python API) I was able to produce a non-standard InChI that I think corresponds to Pubchem compound 6925665.

from rdkit import Chem
zwitterion_phe_smiles = 'C1=CC=C(C=C1)CC(C(=O)[O-])[NH3+]'
zwitterion_phe_mol = Chem.MolFromSmiles(zwitterion_phe_smiles)

# produces "standard" InChI so not explicitly zwitterionic
print Chem.MolToInchi(zwitterion_phe_mol)


# produces "non-standard" InChI with fixed-H layer so zwitterion can be IDed
print Chem.MolToInchi(zwitterion_phe_mol, options='/FixedH')


# going from respective InChIs to SMILES
## The standard InChI produces neutral SMILES
zwitterion_nonstandard_inchi = 'InChI=1/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)/f/h10H'
standard_inchi = 'InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)'

print Chem.MolToSmiles(Chem.MolFromInchi(zwitterion_nonstandard_inchi))
print Chem.MolToSmiles(Chem.MolFromInchi(standard_inchi))


So for zwitterions tautomeric InChI's are possible. Worryingly, however, some kinds of tautomerism are not handled even by non-standard InChI, again quoting from section 6 of the FAW:

In its current state, InChI recognizes the most common form of H migration (for the full list, see Table 6, Section IVb of the InChI Technical Manual). However, several ways of tautomeric migration that are not supported by default may appear important for some chemists. In particular, these are keto-enol and long-range tautomerisms.

Why Pubchem chose to list standard InChI instead of non-standard InChI is not entirely clear to me. I suppose it is difficult to figure out programmatically when non-standard InChI would be required. Ideally I suppose Pubchem would have both standard and non-standard InChIs for each of their molecules, but I'm not sure when/if they will ever make that change.


Related videos on Youtube

Curt F.
Author by

Curt F.

Updated on May 24, 2020


  • Curt F.
    Curt F. about 2 years
    1. PubChem compound 6140 is L-phenylalanine in its neutral (not zwitterionic) form. According this PubChem, this molecule has the following SMILES and InChI indentifiers:

      • SMILES: C1=CC=C(C=C1)CC(C(=O)O)N
      • InChI: InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)/t8-/m0/s1
    2. PubChem compound 6925665 is the zwitterionic form of L-phenylalanine (protonated amine and deprotonated carboxylate). Pubchem has decided that this species should be called "(2S)-2-azaniumyl-3-phenylpropanoate". The SMILES and InChI identifiers are:

      • SMILES: C1=CC=C(C=C1)CC(C(=O)[O-])[NH3+]
      • Inchi: InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12)/t8-/m0/s1
    3. Confusion. My confusion is why these different compounds have different PubChem entries (CIDs), and different SMILES identifiers...but the same InChI structure. The different SMILES identifiers each appear to reflect the respective structures displayed by PubChem, but the single InChI identifier given for both compounds seems to reflect only the neutral form. I even put the InChI into rdkit and converted it back to InChI. The result was the same and rdkit interpreted this InChI as the neutral (not zwitterionic) species. What is the reason for this discrepancy between InChI duplicity yet structural uniqueness?

    Here are some possibilities:

    1. PubChem is in error. They should change the InChI for the zwitterionic compound. (If so , to what?)
    2. PubChem is right but is using definitions of compound and CID that are different than mine.
    3. I have some kind of fundamental misunderstanding of the purpose of InChI, which I had thought would uniquely specify a molecular structure. But InChI is designed to handle ambiguities like zwitterion vs. neutral.
    4. rdkit interprets InChIs weird.
    5. Something else.
    • Leitouran
      Leitouran about 4 years
      I know this discussion is long gone, but... I've faced a similar struggle. I needed to cluster all compounds having identical first three layers in standard InChI. I made a perl script to create these clusters. In case anyone is interested I can share it.
    • Curt F.
      Curt F. about 4 years
      Please do share! Also I think your answer will be deleted soon because it is not really an answer. If you can comment (not answer) to an existing answer or the original question with a link to the script, thank you!
  • Curt F.
    Curt F. almost 7 years
    Great answer. I had no idea that InChI was designed this way.
  • Martin - マーチン
    Martin - マーチン almost 7 years
    @Curt Yeah, i thought the same thing as you and to be frank it does not make any sense to me why it should be that way. Maybe they address this in the second version, but knowing IUPAC they'll take their time *sigh*.
  • Curt F.
    Curt F. almost 7 years
    Based on your link, it seems like specifying zwitterions and some tautomers might be possible with a non-standard InChI (InChI=1/... instead of InChI=1S/... that includes a fixed-hydrogen layer...if I figure it out I might post a separate answer, but I never would have known if not for your link so thanks again.
  • Geoff Hutchison
    Geoff Hutchison almost 7 years
    Basically you're not supposed to use non-standard InChIs in a public-facing service. A standardized InChI is what it is - but you'd need to know the exact flags passed to the code to understand a non-standard one.
  • Curt F.
    Curt F. almost 7 years
    PubChem clearly has different ideas about what a "compound" is than standard InChI does, or else the different CIDs wouldn't exist in the first place. To me using the same InChI for different CIDs obscures this fact. Plus, there is a widespread conception, which I now realize to be very wrong, that "there is a 1:1 correspondence between every organic chemical structure and a single InChI". Maybe that's the real issue.
  • Geoff Hutchison
    Geoff Hutchison almost 7 years
    I don't think PubChem obscures the difference. If you search by InChI or InChI key, you'll get multiple CIDs. I agree that many people don't realize that InChI is imperfect and there is not a 1:1 correspondence between a chemical structure and an InChI.
  • Geoff Hutchison
    Geoff Hutchison almost 7 years
    I do think that many people using PubChem don't realize its limitations (e.g., as listed in my answer). That included myself when we started to build PQR.
  • Curt F.
    Curt F. almost 7 years
    Thanks for the suggestions on SMILES and the discussion of PubChem / InChI is very illuminating. On whether Inchis treatment of zwitterions/tautomers is a bug or a feature, I agree that predicting the "most stable" one is very complicated, but how is stability related to representing InChI? What would the rate of interconversion between gaseous CID 6140 and CID 6925665 be in a near vacuum at say 100 or 200 Kelvins? My guess would be very low, which is why I think of them as separate molecules. So to me separate InChIs would be good.
  • Geoff Hutchison
    Geoff Hutchison almost 7 years
    @CurtF. Your comment indicates gaseous compounds.. Keep in mind that most organic chemistry is done in solution and what's the rate of interconversion between these compounds in water or another protic solvent? As I said.. this is why it's incredibly hard to come up with "perfect" chemical identifiers.
  • Curt F.
    Curt F. almost 7 years
    Oh yeah I don't think anyone is after "perfect". Just better. Or at least more consistent.
  • Martin - マーチン
    Martin - マーチン almost 5 years
    @CurtF. Although the FAQ is still the same, there is a new version of InChI available. However, I doubt they address this issue, but you still might be interested.