Forrest provides consulting services and software development through Mib Software. He can be contacted at [email protected].
Many scripting languages including Perl and PHP do not feature semantic error reporting expected from good C++ compilers. Instead, they parse only for syntactic errors and do not report common programming errors, or potential errors such as misspelled variable and function names, mismatched types in assignments and expressions, mismatched numbers and types of function parameters, functions ending without returning a value, unused local variables, and use without assignment.
Consequently, you commonly see requests on mailing lists for program-checking tools that can report these errors. One approach to developing such a tool might be to write a complete parser and semantic analyzer. My approach, however, translates the source into C++, so that existing C++ tools can perform the analysis. To that end, I present PHP2C++, a working PHP-to-C++ translator (available electronically; see "Resource Center," page 5, and http://www.mibsoftware.com/).
PHP (http://www.php.net/) is a scripting language with a syntax based on C. The PHP interpreter parses the source and any included files for syntactic errors only, and otherwise does limited error checking. Typical of many scripting languages, the lack of semantic analysis can let errors remain undetected:
- PHP does not use declared variables, so misspellings pass the syntactic check.
- PHP has weak typing of variables, with many automatic conversions and very little type checking in expressions.
- Unknown function names are found only when the function call is attempted at run time.
- The number of function call parameters is checked only when the function is run.
- Run-time checking of the types of function call parameters can be done by the programmer, but typically is not.
Simplifying Translation
Language translation is a hard problem. The requirements for this type of translation are simplified by a few reasonable assumptions and limitations.
- Translation is to be from one imperative language to another with input files known to be syntactically valid. (Syntax is already verified by the scripting language.)
- The primary goal is a translation for C++ lexical analysis. Since the result will not be executed, 100 percent operational equivalence is not necessary.
- Programmers will be stepping through the errors and warnings. Some errors and omissions in the translation can be tolerated. (99 percent translation is fine. Not translating deprecated or unused language features is fine.)
- Programmers can supply translation hints to guide the translation.
"Translation hints" are techniques I've previously used in other tools I've developed. For example, hints are instructions to the translator to insert or delete snippets of code in the output. Hints appear in the source file as comments, so that they do not change the original operation.
For PHP2C++, an insertion hint is a PHP comment starting with /*c++: and ending with */. PHP2C++ strips off the beginning and end sequence from the comment, and places the rest into the C++ output. A deletion hint starts with /*php:*/ and continues to the end of the next comment (which is typically /**/); the PHP code in between will not appear in the C++ output.
Resolving Differences
There are four groups of differences between imperative programming languages for the translator to accommodate. Many scripting languages are similar, and some of the issues and techniques also apply when other source languages are translated.
Differences in syntax. By assuming the input is already in valid syntax, and ignoring infrequently used features, accommodating syntax differences is often the easiest part of the translation. PHP2C++ checks a list of functions responsible for recognizing certain PHP tokens and words. When a function recognizes the syntax, it translates to the C++ equivalent and returns the number of source characters processed. It may set state variables in order to affect later translation.
Differences in names and scope. The characters allowed in variable and function names can be different, and may or may not be required to match case. All PHP variable names begin with $. To conform to the C++ standard, PHP2C++ converts the $ to _s_, which changes the names appearing in the error message output as well.
Names in C++ are case sensitive. If the source language has case-insensitive names (PHP has case-insensitive function names) the translator can translate them, but I chose not to translate them to report any nonstandard capitalization.
Scoping rules between languages can be slightly different. For example, variable names in a PHP function body are local unless declared with the global keyword. (Unintentionally omitting the global declaration is a frequent PHP programmer error I wanted to catch.) This version of PHP2C++ translates each global declaration into a declaration of a local variable. (This is lexically correct, but a more advanced translation would allow operational equivalence of accessing global variables.)
Because C++ requires declaration before use and all statements to appear within function bodies, four passes are made through each PHP source file to produce one C++ output file.
- The first pass writes function declarations.
- The second pass writes class declarations, including member function bodies.
- The third pass writes main(), which contains all PHP source that is outside function bodies. (A side effect is that comments between functions will appear in the body of main(). This does not effect lexical scan.)
- The last pass writes all the function bodies.
Differences in compiled versus interpreted behavior. Many scripting languages often have "interpreted" strings in which the values of variables mentioned within strings are inserted at run time. This is a commonly used feature in PHP. PHP2C++ splits the strings into variables and constant sections joined with the + operator override.
A few infrequently used PHP features that have no easy C++ implementation are not translated. PHP has a feature called "variable variables" where a string variable can be used to look up the value of another variable or function by name at run time. Another feature allows the file name of the include statement to be an interpreted expression at run time. Both features would require a translation hint to remove the errors from the C++ compiled output.
Differences in declarations and strong versus weak typing. Like many scripting languages, PHP uses weak typing: There are no variable declarations and there is automatic type juggling. In contrast, C++ uses strong typing: Variable declarations are required before use, and variables do not change type. Translating from weak to strong typing is the most challenging part of the translation.
A complex translator could automatically infer the type of a variable from context, or more simply insert a declaration of type "mixed" for every variable mentioned in the PHP source. Unfortunately, this would hide misspellings and inappropriate type mixing. Misspelled variables would be declared along with the correct spellings. The mixed class would override most expression operators and result in legal C++, but would defeat semantic type checking.
The technique I decided to use requires you to use translation hints to insert variable declarations. This is usually possible by inserting a "shorthand" hint for a type (one of /*int*/, /*void*/, /*bool*/, /*string*/, /*mixed*/, /*array*/, and /*string_array*/). Example 1 shows hints inserted into a PHP function header and the resulting C++ output. (Hints for function return types must appear after the PHP function keyword.)
You should strive to use the most accurate variable type for each hint. Avoid using the mixed type, because that will prevent C++ type checking. When array elements will always contain a string, use the string_array type instead of array to permit additional checking.
Type Juggling
For the purpose of program audit, you need to be cautious in declaring C++ classes with operator overrides and constructors that mimic most of the automatic type juggling of PHP. (For instance, the statement $I = '20 fish' + 30.2; results in $I being a variable of floating-point type with the value 50.2.) I choose to allow converting integers to strings automatically, but want to flag converting strings to integers as a potential error.
Strings and arrays are common conditional expressions in PHP. PHP2C++ avoids the need for an (int) operator for strings and arrays by enclosing all if()/while() conditions in !!(). This is sometimes unnecessary but always lexically correct.
Using PHP2C++
When using PHP2C++, the first step is to obtain and compile the PHP2C++ source. You run PHP2C++ by specifying the source filename as the first command-line argument and the output filename as the second. When you run a C++ compiler on the output, do not be alarmed by many errors if you have not inserted translation hints for variable declarations.
Each function body typically requires a small number of hints. Since #line directives were inserted into the output, your IDE will quickly take you right to the PHP source line, where you can easily add hints or correct errors.
After the hints are added, there should be only a few remaining C++ errors, which fall into one of the following categories:
- Programmer errors (misspellings and the like). Action: should be corrected.
- Lines where PHP run-time type juggling comes into play. Action: Consider changing the PHP (use strval() and intval() functions, for example, which will improve the PHP slightly, and satisfy the C++ compiler at the same time).
- A feature not supported by this version of PHP2C++. Action: You may need to inspect the C++ that was output to determine what is not supported. You can choose to use a translation hint to resolve it, or just ignore it.
If your PHP uses classes and declares more than one, be aware that Microsoft C++ compilers stop reporting errors after the first class with errors. (The function bodies in later class declarations are not even scanned for syntax errors!) I prefer the error reporting of the GNU C++ compiler, but I have used PHP2C++ successfully with Microsoft compilers when I use translation hints to remove all the errors in earlier class translations.
Conclusion
The focus of this version of PHP2C++ is translation for lexical analysis validating PHP code, after a reasonable effort to add translation hints. I've used it to accelerate debugging and auditing. In addition to in-house use, I have found several errors in popular and mature public PHP projects.
A future version of PHP2C++ will translate for operational equivalence requiring a more complicated translator, class implementations, and a large function library.
DDJ