6.6.5.14 Conversion to/from C

When creating a Scheme string from a C string or when converting a Scheme string to a C string, the concept of character encoding becomes important.

In C, a string is just a sequence of bytes, and the character encoding describes the relation between these bytes and the actual characters that make up the string. For Scheme strings, character encoding is not an issue (most of the time), since in Scheme you usually treat strings as character sequences, not byte sequences.

Converting to C and converting from C each have their own challenges.

When converting from C to Scheme, it is important that the sequence of bytes in the C string be valid with respect to its encoding. ASCII strings, for example, can’t have any bytes greater than 127. An ASCII byte greater than 127 is considered ill-formed and cannot be converted into a Scheme character.

Problems can occur in the reverse operation as well. Not all character encodings can hold all possible Scheme characters. Some encodings, like ASCII for example, can only describe a small subset of all possible characters. So, when converting to C, one must first decide what to do with Scheme characters that can’t be represented in the C string.

Converting a Scheme string to a C string will often allocate fresh memory to hold the result. You must take care that this memory is properly freed eventually. In many cases, this can be achieved by using scm_dynwind_free inside an appropriate dynwind context, See Dynamic Wind.

C Function: SCM scm_from_locale_string (const char *str)
C Function: SCM scm_from_locale_stringn (const char *str, size_t len)

Creates a new Scheme string that has the same contents as str when interpreted in the character encoding of the current locale.

For scm_from_locale_string, str must be null-terminated.

For scm_from_locale_stringn, len specifies the length of str in bytes, and str does not need to be null-terminated. If len is (size_t)-1, then str does need to be null-terminated and the real length will be found with strlen.

If the C string is ill-formed, an error will be raised.

Note that these functions should not be used to convert C string constants, because there is no guarantee that the current locale will match that of the execution character set, used for string and character constants. Most modern C compilers use UTF-8 by default, so to convert C string constants we recommend scm_from_utf8_string.

C Function: SCM scm_take_locale_string (char *str)
C Function: SCM scm_take_locale_stringn (char *str, size_t len)

Like scm_from_locale_string and scm_from_locale_stringn, respectively, but also frees str with free eventually. Thus, you can use this function when you would free str anyway immediately after creating the Scheme string. In certain cases, Guile can then use str directly as its internal representation.

C Function: char * scm_to_locale_string (SCM str)
C Function: char * scm_to_locale_stringn (SCM str, size_t *lenp)

Returns a C string with the same contents as str in the character encoding of the current locale. The C string must be freed with free eventually, maybe by using scm_dynwind_free, See Dynamic Wind.

For scm_to_locale_string, the returned string is null-terminated and an error is signaled when str contains #\nul characters.

For scm_to_locale_stringn and lenp not NULL, str might contain #\nul characters and the length of the returned string in bytes is stored in *lenp. The returned string will not be null-terminated in this case. If lenp is NULL, scm_to_locale_stringn behaves like scm_to_locale_string.

If a character in str cannot be represented in the character encoding of the current locale, the default port conversion strategy is used. See Ports, for more on conversion strategies.

If the conversion strategy is error, an error will be raised. If it is substitute, a replacement character, such as a question mark, will be inserted in its place. If it is escape, a hex escape will be inserted in its place.

C Function: size_t scm_to_locale_stringbuf (SCM str, char *buf, size_t max_len)

Puts str as a C string in the current locale encoding into the memory pointed to by buf. The buffer at buf has room for max_len bytes and scm_to_local_stringbuf will never store more than that. No terminating '\0' will be stored.

The return value of scm_to_locale_stringbuf is the number of bytes that are needed for all of str, regardless of whether buf was large enough to hold them. Thus, when the return value is larger than max_len, only max_len bytes have been stored and you probably need to try again with a larger buffer.

For most situations, string conversion should occur using the current locale, such as with the functions above. But there may be cases where one wants to convert strings from a character encoding other than the locale’s character encoding. For these cases, the lower-level functions scm_to_stringn and scm_from_stringn are provided. These functions should seldom be necessary if one is properly using locales.

C Type: scm_t_string_failed_conversion_handler

This is an enumerated type that can take one of three values: SCM_FAILED_CONVERSION_ERROR, SCM_FAILED_CONVERSION_QUESTION_MARK, and SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE. They are used to indicate a strategy for handling characters that cannot be converted to or from a given character encoding. SCM_FAILED_CONVERSION_ERROR indicates that a conversion should throw an error if some characters cannot be converted. SCM_FAILED_CONVERSION_QUESTION_MARK indicates that a conversion should replace unconvertable characters with the question mark character. And, SCM_FAILED_CONVERSION_ESCAPE_SEQUENCE requests that a conversion should replace an unconvertable character with an escape sequence.

While all three strategies apply when converting Scheme strings to C, only SCM_FAILED_CONVERSION_ERROR and SCM_FAILED_CONVERSION_QUESTION_MARK can be used when converting C strings to Scheme.

C Function: char *scm_to_stringn (SCM str, size_t *lenp, const char *encoding, scm_t_string_failed_conversion_handler handler)

This function returns a newly allocated C string from the Guile string str. The length of the returned string in bytes will be returned in lenp. The character encoding of the C string is passed as the ASCII, null-terminated C string encoding. The handler parameter gives a strategy for dealing with characters that cannot be converted into encoding.

If lenp is NULL, this function will return a null-terminated C string. It will throw an error if the string contains a null character.

The Scheme interface to this function is string->bytevector, from the ice-9 iconv module. See Representing Strings as Bytes.

C Function: SCM scm_from_stringn (const char *str, size_t len, const char *encoding, scm_t_string_failed_conversion_handler handler)

This function returns a scheme string from the C string str. The length in bytes of the C string is input as len. The encoding of the C string is passed as the ASCII, null-terminated C string encoding. The handler parameters suggests a strategy for dealing with unconvertable characters.

The Scheme interface to this function is bytevector->string. See Representing Strings as Bytes.

The following conversion functions are provided as a convenience for the most commonly used encodings.

C Function: SCM scm_from_latin1_string (const char *str)
C Function: SCM scm_from_utf8_string (const char *str)
C Function: SCM scm_from_utf32_string (const scm_t_wchar *str)

Return a scheme string from the null-terminated C string str, which is ISO-8859-1-, UTF-8-, or UTF-32-encoded. These functions should be used to convert hard-coded C string constants into Scheme strings.

C Function: SCM scm_from_latin1_stringn (const char *str, size_t len)
C Function: SCM scm_from_utf8_stringn (const char *str, size_t len)
C Function: SCM scm_from_utf32_stringn (const scm_t_wchar *str, size_t len)

Return a scheme string from C string str, which is ISO-8859-1-, UTF-8-, or UTF-32-encoded, of length len. len is the number of bytes pointed to by str for scm_from_latin1_stringn and scm_from_utf8_stringn; it is the number of elements (code points) in str in the case of scm_from_utf32_stringn.

C function: char *scm_to_latin1_stringn (SCM str, size_t *lenp)
C function: char *scm_to_utf8_stringn (SCM str, size_t *lenp)
C function: scm_t_wchar *scm_to_utf32_stringn (SCM str, size_t *lenp)

Return a newly allocated, ISO-8859-1-, UTF-8-, or UTF-32-encoded C string from Scheme string str. An error is thrown when str cannot be converted to the specified encoding. If lenp is NULL, the returned C string will be null terminated, and an error will be thrown if the C string would otherwise contain null characters. If lenp is not NULL, the string is not null terminated, and the length of the returned string is returned in lenp. The length returned is the number of bytes for scm_to_latin1_stringn and scm_to_utf8_stringn; it is the number of elements (code points) for scm_to_utf32_stringn.

It is not often the case, but sometimes when you are dealing with the implementation details of a port, you need to encode and decode strings according to the encoding and conversion strategy of the port. There are some convenience functions for that purpose as well.

C Function: SCM scm_from_port_string (const char *str, SCM port)
C Function: SCM scm_from_port_stringn (const char *str, size_t len, SCM port)
C Function: char* scm_to_port_string (SCM str, SCM port)
C Function: char* scm_to_port_stringn (SCM str, size_t *lenp, SCM port)

Like scm_from_stringn and friends, except they take their encoding and conversion strategy from a given port object.