Recently I was updating an API to perform compression for potentially large incoming requests.
{
"Payload": "{big chunk of compressible text here}"
}
During code review, some questions were raised around string encoding. For example, how do we know the encoding of an incoming JSON request body? Is it safe to always assume UTF-8?
First things first, by the time you're dealing with in-memory 'string' type objects in C#, such as when your controller code starts executing, they're going to be encoded as UTF-16. Since this is part of the C# specification, we can take it as ground truth, even on platforms like Linux.
What about beforehand? ASP.NET Core is reading a byte stream off the network, detecting and deserializing JSON, and spitting out C# objects on the other side. How does it know the encoding of those bytes?
Answering this question is why HTTP defines a Content-Type header. The header even has a charset parameter where the encoding is specified.
Content-Type: application/json; charset=UTF-8
Alright, simple! ASP.NET Core just needs to read the charset from the Content-Type header.
One problem. The application/json content type doesn't define a charset parameter. It's entirely optional.
The world isn't as clear cut as we'd like. However, since we're working with JSON, the option space is limited. The JSON specification supports only UTF-8, UTF-16, and UTF-32.
So, what does ASP.NET Core do with JSON content with no declared encoding? There doesn't seem to be any official documentation on how the framework will behave. We'll have to dive into the source code.
Warning: since this behavior is undocumented, it could change at any time. It's an implementation detail.
For this investigation I'll be looking at the ASP.NET Core 3.1 source.
ASP.NET Core will read the Content-Type header and match the type with an input formatter. For application/json, by default the framework will use the SystemTextJsonInputFormatter to deserialize the JSON into a model object. It's here where the encoding becomes important. This formatter only accepts UTF-8 and UTF-16. Other encodings will be rejected by the formatter (sorry UTF-32).
If the charset is UTF-16, the formatter will transcode the request to UTF-8 before deserialization. The underlying JSON parser, System.Text.Json.JsonSerializer, is explicitly optimized for working with UTF-8, so it only accepts that encoding.
Finally, if there's no charset available, SystemTextJsonInputFormatter will assume the content is UTF-8 and try its best to deserialize (link to source).
// We want to do our best effort to read the body of the request even in the
// cases where the client doesn't send a content type header or sends a content
// type header without encoding. For that reason we pick the first encoding of the
// list of supported encodings and try to use that to read the body. This encoding
// is UTF-8 by default in our formatters, which generally is a safe choice for the
// encoding.
return SupportedEncodings[0];
I tested this behavior to confirm this is the case.
First I sent a request with a JSON body encoded as UTF-16 (specifically UTF-16LE, little endian), which is supported as long as you include the charset parameter.
Encoding utf16le = new UnicodeEncoding(bigEndian: false, byteOrderMark: true, throwOnInvalidBytes: true);
HttpContent content = new StringContent(JsonConvert.SerializeObject(body), utf16le, "application/json");
HttpResponseMessage response = await client.PostAsync("http://localhost:5000/submit", content);
No sweat. ASP.NET Core deserialized things just fine.
Next I tried an identical request but with the charset removed.
Encoding utf16le = new UnicodeEncoding(bigEndian: false, byteOrderMark: true, throwOnInvalidBytes: true);
HttpContent content = new StringContent(JsonConvert.SerializeObject(body), utf16le, "application/json");
content.Headers.ContentType.CharSet = string.Empty;
HttpResponseMessage response = await client.PostAsync("http://localhost:5000/submit", content);
Not so nice this time! The API returned 400 Bad Request with the following validation error:
{"type":"https://tools.ietf.org/html/rfc7231#section-6.5.1","title":"One or more validation errors occurred.","status":400,"traceId":"|694dce14-41170f2541b5204f.","errors":{"$":["'0x00' is an invalid start of a property name. Expected a '\"'. Path: $ | LineNumber: 0 | BytePositionInLine: 1."]}}
This is expected if the framework assumes UTF-8 when the content is actually UTF-16.
Ok, we've done a bit of digging. What can we learn from all of this? First, we don't need to worry about the encoding of incoming JSON in our application logic. ASP.NET Core will either reject the request or transcode it to UTF-8 before deserializing to C# objects. Second, it may be helpful to know that the framework doesn't support UTF-32 encoding, at least for content it needs to deserialize from JSON.