@e0qdk

e0qdk@reddthat.com · 8 days ago

I was curious, so I did some searches on this topic for you and found these pages:

The second link in particular notes:

The reason that things are much easier with all ASCII data is that practically every Unicode encoding in existence maps bytes 0x00…0x7f to the corresponding code points, so byte strings and Unicode strings that contain the same all-ASCII data are basically equivalent, even semantically. What usually trips people up with non-ASCII data is that the semantic meaning of bytes in the range 0x80…0xff changes from one encoding to another.

But, thinking like a systems programmer again, for many purposes the semantic meaning of bytes 0x80…0xff doesn’t matter. All that matters is that those bytes are preserved unchanged by whatever operations are done. Typical operations like tokenizing strings, looking for markers indicating particular types of data, etc. only need to care about the meaning of bytes in the range 0x00…0x7f; bytes in the range 0x80…0xff are just along for the ride.

So the trick for beating Python 3 strings into submission is to put in encoding and decoding calls where you need to, choosing a single-byte encoding that doesn’t mutate 0x80…0xff. There are many of these; most of the Latin-{1…6} sequence (aka ISO-8859-1…10) is has this property. What you do not want to do is pick utf-8 or any of the multibyte Asian encodings. Latin-1 will do fine; in fact it has an advantage over the others in memory consumption, which we’ll describe below.

Whether depending on this is actually correct or not is beyond me, but it seems like people have actually been using that pass-through behavior in practice and put it into things like Python2 -> 3 migration guides.

The first link suggests that the seemingly undefined ranges are valid as C0 and C1 control codes which may be why it doesn’t throw errors.

e0qdk@reddthat.com · 1 month ago

I don’t know how to do it with KDE’s tools, but on the command line with ffmpeg you can do something like this:

ffmpeg -i video_track.mp4 -i audio_jp.m4a -i audio_en.m4a -map 0:v -map 1:a -map 2:a -metadata:s:a:0 language=jpn -metadata:s:a:1 language=eng -c:v copy -c:a copy output.mp4

Breaking it down, it:

runs ffmpeg
with three inputs (-i flag) – a video file, and two audio files.
The streams are explicitly mapped into the result, counting the inputs from 0 – i.e. -map 0:v maps input 0 (the first file) as video (v) to the output file and -map 1:a maps the next input as audio (a), etc.
It sets the metadata for the audio tracks -metadata:s:a:0 language=jpn sets the first audio track (again counting from 0…) to Japanese; the second metadata option sets the next audio track to English.
-c:v copy specifies that the video codec should be copied directly (i.e. don’t re-encode – remove this if you DO need to re-encode)
-c:a copy specifies that the audio codec should be copied directly (i.e. don’t re-encode – remove this if you DO need to re-encode)
output.mp4 – finally, list the name of the file you want the result written into.

See documentation here: https://ffmpeg.org/ffmpeg.html

If you need another language in the future, I think the language abbreviations are the three letter codes from here: https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes – but I’m not certain on that.