DOS vs UNIX Line Endings

on-topic acorn-related discussions not covered by the other forums
Post Reply
User avatar
BeebMaster
Posts: 3970
Joined: Sun Aug 02, 2009 5:59 pm
Location: Lost in the BeebVault!
Contact:

DOS vs UNIX Line Endings

Post by BeebMaster »

I think this is a very old problem, but one that has only just started to bug me.

I've been on Linux for about 12 years now, so all my text files have had UNIX line endings since then (ASCII 10 line end character). I've never even really noticed it until I adapted my (ARMBASIC) HTML generator for my website to run on 6502 BASIC as well. I thought it would be nice if I could edit the caption files that are fed into the HTML generator on a Beeb as well as doing the generation itself.

So I loaded up my template in Edit, and there are no line breaks at all, it's just all the text littered with CTRL-J characters so it's not readable and not editable. In Wordwise Plus, View and Inter-Word it's even worse, because the control characters just show as spaces, so I have no idea where the line breaks should be.

*TYPEing it displays it correctly.

I made a version of the template with DOS line endings (ASCII 13 followed by ASCII 10), which is a bit better. In Edit it shows the CTRL-J characters at the beginning of each line, and in the others these are just shown as a space as before, at the beginning of the lines.

I don't really want to move everything to DOS line endings, not least because it will break my HTML generator which can't cope correctly with two consecutive end of line characters which are only supposed to be a single line break. The generator itself outputs ASCII 13 line breaks, so maybe I've unwittingly created my own standard! That seems sensible and normal to me, and displays correctly when loaded in a Beeb editor or word-processor, or *TYPEd, or loaded in a Linux text editor.

UNIX line endings can be converted on RISC OS in !Edit using the CR<>LF facility, which is handy to aid readability, but it modifies the file to change from UNIX to DOS endings, so that's not all that satisfactory either.

Is there any way on a Beeb I can edit a text file which will correctly display UNIX line endings as a new-line?

Or is there a way on a Linux text editor (Kate, Gedit etc) for me to specify my own (per-file preferably) line endings so I can just use the "BM Standard" of ASCII 13?
Image
User avatar
BeebMaster
Posts: 3970
Joined: Sun Aug 02, 2009 5:59 pm
Location: Lost in the BeebVault!
Contact:

Re: DOS vs UNIX Line Endings

Post by BeebMaster »

A solution has presented itself, which often happens when writing down the problem.

I did a regex replace \n with \r and whilst it jumbles everything up, after re-loading in Kate, the file displays correctly, and will display correctly on the Beeb. My HTML generator always checks for either CHR$10 or CHR$13 so the new "BeebMaster Standard" line endings don't affect it.
Image
User avatar
scruss
Posts: 336
Joined: Sun Jul 01, 2018 4:12 pm
Location: Toronto
Contact:

Re: DOS vs UNIX Line Endings

Post by scruss »

BeebMaster wrote:
Sat Nov 14, 2020 11:22 am
Or is there a way on a Linux text editor (Kate, Gedit etc) for me to specify my own (per-file preferably) line endings so I can just use the "BM Standard" of ASCII 13?
Kate claims to have line ending auto detection, but I've never used it. Gedit (3.36.2, at least) silently keeps whatever line endings it was fed, offering to change under "Save As". Emacs does too, but you probably don't want to go there.

The traditional way of doing this is using the unix2(dos|mac) / (mac|dos)2unix tools. On this Ubuntu system, they're in the dos2unix package. Capabilities vary from system to system, and the dos2unix I used on Solaris systems in 1997 was very different from the one I've got here. Common to all of them is that they silently overwrite files:

Code: Select all

unix2mac file.txt
will convert file.txt from LF line endings to CR endings, overwriting the original. Gotta be careful with that. The version I have uses some unusual conventions, with '-o' doing something quite unexpected. A sensible invocation might be:

Code: Select all

unix2mac -n unixfile.txt macfile.txt
`macfile.txt` is created, while `unixfile.txt` is left untouched.

None of these utilities add the expected Ctrl-Z at end of file that CP/M needs, but you probably don't need that.
guesser
Posts: 548
Joined: Mon Jun 26, 2006 10:21 pm
Contact:

Re: DOS vs UNIX Line Endings

Post by guesser »

BeebMaster wrote:
Sat Nov 14, 2020 11:22 am
Or is there a way on a Linux text editor (Kate, Gedit etc) for me to specify my own (per-file preferably) line endings so I can just use the "BM Standard" of ASCII 13?
It's more of a per-project than per-file thing, but specifying what an editor should use for line endings in different types of file is one of the things EditorConfig does: https://editorconfig.org/
Various teletext things including a web based teletext editor which can export as mode 7 screens.
Join the Teletext Discord for teletext chat.
Coeus
Posts: 2024
Joined: Mon Jul 25, 2016 12:05 pm
Contact:

Re: DOS vs UNIX Line Endings

Post by Coeus »

scruss wrote:
Sat Nov 14, 2020 5:53 pm
None of these utilities add the expected Ctrl-Z at end of file that CP/M needs, but you probably don't need that.
See https://github.com/SteveFosdick/Utils The utilities txt2bbc, txt2cpm, and txt2dos all write files with the corresponding line ending and, in the case of txt2cpm, with the ^Z at the end. They also don't need you to specify what the line ending are already and use a common, simple, finite state machine to handle any of CR only, LF only and CR/LF.

Also, we tend to think of CR-only as the standard text file of the BBC (like Mac classic) but there wasn't really a standard at the start as far as I can tell. The BBC Micro lacked a supplied text editor until the Master. BASIC does not make it all easy to use text files and has its own format for PRINT# and INPUT#. *EXEC uses CR only but only because that's the ASCII code the Return key generates and thus is used by OSWORD 0 to terminate an input line. *BUILD uses CR because it is usually used to create files for *EXEC whereas *SPOOL copies everything sent to OSWRCH so ends up with CR/LF as the line endings and the BCPL editor uses that format too.
Last edited by Coeus on Sat Nov 14, 2020 7:18 pm, edited 1 time in total.
Coeus
Posts: 2024
Joined: Mon Jul 25, 2016 12:05 pm
Contact:

Re: DOS vs UNIX Line Endings

Post by Coeus »

scruss wrote:
Sat Nov 14, 2020 5:53 pm
Kate claims to have line ending auto detection, but I've never used it. Gedit (3.36.2, at least) silently keeps whatever line endings it was fed, offering to change under "Save As". Emacs does too, but you probably don't want to go there.
Geany (GTK-based, light IDE) also deals with different line endings as does Notepad++ on Windows. It's a feature I'd expect most modern text editors to have (though I suspect the original notepad still doesn't.), even to the point that if an editor you like is open source and it doesn't have this feature I'd raise a bug and then maybe consider submitting a patch/pull request.
User avatar
Richard Russell
Posts: 2071
Joined: Sun Feb 27, 2011 10:35 am
Location: Downham Market, Norfolk
Contact:

Re: DOS vs UNIX Line Endings

Post by Richard Russell »

Coeus wrote:
Sat Nov 14, 2020 7:01 pm
there wasn't really a standard at the start as far as I can tell.
Doesn't OSNEWL at &FFE7 (which as far as I know has been in the MOS from the start) effectively set the standard as LFCR? It's quite unusual (LF, CRLF and CR probably all being more common).
I am suffering from 'cognitive decline' and depression. If you have a comment about the style or tone of this message please report it to the moderators by clicking the exclamation mark icon, rather than complaining on the public forum.
User avatar
1024MAK
Posts: 10544
Joined: Mon Apr 18, 2011 5:46 pm
Location: Looking forward to summer in Somerset, UK...
Contact:

Re: DOS vs UNIX Line Endings

Post by 1024MAK »

There was definitely no standard line ending control character in the past. That’s why dot matrix printers often had a DIP switch to select LF or CR or both (also auto advance or ignore).

Really, both should be used, as LF and CR are different things... But of course, that used up more valuable memory, so most of the time, instead only one control code was used.

And every computer system/manufacturer did their own thing (Sinclair’s ZX80 and ZX81 used 0x76, which is the Z80 HALT instruction code).

Mark
User avatar
scruss
Posts: 336
Joined: Sun Jul 01, 2018 4:12 pm
Location: Toronto
Contact:

Re: DOS vs UNIX Line Endings

Post by scruss »

Coeus wrote:
Sat Nov 14, 2020 7:01 pm
See https://github.com/SteveFosdick/Utils The utilities txt2bbc, txt2cpm, and txt2dos all write files with the corresponding line ending …
Thanks. I've been making do with an awk one-liner and requiring LF eols because who wouldn't?

Code: Select all

awk '{printf("%s\r\n", $0);} END {printf("%c", 26);}'
1024MAK wrote:
Sat Nov 14, 2020 8:32 pm
And every computer system/manufacturer did their own thing (Sinclair’s ZX80 and ZX81 used 0x76, which is the Z80 HALT instruction code).
The ZX80/81 weren't even close to ASCII in any way, something I only found out recently. And I really don't want to know what my PDP-8 clone uses internally. In OS-8 BASIC, it does have a sort-of ASCII code function, but A-Z is way down where I'd expect control characters to be.
User avatar
sweh
Posts: 2361
Joined: Sat Mar 10, 2012 12:05 pm
Location: New York, New York
Contact:

Re: DOS vs UNIX Line Endings

Post by sweh »

I use use the `tr` command:

Code: Select all

tr '\012' '\015' < unixfile > beebfile
tr '\015' '\012' < beebfile > unixfile
Rgds
Stephen
User avatar
jgharston
Posts: 4299
Joined: Thu Sep 24, 2009 12:22 pm
Location: Whitby/Sheffield
Contact:

Re: DOS vs UNIX Line Endings

Post by jgharston »

From BBC BASIC you can read lines of text agnostic to line endings with the FNrd() function in StringIO, adapted from Richard's original:

Code: Select all

   90 REM rd(in%) - Read a <cr>, <lf>, <cr><lf>, <lf><cr> or <eof> terminated string from in%
  100 REM -----------------------------------------------------------------------------------
  110 DEFFNrd(i%):LOCALA%,B%,A$:REPEAT:A%=BGET#i%:IFA%<>10ANDA%<>13:A$=A$+CHR$A%
  120 UNTILA%=10ORA%=13OREOF#i%:IFNOTEOF#i%:B%=BGET#i%:IFA%=B%OR(B%<>13ANDB%<>10):PTR#i%=PTR#i%-1
  130 =A$

Code: Select all

$ bbcbasic
PDP11 BBC BASIC IV Version 0.32
(C) Copyright J.G.Harston 1989,2005-2020
>_
User avatar
Bobbi
Posts: 605
Joined: Thu Sep 24, 2020 12:32 am
Contact:

Re: DOS vs UNIX Line Endings

Post by Bobbi »

I run into this all the time (Apple II uses CR (13) as the line ending for example.) My solution is to use the Linux command line util tr.

tr \\r \\n infile > outfile will convert CR to LF.

tr \\n \\r infile > outfile will do the reverse.

MS-DOS CR+LF endings are more of a pain in the ass to deal with. There are lots of editors that can load one and save in a different format.

EDIT: Should have read all the thread before responding. @sweh beat me to it with tr. Seconded!
User avatar
Bobbi
Posts: 605
Joined: Thu Sep 24, 2020 12:32 am
Contact:

Re: DOS vs UNIX Line Endings

Post by Bobbi »

On another subject I think the PDP-8 usually packs two six byte chars in a 12 bit word. Nothing like ASCII.

PDP-10 has 36 bit words and does six chars packed to a word.
User avatar
BeebMaster
Posts: 3970
Joined: Sun Aug 02, 2009 5:59 pm
Location: Lost in the BeebVault!
Contact:

Re: DOS vs UNIX Line Endings

Post by BeebMaster »

Thanks for all the replies, like I said at the beginning, I haven't really ever give this much thought before, I suppose I always assumed line-endings were &D because that's what the Beeb seems to do, and it isn't all that often you have to look at a hex dump of a text file generated elsewhere. Sounds like some company called Apple have beaten me to it with &D endings, but whatever became of them, eh, so I'm still claiming to have invented "BeebMaster line endings"!

For about 12 years I always used Gedit on Linux, and as has been noted, at time of save it gives you the choice between DOS or UNIX line endings (and also encoding). However the death-knell was sounded when they took away the File menu, and then I was finding it struggling to load large text files (dmesg dumps or BeebSCSI logs etc) so I looked for something else and settled on Kate. Line-endings and encoding have to be set in the preferences, so it's not as easy to switch them per-file. It actually has a 6502 assembler display mode! For now doing a regex /n to /r seems to work, and it survives a save, so I can use that on a Beeb and have it display nicely. (Actually, I think Kate auto-converts /r back to /n on load, because if you repeat the regex, it will do it again, but it doesn't seem to spoil the saved file).

On the Master, I've been using Edit, but it doesn't have on-screen word wrap. View doesn't automatically wrap to the margins even using READ to load the file, and after using FORMAT it seems to muck things up a bit. Even Inter-Word has let me down, as it skips the control character I use as a delimiter in the text file (|) when spooling the output, so I can't use that either. Probably that was a bad choice of delimiter, but it's too late now. It's a real shame, as it was looking good in 106-characters-per-line mode.

Might end up having to write my own Beeb text editor. When I've got a couple of years to spare. I did start writing my own word-processor once upon a time, but got stuck with how to manage text being inserted in the middle of existing text.
Image
User avatar
Bobbi
Posts: 605
Joined: Thu Sep 24, 2020 12:32 am
Contact:

Re: DOS vs UNIX Line Endings

Post by Bobbi »

I wrote a text editor for the Apple II during the summer. It is not a trivial task. (If you're curious the source code is here ... https://github.com/bobbimanners/emaille ... pps/edit.c)

For Linux, you may want to give Sublime Text a try. You can download it here: https://www.sublimetext.com/ People keep recommending it to me, but I am a vi diehard.
Coeus
Posts: 2024
Joined: Mon Jul 25, 2016 12:05 pm
Contact:

Re: DOS vs UNIX Line Endings

Post by Coeus »

1024MAK wrote:
Sat Nov 14, 2020 8:32 pm
Really, both should be used, as LF and CR are different things... But of course, that used up more valuable memory, so most of the time, instead only one control code was used.
So that's an output-centric view of a text file, i.e. something that can be copied byte for byte to an output device such as a screen or printer and have it display correctly. Looking from an input perspective, users expect to hit a single key to signal end of line and the usual key, Return, generates CR so to have a single convention there has to be some translation going on somewhere. CR could be expanded to CR/LF on input, upon writing the file to disc, or upon output. The BBC Micro definitely leans towards the "translate on output" option with the provision of OSACII though as I said earlier, there are exceptions such that it doesn't seem to be me to be a strong convention in the way that CRLF is for CP/M, DOS and Windows and Mark already mentioned printers which accept a variety of line endings.

Unix's choice of LF (which it calls newline) seems strange at first as this requires translation both on input and on output (which is done by the terminal driver). My guess as to why is that CR on its own is useful for overstrike, i.e. to print a line in bold one can send the print head of an impact printer back to the left and print the line again, or even print parts of it again for selective bold. Characters can also be combined this way, for example to get accents or underline. Advancing to the next line without returning the print head to the left is less useful.

It is also worth remembering that this "stream of bytes" view of text files, or even all files, is far from universal. The APIs of mainframe operating systems often presented a file as a sequence of records which maybe fixed or variable length, where for text files on record would be a line.
User avatar
1024MAK
Posts: 10544
Joined: Mon Apr 18, 2011 5:46 pm
Location: Looking forward to summer in Somerset, UK...
Contact:

Re: DOS vs UNIX Line Endings

Post by 1024MAK »

Coeus wrote:
Mon Nov 16, 2020 10:20 am
1024MAK wrote:
Sat Nov 14, 2020 8:32 pm
Really, both should be used, as LF and CR are different things... But of course, that used up more valuable memory, so most of the time, instead only one control code was used.
So that's an output-centric view of a text file, i.e. something that can be copied byte for byte to an output device such as a screen or printer and have it display correctly.
Yeah, agree. But as with a lot of things, it’s a hangover from ASCII control codes and teletypewriters/terminals and line printers...

Really there should have been an open international agreement/standard... (because if there is not the ‘right’ standard, just add your own :lol: ).

Mark
User avatar
Richard Russell
Posts: 2071
Joined: Sun Feb 27, 2011 10:35 am
Location: Downham Market, Norfolk
Contact:

Re: DOS vs UNIX Line Endings

Post by Richard Russell »

Coeus wrote:
Mon Nov 16, 2020 10:20 am
the usual key, Return, generates CR
I'm not sure that it "generates" CR, at least not on the modern platforms I mostly deal with. Typically pressing the Enter key results in a 'key down' event with the key being identified by a symbolic constant (e.g. VK_ENTER) that could in principle be anything. When I receive that event I do write CR into the input buffer, since that's what BBC BASIC programs expect, but that's a choice of the application program rather than something determined by the OS.
I am suffering from 'cognitive decline' and depression. If you have a comment about the style or tone of this message please report it to the moderators by clicking the exclamation mark icon, rather than complaining on the public forum.
User avatar
1024MAK
Posts: 10544
Joined: Mon Apr 18, 2011 5:46 pm
Location: Looking forward to summer in Somerset, UK...
Contact:

Re: DOS vs UNIX Line Endings

Post by 1024MAK »

Coeus wrote:
Mon Nov 16, 2020 10:20 am
Return, generates CR ...
But does it do this on all systems? I’m not sure. It’s entirely possible that on some systems the LF code is generated instead. And that assumes that a control code is actually generated and placed in the text stream/buffer/file in the first place.

Heck on some keyboards the ‘Return’ key is called “Enter” and Sinclair being different (as usual) called it “New Line” on some of their computers.

[BTW I’ve not actually researched this, I’m just asking the questions, because never assume anything, other than, it’s likely there is a different way of doing something...]

Edit: Richard got in while I was editing.

Mark
Coeus
Posts: 2024
Joined: Mon Jul 25, 2016 12:05 pm
Contact:

Re: DOS vs UNIX Line Endings

Post by Coeus »

Richard Russell wrote:
Mon Nov 16, 2020 11:30 am
I'm not sure that it "generates" CR, at least not on the modern platforms I mostly deal with. Typically pressing the Enter key results in a 'key down' event with the key being identified by a symbolic constant (e.g. VK_ENTER) that could in principle be anything. When I receive that event I do write CR into the input buffer, since that's what BBC BASIC programs expect, but that's a choice of the application program rather than something determined by the OS.
Again, it is not universal. GUI environments tend to present keystrokes as events and indeed the keycode may be unrelated to ASCII and may already have been through various translations, but an ASCII (or Unicode) translation may also be part of that event or available separately - something that is very useful for the letter keys.

CR for the return key probably comes from the asynchronous serial terminals, possibly dating back to the teletype, but was certainly the case in the VT100 era. CP/M and Unix both seem to be written around this serial terminal idea, even when no physically separate terminal is present, and the BBC Micro continues this idea in having all screen drawing the result of a stream of bytes sent to the VDU driver rather than a series of procedure calls made by the application program. OSRDCH returns CR for the return key, i.e. even when reading a character at a time and in most GUI environments I would expect the translation from the keycode in a keydown event for the Return key to a character would result in CR, though other options are possible.
User avatar
Richard Russell
Posts: 2071
Joined: Sun Feb 27, 2011 10:35 am
Location: Downham Market, Norfolk
Contact:

Re: DOS vs UNIX Line Endings

Post by Richard Russell »

Coeus wrote:
Mon Nov 16, 2020 12:03 pm
GUI environments tend to present keystrokes as events and indeed the keycode may be unrelated to ASCII and may already have been through various translations, but an ASCII (or Unicode) translation may also be part of that event.
True. In the systems I'm most familiar with, keys corresponding to 'printing' characters generate events reporting the ASCII code (or more likely Unicode these days) but non-printing keys like Enter, Backspace, Delete etc. are identified only by symbolic constants. Again it's not universal, but commonly pressing Shift modifies the code received from printing keys, but not the ID of non-printing keys (if you want Shift and/or Ctrl to modify the code you have to do that yourself).
I am suffering from 'cognitive decline' and depression. If you have a comment about the style or tone of this message please report it to the moderators by clicking the exclamation mark icon, rather than complaining on the public forum.
User avatar
BeebMaster
Posts: 3970
Joined: Sun Aug 02, 2009 5:59 pm
Location: Lost in the BeebVault!
Contact:

Re: DOS vs UNIX Line Endings

Post by BeebMaster »

BeebMaster wrote:
Sun Nov 15, 2020 10:57 am
Even Inter-Word has let me down, as it skips the control character I use as a delimiter in the text file (|) when spooling the output, so I can't use that either. Probably that was a bad choice of delimiter, but it's too late now. It's a real shame, as it was looking good in 106-characters-per-line mode.
I fixed that, it's necessary to change the "pad" character which I think nowadays would more likely be called a hard space, which defaults to the | character. But now I've found something really really annoying. I have to spool the file when I've finished editing it, so that it's a pure text file, rather than an Inter-Word format file, and when doing so it inserts a space before the carriage return at the end of each line! I can't stop it doing that!
Image
Post Reply

Return to “general”