Web automation, e.g. AT&T's "Unified Messaging" (voicemail) and downloading all that content. - BALUG-Talk

19 Nov 2024


      Ah, lovely web automation!  :-)
So, lately had a little mini-project to give myself.
AT&T's "Unified Messaging" (voicemail).  Wanted to "cut the cord" -
bye-bye landline - porting ye olde landline # to mobile.
But first, wanted to download all of my content from
AT&T's "Unified Messaging" (voicemail).
AT&T's "Unified Messaging" (UM/um), in addition to ye olde phone DTMF
("Touch Tone") interface to the voicemail, also has web interface.
So, web interface.  Essentially works as web GUI interface to email in
"INBOX", messages are stored in email, and within an email item,
voicemail as .wav attachment, text attachment having transcript as body
- which will generally have empty body if it wasn't able to transcribe
it.  And generally html attachment, an html version of that text
attachment.
And, "of course", Perl also has the lovely WWW::Mechanize.
So ... I got to programming.  mitmproxy was also handy to figure out
some bits going on within the SSL/TLS communications between client
(e.g. web browser) and AT&T server(s).
And got the key bits of that finished up
this past Sunday.  And got 'er all nicely downloaded.
$ um.att.com
um.att.com: Inbox is empty.  Exiting
$
That's what it outputs at the end, when there's nothing left to
download.
It also handles deleting the "email" item (message and related) from the
AT&T "INBOX" once it's successfully downloaded.
$ cd ~/.um.att.com.d/data
$ ls -A1 | sed -e 's/^.*././' | sort | uniq -c | sort -k 1,1bnr -k 2,2
    117 .eml
    117 .wav
    113 .txt
    112 .html
$
Very nicely handles it all.
.eml is the full raw "original" email as AT&T has it in the "INBOX",
.wav files are the raw audio portion thereof,
.txt the text transcript (or no file if that part was empty), and
.html the html equivalent of that text.
Ah, I was wondering about why one less .html than .txt ... peeking
further, the .txt has:
Message too short for transcription
And that original .eml has no html part, and the .wav ... yeah, no words
in that audio.
Alas, I didn't clean out quite all the junk before downloading
everything ... and the slight mismatch makes that bit of
junk pretty easy to spot ... likewise grep on the .txt files is rather
handy.
So, the file names start with ISO date and time, which is derived from
the Date: header which is timestamp of when the end of the message was
received.  Likewise that same time data is used to set the mtime on the
files.  File names also contain data from Subject: and From: fields,
generally identifying caller name/number, or when not (CNID) identified
otherwise unknown caller / Identity Withheld, e.g.:
... unknown caller ... Identity withheld <unknown_caller...>
https://www.mpaoli.net/~michael/bin/um.att.com
Ah, one of these days I need to tweak Apache configuration so it
"knows", e.g. that file (and that name and location), can be handled
like plain text, not a binary.  Yeah, I know there's a "magic" type
option that can read the files and make intelligent guess on that, but
that's excessive overhead for most cases - so really need to just
configure the exceptions ... down to directory or even per-file basis.
(On my to-do list ... with thousands of other items yet to be done ...
at least maybe when I get around to it).
And ... maybe even others might find it handy, or handy starting point.
Though this one was done almost / mostly as a one-off/one-shot.
Though until the number completes being ported over, very handy to still
check if anything has shown up there, and download it if so.
It might need some adjustments to handle some other email messages.
E.g. the ones from AT&T about the INBOX being nearly full.  And looks
like I probably won't have need for that (nor example data to match it
to and test it on).  And I didn't handle the more general email case
(which I think UM will also accept and have in "INBOX"), as I only ever
used UM for voicemail.