AWS troubleshooting: how to fix a broken EBS volume (bad superblock on xfs)

As great as EBS volumes are on Amazon Web Services, they can break and not ever mount again, even though your data could still be there intact, a simple corruption on the filesystem structure can cause a lot of damage. On this post I teach you how to move all that data onto a new EBS drive, so keep calm and read slowly.

So, you try to mount your drive after some updates and you get an error like this on dmesg | tail:

[56439860.329754] XFS (xvdf): Corruption detected. Unmount and run xfs_repair

so you unmount your drive, invoke xfs_repair and you get this…

$ sudo xfs_repair -n /dev/xvdf
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
..........................................

and no good secondary superblock is found.

Don’t panic, this is what you have to do next to solve this issue:

  1. Go to your AWS dashboard, EC2 section.
  2. Click on “Volumes”
  3. Find the broken volume.
  4. Create a snapshot of the broken volume (this takes a while)
  5. Create a new volume the same size (or larger than) your old drive out of the snapshot you just created (this takes a while)
  6. Attach your new volume to the same EC2 instance (no need to reboot or anything), if the old drive was mapped to /dev/xvdf, the new one will be mapped to /dev/xvdg (see how the last letter increases alphabetically)

Now here’s a gotcha, Amazon will not create your new drive using the same file system type (xfs), for some reason it will create it using the ext2 filesystem.

$ sudo file -s /dev/xvdg
/dev/xvdg: Linux rev 1.0 ext2 filesystem data (mounted or unclean), UUID=2e35874f-1d21-4d2d-b42b-ae27966e0aab (large files)

Here you have two options:
1. Live with the new ext2 file system, make sure your /etc/fstab is updated to look something like this:
/dev/xvdg /path/to/mount/to auto defaults,nobootwait,noatime 0 0

or 2. copy the contents of your drive to a temporary location, usually inside /mnt which has plenty of space from that ephemeral drive the ec2 instances come to, and then mkfs.xfs the new volume, and then copy the contents back… (which what I did, as I chose to create a larger drive and the ext2 format that came on the new volume only recognized the size of the snapshot)

Hope this saved your ass, leave a note if I did.

Remember to never do any irreversible action until you have a disk snapshot, try your best to never lose data.

How to have a Play framework app autostart during boot on Elastic Beanstalk CentOS ec2 instances

So you’ve created an Elastic Beanstalk environment, you have a play framework distribution which you’ve created using play dist (either on your local environment, or right there on the server, whatever you prefer)

play dist outputs a my-app-1.0.zip file which has a self-contained version of your app with all the necessary libraries and a start script.

Afer you unzip it, you end up with a my-app-1.0/lib/ folder and a start script.

[ec2-user@ip-10-235-8-106 bullq-1.0]$ ls -l
total 24
drwxrwxr-x 2 ec2-user ec2-user 4096 Sep 27 15:35 lib
-rwxrwxr-x 1 ec2-user ec2-user 4328 Sep 27 15:35 start

Make sure it’s executable by using chmod +x start on the start script.

So now, this is all in the first ec2 instance of your elastic beanstalk environment, if you’re like me and you’ve used ubuntu/debian for your server management things can be slightly different here, since Amazon preferred CentOS for their default image, and here I’ll show you how to make your play app auto start when the server boots because you want every new machine that may be instanciated to have your app installed and to start the service as soon as the machine is up.

Create a /etc/init.d/myappd script
(I’m using ‘myapp’ here as an example, your app can be named whatever is named, so replace accordingly)

#!/usr/bin/env bash
#myappd
#Script to start|stop|restart myappd from /etc/init.d/
#By Gubatron – @gubatron – gubatron@gmail.com

#replace accordingly in these variables ‘myapp’ for the name of your app
PID_FILE=/home/ec2-user/myapp/dist/myapp-1.0/RUNNING_PID
DAEMON_NAME=myappd
DAEMON_PATH=/home/ec2-user/myapp
DAEMON=$DAEMON_PATH/dist/myapp-1.0/start

test -x $DAEMON || exit 0

set -e

function killDAEMON() {
echo “start kill daemon”
kill -9 cat /home/ec2-user/bullq/dist/bullq-1.0/RUNNING_PID
echo “end kill daemon”
}

function removePIDFile() {
if [ -e $PID_FILE ]
then
rm -f $PID_FILE
fi
}

case $1 in
start)
removePIDFile
echo “Starting $DAEMON_NAME… $DAEMON”
nohup $DAEMON &
;;
restart)
echo “Hot restart of $DAEMON_NAME”
killDAEMON
removePIDFile
COMMAND=”nohup $DAEMON &”;
echo $COMMAND
$COMMAND
rm -f $PID_FILE
;;
stop)
echo “Stopping $DAEMON_NAME”
killDAEMON
removePIDFile
;;
*)
echo “Usage: $DAEMON_NAME {start|restart|stop}” >&2
exit 1
;;
esac

exit 0

 

Wire it to autostart

The simplest way I found to have this script start when the server would boot was to add it at the end of the
/etc/rc.local file. (In ubuntu you’d register the new script with the upate-rc.d command)

#!/bin/sh
#
This script will be executed after all the other init scripts.
You can put your own initialization stuff in here if you don’t
want to do the full Sys V style init stuff.

touch /var/lock/subsys/local

/etc/init.d/myappd start

 

can’t ssh to ec2 ubuntu instance, /etc/fstab breaks bootup due to missing ebs volume [SOLVED]

Screen Shot 2013-08-21 at 12.08.04 PM

So the /etc/fstab file on your root volume looked like this

LABEL=cloudimg-rootfs / ext4 defaults 0 0
/dev/xvdf /mnt/backups auto defaults,comment=cloudconfig 0 2

by mistake you deleted the ebs volume that you had mounted on /mnt/backups (or whatever folder) and you restarted your ubuntu instance not knowing that if the /etc/fstab would break it would not continue to start all the application layer networking services like ssh on port 22…

you can ping the machine, but you can’t ssh, amazon support won’t respond or will tell you to fuck yourself.

you learn that ubuntu has had this bug for a while, but it’s been addressed by passing your volume configuration a nobootwait option.

you wish your /etc/fstab looked like this, but you can’t get in, amazon doesn’t give you any other options from their console to go in and solve the problem through a console…

LABEL=cloudimg-rootfs / ext4 defaults 0 0
/dev/xvdf /mnt/backups auto defaults,nobootwait,comment=cloudconfig 0 2

No worries, I have a fix that will let you edit that file, and boot back and try to recover things, you may have lost that ebs volume, but you won’t have to setup this computer again.

1. Make a snapshot of the root volume on that instance. This will take a while.
2. Make a new ebs volume of that snapshot and put it on the zone where the ec2 instance lives.
3. Create an identical temporary new ec2 instance on the same zone.
4. Attach the snapshot volume you created on step 2 to the new instance.
5. ssh to the new machine.
6. sudo fdisk -l, you should see all the attached devices, you will see something like this referring to the attached ebs

Disk /dev/xvdf: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/xvdf doesn't contain a valid partition table

Don’t listen to that last message, you do have a valid partition.

7. Create a folder where to mount the disk. sudo mkdir /mnt/old-volume
8. Mount it sudo mount -t auto /dev/xvdf /mnt/old-volume
9. Get into /mnt/old-volume/etc/fstab and fix it.
10. Unmount /mnt/old-volume, turn off the instance, detach the repaired volume.
11. Turn off the original instance, detach the broken root volume (at /dev/sda1)
12. Attach the repaired volume to the original instance under /dev/sda1
13. Start the original instance.
14. ssh to it. (it will have a new ip address, make sure to update your DNS or load balancing entries)
15. Terminate the temporary instance and all the volumes that you won’t need.
16. Get to work.
17. Leave a tip below. 😉

[SOLVED] Java7 SMTP Issue (Caused by: sun.security.pkcs11.wrapper.PKCS11Exception)

So you had your little program that would use AWS to send emails, and all of a sudden after a Java 7 update you get a stack trace like this:

[java]
javax.mail.MessagingException: Could not connect to SMTP host: email-smtp.us-east-1.amazonaws.com, port: 465;
nested exception is:
javax.net.ssl.SSLException: Server key
at com.sun.mail.smtp.SMTPTransport.openServer(SMTPTransport.java:1962)
at com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:654)
at javax.mail.Service.connect(Service.java:295)
at javax.mail.Service.connect(Service.java:176)
at javax.mail.Service.connect(Service.java:196)

Caused by: javax.net.ssl.SSLException: Server key
at sun.security.ssl.Handshaker.throwSSLException(Handshaker.java:1274)

Caused by: java.security.spec.InvalidKeySpecException: Could not create EC public key
at sun.security.pkcs11.P11ECKeyFactory.engineGeneratePublic(P11ECKeyFactory.java:169)
at java.security.KeyFactory.generatePublic(KeyFactory.java:334)

Caused by: sun.security.pkcs11.wrapper.PKCS11Exception: CKR_DOMAIN_PARAMS_INVALID
at sun.security.pkcs11.wrapper.PKCS11.C_CreateObject(Native Method)
at sun.security.pkcs11.P11ECKeyFactory.generatePublic(P11ECKeyFactory.java:233)
[/java]

It seems like the SSL Socket Factory in Java 7 is missing some of the Ciphers when it comes to setting up SSL socket cipher suits (see SSLSocket.setEnabledCipherSuites)

so… the question is now,
how do I tell java mail to use the cipher suit I used to use in java 6?

Easy, when you’re setting up your Session properties, make sure to include the following key:

[java]properties.put(“mail.smtp.ssl.ciphersuites”,”SSL_RSA_WITH_RC4_128_MD5 SSL_RSA_WITH_RC4_128_SHA TLS_RSA_WITH_AES_128_CBC_SHA TLS_DHE_RSA_WITH_AES_128_CBC_SHA TLS_DHE_DSS_WITH_AES_128_CBC_SHA SSL_RSA_WITH_3DES_EDE_CBC_SHA SSL_DHE_RSA_WITH_3DES_EDE_CBC_SHA SSL_DHE_DSS_WITH_3DES_EDE_CBC_SHA SSL_RSA_WITH_DES_CBC_SHA SSL_DHE_RSA_WITH_DES_CBC_SHA SSL_DHE_DSS_WITH_DES_CBC_SHA SSL_RSA_EXPORT_WITH_RC4_40_MD5 SSL_RSA_EXPORT_WITH_DES40_CBC_SHA SSL_DHE_RSA_EXPORT_WITH_DES40_CBC_SHA SSL_DHE_DSS_EXPORT_WITH_DES40_CBC_SHA TLS_EMPTY_RENEGOTIATION_INFO_SCSV”);[/java]

which specifies in a space separated String a list of tokens that represent the cipher suites the SSL sockets will use. At some point that list gets split into a String[] and passed to SSLSocket.setEnabledCipherSuites.

This is how my session properties look like in case you want to know:

[java]
Properties properties = new Properties();
properties.put(“mail.transport.protocol”, “smtp”);
properties.put(“mail.smtp.auth”, true);
properties.put(“mail.smtp.ssl.enable”,true);
properties.put(“mail.smtp.port”, emailSMTPPort);
properties.put(“mail.smtp.host”, emailSMTPHost);
properties.put(“mail.smtp.ssl.ciphersuites”,”SSL_RSA_WITH_RC4_128_MD5 SSL_RSA_WITH_RC4_128_SHA TLS_RSA_WITH_AES_128_CBC_SHA TLS_DHE_RSA_WITH_AES_128_CBC_SHA TLS_DHE_DSS_WITH_AES_128_CBC_SHA SSL_RSA_WITH_3DES_EDE_CBC_SHA SSL_DHE_RSA_WITH_3DES_EDE_CBC_SHA SSL_DHE_DSS_WITH_3DES_EDE_CBC_SHA SSL_RSA_WITH_DES_CBC_SHA SSL_DHE_RSA_WITH_DES_CBC_SHA SSL_DHE_DSS_WITH_DES_CBC_SHA SSL_RSA_EXPORT_WITH_RC4_40_MD5 SSL_RSA_EXPORT_WITH_DES40_CBC_SHA SSL_DHE_RSA_EXPORT_WITH_DES40_CBC_SHA SSL_DHE_DSS_EXPORT_WITH_DES40_CBC_SHA TLS_EMPTY_RENEGOTIATION_INFO_SCSV”);
[/java]

Cheers, don’t forget to tip If I saved your ass.